LWN: mseal() and subsequent solutions!

If you follow it, you can see more great articles like this~

mseal() and what comes after

By Jonathan Corbet
October 20, 2023
ChatGPT translation
https://lwn.net/Articles/948129/

Jeff Xu proposed a new system call called mseal() that would allow applications to prevent modifications to selected memory mappings. This helps prevent certain types of attacks from userspace applications; some other operating systems already have this capability. There is also support for adding such a mechanism in the Linux kernel, but it is currently clear that mseal() will not enter the mainline in its current form. Instead, in many ways it has become a negative example of what not to do in kernel development.

Xu describes the purpose of the new system call as follows:

“Memory sealing provides additional protection against modifications to the map itself. This helps mitigate memory corruption problems in cases where a corrupted pointer is passed to a memory management system call. For example, such an attack could compromise control flow integrity (control- flow integrity) because supposedly trusted read-only memory may change to be writable, or the .text page may be remapped.”

This feature is targeted at the Chrome browser, which includes a just-in-time (JIT) compilation engine for JavaScript code. Because it generates executable code on the fly, JIT compilation must be done carefully to avoid creating (and running) problematic code. As Stephen R?ttger explains in this blog post, a lot of effort has been put into implementing control flow integrity to prevent JIT systems from becoming a tool for attackers. However, all precautions are no longer valid if the attacker somehow forces a memory management system call that changes memory permissions. Therefore, Chrome developers hope to have a mechanism to make those system calls not applicable to specific areas of memory to enhance the browser’s protection against such attacks.

The cover letter states that mseal() is similar to the mimmutable() recently added to OpenBSD. However, the prototype of the proposed system call is completely different from mimmutable():

int mseal(void *addr, size_t len, unsigned int types, unsigned int flags);

The memory range to be affected is indicated by addr and len. flags must be set to zero, and types controls which system calls are blocked on this address range:

  • MM_SEAL_MPROTECT: mprotect() and pkey_mprotect()

  • MM_SEAL_MMAP: mmap()

  • MM_SEAL_MUNMAP: munmap()

  • MM_SEAL_MREMAP: mremap()

  • MM_SEAL_MSEAL: future mseal() calls

Linus Torvalds was quick to raise objections to the patch series, saying “I have no objection to adding some kind of “locked memory map” model, but I’m not happy with the current scheme”. He made many complaints about implementation details, but later made it clear that the design of the system call was wrong. For example, it doesn’t make much sense to prevent munmap() when other operations that can cause addresses to be unmapped (such as mmap() and mremap()) are still allowed. He said that putting in the effort to block only specific system call operations is clearly the wrong approach; if a range of memory is prevented from being unmapped (for example), it must block from all directions, otherwise the protection provided will be just an illusion.

Matthew Wilcox questioned the complexity of the interface and suggested just adding a few flags to mprotect(). He stated that memory regions should either be immutable (perhaps with further reduced access permissions), or immutable, regardless of which system call is used. He later added:

“That’s the problem with seccomp, and it’s worse because you’re trying to deny individual system calls instead of building a list of system calls that can be allowed. If we introduced a new system call tomorrow that could affect the VMA, then the problem would be to blame The application does not disable new system calls. That’s terrible design!”

There was even an appearance on linux-kernel by OpenBSD maintainer Theo de Raadt, who agreed with Torvalds and suggested that Linux should simply add mimmutable() rather than reinvent it in a more complex form This function. Torvalds agreed with the idea, although he suggested adding a flags parameter for future changes, Theo de Raadt did not like this idea. This reflects the fact that OpenBSD has control over its userspace, so it can add a flags parameter later if necessary; Linux does not have this luxury, so the parameter must be present from the beginning if it is to exist .

Xu resisted the idea, prompting a typical (relatively mild) de Raadt reply. In fact, even after receiving comments, Xu still stood by his proposed design, which led to a response from Wilcox, in which Wilcox tried to steer the discussion back to what this patch series was actually trying to achieve:

“Let’s start with the purpose. The purpose of mimmutable/mseal/whatever is to fix the mapping of an address range to its underlying object, whether it’s a specific file mapping or anonymous memory. After a successful call, it MUST NOT be possible to make anything in that virtual range The address points to any other object.

The secondary purpose is to lock down the permissions for that scope. Fixing them there might fix them there, possibly allowing RW->RO conversion.

On the basis of these purposes, you should be able to determine whether any one system call or any madvise()… should be allowed. “

Wilcox ultimately concluded that Xu needed to do a better job of listening to the developers who were trying to help him.

Currently, it is clear that mseal() does not enter the kernel in its current form. This begs the question of what to do next. R?ttger joined the discussion, pointing out that the pure mimmutable() solution didn’t satisfy everything Chrome developers wanted to see; there were some cases where they wanted to prevent unmapping, But you still need to be able to change memory protection using mprotect(). De Raadt described the situation as “partially sealed,” meaning the affected memory was not actually protected.

Some follow-up proposals to remove the more complex options provided by mseal() from mseal() while retaining this functionality may be made in the future. However, it remains to be seen whether this proposal is in the way of mimmutable() or its variants.

It can be pointed out that something is wrong here. Many felt that the original proposal simply implemented what Chrome developers said they wanted without delving into what the real needs were (for both Chrome and any other potential users). Google has many experienced developers who could review this commit before releasing it publicly, but that doesn’t appear to be happening, leaving relatively inexperienced developers in the lurch. Feedback on the proposal was met with resistance rather than being carefully listened to. This results in a series of interactions that satisfy no one.

Still, everyone seems to agree that this is a legitimate use case. So the solution to the problem is to find the right way, hopefully now the problem is better understood. If the next attempt looks more like mimmutable() and reflects the feedback already given, then the kernel may get this sealed feature to satisfy the Chrome scenario and provide broader user space hardening.

Complete text
LWN articles are licensed under the CC BY-SA 4.0 license.

Welcome to share, reprint and recreate based on existing agreements ~

Long press the QR code below to follow and follow LWN’s in-depth articles and various recent comments from the open source community~

format,png