Analysis of the problem of kernel mode access to user mode address space introduced by X86 SMAP (Supervisor Mode Access Prevention) mechanism

In Linux systems, when it comes to copying user-mode and kernel-mode data, if you do not consider the zero-copy situation of establishing a shared mapping between kernel space and user space, you usually call several groups of macros copy_from_user/copy_to_user/put_user/get_user. realized. Earlier, for the two situations of illegal user mode pointer (no VMA) or page fault (there is VMA but no MMU mapping), the repair table was used in the function implementation. In the former case, an error code was returned. In the second case, the physical PFN mapping is established through the page fault process, and then the repair table process is entered to complete the remaining operations. For legal user-mode pointers, they can be accessed directly in the kernel. Therefore, the user space of the current process can be directly accessed from the kernel, and the virtual address used is exactly the same as the address when the current process is in user space. The reverse is not allowed.

However, this conclusion seems to have encountered a counterexample on the x86 platform running the latest Linux. Even if you access a legal user mode address in the kernel, it will be prohibited. Let’s do an experiment:

Directly read the contents of the buf pointer passed from user mode in the device driver:

When running the user mode use case, it was found that the test process was KILL and the kernel reported a permission violation.

After careful analysis of the error LOG, it was found that the cause of the error was #PF: error_code(0x0001) – permissions violation,permission violation. Analyzing the scene, we found that the error address 0x7ffc922277b0 is consistent with the fault address reported by the kernel, and the four-level page table corresponding to this virtual address is mapped (PGD 23144d067 P4D 23144d067 PUD 2315d8067 PMD 23b0e9067 PTE, P4D and PGD overlap ). In other words, the accessed virtual address is neither an illegal address nor a page fault. It is a legal user-mode address. According to the analysis conclusion at the beginning of this article, this address should be able to be safely accessed by the kernel, but an error was reported.

Cause analysis:

The reason for this problem is related to the hardware architecture and kernel version. The most fundamental reason is that the CPU has introduced a new function. In the CR4 register of the latest X86 processor, SMEP and SMAP control BITs have been introduced. Used to configure the kernel’s access permissions to user-mode address space. SMAP (Supervisor Mode Access Prevention) is a new feature introduced by Intel from the Haswell microarchitecture. It introduces a new flag bit SMAP in the CR4 register. If this flag is 1, it will be triggered when the kernel accesses the address space of the user process. The purpose of a page fault is to prevent the kernel from accidentally accessing user space due to its own errors, so as to avoid security issues caused by some kernel vulnerabilities. However, because the kernel still needs to access user space sometimes, Intel provides two instructions STAC and CLAC are used to temporarily turn on/off this function. Repeated use of STAC and CLAC will cause some slight performance loss, but considering the increased security, it is recommended to turn it on.

SMEP: Located at bit 20 of Cr4, its function is to prevent the CPU with kernel permissions from executing user code.
SMAP: Located at bit 21 of Cr4, its function is to prevent the CPU with kernel permissions from reading and writing user code.

You can use the following command to check whether the CPU supports the smap function. As shown in the figure below, each core of my 8-core processor supports SMAP.

$ sudo cpuid|grep -i smap

SMAP support for another 12-core AMD processor

The virtual machine system does not support SMAP, so the test of the kernel directly accessing the user state pointer in the virtual machine is successful.

There is a configuration option CONFIG_X86_SMAP in the kernel to enable or disable the SMAP function. It is enabled by default.

So why can user-mode pointers be accessed safely through the copy_from_user/put_user macro?

Taking get_user as an example, other macro definition implementations are similar. In the core implementation of get_user, __get_user_1, before and after the real user state pointer access, the program calls ASM_STAC/ASM_CLAC to turn on/off the kernel’s function of accessing the user address space:

And in the implementation of the correction table at label 1, the CLAC instruction is also called to control the SMAP function, so get_user can safely access the user address space.

The clac/stac instructions in the kernel are defined in the form of byte code. We can DUMP the implementation of the kernel __get_user_1 function to see whether its memory access call is surrounded by stac/clac instructions:

Make sure that the memory access in the __get_user_1 instruction is surrounded by stac/clac instructions, so that there is no need to trigger the permission violation exception and execute the repair code in exception_table_entry.

However, observing the implementation of the __get_user_1 function in the decompiled file of the vmlinux file corresponding to the running kernel, we found that the corresponding stac/clac instruction areas are all NOP. It is speculated that the program may modify the instructions in the NOP area during the running phase and replace NOP with stac. /clac command. But exactly when and where it was done is unclear.

The method of directly calling CPU instructions is not very elegant. The kernel provides two functions, define user_access_begin and user_access_end, which are used by driver developers to set code areas for safe access to user space. They are essentially encapsulation of two instructions.

Use these two calls to protect access to the area of user memory and retest:

It was found that the program was executed normally and was not killed, indicating that the two interfaces worked and the kernel state successfully accessed the address space of the user state.

For comparison, try to turn off SMAP, recompile the kernel, and confirm whether the user mode address can be directly accessed in the kernel:

The test found that after turning off the SMAP mechanism, the user-mode address can be directly accessed in the kernel module even without calling the stac/clac instruction:

In addition, if you pay a little attention to the implementation of __get_user_1, you will find that it only supports access requests whose source address is user space. If the incoming source address is in kernel space, it will jump directly to the bad_get_user flag and return an error. This is a professional API. It is of course no problem to access the kernel pointer in kernel mode, but because it is a get_user scenario, something should be done and something should not be done. It is not impossible, but it is not possible.

Process Analysis

When SMAP check is turned off, an exception occurs when executing the memory access instruction at label 1, and do_page_fault is triggered. The system will pass the call chain do_page_fault->….->no_context->fixup_exception->search_exception_tables->handler…

Execute the repair instruction pointed to by the fixup field in ex_table. The repair instruction corresponding to 1b is in .Lbad_get_user_clac. When executed here, the SMAP that was closed before the memory access will be opened first, and then an error code will be returned to the application.

Why is it so complicated? After all, page fault processing in user mode is transparent to developers. The system can ensure that even if a page fault occurs, it can transparently perform correct operations on the specified virtual address. Why can’t it be done in the kernel? Repairs must be shown via a repair table.

My personal understanding is that if it is only dealing with page faults, the kernel page fault process can also submit the physical page and return to the point where the kernel exception occurs to continue processing. However, there is a special situation where the kernel cannot refer to the processing method of user mode processes. In this case, the address received by macros such as copy_to_user is an illegal address (not in the VMA range). If this happens in user mode, the system kernel can simply Send a KILL signal to kill the problematic process, but copy_to_user is called in the kernel state. The process with an illegal address exception that occurs in the kernel state cannot simply be killed by sending a KILL signal, because the system does not know when the execution reaches the exception point. Previously, whether the execution flow of the kernel agent had acquired kernel resources, such as locks, memory, signals, etc., this information was only known by the execution flow itself. Therefore, the best way is to return an error code and let the execution flow clean up the on-site resources, and Return to user mode with an error code, allowing user mode to decide whether to exit or try to continue the system call. The way to return the error code is to add bad_get_user to the fixup repair table, as shown below:

This is why when get_user is used in the kernel to operate an illegal address passed in from user mode, the process will return an error code without reporting any other exceptions.

After all, if something goes wrong in a user-mode process, the kernel will take care of it. The process can be a hands-off shopkeeper, but in the kernel-mode, everything must be done personally and there can be no mistakes.

Repair table

The repair table is generally used to restore the page missing problem that occurs during access to legal user-mode addresses. After submitting the PFN in do page fault, the regs->ip return address is forcibly modified to the repair table address, so that after returning from the page missing exception, You can execute the instructions in the repair table to complete the recovery.

The repair table is located in a section in the kernel ELF file, and the runtime address is within the range defined by the __start___ex_table and __stop___ex_table symbols.

You can also DUMP it for analysis. Observe the picture below, pay attention to the output of line 4590, and randomly select a line of lucky LOG for analysis:

extable_test line 4590, insn load_ucode_bsp + 0xd7/0x1f0, fixup load_ucode_bsp + 0xd9/0x1f0.

It means that when an exception occurs in load_ucode_bsp + 0xd7/0x1f0, the page fault exception must be returned to the load_ucode_bsp + 0xd9/0x1f0 address for repair.

Take __get_user_1 repair table entry as an example:

The content of the repair table is:

extable_test line 4632, insn __get_user_1 + 0xd/0x20, fixup __get_user_nocheck_8 + 0x20/0x40.

Its exception instruction is at __get_user_1 + 0xd/0x20, the repair instruction is at __get_user_nocheck_8 + 0x20/0x40, the kernel source code, the decompilation instruction and the above printed __get_user_1 repair The table output is consistent, see the figure below:

Summary:

Therefore, the reason why the kernel directly accesses user-mode pointers and reports page fault errors is that direct access to user-mode memory triggers SMAP protection. The kernel provides configuration and API interfaces to turn off this protection, and the several macro definitions mentioned at the beginning of this article can safely The reason why user mode memory is accessed is also because SMAP protection is turned off when accessing the device.

From a general point of view, there is no problem in kernel mode accessing user mode address space. However, you need to pay attention to the slight differences in implementation under different architectures. At least for now, there is no implementation of similar SMAP mechanism in other architectures.

For the user-mode address FAULT branch on the left side of the above figure, the repair function in the repair table will be called to repair it. In the case of an illegal address, the repair function points to bad_get_user, which directly returns the error code. The following three chapters show the testing of illegal addresses. The use case and the stack situation when the kernel looks up the repair table repair function:

SMEP

smep (supervsion mode execution protection) is an execution protection mechanism. The SMEP feature prevents programs in management mode from obtaining and executing instructions in the user mode address space. The older technology can be traced back to NX BIT technology, which is the ARM page The XN bit (execution prohibition) in the table is called Enhance Virus Protection by AMD and the XD bit by Intel.

Reference article

https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

Supervisor Memory Protection – OSDev Wiki

End

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. CS entry skill treeLinux introductionFirst introduction to Linux 37721 people are learning the system