Viewing page fault interruption from the perspective of Linux mmap system call

Question

1. How does mmap achieve one memory copy less than read/write?

2. What is the difference between mmap shared mapping and private mapping when the kernel is implemented?

3. What is the difference between mmap’s file mapping and anonymous mapping when the kernel implements it?

4. How to implement the COW of the parent-child process

Overview

In the actual development process, the mmap function is often used or seen. For details, you can check the relevant details with man mmap. This system call is an all-rounder. You can see the figure of mmap in application space application memory (for example, the glibc library applies for large memory using mmap), reads and writes large files, links dynamic libraries, and shares memory between multiple processes. To understand this system, on the one hand, we must understand mmap from the requirements of these usage scenarios. More importantly, we must analyze the kernel implementation corresponding to each parameter in depth based on the kernel source code.

Number of memory copies

In the case of mmap mapping files, mmap reading and writing files is one less memory copy than read/write. If you want to really understand this sentence, you’d better go to read/write system call in person. Implement the process. In terms of the read system call, its statement is as follows:

NAME
       read - read from a file descriptor
SYNOPSIS
       #include <unistd.h>
       ssize_t read(int fd, void *buf, size_t count);

Pass in a virtual address pointer buf of the user mode space. When the kernel is finally implemented, the file content should be copied to the buf. We all know that the kernel will create a page cache memory in the kernel in order to speed up the speed of reading and writing files (regardless of direct io), then the content of the file is firstly read into the kernel page cache memory, and the page cache is then copied to the user mode address buf, and the mmap kernel page is directly implemented with the user mode address mapping, then as long as there is a page fault (the mm/filemap. The virtual address of the user mode is directly mapped to the page generated by the filemap_fault page fault interrupt through the page table).

Important: What is the difference between the pages generated by mmap and read/write?

We see that the mmap and read/write system calls will create physical pages in the kernel state. mmap is created when there is a page fault. If anonymous mapping (MAP_ANON) creates an anonymous page; if file mapping creates a file-back page page; These two types of pages are mapped through the user mode page table. The page created by the read/write system call is different. This kind of page cache is a “temporary worker” from a certain point of view, because there is no user mode page table mapping The page and write functions are taken as an example. When the kernel system call is implemented, only the user state buf copy is returned to the “temporary worker” page cache. So how do temporary workers reflect?

Since read/write does not pass the user-mode page table, and the kernel copies the user-mode buf to the page cache in the kernel, it must use the kernel-mode virtual address. Finally, the kernel uses kmap to copy the page cache Temporary mapping to kernel virtual address:

//write system call will call this function
ssize_t generic_perform_write(struct file *file,
struct iov_iter *i, loff_t pos)
{
    ...
    //iov_iter encapsulates the user mode address buf, this function copies the user mode buf to the kernel page cache
copied = iov_iter_copy_from_user_atomic(page, i, offset, bytes);
    ...
}

size_t iov_iter_copy_from_user_atomic(struct page *page,
        struct iov_iter *i, unsigned long offset, size_t bytes)
{
    // Temporarily map the kernel page cache to the kernel state virtual address through kmap_atomic
    char *kaddr = kmap_atomic(page), *p = kaddr + offset;
    if (unlikely(!page_copy_sane(page, offset, bytes))) {
        kunmap_atomic(kaddr);
        return 0;
    }
    if (unlikely(i->type & ITER_PIPE)) {
        kunmap_atomic(kaddr);
        WARN_ON(1);
        return 0;
    }
    iterate_all_kinds(i, bytes, v,
        copyin((p + = v.iov_len) - v.iov_len, v.iov_base, v.iov_len),
        memcpy_from_page((p + = v.bv_len) - v.bv_len, v.bv_page,
                 v.bv_offset, v.bv_len),
        memcpy((p += v.iov_len) - v.iov_len, v.iov_base, v.iov_len)
    )
    //Because it is temporarily mapped, use the virtual address to pass through kunmap, because this virtual address is still limited (32 bits should be 4M)
    kunmap_atomic(kaddr);
    return bytes;
}

Through the above scenarios, we also learned the typical usage scenarios of kmap. It is often misleading to translate kmap into “permanent mapping” in Chinese books. On the contrary, the usage scenario of kmap is temporary mapping.

For the file reading and writing situation of mmap, I don’t want to draw a picture. I quote a picture on the Internet:

Detailed explanation of mmap function parameters

NAME
       mmap, munmap - map or unmap files or devices into memory
SYNOPSIS
       #include <sys/mman.h>
       void *mmap(void *addr, size_t length, int prot, int flags,
                  int fd, off_t offset);
       int munmap(void *addr, size_t length);

prot : Set the read and write attributes of the memory mapping area vma (vma_flags in vma), which will eventually affect the read and write attributes of the page table entry pte. Ranges:

PROT_EXEC Pages may be executed.
PROT_READ Pages may be read.
PROT_WRITE Pages may be written.
PROT_NONE Pages may not be accessed.

flags: Set properties such as mapping sharing. Such as MAP_SHARED, MAP_PRIVATE, MAP_ANONYMOUS, etc.

PRIVATE: Private means that modifying the content triggers copy-on-write, and the process owns the physical page without sharing it.

MAP_SHARED: What is shared, that is, copy-on-write will not be triggered when a page is missing, as long as the original page content is modified. Divided into file sharing and anonymous sharing:

File sharing: Any changes that have been processed are visible to other processes (because multiple processes map the same physical page, so they are naturally visible to each other), and the modification of the content will synchronize the disk file strong>.

Anonymous sharing: The difference from file sharing is that there is no mapped disk file, and the same physical page is still mapped between multiple processes, so they can also be seen by each other, and are often used for inter-process communication. The bottom layer of the kernel is implemented through shmem.

MAP_PRIVATE: What is private, modifying the page triggers copy-on-write, so that each process has its own independent physical memory page, which is private. Divided into file private and anonymous private:

File private: A process modifying a file will trigger copy-on-write, other processes will not see the change of the mapped content, and the modified content will not be written back to disk, the most common The scenario is to load a dynamic library.

Anonymous private: Do not map the file, and modify the content to trigger copy-on-write.

prot and flags of source code analysis of mmap kernel system call

Let’s see how the two parameters of flags and prot affect vma and pte:

mm/mmap.c

unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flags, unsigned long pgoff,
unsigned long *populate, struct list_head *uf)
{
...
/* Do simple checking here so the lower-level routines won't have
* to. we assume access permissions have been handled by the open
* of the memory object, so we don't do any here.
*/
    //prot is the prot parameter of the mmap function, the following function converts prot into a flag related to vm_flags
vm_flags = calc_vm_prot_bits(prot, pkey) | calc_vm_flag_bits(flags) |
mm->def_flags | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;

if (file) {
...
switch (flags & MAP_TYPE) {
case MAP_SHARED:
/*
* Force use of MAP_SHARED_VALIDATE with non-legacy
* flags. E.g. MAP_SYNC is dangerous to use with
* MAP_SHARED as you don't know which consistency model
* you will get. We silently ignore unsupported flags
* with MAP_SHARED to preserve backward compatibility.
*/
flags & amp;= LEGACY_MAP_MASK;
fallthrough;
case MAP_SHARED_VALIDATE:
...
vm_flags |= VM_SHARED | VM_MAYSHARE;
...
case MAP_PRIVATE:
...
break;

default:
return -EINVAL;
}
} else {
switch (flags & MAP_TYPE) {
case MAP_SHARED:
...
vm_flags |= VM_SHARED | VM_MAYSHARE;
break;
case MAP_PRIVATE:
/*
* Set pgoff according to addr for anon_vma.
*/
pgoff = addr >> PAGE_SHIFT;
break;
default:
return -EINVAL;
}
}
...
addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
...
return addr;
}

unsigned long mmap_region(struct file *file, unsigned long addr,
        unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
        struct list_head *uf)
{
    ...
    vma->vm_mm = mm;
    vma->vm_start = addr;
    vma->vm_end = addr + len;
    vma->vm_flags = vm_flags;
    //Convert vm_flags to vm_page_prot, see the mk_pte function below to know that vm_page_prot will eventually affect the page table entry
    // Read and write properties.
    vma->vm_page_prot = vm_get_page_prot(vm_flags);
    vma->vm_pgoff = pgoff;
    ...
    if(file) {
        ...
        //Trigger mmap in fs, such as ext4_file_mmap
        call_mmap(...);
    }
    ...
}

//Creating pte when the kernel page is missing generally calls the mk_pte function, which is to generate the relevant flag of pte through vma->vm_page_prot
//For example, anonymous page fault interrupt:
static int do_anonymous_page(struct vm_fault *vmf)
{
    ...
    //If vm_flags is not set
    entry = mk_pte(page, vma->vm_page_prot);
    ...
}

The logic of cal_vm_prot_bits to convert prot is also simple. We know that the value of prot is PROT_READ/PROT_WRITE/PROT_EXEC. This function converts prot into VM_READ/VM_WRITE/VM_EXEC respectively.

Summary: Finally, vm_flags combines read and write related attributes (from the prot parameter of the mmap function converted by cal_vm_prot_bits) and shared attributes (from the mmap function flag parameter), and finally vm_flags affects the flags of vma and pte.

Page fault interruption process hierarchy diagram

Source code:

/*
 * These routines also need to handle stuff like marking pages dirty
 * and/or accessed for architectures that don't do it in hardware (most
 * RISC architectures). The early dirty is also good on the i386.
 *
 * There is also a hook called "update_mmu_cache()" that architectures
 * with external mmu caches can use to update those (ie the Sparc or
 * PowerPC hashed page tables that act as extended TLBs).
 *
 * We enter with non-exclusive mmap_lock (to exclude vma changes, but allow
 * concurrent faults).
 *
 * The mmap_lock may have been released depending on flags and our return value.
 * See filemap_fault() and __lock_page_or_retry().
 */
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
pte_t entry;
    ...

if (!vmf->pte) {
if (vma_is_anonymous(vmf->vma))
return do_anonymous_page(vmf);
else
return do_fault(vmf);
}

if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);

if (pte_protnone(vmf->orig_pte) & amp; & amp; vma_is_accessible(vmf->vma))
return do_numa_page(vmf);

vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
spin_lock(vmf->ptl);
entry = vmf->orig_pte;
if (unlikely(!pte_same(*vmf->pte, entry))) {
update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
goto unlock;
}
if (vmf->flags & FAULT_FLAG_WRITE) {
if (!pte_write(entry))
return do_wp_page(vmf);
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry,
vmf->flags & FAULT_FLAG_WRITE)) {
update_mmu_cache(vmf->vma, vmf->address, vmf->pte);
} else {
/* Skip spurious TLB flush for retrieved page fault */
if (vmf->flags & FAULT_FLAG_TRIED)
goto unlock;
/*
* This is needed only for protection faults but the arch code
* is not yet telling us if this is a protection fault or not.
* This still avoids useless tlb flushes for .text page faults
* with threads.
*/
if (vmf->flags & FAULT_FLAG_WRITE)
flush_tlb_fix_spurious_fault(vmf->vma, vmf->address);
}
unlock:
pte_unmap_unlock(vmf->pte, vmf->ptl);
return 0;
}

Call stack (take anonymous page fault interrupt as an example):

#0 0xffffffff813988ff in do_anonymous_page (vmf=<optimized out>) at mm/memory.c:4409
#1 handle_pte_fault (vmf=<optimized out>) at mm/memory.c:4367
#2 __handle_mm_fault (flags=<optimized out>, address=<optimized out>, vma=<optimized out>) at mm/memory.c:4504
#3 handle_mm_fault (vma=<optimized out>, address=12040240, flags=<optimized out>, regs=<optimized out>) at mm/memory.c:4602
#4 0xffffffff8114b2a4 in do_user_addr_fault (regs=0xffff8880045fff58, hw_error_code=6, address=12040240) at arch/x86/mm/fault.c:1372
#5 0xffffffff824e4c09 in handle_page_fault (address=<optimized out>, error_code=<optimized out>, regs=<optimized out>) at arch/x86/mm/fault.c:1429
#6 exc_page_fault (regs=0xffff8880045fff58, error_code=6) at arch/x86/mm/fault.c:1482
#7 0xffffffff82600ace in asm_exc_page_fault () at ./arch/x86/include/asm/idtentry.h:538

We know that a page fault interrupt is essentially a page fault exception. When the CPU handles this exception, it will be distributed to a specific exception handling function. For example, exc_page_fault is called here, and eventually it will be called to enter the do_anonymous_page anonymous page fault interrupt.

Anonymous page faults

If pte does not exist, if vma_is_anonymous returns true, it is determined to be an anonymous page:

static inline bool vma_is_anonymous(struct vm_area_struct *vma)
{
    return !vma->vm_ops;
}

That is, if vma->vm_ops is set, it is not an anonymous page. Where is vm_ops set? Looking upwards, there is a call_mmap call in mmap_region, which will eventually call the specific mmap function setting in ext4:

OK, after the above judgment, it is finally confirmed that it is an anonymous page fault (for example, when the mmap function is used, the file mapping setting vm_ops is specified, and the anonymous page fault logic will not be entered naturally). For specific anonymous page fault functions, please refer to the following articles:

The life cycle of Linux anonymous pages – Programmer Sought

File page fault interrupt

If the above logic does not enter the anonymous page fault, it will naturally enter the processing flow of the file page fault, that is, the do_fault function. According to the flag in mmap, the file fault is mostly a variety of logic:

FAULT_FLAG_WRITE is a flag set according to the CPU state. If it is not set, it means that it is a read-only exception, then call do_read_fault;Otherwise it means a write exception, and it needs to be subdivided whether it is copy-on-write. If vm_flags is not set VM_SHARED means PRIVATE, Call do_cow_fault to trigger copy-on-write, otherwise do_share_fault share.

According to the previous analysis, vm_flags comes from the flag of the mmap function. For the write exception here, if MAP_SHARED is set, then it enters do_shared_fault, otherwise it is do_cow_fault.

According to the file mapping that mmap uses MAP_SHARED, a page fault during writing will also write the content back to the disk, so we speculate that do_shared_fault will perform file write-back logic internally:

fault_dirty_shared_page to implement this part of the logic.

do_wp_page

Leave a hole and fill it later

Scenario of parent-child process copy-on-write

Anyone who has studied the operating system knows that for performance considerations in Linux, the fork system calls the child process and does not completely copy the physical page of the parent process. The two share physical memory and save memory. COW will only be triggered when any party writes data. We pondered the following questions:

If the parent process executes the following process:

1. addr = mmap(PROT_READ|PROT_WRITE, MAP_PRIVATE) first creates a virtual address mapping

2. Write data to addr.

3. fork a child process.

4. The child process writes data to addr to trigger COW.

The fourth step of theoretical derivation should follow the do_wp_page logic, but the following conditions must be met to enter the entire logic:

That is to say, the entry page table item must be non-writable. We know that when we mmap, PROT_READ | PROT_WRITE is obviously set to be readable and writable, so the pte is also readable and writable when the page is missing for the first time. So where exactly is the pte modified to be read-only? Answer: If the fork system call finds that it is a private mapping of private, it will modify the corresponding pte to read-only.

Reference article:

Anonymous Mapping Page Fault Analysis of Linux Kernel Virtual Memory Management – vm_get_page_prot

Explain in detail the COW mechanism of the Linux kernel copy-on-write technology (tear the source code by hand) – Programmer Sought