Analysis of anonymous mapping page fault exception in Linux kernel virtual memory management

Before explaining the anonymous mapping page fault exception, we must first understand what is an anonymous page? Corresponding to the anonymous page is the file page. We should understand the file page well, which is the page that maps the file, such as mapping the file to the virtual memory through mmap and then reading the file data, the code data segment of the process, etc. These pages have a backup cache. That is, files on block devices, and anonymous pages are pages that are not associated with files, such as the heap, stack, etc. of a process. There is another point to note: The following discussions are all about private anonymous pages. Shared anonymous pages evolve into file mapping page fault exceptions (pseudo file systems) in the kernel. We will explain this later when we have the opportunity. Interested friends can take a look at the mmap code to implement the processing of shared anonymous pages.

1. Triggering of anonymous mapping page fault exception

We explained earlier what an anonymous page is, so think about the circumstances under which an anonymous mapping page missing exception will be triggered? This exception is very common for us:

1. When our application uses malloc to apply for a piece of memory (heap allocation), before this memory is used, only virtual memory is allocated, and no physical memory is allocated. A page fault exception will be triggered the first time it is accessed. To allocate physical pages to establish mapping relationships with virtual pages.

2. When our application uses mmap to create an anonymous memory mapping, the page is only allocated virtual memory and not physical memory. The first time it is accessed, the physical page creation and virtual memory will be allocated by triggering a page fault exception. Page mapping.

3. When the local variables of a function are relatively large, or the level of function calls is relatively deep, causing the current stack to be insufficient, the stack needs to be expanded. Of course, the above scenarios are transparent to the application program. The kernel does a lot of processing work for the user program. We will see how to deal with it in the following sections.

Two, what is page 0? Why use page 0?

Why does it say page 0 here? What is page 0? Is it the page with address 0? The answer is: A page of memory is allocated during the system initialization process, and this memory is filled with zeros. Let’s take a look at how page 0 is allocated: in arch/arm64/mm/mmu.c:

 61 /*
    62 * Empty_zero_page is a special page that is used for zero-initialized data
    63 * and COW.
    64 */
    65 unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)] __page_aligned_bss;
    66 EXPORT_SYMBOL(empty_zero_page);

You can see that a global variable is defined, the size is one page, and the page is aligned to the bss segment. All this data will be cleared when the kernel is initialized and is called page 0.

So why use page 0? One is that its data is filled with 0, and the data is all 0 when reading. The other is to save memory. When the anonymous page is read for the first time, the data is all 0, and it will be mapped to this page to save memory (share 0 pages). , so what happens if a process wants to write this page? The answer is that COW reallocates pages for writing.

Three, source code analysis

3.1 Trigger conditions

When the trigger situation in the first section occurs, a page fault exception will occur in the processor, transitioning from the processor architecture-related part to the processor-independent part, and finally reaching the handle_pte_fault function:

 3742 static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
  3743 {
  3744 pte_t entry;
  ...
  3782 if (!vmf->pte) {
  3783 if (vma_is_anonymous(vmf->vma))
  3784 return do_anonymous_page(vmf);
  3785 else
  3786 return do_fault(vmf);
  3787 }

Lines 3782 and 3783 are the triggering conditions for anonymous mapping page fault exceptions:

1. The page table entry where the address where the page fault occurs does not exist.

2. It happens on the anonymous page, that is, vma->vm_ops is empty.

When these two conditions are met, the do_anonymous_page function will be called to handle the anonymous mapping page fault exception.

 2871 /*
  2872 * We enter with non-exclusive mmap_sem (to exclude vma changes,
  2873 * but allow concurrent faults), and pte mapped but not yet locked.
  2874 * We return with mmap_sem still held, but pte unmapped and unlocked.
  2875 */
  2876 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  2877 {
  2878 struct vm_area_struct *vma = vmf->vma;
  2879 struct mem_cgroup *memcg;
  2880 struct page *page;
  2881 vm_fault_t ret = 0;
  2882 pte_t entry;
  2883
  2884 /* File mapping without ->vm_ops ? */
  2885 if (vma->vm_flags & VM_SHARED)
  2886 return VM_FAULT_SIGBUS;
  2887
  2888/*
  2889 |* Use pte_alloc() instead of pte_alloc_map(). We can't run
  2890 |* pte_offset_map() on pmds where a huge pmd might be created
  2891 |* from a different thread.
  2892 |*
  2893 |* pte_alloc_map() is safe to use under down_write(mmap_sem) or when
  2894 |* parallel threads are excluded by other means.
  2895 |*
  2896 |* Here we only have down_read(mmap_sem).
  2897 |*/
  2898 if (pte_alloc(vma->vm_mm, vmf->pmd))
  2899 return VM_FAULT_OOM;
  2904
  ...

Line 2885 determines whether the vma where the page fault occurs is a private mapping. This function handles private anonymous mapping.

Line 2898: If the page table does not exist, allocate the page table (it is possible that the direct page table where the page table entry of the missing page address is located does not exist).

3.2 Reading the anonymous page for the first time

 ...
  2905 /* Use the zero-page for reads */
  2906 if (!(vmf->flags & amp; FAULT_FLAG_WRITE) & amp; & amp;
  2907 !mm_forbids_zeropage(vma->vm_mm)) {
  2908 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
  2909 vma->vm_page_prot));
  2910 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
  2911 vmf->address, & amp;vmf->ptl);
  2912 if (!pte_none(*vmf->pte))
  2913 goto unlock;
  2914 ret = check_stable_address_space(vma->vm_mm);
  2915 if (ret)
  2916 goto unlock;
  2917 /* Deliver the page fault to userland, check inside PT lock */
  2918 if (userfaultfd_missing(vma)) {
  2919 pte_unmap_unlock(vmf->pte, vmf->ptl);
  2920 return handle_userfault(vmf, VM_UFFD_MISSING);
  2921 }
  2922 goto setpte;
  2923 }
  ...
  2968 setpte:
  2969 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
  

? Lines 2906 to 2923 deal with private anonymous page reading: the page 0 we mentioned above will be used here.

Lines 2906 and 2907 determine whether the page fault is caused by a read operation and page 0 is not prohibited.

Lines 2908-2909 are the core part: Set the value of the page table entry to map to page 0.

We mainly study this statement: pfn_pte is used to splice the page frame number and page table attributes into page table entry values:

arch/arm64/include/asm/pgtable.h:
77 #define pfn_pte(pfn,prot) \
78 __pte(__phys_to_pte_val((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot))

It is to shift pfn to the left by PAGE_SHIFT bits (usually 12 bits), or to pgprot_val(prot)

Let’s look at my_zero_pfn first:

include/asm-generic/pgtable.h:
   875 static inline unsigned long my_zero_pfn(unsigned long addr)
   876 {
   877 extern unsigned long zero_pfn;
   878 return zero_pfn;
   879}

|| \/

mm/memory.c:
   126 unsigned long zero_pfn __read_mostly;
   127 EXPORT_SYMBOL(zero_pfn);
   128
   129 unsigned long highest_memmap_pfn __read_mostly;
   130
   131/*
   132 * CONFIG_MMU architectures set up ZERO_PAGE in their paging_init()
   133 */
   134 static int __init init_zero_pfn(void)
   135 {
   136 zero_pfn = page_to_pfn(ZERO_PAGE(0));
   137 return 0;
   138 }
   139 core_initcall(init_zero_pfn);

|| \/

arch/arm64/include/asm/pgtable.h:
   54/*
   55 * ZERO_PAGE is a global shared page that is always zero: used
   56 * for zero-mapped memory areas etc..
   57 */
   58 extern unsigned long empty_zero_page[PAGE_SIZE / sizeof(unsigned long)];
   59 #define ZERO_PAGE(vaddr) phys_to_page(__pa_symbol(empty_zero_page))

Finally, we see that the page 0 of empty_zero_page set by the kernel initialization is used to get the page frame number. Look at the second parameter vma->vm_pageprot of pfn_pte. This is the access permission of vma, which will be set when doing memory mapping mmap.

So what we want to know is when page 0 was set as read-only (that is, when the page table entry was set as read-only)?

We took this question to find the answer in the kernel code. In fact, when you see the code here, you generally can’t see a clue, but if we need to know when the vm_page_prot member of vma is set and how it is set, we may be able to find the answer.

Let’s go to mm/mmap.c to find the answer: Let’s take the do_brk_flags function as an example. This is the function that sets the heap. We pay attention to line 3040, which sets vm_page_prot:

3040 vma->vm_page_prot = vm_get_page_prot(flags); 

|| \/

 110 pgprot_t vm_get_page_prot(unsigned long vm_flags)
   111 {
   112 pgprot_t ret = __pgprot(pgprot_val(protection_map[vm_flags & amp;
   113 (VM_READ|VM_WRITE|VM_EXEC|VM_SHARED)]) |
   114 pgprot_val(arch_vm_get_page_prot(vm_flags)));
   115
   116 return arch_filter_pgprot(ret);
   117 }
   118 EXPORT_SYMBOL(vm_get_page_prot);

The vm_get_page_prot function will convert the passed vmflags into a protection bit combination based on whether it is VMREAD|VMWRITE|VMEXEC|VMSHARED. Continue reading || \/

 78 /* description of effects of mapping type and prot in current implementation.
    79 * this is due to the limited x86 page protection hardware. The expected
    80 * behavior is in parens:
    81*
    82 * map_type prot
    83 * PROT_NONE PROT_READ PROT_WRITE PROT_EXEC
    84 * MAP_SHARED r: (no) no r: (yes) yes r: (no) yes r: (no) yes
    85 * w: (no) no w: (no) no w: (yes) yes w: (no) no
    86 * x: (no) no x: (no) yes x: (no) yes x: (yes) yes
    87*
    88 * MAP_PRIVATE r: (no) no r: (yes) yes r: (no) yes r: (no) yes
    89 * w: (no) no w: (no) no w: (copy) copy w: (no) no
    90 * x: (no) no x: (no) yes x: (no) yes x: (yes) yes
    91*
    92 * On arm64, PROT_EXEC has the following behavior for both MAP_SHARED and
    93 *MAP_PRIVATE:
    94 * r: (no) no
    95 * w: (no) no
    96 * x: (yes) yes
    97 */
    98 pgprot_t protection_map[16] __ro_after_init = {
    99 __P000, __P001, __P010, __P011, __P100, __P101, __P110, __P111,
   100 __S000, __S001, __S010, __S011, __S100, __S101, __S110, __S111
   101 };

The protection_map array defines a total of 16 combinations fromP000 toS111. P means Private, S means Share. The next three numbers are readable, writable and executable in order, such as :_S010 means shared, unreadable, writable, and unexecutable.

|| \/

arch/arm64/include/asm/pgtable-prot.h:
   93 #define PAGE_NONE __pgprot(((_PAGE_DEFAULT) & amp; ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
   94 #define PAGE_SHARED __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | PTE_PXN | PTE_UXN | PTE_WRITE)
   95 #define PAGE_SHARED_EXEC __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_NG | PTE_PXN | PTE_WRITE)
   96 #define PAGE_READONLY __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
   97 #define PAGE_READONLY_EXEC __pgprot(_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PTE_NG | PTE_PXN)
   98 #define PAGE_EXECONLY __pgprot(_PAGE_DEFAULT | PTE_RDONLY | PTE_NG | PTE_PXN)
   99
  100 #define __P000 PAGE_NONE
  101 #define __P001 PAGE_READONLY
  102 #define __P010 PAGE_READONLY
  103 #define __P011 PAGE_READONLY
  104 #define __P100 PAGE_EXECONLY
  105 #define __P101 PAGE_READONLY_EXEC
  106 #define __P110 PAGE_READONLY_EXEC
  107 #define __P111 PAGE_READONLY_EXEC
  108
  109 #define __S000 PAGE_NONE
  110 #define __S001 PAGE_READONLY
  111 #define __S010 PAGE_SHARED
  112 #define __S011 PAGE_SHARED
  113 #define __S100 PAGE_EXECONLY
  114 #define __S101 PAGE_READONLY_EXEC
  115 #define __S110 PAGE_SHARED_EXEC
  116 #define __S111 PAGE_SHARED_EXEC

It can be found that for private mapping, there is only read-only (PTE_RDONLY) and no writable attribute (PTE_WRITE) lines 105-107, although the writable (VM_WRITE) was set before! The corresponding shared mapping will have writable attributes.

And this set protection bit combination will eventually be set to the page table in a page fault exception: the do_anonymous_page function mentioned above:

2908 entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
2909 vma->vm_page_prot));

For private anonymously mapped pages, assuming that the set vmflags is VMREAD|VMWRITE, the corresponding protection bit combination is:P110 is PAGE_READONLY_EXEC=pgprot (_PAGE_DEFAULT | PTE_USER | PTE_RDONLY | PT_ENG | PTE_PXN) will not be set is writable.

So set its page table to read-only! ! !

Line 2922: Jump to setpte to fill in the set page table entry value into the page table entry.

When the anonymous page is read and then written again, a COW page fault exception will occur because the page table attribute is read-only. Please refer to the COW related articles for details and will not go into details here. Let’s talk with pictures:

3.3 The first time writing an anonymous page

Then the do_anonymous_page function continues to analyze:

 2876 static vm_fault_t do_anonymous_page(struct vm_fault *vmf)
  2877 {
  ...
  2924
  2925 /* Allocate our own private page. */
  2926 if (unlikely(anon_vma_prepare(vma)))
  2927 goto oom;
  2928 page = alloc_zeroed_user_highpage_movable(vma, vmf->address);
  2929 if (!page)
  2930 goto oom;
  2931
  2932 if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL, & amp;memcg,
  2933 false))
  2934 goto oom_free_page;
  2935
  2936/*
  2937 |* The memory barrier inside __SetPageUptodate makes sure that
  2938 |* preceeding stores to the page contents become visible before
  2939 |* the set_pte_at() write.
  2940 |*/
  2941 __SetPageUptodate(page);
  2942
  2943 entry = mk_pte(page, vma->vm_page_prot);
  2944 if (vma->vm_flags & VM_WRITE)
  2945 entry = pte_mkwrite(pte_mkdirty(entry));
2946
  2947 vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
  2948 &vmf->ptl);
  2949 if (!pte_none(*vmf->pte))
  2950 goto release;
  2951
  2952 ret = check_stable_address_space(vma->vm_mm);
  2953 if (ret)
  2954 goto release;
  2955
  2956 /* Deliver the page fault to userland, check inside PT lock */
  2957 if (userfaultfd_missing(vma)) {
  2958 pte_unmap_unlock(vmf->pte, vmf->ptl);
  2959 mem_cgroup_cancel_charge(page, memcg, false);
  2960 put_page(page);
  2961 return handle_userfault(vmf, VM_UFFD_MISSING);
  2962 }
  2963
  2964 inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
  2965 page_add_new_anon_rmap(page, vma, vmf->address, false);
  2966 mem_cgroup_commit_charge(page, memcg, false, false);
  2967 lru_cache_add_active_or_unevictable(page, vma);
  2968 setpte:
  2969 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
  2970
  2971 /* No need to invalidate - it was non-present before */
  2972 update_mmu_cache(vma, vmf->address, vmf->pte);
  2973 unlock:
  2974 pte_unmap_unlock(vmf->pte, vmf->ptl);
  2975 return ret;
  2976 release:
  2977 mem_cgroup_cancel_charge(page, memcg, false);
  2978 put_page(page);
  2979 goto unlock;
  2980 oom_free_page:
  2981 put_page(page);
  2982 oom:
  2983 return VM_FAULT_OOM;
  2984}

When it is judged that the page fault is not caused by a read operation, it is caused by a write operation. To deal with the situation of writing a private anonymous page, Please remember that this is still the first time to access this anonymous page, it is just a write access..

Row 2928 will allocate a high-end, portable, zero-filled physical page. 2941. The data in the setting page is valid. 2943. Use the page frame number and vma access rights to set the page table entry value (note: at this time, the page table entry attribute is still read-only).

Lines 2944-2945: If vma is writable, set the page table entry value to dirty and *writable* (it is set to writable at this time).

Line 2964: Anonymous page count statistics Line 2965: Added to the reverse mapping of the anonymous page Line 2967: Added to the lru linked list 2969: Fill the set page table entry value into the page table entry.

Let’s talk with pictures:

3.4 Write anonymous page after reading

Writing anonymous pages after reading is actually very simple, that is, a COW copy-on-write page fault occurs. Let’s still talk about the pictures:

4. Application layer experiment

Experiment 1: Mainly experience the kernel’s on-demand page allocation strategy! Experimental code: mmap mapping 10 * 4096 * 4096/1M = 160M memory space, obtain memory usage before and after mapping and writing pages:

 1 #include <stdio.h>
    2 #include <stdlib.h>
    3 #include <sys/mman.h>
    4 #include <unistd.h>
    5
    6
    7 #define MAP_LEN (10 * 4096 * 4096)
    8 
    9 int main(int argc, char **argv)
   10 {
   11 char *p;
   12 int i;
   13
   14
   15 puts("before mmap ->please exec: free -m\
");
   16 sleep(10);
   17 p = (char *)mmap(0, MAP_LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
   18
   19 puts("after mmap ->please exec: free -m\
");
   20 puts("before write....\
");
   21 sleep(10);
   twenty two 
   23 for(i=0;i <4096 *10; i + + )
   24 p[4096 * i] = 0x55;
   25
   26
   27 puts("after write ->please exec: free -m\
");
   28
   29 pause();
   30
   31 return 0;
   32 } 

Results of the:

Execute after “before mmap ->please exec: free -m” appears:

$ free -m
              Total Used Free Shared Buffer/Cache Available
Memory: 15921 6561 462 796 8897 8214
Exchange: 16290 702 15588

Execute after printing “after mmap ->please exec: free -m”:

$ free -m
              Total Used Free Shared Buffer/Cache Available
Memory: 15921 6565 483 771 8872 8236
Exchange: 16290 702 15588

Execute after “after write ->please exec: free -m” appears:

$:~/study/user_test/page-fault$ free -m
              Total Used Free Shared Buffer/Cache Available
Memory: 15921 6727 322 770 8871 8076
Exchange: 16290 702 15588

We only focus on used memory, and we can find that there is basically no change in used memory before and after mapping (taking into account the existence of other memory applications, there will also be memory changes), which are 6561M and 6565M, indicating that there is no change when mmap No physical memory is allocated. After writing, it is found that the memory usage is 6727M, 6727-6565=162M, which is basically the same as the size of our mmap, indicating that the same amount of physical memory is allocated when the anonymous page is actually written.

Experiment 2: The main experience is to read the anonymous page and then write the memory page application. Experimental code: mmap mapping 10 * 4096 * 4096/1M = 160M memory space, mapping, reading and then writing the page to obtain the memory usage before and after :

 1 #include <stdio.h>
    2 #include <stdlib.h>
    3 #include <sys/mman.h>
    4 #include <unistd.h>
    5
    6
    7 #define MAP_LEN (10 * 4096 * 4096)
    8 
    9 int main(int argc, char **argv)
   10 {
   11 char *p;
   12 int i;
   13
   14
   15 puts("before mmap...pls show free:.\
");
   16 sleep(10);
? 17 p = (char *)mmap(0, MAP_LEN, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
   18
   19 puts("after mmap....\
");
   20
   21 puts("before read...pls show free:.\
");
   22 sleep(10);
   twenty three 
   24 puts("start read....\
");
   25
   26
   27 for(i=0;i <4096 *10; i + + )
   28 printf("%d ", p[4096 * i]);
   29 printf("\
");
   30
   31 puts("after read....pls show free:\
");
   32
   33 sleep(10);
   34
   35 puts("start write....\
");
   36
   37 for(i=0;i <4096 *10; i + + )
   38 p[4096 * i] = 0x55;
   39
   40
   41 puts("after write...pls show free:.\
");
   42
   43 pause();
   44
   45 return 0;
   46}

Execution result: Execute after “before mmap ->please exec: free -m” appears:

$ free -m
              Total Used Free Shared Buffer/Cache Available
Memory: 15921 6590 631 780 8700 8164
Exchange: 16290 702 15588

Execute after “before read ->please exec: free -m” appears:

$ free -m
              Total Used Free Shared Buffer/Cache Available
Memory: 15921 6586 644 770 8690 8178
Exchange: 16290 702 15588

Execute after “after read ->please exec: free -m” appears:

$ free -m
              Total Used Free Shared Buffer/Cache Available
Memory: 15921 6587 624 789 8709 8158
Exchange: 16290 702 15588

Execute after “after write ->please exec: free -m” appears:

$ free -m
              Total Used Free Shared Buffer/Cache Available
Memory: 15921 6749 462 789 8709 7996
Exchange: 16290 702 15588

It can be found that there is basically no change in memory usage after reading and before (it is actually mapped to page 0, which is allocated during kernel initialization). It is known that 6749-6587 = 162M after writing is in line with expectations, and when printing, it can be found that the data is all 0 .

Analysis: In fact, when mmap is used, only a block of vma is applied for. When reading, a page fault exception occurs, which is mapped to page 0. All memory is not allocated. When the page is written again, a COW occurs to allocate a new page (in cow). When allocating a new page, it will be judged whether the original page is page 0. If it is page 0, the page will be allocated directly and filled with 0).

V. Summary

Anonymous mapping page fault exception is a very common exception we encounter. For anonymous mapping, after the mapping is completed, only a piece of virtual memory is obtained, and no physical memory is allocated. When accessed for the first time: if For read access, the virtual page will be mapped to page 0 to reduce unnecessary memory allocation; if it is a write access, a new physical page will be allocated, filled with 0, and then mapped to the virtual page. If a page is accessed first for reading and then for writing, two page fault exceptions will occur: the first is the processing of the anonymous page fault exception for reading, and the second is the processing of the copy-on-write page fault exception.

Original author: Linux code reading field

Original address: Analysis of Anonymous Mapping Page Missing Abnormalities in Linux Kernel Virtual Memory Management – Tencent Cloud Developer Community – Tencent Cloud (Copyright belongs to the author of the original article, please contact us to delete any infringement messages)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. CS entry skill treeLinux entryFirst introduction to Linux 38064 people are learning the system

syntaxbug.com © 2021 All Rights Reserved.