Linux memory management–high-end memory mapping kmap persistent kernel mapping

1 High-end memory and kernel mapping

Although the vmalloc function family can be used to map page frames from high memory areas to the kernel (these are usually not directly visible in kernel space), this is not the actual purpose of these functions.

It is important to emphasize the fact that the kernel provides other functions for explicitly mapping ZONE_HIGHMEM page frames into kernel space, which have nothing to do with the vmalloc mechanism. Therefore, this creates confusion.

Pages in high-end memory cannot be permanently mapped into the kernel address space. Therefore, memory pages obtained with the __GFP_HIGHMEM flag through the alloc_pages() function cannot have logical addresses.

In the x86_32 architecture, all physical memory ranges above 896MB are mostly high-end memory, which is not permanently or automatically mapped into the kernel address space, although the x86 processor is able to address physical RAM ranges up to 4GB (PAE enabled) 64GB can be addressed), once these pages are allocated, they must be mapped to the kernel’s logical address space. On x86_32, pages with high addresses are mapped to the kernel address space (i.e. 3GB~4GB of the virtual address space)

What is the last 128 MiB of the kernel address space used for?

This section serves 3 purposes.

  1. Memory areas that are contiguous in virtual memory but not contiguous in physical memory can be allocated in the vmalloc area. This mechanism is usually used for user processes, and the kernel itself will try to avoid non-contiguous physical addresses. The kernel usually succeeds because most large blocks of memory are allocated to the kernel at boot time, when memory fragmentation is not yet severe. However, on systems that have been running for a long time, when the kernel requires physical memory, the available space may be discontinuous. This kind of situation mainly occurs when dynamically loading modules.
  2. Persistent mapping is used to map non-persistent pages in the high memory domain into the kernel
  3. A fixed map is a virtual address space entry that is associated with a fixed page in the physical address space, but the specific associated page frame can be freely chosen. In contrast to a direct mapped page that is associated with physical memory by a fixed formula, a virtual fixed map address is associated with physical memory The association between positions can be defined by yourself, and the kernel will always notice it after the association is established.

Two preprocessor symbols are important here: __VMALLOC_RESERVE sets the length of the vmalloc region, and MAXMEM represents the maximum possible amount of physical memory that the kernel can directly address. .

In the kernel, the division of memory into various areas is controlled by various constants shown in Figure 3-15. Depending on the kernel and system configuration, these constants may have different values. The bounds of direct mapping are specified by high_memory.

  1. direct mapping area

The range starting from 3G and up to 896M in the linear space is a direct memory mapping area. There is a linear conversion relationship between the linear address and the physical address in this area: linear address = 3G + physical address.

  1. Dynamic memory mapping area

This area is allocated by the kernel function vmalloc. Its characteristics are: the linear space is continuous, but the corresponding physical space is not necessarily continuous. The physical page corresponding to the linear address allocated by vmalloc may be in low-end memory, or it may be In high-end memory.

  1. Permanent memory mapping area

This area can access high-end memory. The access method is to use alloc_page(_GFP_HIGHMEM) to allocate high-end memory pages or use the kmap function to map the allocated high-end memory to this area.

  1. Fixed mapping area

There is only a 4k isolation zone at the top of this area and 4G, and each of its address entries serves a specific purpose, such as ACPI_BASE, etc.

Note that user space can of course use high-end memory, and it is used normally. When the kernel allocates memory that is not frequently used, it uses high-end memory space (if any). The so-called infrequently used is relative, such as the kernel’s Some data structures are frequently used, while some user data are infrequently used. When users start an application, they need memory, and each application has a 3G linear address. When mapping page tables to these addresses, high-end memory can be used directly. And one more thing to correct is: the 128M linear address is not only used in these places. If you want to load a device, and this device needs to map its memory to the kernel, it also needs to use this linear address space to complete. , otherwise the kernel cannot access the memory space on the device. In short, the high-end linear address of the kernel is to access memory resources outside the kernel’s fixed mapping. When a process uses memory, a page fault exception is triggered. Specifically which physical pages are mapped to the user process is a matter for the kernel to consider. There is no concept of high-end memory in user space.

That is, the kernel does not need a special mapping mechanism for low-end memory. You can use direct mapping to access ordinary memory areas. For high-end memory areas, the kernel can use three different mechanisms to map page frames to High-end memory: respectively called Permanent Kernel Mapping, Temporary Kernel Mapping and Non-Contiguous Memory Allocation

2 Persistent kernel mapping

If you need to long-term map the high-end page frame (as a persistent mapping) into the kernel address space, you must use the kmap function. The page that needs to be mapped is specified with a pointer to page as a parameter of this function. This function creates a map if necessary (that is, if the page is indeed a high-end page), and returns the address of the data.

If high-end support is not enabled, the task of this function is simpler. In this case, all pages are directly accessible, so only the address of the page needs to be returned, without explicitly creating a mapping.

If the high page does exist, the situation is more complicated. Similar to vmalloc, the kernel must first establish the association between the high page and the address to which it is mapped. It must also allocate an area in the virtual address space to map the page frame, and finally, the kernel It must be recorded which parts of the virtual area are in use and which are still free.

2.1 Data Structure

The kernel on IA-32 platforms allocates a region after the vmalloc region, from PKMAP_BASE to FIXADDR_START. This region is used for persistent mapping. Different The scheme used by the architecture is similar.

Permanent kernel mapping allows the kernel to establish a long-term mapping of upper page frames into the kernel address space. They use a special page table in the kernel page table, whose address is stored in the variable pkmap_page_table. The number of entries in the page table is generated by the LAST_PKMAP macro. Therefore, the kernel can access up to 2MB or 4MB at a time. High-end memory.

#define PKMAP_BASE (PAGE_OFFSET - PMD_SIZE)

The linear address of the page table mapping starts from PKMAP_BASE. The pkmap_count array contains LAST_PKMAP counters, one for each entry in the pkmap_page_table page table.

// http://lxr.free-electrons.com/source/mm/highmem.c?v=4.7#L126
static int pkmap_count[LAST_PKMAP];
static __cacheline_aligned_in_smp DEFINE_SPINLOCK(kmap_lock);

pte_t * pkmap_page_table;

The allocation structure of logical pages in the high-end mapping area is described by the allocation table (pkmap_count). It has 1024 entries, corresponding to different logical pages in the mapping area. When the value of the allocation item is equal to 0, it is a free item, when it is equal to 1, it is a buffer item, and when it is greater than 1, it is a mapped item. The allocation of mapped pages is based on scanning of the allocation table. When all free entries are used up, the system will clear all buffer entries. If even the buffer entries are used up, the system will enter a waiting state.

// http://lxr.free-electrons.com/source/mm/highmem.c?v=4.7#L126
/*
The allocation structure of the high-end mapping area logical page is described by the allocation table (pkmap_count), which has 1024 entries.
Corresponds to different logical pages in the mapping area. When the value of the allocation item is equal to zero, it is a free item, and when it is equal to 1, it is a free item.
Buffer item, if it is greater than 1, it is a mapping item. Allocation of mapped pages is based on a scan of the allocation table, when all free
When all items are used up, the system will clear all buffer items. If even the buffer items are used up, the system will
The system will enter the waiting state.
*/
static int pkmap_count[LAST_PKMAP];

pkmap_count is an integer array with capacity LAST_PKMAP, where each element corresponds to a persistent map page. It is actually a usage counter of the mapped page, with less common semantics.

The kernel can obtain the number of elements in the pkmap_count array through get_next_pkmap_nr.

/*
 * Get next index for mapping inside PKMAP region for page with given color.
 */
static inline unsigned int get_next_pkmap_nr(unsigned int color)
{
    static unsigned int last_pkmap_nr;

    last_pkmap_nr = (last_pkmap_nr + 1) & LAST_PKMAP_MASK;
    return last_pkmap_nr;
}

To record the relationship between the upper memory page frame and the linear address contained in the permanent kernel map, the kernel uses a page_address_htable hash table.

This table contains a page_address_map data structure that is used to perform the current mapping for each page frame in high memory. The data structure also contains a pointer to the page descriptor and the linear address assigned to the page frame.

/*
 * Describes one page->virtual association
 */
struct page_address_map
{
    struct page *page;
    void *virtual;
    struct list_head list;
};

This structure is used to establish the mapping of page-->virtual (hence the name of the structure).

Field Description
page is a pointer to the global mem_map array The pointer to the page instance
virtual specifies the location of the page allocated in the kernel virtual address space

To facilitate organization, the mapping is saved in a hash table, and the linked list elements in the structure are used to build an overflow linked list to handle hash collisions. The hash table is implemented through the page_address_htable array,

static struct page_address_slot *page_slot(const struct page *page)
{
    return &page_address_htable[hash_ptr(page, PA_HASH_ORDER)];
}

2.2 page_address function

page_address is a front-end function that uses the above data structure to determine the linear address of a given page instance.

/**
 * page_address - get the mapped virtual address of a page
 * @page: & amp;struct page to get the virtual address of
 *
 * Returns the page's virtual address.
 */
void *page_address(const struct page *page)
{
    unsigned long flags;
    void *ret;
    struct page_address_slot *pas;
    /*If the page frame is not in high memory*/
    if (!PageHighMem(page))
         /*Linear addresses always exist, by calculating the page frame subscript
            Then convert it into a physical address, and finally according to the corresponding
            /Physical address gets linear address*/
        return lowmem_page_address(page);
    /*Get pas from page_address_htable hash table*/
    pas = page_slot(page);
    ret = NULL;
    spin_lock_irqsave( & amp;pas->lock, flags);
    if (!list_empty( & amp;pas->lh)) {<!-- -->{/*If the corresponding linked list is not empty,
    Stored in this linked list is the page_address_map structure*/
        struct page_address_map *pam;
        /*For each element in the linked list*/
        list_for_each_entry(pam, & amp;pas->lh, list) {
            if (pam->page == page) {
                /*return linear address*/
                ret = pam->virtual;
                goto done;
            }
        }
    }
done:
    spin_unlock_irqrestore( & amp;pas->lock, flags);
    return ret;
}

EXPORT_SYMBOL(page_address);

page_address first checks whether the passed in page instance is in ordinary memory or high-end memory.

  • If it is the former (ordinary memory area), the page address can be calculated based on the position of page in the mem_map array. This work can be completed by calling page_to_virt(page) through lowmem_page_address
  • For the latter, the virtual address can be found via the hash table described above.

2.3 kmap creates mapping

2.3.1 kmap function

To create a mapping via a page pointer, the kmap function must be used.

The definition may differ for different architectures, but most architectures have the following definition,

/*High-end memory mapping, using arrays for operation allocation
After allocation, it needs to be added to the hash table; */
void *kmap(struct page *page)
{
    might_sleep();
    if (!PageHighMem(page)) /*If the page frame does not belong to high-end memory*/
        return page_address(page);
    return kmap_high(page); /*Page frame does belong to high-end memory*/
}
EXPORT_SYMBOL(kmap);

The kmap function is just a front-end to page_address, used to confirm whether the specified page is indeed in the high memory domain. Otherwise, the result is returned by page_address address. If it is indeed in high memory, the kernel delegates the work to kmap_high

2.3.2 kmap_high function

/**
 * kmap_high - map a highmem page into memory
 * @page: & struct page to map
 *
 * Returns the page's virtual memory address.
 *
 * We cannot call this from interrupts, as it may block.
 */
void *kmap_high(struct page *page)
{
    unsigned long vaddr;

    /*
     * For highmem pages, we can't trust "virtual" until
     * after we have the lock.
     */
    lock_kmap(); /*Protect page tables from concurrent access on multi-processor systems*/

    /*Check if it has been mapped*/
    vaddr = (unsigned long)page_address(page);
    if (!vaddr) )/* if not mapped */
        /*Insert the physical address of the page frame into pkmap_page_table
        an item and add one to the page_address_htable hash table
        element*/
        vaddr = map_new_virtual(page);
    /*Add one to the allocation count. At this time, the process is correct and it should be 2*/
    pkmap_count[PKMAP_NR(vaddr)] + + ;
    BUG_ON(pkmap_count[PKMAP_NR(vaddr)] < 2);
    unlock_kmap();
    return (void*) vaddr; ;/*return address*/
}

EXPORT_SYMBOL(kmap_high);

2.3.3 map_new_virtual function

The page_address function discussed above first checks whether the page is already mapped. If it does not correspond to a valid address, the page must be mapped using map_new_virtual.

The following main steps will be performed.

static inline unsigned long map_new_virtual(struct page *page)
{
    unsigned long vaddr;
    int count;
    unsigned int last_pkmap_nr;
    unsigned int color = get_pkmap_color(page);

start:
    count = get_pkmap_entries_count(color);
    /* Find an empty entry */
    for (;;) {
        last_pkmap_nr = get_next_pkmap_nr(color); /*Add 1 to prevent crossing the boundary*/
        /* Next, determine when last_pkmap_nr is equal to 0. If it is equal to 0, it means that 1023 (LAST_PKMAP(1024)-1) page table entries have been allocated.
        , at this time, you need to call the flush_all_zero_pkmaps() function to flush all page table entries with a pkmap_count[] count of 1 in the TLB.
        , and reset it to 0, which means that the page table entry can be used again. You may be wondering why it is not set when pkmap_count is set to 1.
        What about unmapping and flushing the TLB at the same time?
        Personally, I feel that it may be for efficiency reasons. After all, it would be more efficient to wait until it is not enough before refreshing. */
        if (no_more_pkmaps(last_pkmap_nr, color)) {
            flush_all_zero_pkmaps();
            count = get_pkmap_entries_count(color);
        }

        if (!pkmap_count[last_pkmap_nr])
            break; /* Found a usable entry */
        if (--count)
            continue;

        /*
         * Sleep for somebody else to unmap their entries
         */
        {
            DECLARE_WAITQUEUE(wait, current);
            wait_queue_head_t *pkmap_map_wait =
                get_pkmap_wait_queue_head(color);

            __set_current_state(TASK_UNINTERRUPTIBLE);
            add_wait_queue(pkmap_map_wait, & amp;wait);
            unlock_kmap();
            schedule();
            remove_wait_queue(pkmap_map_wait, & amp;wait);
            lock_kmap();

            /* Somebody else might have mapped it while we slept */
            if (page_address(page))
                return (unsigned long)page_address(page);

            /* Re-start */
            goto start;
        }
    }
    /*Return the linear address vaddr.*/ corresponding to this page table entry
    vaddr = PKMAP_ADDR(last_pkmap_nr);
    /*Set page table entries*/
    set_pte_at( & amp;init_mm, vaddr,
            & amp;(pkmap_page_table[last_pkmap_nr]), mk_pte(page, kmap_prot));
    /*Next, set pkmap_count[last_pkmap_nr] to 1. Doesn’t 1 mean unavailable?
    Now that the mapping has been established, it should be assigned a value of 2. In fact, this operation
    It is completed in his upper-level function kmap_high (pkmap_count[PKMAP_NR(vaddr)] + + ).*/
    pkmap_count[last_pkmap_nr] = 1;
    /*At this point, the entire mapping is completed, and then the page and the corresponding linear address
    Just add it to the page_address_htable hash list*/
    set_page_address(page, (void *)vaddr);

    return vaddr;
}
  1. Starting from the last used position (saved in the global variable last_pkmap_nr), the pkmap_count array is scanned in reverse direction until a free position is found. If there is no free position, the function goes to sleep until another part of the kernel performs an unmap operation to free up the space. . When the maximum index value of pkmap_count is reached, the search starts from position 0. In this case, the flush_all_zero_pkmaps function is also called to flush the CPU cache (the reader will see this later).
  2. Modify the kernel’s page table and map the page to the specified location. But the TLB has not been updated yet.
  3. The usage counter of the new location is set to 1. As mentioned above, this means that the page was allocated but cannot be used because the TLB entry was not updated.
  4. set_page_address adds the page to a persistent kernel-mapped data structure. This function returns the virtual address of the newly mapped page. On architectures that do not require high memory pages (or do not set CONFIG_HIGHMEM), a generic version of kmap is used to return the address of the page without modifying virtual memory.

2.4 kunmap unmapping

Pages mapped with kmap, if no longer needed, must be unmapped with kunmap. As usual, the function first checks whether the relevant page (identified by the page instance) is indeed in high memory. If so, the actual work is delegated to mm/highmem kunmap_high in .c, the main task of this function is to decrement the counter at the corresponding position in the pkmap_count array by 1

The mechanism can never reduce the counter value below 1. This means that the associated page is not freed. Because the usage counter is additionally incremented, as discussed earlier, this is to ensure correct handling of the CPU cache.

flush_all_zero_pkmaps, also mentioned above, is the key to finally releasing the map. This function is always called when map_new_virtual searches for free locations from scratch.

It is responsible for the following 3 operations.

  1. flush_cache_kmaps performs a flush on the kernel map (on most architectures where explicit flushing is required, flush_cache_all will be used to flush the CPU’s entire cache), because The kernel’s global page table has been modified.
  2. Scan the entire pkmap_count array. Entries with a counter value of 1 are set to 0, the associated entries are removed from the page table, and finally the mapping is deleted.
  3. Finally, use the flush_tlb_kernel_range function to flush out all TLB entries related to the PKMAP area.

[Article Benefits]The editor recommends his own Linux kernel technology exchange group:
【977878001】Compile some learning books and video materials that I personally think are better! Enter the group private chat management to receive
Kernel information package (including video tutorials, e-books, practical projects and code)

Kernel information direct train:Linux kernel source code technology learning route + video tutorial code information

Join and learn for free:Linux/c/c++/kernel source code/audio and video/DPDK/Golang cloud native/QT

2.4.1 kunmap function

Similar to kmap, each architecture should implement its own kmap function. The definition of most architectures is as follows,

void kunmap(struct page *page)
{
    BUG_ON(in_interrupt());
    if (!PageHighMem(page))
        return;
    kunmap_high(page);
}
EXPORT_SYMBOL(kunmap);

The kernel first checks whether the memory area to be released is in the high-end memory area.

  • If the memory area is in a normal memory area, the kernel does not establish a persistent kernel mapping for it through kmap_high, and of course there is no need to use kunmap_high to release it.
  • If the memory area is in the high-end memory area, the kernel releases the memory space through kunmap_high

2.4.2 kunmap_high function

#ifdef CONFIG_HIGHMEM
/**
 * kunmap_high - unmap a highmem page into memory
 * @page: & struct page to unmap
 *
 * If ARCH_NEEDS_KMAP_HIGH_GET is not defined then this may be called
 * only from user context.
 */
void kunmap_high(struct page *page)
{
    unsigned long vaddr;
    unsigned long nr;
    unsigned long flags;
    int need_wakeup;
    unsigned int color = get_pkmap_color(page);
    wait_queue_head_t *pkmap_map_wait;

    lock_kmap_any(flags);
    vaddr = (unsigned long)page_address(page);
    BUG_ON(!vaddr);
    nr = PKMAP_NR(vaddr); /*The first page of the permanent memory area*/

    /*
     * A count must never go down to zero
     * without a TLB flush!
     */
    need_wakeup = 0;
    switch (--pkmap_count[nr]) { /*Reduce this value because it is added by 2 during mapping*/
    case 0:
        BUG();
    case 1:
        /*
         * Avoid an unnecessary wake_up() function call.
         * The common case is pkmap_count[] == 1, but
         * no waiters.
         * The tasks queued in the wait-queue are guarded
         * by both the lock in the wait-queue-head and by
         * the kmap_lock. As the kmap_lock is held here,
         * no need for the wait-queue-head's lock. Simply
         * test if the queue is empty.
         */
        pkmap_map_wait = get_pkmap_wait_queue_head(color);
        need_wakeup = waitqueue_active(pkmap_map_wait);
    }
    unlock_kmap_any(flags);

    /* do wake-up, if needed, race-free outside of the spin lock */
    if (need_wakeup)
        wake_up(pkmap_map_wait);
}

EXPORT_SYMBOL(kunmap_high);
#endif

3 Temporary kernel mapping

The kmap function just described cannot be used with an interrupt handler because it may go to sleep. If there are no free locations in the pkmap array, the function will go to sleep until the situation improves.

void *kmap_atomic(struct page *page)
{
    unsigned int idx;
    unsigned long vaddr;
    void *kmap;
    int type;

    preempt_disable();
    pagefault_disable();
    if (!PageHighMem(page))
        return page_address(page);

#ifdef CONFIG_DEBUG_HIGHMEM
    /*
     * There is no cache coherency issue when non VIVT, so force the
     * dedicated kmap usage for better debugging purposes in that case.
     */
    if (!cache_is_vivt())
        kmap = NULL;
    else
#endif
        kmap = kmap_high_get(page);
    if(kmap)
        return kmap;

    type = kmap_atomic_idx_push();

    idx = FIX_KMAP_BEGIN + type + KM_TYPE_NR * smp_processor_id();
    vaddr = __fix_to_virt(idx);
#ifdef CONFIG_DEBUG_HIGHMEM
    /*
     * With debugging enabled, kunmap_atomic forces that entry to 0.
     * Make sure it was indeed properly unmapped.
     */
    BUG_ON(!pte_none(get_fixmap_pte(vaddr)));
#endif
    /*
     * When debugging is off, kunmap_atomic leaves the previous mapping
     * in place, so the contained TLB flush ensures the TLB is updated
     * with the new mapping.
     */
    set_fixmap_pte(idx, mk_pte(page, kmap_prot));

    return (void *)vaddr;
}
EXPORT_SYMBOL(kmap_atomic);

This function does not block, so can be used in interrupt context and where KIA cannot reschedule. It also disables kernel preemption, which is necessary so that the mapping is unique to each processor (scheduling may be done to which process which process the server executes to make changes).

3.2 kunmap_atomic function

Unmapping can be done through the function kunmap_atomic

/*
 * Prevent people trying to call kunmap_atomic() as if it were kunmap()
 * kunmap_atomic() should get the return value of kmap_atomic, not the page.
 */
#define kunmap_atomic(addr) \
do { \
    BUILD_BUG_ON(__same_type((addr), struct page *)); \
    __kunmap_atomic(addr); \
} while (0)

This function also does not block. On many architectures, unless kernel preemption is activated, kunmap_atomic has nothing to do, because the previous temporary mapping is not valid until the next temporary mapping arrives. Therefore, the kernel can completely “forget” There is no need to do anything practical with kmap_atomic mapping and kunmap_atomic. The next atomic mapping will automatically overwrite the previous mapping.

Original author: 233333

Original address: High-end memory mapping kmap persistent kernel mapping – Linux memory management (20) – Tencent Cloud Developer Community – Tencent Cloud (copyright belongs to the original author, please contact us to delete any infringement messages)

The knowledge points of the article match the official knowledge files, and you can further learn relevant knowledge. CS entry skill treeLinux introductionFirst introduction to Linux 37961 people are learning the system