Linux physical page allocation algorithm—-partner algorithm

Foreword

The content of this chapter focuses on the partner algorithm in the Linux physical page allocation algorithm. However, it is a bit too far-fetched to directly introduce this algorithm, so we first analyze when physical page allocation starts: (The source code in this article is from kernel version 5.15)

Linux uses the “request paging” mechanism. Request paging is a dynamic memory allocation technology. It postpones the allocation of pages until it can no longer be postponed, that is, until the page to be accessed by the process is no longer in physical memory. This causes a page missing exception. The page missing exception handling function then begins to actually allocate physical memory space. What we are going to talk about today is the function get_free_pages that is called in the page missing exception handling function and actually allocates physical memory space. The specific implementation of ( ) and the partner algorithm it uses.

1. Preparation

Before explaining the allocation algorithm, we need to understand that the address accessed by the CPU is not a real address in physical memory, but a virtual address in the virtual address space. Therefore, for the management of memory pages, virtual memory space is usually allocated first, and then real physical memory pages are allocated and mapped to this virtual memory space as needed.

1. Page descriptor

So how does the kernel represent each physical page in the system? The kernel is represented by a structure struct page page descriptor. By looking at the source code, we found its definition: (The directory is: /include/linux/mm_types.h)

struct page {
// Atomic tag for the page, may be updated asynchronously
unsigned long flags;

// The union below includes a variety of possible page uses. For different page types, the kernel will use different parts of this union.
union {
// Used for page caching and anonymous pages
struct {
// lru list, used for page replacement algorithm
struct list_head lru;

// Pointer to the address space object associated with the page
struct address_space *mapping;

// Offset in map
pgoff_t index;

//Private data field. It has different meanings for different page types
unsigned long private;
};

// Page pool for network stack
struct {
// A magic value used to ensure that only pages allocated by page_pool are reclaimed
unsigned long pp_magic;
struct page_pool *pp;
unsigned long _pp_mapping_pad;
unsigned long dma_addr;

// Some DMA related information
union {
unsigned long dma_addr_upper;
atomic_long_t pp_frag_count;
};
};

// Pages for slab, slob and slub allocators
struct {
union {
struct list_head slab_list;
struct {
struct page *next;

#ifdef CONFIG_64BIT
int pages;
int pobjects;
#else
short int pages;
short int pobjects;
#endif
};
};

// Pointer to kmem_cache for slub/slab allocation
struct kmem_cache *slab_cache;

//head pointer of freelist
void *freelist;

union {
void *s_mem;
unsigned long counters;
struct {
unsigned inuse:16;
unsigned objects:15;
unsigned frozen:1;
};
};
};

//The last page of the compound page
struct {
unsigned long compound_head;
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
unsigned int compound_nr;
};

//The last page of the second composite page
struct {
unsigned long _compound_pad_1;
atomic_t hpage_pinned_refcount;
struct list_head deferred_list;
};

// used for page table pages
struct {
unsigned long _pt_pad_1;
pgtable_t pmd_huge_pte;
unsigned long _pt_pad_2;
union {
struct mm_struct *pt_mm;
atomic_t pt_frag_refcount;
};

#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
};

// Used for ZONE_DEVICE page
struct {
struct dev_pagemap *pgmap;
void *zone_device_data;
};

//Pages released by RCU can use this field
struct rcu_head rcu_head;
};

union {
atomic_t _mapcount;
unsigned int page_type;
unsigned int active;
int units;
};

//Reference count of the page
atomic_t _refcount;

#ifdef CONFIG_MEMCG
unsigned long memcg_data;
#endif

#if defined(WANT_PAGE_VIRTUAL)
// If the page is mapped into the kernel virtual address space, this field contains its virtual address
void *virtual;
#endif

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
#endif
} _struct_page_alignment;

It can be seen that it is really complicated, but don’t be afraid. Since our current purpose is to analyze the partner algorithm, we will ignore the other content for now. Let me adjust the key fields:

flags: used to store the status of the page, including whether the page is dirty and whether it is locked in memory.

_refcount: Stores the number of times the page has been referenced.

virtual: stores the virtual address to which the physical page is mapped.

list_head lru: points to the corresponding node in the most recently unused (LRU) linked list. This linked list is used for page recycling.

At present, we know how Linux describes physical pages, but next we will consider a question. After physical memory is paging, many page structures will be generated. So how does Linux manage so many pages? It’s very simple. The system creates a page type array during initialization. This array describes all physical pages in the system.

2. Partner Algorithm—-Theoretical Explanation

If a process wants to allocate a large continuous page, it can use the buddy algorithm to allocate it. The description of the buddy algorithm is as follows:

The buddy algorithm is a method for allocating and reclaiming fixed-size contiguous blocks of memory. The core idea is to divide the available physical memory into blocks whose size is an integer power of 2. The meaning of partners is: two page blocks with the same size and continuous physical addresses are called partners. The partner system uses a free_area array to record free physical pages:

Example:

When requesting a memory block of size 2 (that is, requesting two pages), the system will find the position with array index 1 (because 2 raised to the power of 1 is 2), and then request Or allocate such a 2 size block and modify the linked list.

If the position with subscript 1 cannot be found, search backwards in order until it is found, then allocate it to the requester, and then hang the extra blocks at the corresponding position in the array.

When a block is freed, the system checks whether the “buddy” adjacent to the block is also free. If so, then the two “partners” are merged into a larger block.

Combined with the specific implementation of Linux, let’s take a look at the free_area array:

//The maximum value of the free_area array is 11
#defineMAX_ORDER 11

//defined linked list
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};

//Created an array free_area of type free_area, which is the array in the picture above
struct zone {

...

struct free_area free_area[MAX_ORDER];

...

}

This algorithm is relatively easy to understand in theory. The core is to maintain the free_area array. I will not go into details. Next, let’s take a look at how Linux kernel 5.15 is implemented.

3. Partner Algorithm—-Source Code Analysis

1.__get_free_pages(): (/mm/page_alloc.c)

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order)
{
struct page *page;

page = alloc_pages(gfp_mask & amp; ~__GFP_HIGHMEM, order);
if (!page)
return 0;
return (unsigned long) page_address(page);
}

It is found that the work of allocating physical pages is handed over to the alloc_pages() function, then we will continue to follow up.

2.alloc_pages()(/mm/mempolicy.c)

//Define the alloc_pages function, which accepts two parameters: one is the memory allocation flag (gfp_t) and the other is the number of pages (order power of 2).
struct page *alloc_pages(gfp_t gfp, unsigned order)
{
    //Initialize a pointer to the default memory policy
    struct mempolicy *pol = & amp;default_policy;
    // Declare a pointer to the page structure
    struct page *page;

    // Check if we are currently in interrupt context and if a page is allocated for a specific node
    // If not in an interrupt and no page is requested for a specific node, get the memory policy of the current task
    if (!in_interrupt() & amp; & amp; !(gfp & amp; __GFP_THISNODE))
        pol = get_task_policy(current);

    // Check the memory policy mode
    // If it is crossover mode (INTERLEAVE), allocate pages alternately on multiple nodes
    if (pol->mode == MPOL_INTERLEAVE)
        page = alloc_page_interleave(gfp, order, interleave_nodes(pol));
    // If it is multiple preferred node mode, allocate pages according to the specified multiple preferred node strategy
    else if (pol->mode == MPOL_PREFERRED_MANY)
        page = alloc_pages_preferred_many(gfp, order,
                numa_node_id(), pol);
    // For other strategies, use the default __alloc_pages method and pass in the relevant node and nodemask information
    else
        page = __alloc_pages(gfp, order,
                policy_node(gfp, pol, numa_node_id()),
                policy_nodemask(gfp, pol));

    //Returns a pointer to the allocated page
    return page;
}

Assuming that our strategy is not cross mode or multiple preferred node mode, enter the __alloc_pages() function to continue analysis.

3.__alloc_pages():

// Define the __alloc_pages function, which attempts to allocate a set of consecutive pages from the system based on the given parameters.
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask)
{
    // Declare a pointer to the page structure
    struct page *page;
    // Set the default allocation flag to ALLOC_WMARK_LOW
    unsigned int alloc_flags = ALLOC_WMARK_LOW;
    // Used to record the gfp_t flag actually used for allocation
    gfp_t alloc_gfp;
    //Initialize an allocation context structure
    struct alloc_context ac = { };

    // Check whether the requested order exceeds the maximum value allowed by the system
    // MAX_ORDER is a constant for the maximum number of pages allowed in the system
    if (unlikely(order >= MAX_ORDER)) {
        // If the requested order is out of range and the __GFP_NOWARN flag is not set, issue a warning
        WARN_ON_ONCE(!(gfp & amp; __GFP_NOWARN));
        //Return NULL to indicate allocation failure
        return NULL;
    }

    // Use gfp_allowed_mask mask to clear disallowed gfp flags
    gfp & amp;= gfp_allowed_mask;

    // Get the gfp context of the current task, mainly taking into account the inheritance of GFP_NOFS, GFP_NOIO and other flags
    gfp = current_gfp_context(gfp);
    alloc_gfp = gfp;

    // Prepare page allocation: set relevant context, node and other parameters
    if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, & amp;ac,
& amp;alloc_gfp, & amp;alloc_flags))
        return NULL;

    // Disable first allocation attempts from falling back to types that may fragment memory until all local areas have been considered
    alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);

    // Make first allocation attempt
    page = get_page_from_freelist(alloc_gfp, order, alloc_flags, & amp;ac);
    if (likely(page))
        goto out;

    //Reset alloc_gfp to the original gfp flag
    alloc_gfp = gfp;
    ac.spread_dirty_pages = false;

    // Restore the original node mask, especially if it might have been replaced in a fast path attempt
    ac.nodemask = nodemask;

    // Use slow path for page allocation
    page = __alloc_pages_slowpath(alloc_gfp, order, & amp;ac);

out:
    // If kernel memory accounting for the memory control group is enabled and the allocation is successful, try to account for the page
    if (memcg_kmem_enabled() & amp; & amp; (gfp & amp; __GFP_ACCOUNT) & amp; & amp; page & amp; & amp;
unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {
        // If accounting fails, release the page and set page to NULL
        __free_pages(page, order);
        page = NULL;
    }

    // Use trace macro to record page allocation events
    trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);

    // Return the allocated page, or NULL if allocation fails
    return page;
}

Pay attention to this line of code: page = get_page_from_freelist(alloc_gfp, order, alloc_flags, & amp;ac). This line of code starts allocating physical pages from free_area.

4. get_page_from_freelist():

get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
const struct alloc_context *ac)
{
 .......

//Try to allocate pages for the current zone
try_this_zone:
    // Request pages from the rmqueue in the priority area
    page = rmqueue(ac->preferred_zoneref->zone, zone, order,
            gfp_mask, alloc_flags, ac->migratetype);
    // Check if a page was successfully obtained
    if (page) {
        // Prepare newly allocated pages
        prep_new_page(page, order, gfp_mask, alloc_flags);

        // If this is a higher-order atomic allocation, check if the page block should be reserved for the future
        if (unlikely(order & amp; & amp; (alloc_flags & amp; ALLOC_HARDER)))
            reserve_highatomic_pageblock(page, zone, order);

        // Return to the newly allocated page
        return page;
    } else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
        // If there are lazy initialized pages, try to allocate pages to the area again
        if (static_branch_unlikely( & amp;deferred_pages)) {
            if (_deferred_grow_zone(zone, order))
                goto try_this_zone;
        }
#endif
    }

    // If all regions have been tried without finding a usable page, and you are in fragmentation avoidance mode
    // then reset and try again
    if (no_fallback) {
        alloc_flags & amp;= ~ALLOC_NOFRAGMENT;
        goto retry;
    }

    // If the page cannot be allocated, return NULL
    return NULL;
}
.......
}

Finally, we analyzed the specific allocation page function rmqueue(). As the name suggests, this method is the core of the partner algorithm. Next, let’s take a look at the implementation of this function.

5.rmqueue():

//Remove the function definition of the page from the given zone
struct page *rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
unsigned long flags; // used to save interrupt status
struct page *page; // used to return the found page

// Check if allocation of pages from the per-CPU list is allowed
if (likely(pcp_allowed_order(order))) {
//CMA (contiguous memory allocation) processing logic
// If CMA is not enabled, or the allocation flag allows CMA, or the migration type is not MIGRATE_MOVABLE
// Then remove the page from the per-CPU list
if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & amp; ALLOC_CMA ||
migratetype != MIGRATE_MOVABLE) {
page = rmqueue_pcplist(preferred_zone, zone, order,
gfp_flags, migratetype, alloc_flags);
goto out;
}
}

// WARN_ON_ONCE is a debugging helper used to warn when a condition is true
WARN_ON_ONCE((gfp_flags & amp; __GFP_NOFAIL) & amp; & amp; (order > 1));
spin_lock_irqsave( & amp;zone->lock, flags); // Obtain the spin lock of zone and save the interrupt status

do {
page = NULL;
// For higher-order atomic allocation, try allocating from the MIGRATE_HIGHATOMIC migration type
if (order > 0 & amp; & amp; alloc_flags & amp; ALLOC_HARDER) {
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if(page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
}
// If page not found, try allocating from other migration types
if (!page)
page = __rmqueue(zone, order, migratetype, alloc_flags);
} while (page & amp; & amp; check_new_pages(page, order));
if (!page)
goto failed;

//Adjust zone's free page counter
__mod_zone_freepage_state(zone, -(1 << order),
get_pcppage_migratetype(page));
spin_unlock_irqrestore( & amp;zone->lock, flags); // Release the spin lock and restore the interrupt state

//Record page allocation events
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
//Update statistics
zone_statistics(preferred_zone, zone, 1);

out:
// If the zone's water level mark is raised, clear it and wake up kswapd to release more pages
if (test_bit(ZONE_BOOSTED_WATERMARK, & amp;zone->flags)) {
clear_bit(ZONE_BOOSTED_WATERMARK, & amp;zone->flags);
wakeup_kswapd(zone, 0, 0, zone_idx(zone));
}

//Debugging help, ensure that the returned page is in the given zone
VM_BUG_ON_PAGE(page & amp; & amp; bad_range(zone, page), page);
return page;

failed:
spin_unlock_irqrestore( & amp;zone->lock, flags); // If allocation fails, release the spin lock and restore the interrupt state
return NULL;
}

The __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC) method starts allocating physical pages.

6.__rmqueue_smallest():

// Function definition to obtain the smallest available page from the free list of a given zone
static __always_inline
struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,
int migratetype)
{
unsigned int current_order; // Used to loop through different page sizes
struct free_area *area; // Points to the free area of a specific order
struct page *page; // Page found from free area

// Starting from the specified order, traverse all larger orders until the largest order is reached
for (current_order = order; current_order < MAX_ORDER; + + current_order) {
area = & amp;(zone->free_area[current_order]); // Get the free area of the current order
page = get_page_from_free_area(area, migratetype); // Get a page from this area
if (!page) // If there is no page, continue to the next order
continue;
// If the page is found, remove it from the free list
del_page_from_free_list(page, zone, current_order);
// Break the page into smaller pages if necessary
expand(zone, page, order, current_order, migratetype);
//Set the migration type of the page
set_pcppage_migratetype(page, migratetype);
return page; // Return to the found page
}

return NULL; // If no suitable page is found, return NULL
}

Next, let’s take a closer look at the implementation of the del_page_from_free_list() function:

//Delete the function definition of a specified page from the free list of a given zone
static inline void del_page_from_free_list(struct page *page, struct zone *zone,
unsigned int order)
{
// If the page has been reported, clear its reporting status and update the reporting page count
if (page_reported(page))
__ClearPageReported(page);

//Delete page from free list
list_del( & amp;page->lru);

// Clear the Buddy attribute of the page, indicating that it is no longer part of the free page
__ClearPageBuddy(page);

//Set the private field of the page to 0
set_page_private(page, 0);

// Reduce the free page count of the current order
zone->free_area[order].nr_free--;
}

Simply put, this function does the following:

If the page has been reported before, it clears the reporting status of the page.
It removes the page from the free list (using a doubly linked list).
It clears the Buddy attribute of the page, indicating that the page is no longer part of the free page.
Set the private field of the page to 0.
Update the free page count of the current order and decrease it by 1.

Through the above analysis, we analyzed that in the __rmqueue_smallest() function, first search the free_area array with the current order value. If a free page that meets the requirements is found, it will be allocated. If not found, + + current_order will be executed. Continue looking for the next array value. This process is consistent with the principles of the partner algorithm we analyzed. Through this article, I think everyone will also analyze how to release memory pages, so I won’t analyze it here.

4. Experience

Through the source code analysis of the partner algorithm, I found that learning the kernel really cannot just stay in theory. We need to combine the source code of Linux to learn. Although this process is difficult for beginners, I think this process is really useful for beginners. Improving our own abilities is very helpful. In the initial process of analyzing source code, we must not pursue details too much. We must grasp the main line and analyze our needs first, and then learn some other related source codes.

Regarding the process of this analysis, I found that I still did not analyze many places well. I was confused about many functions or judgment conditions. I still did not understand the design of some flag bits in the kernel, lock-related knowledge, etc., so what follows? I will continue to combine the source code to learn these issues.