34｜Quickly allocate and release memory: memory pool

After talking about performance testing in the previous lecture, we can finally return to the topic of memory pools and discuss it in depth.

A test case

If you want to use a memory pool, my first question is, do you need to use a memory pool?

Here are some reasons why you might want to use memory pools:

Want to reduce the time overhead of memory allocation and release – faster allocation and release

Want to reduce the space overhead of memory allocation and freeing – less overall memory usage

Here are some reasons against using memory pools:

Your general memory allocator is probably fast enough

Using a memory pool can make it more difficult for the operating system to reclaim memory you no longer need.

Using a memory pool may make it more difficult for your objects to interact with other objects (see Lecture 32, where allocators were part of the container type before PMR)

Of course, now that you’ve seen this, you definitely want to use the memory pool. However, we need to be able to measure the effects of using the memory pool, so we need to test it.

If you want to perform performance testing of an operation, you need some kind of “typical scenario”. As an example, here I take a process incorporating random operations as a test scenario. Specifically, what I did was:

generate random numbers

Insert these random numbers into an unordered_set and measure the time required

Delete these random numbers one by one from this unordered_set and measure the time required

Then reinsert these random numbers into unordered_set and measure the time required

This is not a perfect example, but it does allow us to observe the role of the memory pool. If you have a real scenario, you can also use this method for testing.

Our objects and types under test are very simple:

using TestType = unordered_set<int>;
TestType s;

The code to generate random numbers is a little more complicated:

mt19937 engine;
uniform_int_distribution<int> dist;
array<int, LEN> rand_nums{};
for (int & amp; num : rand_nums) {
  num = dist(engine);
}

We hope to get stable test results across platforms, so we specify a pseudo-random number engine mt19937 with a good reputation (otherwise, the default pseudo-random number engine default_random_engine is fine). We only need a simple random uniform distribution, so we use the default constructed uniform_int_distribution, which does not give a range of random numbers, to generate random numbers within all legal integer ranges. Then, we write a random integer to each item in the array of length LEN (note that a reference must be used to range traverse rand_nums). Since we do not use a truly random seed to initialize the random number engine, these random numbers are the same every time, which ensures the stability of the test.

The initial insertion operation is simple, just insert each item in the array rand_nums into s. Since the operation of the variable s is a bit complicated, whether it is a global variable or a local variable, the compiler is unlikely to optimize these operations. Our test can simply measure the overall time taken:

t1 = rdtsc();
for (int num : rand_nums) {
  s.insert(num);
}
t2 = rdtsc();

The deletion operation is similar. We still use rand_nums to delete each item in s:

t1 = rdtsc();
for (int num : rand_nums) {
  s.erase(num);
}
t2 = rdtsc();

Finally, we repeat the insertion process again to see if there is any change in the performance of reinsertion.

The following are preliminary test results under a certain hardware environment.

Apple Clang 12.0, macOS Catalina 10.15:

It took 449 cycles by average to insert a number

It took 492 cycles by average to erase a number

It took 305 cycles by average to insert a number again

MSVC 19.29, Windows 10:

It took 366 cycles by average to insert a number

It took 185 cycles by average to erase a number

It took 300 cycles by average to insert a number again

GCC 10.3, Ubuntu Linux 20.04 LTS:

It took 307 cycles by average to insert a number

It took 162 cycles by average to erase a number

It took 176 cycles by average to insert a number again

It can be seen that using different platforms and compilers, the results vary greatly. But we do see that the performance is higher when plugging in again than the first time, especially on Linux.

In fact, this is just the result of using the default memory allocator. Using different memory allocators can also achieve different effects. For example, using tcmalloc instead of the default allocator on Linux, we can get better test results:

It took 250 cycles by average to insert a number

It took 116 cycles by average to erase a number

It took 117 cycles by average to insert a number again

Depending on the performance of the memory allocator on the platform you are using, and depending on whether you need better memory allocation performance across platforms, memory pools may or may not be of much use to you. For now, I’ll just assume that memory pools will be useful to you (since you’ve read this far).

PMR memory pool

With the test case in hand, we can verify the function of the memory pool provided in Polymorphic Allocator (Lecture 32). We only need to make a small modification to the test case and change the two lines related to TestType to the following:

using TestType = pmr::unordered_set<int>;
pmr::unsynchronized_pool_resource res;
pmr::polymorphic_allocator<int> a{ & amp;res};
TestType s(a);

Depending on the platform, you may see different performance results. For example, on Linux I got the following test results:

It took 272 cycles by average to insert a number

It took 210 cycles by average to erase a number

It took 169 cycles by average to insert a number again

Performance data is mixed. On macOS and Windows, I saw even bigger, all-around performance improvements. For cross-platform applications, such a memory pool will indeed be effective.

Note that I used unsynchronized_pool_resource without multi-thread synchronization above. A memory pool with multiple threads synchronized is another story. On Linux, the performance will decrease; on other platforms, the performance improvement will not be obvious. –Generally speaking, for multi-threaded processing, the general memory allocator has been fully optimized, and the performance may exceed the performance of the generally simple implementation of the memory pool. Memory pools should generally be used in single-threaded or thread-local (thread_local) scenarios, at least from an execution time perspective.

Customized memory pool

In Lecture 31 I mentioned that a highly optimized memory pool can be implemented by taking advantage of the fact that objects of the same type are exactly the same size. Just using class-specific allocation and release functions will have more limited usage scenarios. Below I will describe a memory pool implemented using this idea, which can be used both in class-specific allocation and release functions and in the container’s allocator.

Basic Strategy

As a memory pool, the most basic requirement is to reduce the number of memory requests from the system’s memory allocator. Therefore, we hope to obtain a large chunk of memory (chunk) in a single memory allocation, and then split it for use by each object. Such a memory block is usually an integer multiple of a certain size.

Next, we have two different approaches:

Any memory allocation request that requires a certain size (or a certain size range) is allocated and released in a certain memory pool.

Any memory allocation request for an object of a specific type is allocated and released from a memory pool

The first approach is similar to SGI STL, while the second approach is an additional optimization opportunity given to us by the C++ memory allocation mechanism. Both approaches have some advantages and disadvantages, and I currently adopt the second approach, mainly considering the following factors:

Different types of objects use different memory pools, even if they are the same size. In many scenarios, by putting objects of the same type together, the program will have better locality.

The object size can be derived from the object type, but not vice versa. In other words, with my current approach, you can reduce the solution to just using the object size, so the current approach is more general.

Another choice I made was to not return memory to the memory allocator most of the time. Because:

Returning memory to the memory allocator is more likely to cause memory fragmentation, resulting in subsequent insufficient memory or greater consumption.

Returning memory to the memory allocator, usually the memory allocator cannot be returned to the operating system (due to memory fragmentation), so it does not reduce the runtime memory overhead of the program.

If the memory is not returned to the memory allocator, the implementation is simple, the code is smaller and faster.

Some of my experiments show that the memory pool also has a hard time deciding when to return memory to the memory allocator. If a certain memory block (chunk) is returned when it is completely empty, the number of times the program requests memory from the memory allocator will increase significantly. The only benefit I can think of at the moment is when the number of objects in the program fluctuates significantly: at a certain moment, the program will generate a large number of A objects and then release them; at another moment, a large number of B objects will be generated. , and then released. In this case only, my non-returning choice increases the maximum memory overhead of the program. I won’t consider this particular scenario for now.

Object memory pool

Based on the above discussion, we need to have a data structure of a memory block, and we also need to decide how many objects to put in a memory block. We take a specializable parameter to determine the latter:

template <typename T>
inline constexpr size_t
  memory_chunk_size = 64;

That is, the default size of memory_chunk_size is 64, but you can specialize it for a specific type and change its size. For example, if you want to change the size to 32 for a specific Obj type of yours, you can write:

template <>
inline constexpr size_t
  memory_chunk_size<Obj> = 32;

Of course, under normal circumstances you don’t need to do this. In most scenarios where a memory pool is needed, the default size works just fine.

Then, we need to define a data structure that can store some kind of object or string memory blocks into a linked list. Obviously, we can use a union:

union node {
  T data;
  node* next;
};

The advantage of using the T type directly is that we can naturally use the alignment features of the T type without using cumbersome methods such as alignas. However, we also have some small complications to solve: when T is an object with a non-trivial constructor and destructor, the above code will have problems compiling because the compiler does not know what to do during construction and destruction. What to do. We only use this node to manage memory, so we just declare empty constructors and destructors (note that = default cannot be used here). In addition, such memory nodes should obviously not be copied, so we’d better disable the copy constructor and copy assignment operator.

union node {
  T data;
  node* next;
  node() {}
  ~node() {}
  node(const node &) = delete;
  node&
  operator=(const node &) = delete;
};

Then, we can define the memory block:

template <typename T>
class memory_chunk {
public:
  union node {
    …
  };
  memory_chunk(
    memory_chunk* next_chunk);
  node* get_free_nodes()
  {
    return storage_.data();
  }
  memory_chunk* get_next() const
  {
    return next_chunk_;
  }

private:
  memory_chunk* next_chunk_{};
  array<node, memory_chunk_size<T>>
    storage_;
};

A memory block is an array of nodes, plus a pointer to the next memory block, to string the memory blocks into a linked list. We initialize the memory block through the constructor:

template <typename T>
memory_chunk<T>::memory_chunk(
  memory_chunk* next_chunk)
  : next_chunk_(next_chunk)
{
  for (size_t i = 0;
       i < storage_.size() - 1;
        + + i) {
    storage_[i].next =
       & amp;storage_[i + 1];
  }
  storage_[storage_.size() - 1]
    .next = nullptr;
}

The pointer to the “next” memory block is passed in from outside. For arrays of nodes, we make the next pointer of each node point to the next item; except for the last item, its next pointer is empty. In other words, we string the memory blocks into a linked list for later use by the memory pool.

With these raw materials, our memory pool can be easily written. The class is defined as follows:

template <typename T>
class memory_pool {
public:
  using node =
    typename memory_chunk<T>::node;
  memory_pool() = default;
  memory_pool(const memory_pool &) =
    delete;
  memory_pool & operator=(
    const memory_pool & amp;) = delete;
  ~memory_pool();
  T* allocate();
  void deallocate(T* ptr);

private:
  node* free_list_{};
  memory_chunk<T>* chunk_list_{};
};

As you can see, the memory pool object has only two member variables, free_list_ and chunk_list_, and three member functions, destructor, allocate and deallocate. free_list_ is a linked list of free nodes, and chunk_list_ is a linked list of all memory blocks. Among the three member functions, the destructor is responsible for releasing all memory blocks:

template <typename T>
memory_pool<T>::~memory_pool()
{
  while (chunk_list_) {
    memory_chunk<T>* chunk =
      chunk_list_;
    chunk_list_ =
      chunk_list_->get_next();
    delete chunk;
  }
}

allocate is responsible for memory allocation:

template <typename T>
T* memory_pool<T>::allocate()
{
  if (free_list_ == nullptr) {
    chunk_list_ =
      new memory_chunk<T>(
        chunk_list_);
    free_list_ =
      chunk_list_->get_free_nodes();
  }
  T* result = & amp;free_list_->data;
  free_list_ = free_list_->next;
  return result;
}

We first check whether the free list free_list_ is empty. If it is empty, it means that there is no memory in the memory pool for the object to use, so we need to apply for a new memory block, then let chunk_list_ point to this new memory block, and let free_list point to its first item. Subsequently, allocating memory is simply a matter of removing an item from the node list and adjusting the pointer to the first item in the list.

deallocate is of course responsible for the release of memory:

template <typename T>
void memory_pool<T>::deallocate(T* ptr)
{
  auto free_item =
    reinterpret_cast<node*>(ptr);
  free_item->next = free_list_;
  free_list_ = free_item;
}

This is even simpler, just treat the pointer passed in by the user as the node pointer, and then put it back into the free list.

By the way, for operations such as adjusting linked lists, the std::exchange tool provided by the standard library can make the code more concise. For example, the last three statements of allocate can be reduced to one: return & amp;exchange(free_list_, free_list_->next)->data;.

Memory pool application: class-specific allocation and release functions

Although class-specific allocation and deallocation functions are not used that often, we can still look at how to use memory pools in this simplest application scenario. This also allows us to measure the memory pool benefits in this extreme case.

As mentioned before, for a certain class Obj, we need to use class-specific allocation and release functions. We only need to declare two member functions like this:

class Obj {
public:
  …
  void* operator new(size_t size);
  void operator delete(
    void* ptr) noexcept;
};

Here I have omitted the static before the declaration, which is allowed and has the same effect (these two member functions are static regardless of whether static is written or not). We can use the memory pool by adding the following content to the implementation file (non-header file) of this class:

memory_pool<Obj> obj_pool;

void* Obj::operator new(size_t size)
{
  assert(size == sizeof(Obj));
  return obj_pool.allocate();
}

void Obj::operator delete(
  void* ptr) noexcept
{
  obj_pool.deallocate(
    static_cast<Obj*>(ptr));
}

For such objects, and objects without class-specific allocation and release functions, doing a lot of new and delete operations respectively, I got on Linux (the mainstream platform with the best default allocation and release performance):

107 cycles for each allocation and deallocations on normal Obj

8 cycles for each allocation and deallocations on pooled Obj\

When the memory pool is not used, each allocation and release takes an average of 107 clock cycles, and when the memory pool is used, it is reduced to 8 clock cycles.

If you use tcmalloc, the difference is smaller:

27 cycles for each allocation and deallocations on normal Obj

8 cycles for each allocation and deallocations on pooled Obj\

It only takes 27 clock cycles without using the memory pool.

Memory pool application: allocator

The above test allows us to see how much benefit the memory pool will bring, but using new and delete manually is no longer recommended. In the most common case, we need to put the object in the container. Therefore, we need to make the allocator support memory pools.

In addition to the necessary definitions we need to define an allocator, the core member functions we need to define are allocate and deallocate. The implementation is as follows:

template <typename T>
memory_pool<T> & get_memory_pool()
{
  thread_local memory_pool<T> pool;
  return pool;
}

template <typename T,
          typename Base =
            allocator<T>>
struct pooled_allocator
  : private Base {
  …

  T* allocate(size_t n)
  {
    if (n == 1) {
      return get_memory_pool<T>()
        .allocate();
    } else {
      return Base::allocate(n);
    }
  }

  void deallocate(T* p, size_t n)
  {
    if (n == 1) {
      return get_memory_pool<T>()
        .deallocate(p);
    } else {
      return Base::deallocate(p, n);
    }
  }
};

That is, for each specific type T, we have a dedicated thread-local memory pool. This memory pool will be created when it is first used and destroyed when the thread exits.

In the allocate and deallocate functions, we first check the number of objects that need to be allocated or released. The current implementation cannot handle allocations and deallocations larger than the size of a single object, so such requests go directly to the base class’s memory allocator for processing, which by default is the system’s std::allocator, which uses operator new and operator delete to allocate and release. We only use the thread local memory pool for memory allocation and release of a single object, so this allocator is suitable for containers such as list, map, and set that allocate memory to elements individually, but not suitable for containers such as vector and deque that allocate memory in batches. –The latter is actually basically no need to use the memory pool.

Using this memory pool is very simple, just set the container’s Allocator template parameter to the currently implemented pooled_allocator. Using the previous test, we need to define TestType as follows:

using TestType = unordered_set<
  int, hash<int>, equal_to<int>,
  pooled_allocator<int>>;

Since Allocator is the last parameter, we must manually fill in the default template parameters of the previous class template, namely hash and equal_to. After making this simple modification, we can see the performance improvement of the test. On Linux I get:

It took 199 cycles by average to insert a number

It took 112 cycles by average to erase a number

It took 110 cycles by average to insert a number again

Indeed the best performance test results yet!

Life cycle trap

Well, I told a little lie. If you implement it yourself according to the code I have given so far, you will probably see the program hang or crash on exit. The problem happens like this:

We have a global object, and when it is constructed, its destructor call is hooked into the code that needs to be executed when the program exits.

The first time this global object requires memory, we initialize an instance of the memory pool. At the same time, its destructor will be hung in the code that needs to be executed when the thread exits. Note that this is later than step 1.

Memory pool destruction will occur before global object destruction (even if they are all global objects or thread-local objects, they must be destroyed first after construction), and it will release all memory.

When the global object is destructed, if there are any operations to read or write the previously allocated heap memory, it is undefined behavior!

So how to solve the problem? At present, it seems that the simplest and feasible solution is:

Change the global object to thread_local (or if we only need single-threaded operation, we can change thread_local in get_memory_pool to static). Obviously, if the object and the memory pool do not use thread_local at the same time, there must be a semantic problem.

Have the constructor of pooled_allocator call get_memory_pool() to ensure that the memory pool exists before the object is fully constructed.

I can observe the program crashing under GCC and Clang before modifying the constructor of pooled_allocator. After this modification, the program can run normally.

Content summary

In this lecture we discuss memory pooling in its entirety, including its testing and implementation. After studying this lecture, you should already have a full understanding of memory pools and know under what circumstances how to implement a memory pool.