Alex Constantin-Gomez

An essential part of memory management is how memory is allocated and eventually deallocated. By default, memory in C++ programs is allocated using the new keyword and deallocated using the delete keyword. However, sometimes we want more control over how and where objects are allocated/deallocated to avoid issues like fragmentation.
The C++ standard library allows programmers to write custom allocators which can be used by STL containers for dynamic memory allocation, rather than using the standard allocator.
Allocators can be used to improve performance-related issues such as fragmentation, per-thread allocation and NUMA-friendly allocation.
We will see some examples in this post and their benefits, but before we should mention the different properties and requirements an allocator should have in C++.

The allocator API

In C++, an allocator is a template class that allocates and deallocates memory for a specific type T. There are two types of allocators:

Equal allocators: two equal allocators can be used to allocate and deallocate memory for a type T interchangeably. These are usually stateless allocators.
Unequal allocators: two unequeal allocators cannot be used to allocate and deallocate memory interchangeably. These are usually stateful allocators.

An allocator class should offer a T* allocate(size_t n) method to allocate n number of objects of type T and a void deallocate(T* p, size_t n) method to deallocate an object of type T.

Additionally, we need to provide an empty copy constructor using a template of type U for full compatibility with STL containers, because the container may also need to allocate internal objects (such as linked list nodes) in addition to objects of type T.

The most simple allocator using malloc() can be implemented as follows:

template <typename T>
class SimpleAllocator {
public:
    using value_type = T;
    using is_always_equal = std::true_type; // bypass writing the equality operators == and !=

    SimpleAllocator() = default;
    
    template <typename U>
    SimpleAllocator(const SimpleAllocator<U>& other) {
        (void) other;
    }
    
    T* allocate(size_t n) {
        auto ptr = static_cast<T*>(malloc(sizeof(T) * n));
        if (ptr)
            return ptr;
            
        throw std::bad_alloc();
    }
    
    void deallocate(T* ptr, size_t n) {
        (void) n;
        free(ptr);
    }
};

Because SimpleAllocator is an equal allocator, it is legal to do the following:

SimpleAllocator<double> a1;
SimpleAllocator<double> a2;

double* ptr = a1.allocate(1);   // allocate a double with a1
a2.deallocate(ptr, 1);          // deallocate the memory with a2

Example 1: A stateless cache-aligned allocator

As an extension to our first stateless allocator which does nothing interesting, we will now implement an allocator that actually does something useful. In this case, our allocator will automatically eliminate false sharing in an STL container being accessed by multiple threads.

Briefly, the solution to false sharing is to align the shared memory locations such that they end up in different cache lines. On x86 CPUs, L1 cache lines are typically 64 bytes, so our allocator should allocate objects at 64 byte boundaries (or whatever the size for your system, which you can get using std::hardware_destructive_interference_size). Here is the code:

template <typename T, size_t Alignment = std::hardware_destructive_interference_size>
class CacheAlignedAllocator {
public:
    using value_type = T;
    using is_always_equal = std::true_type;

    template <typename U>
    struct rebind {
        using other = CacheAlignedAllocator<U, Alignment>;
    };

    CacheAlignedAllocator() = default;

    template <typename U>
    CacheAlignedAllocator(const CacheAlignedAllocator<U, Alignment>& other) {
        (void) other;
    }

    T* allocate(size_t n) {
        auto ptr = static_cast<T*>(aligned_alloc(Alignment, sizeof(T) * n));
        if (ptr)
            return ptr;

        throw std::bad_alloc();
    }

    void deallocate(T* ptr, size_t n) {
        (void) n;
        free(ptr);
    }
};

For allocation, we use the function aligned_alloc() (part of C++). Everything else is the same, except for the struct rebind. Typically, this struct is generated by the compiler automatically for us but because our allocator takes in more than one template argument (in case the user wants to supply a different alignment amount), we must manually define our own rebinder (which is used by STL containers to create new allocators without having to call the copy constructor).
You can now specify the custom allocator as a policy template argument to an STL container and verify that all allocations are indeed 64-byte aligned (by inserting a print statement in the allocate method):

std::vector<int, CacheAlignedAllocator<int>> vec;
for (int i = 0; i < 5; i++) {
    vec.emplace_back(i);
}

Running this benchmark I have prepared to demonstrate the effects of false sharing in a multi-threaded program, I obtain:

std::allocator mean: 2470 ms
CachedAlignedAllocator mean: 2192 ms

Measuring latency in this case may not be fully representative because it takes into account thread creation and context switching which adds jitter to the result. However, you can look at the L1 cache misses using perf on Linux and verify that we get less L1 cache misses using the custom allocator.

Example 2: A stateful pool allocator

Now, we look at a classic use case of custom memory allocators: a pool allocator. The goal of a pool-based allocator is to quickly allocate memory for a fixed-type objects while reducing internal fragmentation of memory.
Pool allocators work by allocating large blocks of memory in advance and dividing this block for individual allocations. This means that memory allocation is much faster than calling malloc(), which is slow.
Because a pool allocator has to manage a list of blocks, it is a stateful allocator (and therefore unequal). Firstly, we need to create a Pool class that manages the memory of chunks of a given size. We use a stack of addresses to quickly pop an available address when we need to allocate an object and push back a newly available address when we deallocate an object at that address.

template <size_t BlockSize, size_t ReservedBlocks = 0>
class Pool {
private:
    size_t size_;
    std::stack<void *> addrs_;
    std::stack<std::unique_ptr<uint8_t[]>> blocks_;

public:
    explicit Pool(size_t size) : size_(size) {
        for (size_t i = 0; i < ReservedBlocks; i++) {
            add_more_addresses();
        }
    }

    void* allocate() {
        if (addrs_.empty()) {
            add_more_addresses();
        }

        auto ptr = addrs_.top();
        addrs_.pop();
        return ptr;
    }

    void deallocate(void *ptr) {
        addrs_.push(ptr);
    }

    void rebind(size_t size) {
        if (!(addrs_.empty() && blocks_.empty())) {
            std::cerr << "Cannot call Pool::rebind() after alloc\n";
            abort();
        }

        size_ = size;
    }

private:
    // Refill the address stack by allocating another block of memory
    void add_more_addresses() {
        auto block = std::make_unique<uint8_t[]>(BlockSize);
        auto total_size = BlockSize % size_ == 0 ? 
                            BlockSize : BlockSize - size_;

        // Divide the block into chunks of size_ bytes, and add their addrs
        for (size_t i = 0; i < total_size; i += size_) {
            addrs_.push(&block.get()[i]);
        }

        // Keep the memory of the block alive by adding it to our stack
        blocks_.push(std::move(block));
    }
};

In our constructor, we specify the fixed-size number of bytes we want to allocate (this will be passed in later as sizeof(T)) and we reserve memory blocks if specified by the template parameter. Importantly, we use a std::unique_ptr to automatically free the memory when we are done using the allocator.
The allocate() method will simply return the first address on the stack in O(1) time (unless the stack is empty) and our deallocate() just puts back the address onto the stack. We also provide a rebind() method to keep STL containers happy.

Now, we need to create the actual PoolAllocator class which manages a Pool instance:

template <typename T, size_t BlockSize = 4096, size_t ReservedBlocks = 0>
class PoolAllocator {
private:
    using PoolType = Pool<BlockSize, ReservedBlocks>;
    std::shared_ptr<PoolType> pool_;

public:
    using value_type = T;
    using is_always_equal = std::false_type;

    template <typename U>
    struct rebind {
        using other = PoolAllocator<U, BlockSize, ReservedBlocks>;
    };

    PoolAllocator() : pool_(std::make_shared<PoolType>(sizeof(T))) {}

    // Rebind copy constructor
    template <typename U>
    PoolAllocator(const PoolAllocator<U>& other) : pool_{other.pool_} {
        pool_->rebind(sizeof(T));
    }

    PoolAllocator(const PoolAllocator& other) = default;
    PoolAllocator(PoolAllocator&& other) = default;
    PoolAllocator& operator=(const PoolAllocator& other) = default;
    PoolAllocator& operator=(PoolAllocator&& other) = default;

    T* allocate(size_t n) {
        if (n > 1) {
            return static_cast<T*>(malloc(sizeof(T) * n));
        }

        return static_cast<T*>(pool_->allocate());
    }

    void deallocate(T* ptr, size_t n) {
        if (n > 1) {
            free(ptr);
            return;
        }

        pool_->deallocate(ptr);
    }
};

Our allocator class looks very similar as our previous one. The constructor creates a Pool instance which will be used to allocate and deallocate memory. The allocate() and deallocate() methods simply pass the call onto the pool instance. Note that our allocator only supports individual allocations: if n > 1, we simply use the standard allocator via malloc() and free().
Again, we must also provide a rebinder because our class has more than one template parameter and a rebind copy constructor which just passes the call down to the pool instance. We also need to provide default copy/move constructors and assignment operators.

Our benchmark in this case will measure the time it takes to add 1 million integers to a std::list (see the code here). We compare the standard allocator, a pool allocator with no reserved blocks in-advance, one with 100 reserved blocks and another with 1000 reserved blocks:

std::allocator<int>            mean: 21611 μs
PoolAllocator<int>             mean: 12885 μs
PoolAllocator<int, 4096, 100>  mean: 5718 μs
PoolAllocator<int, 4096, 1000> mean: 5686 μs

Even without reserving blocks in advance, we get around x2 speed up! And as expected, if we reserve blocks in advance, we get a greater speed up.
Pool allocators are so useful that they were introduced into the standard library in C++17. You can access them in the header (polymorphic memory resource) and use them nicely with STL containers.

Example 3: A huge page allocator

Our final example is just to show off how you can write highly-specific allocators using a concrete feature of Linux: huge pages.

Typically, the OS will allocate memory in fixed-size pages of 4 kB. However, if we have a very large in-memory data structure spanning multiple pages which may be randomly accessed (such as a hash table), we are likely going to suffer from a high number of TLB misses (minor page faults) during virtual address translation. The OS can reduce this miss rate by allocating large-sized pages, known as huge pages. These are typically 2 MB, but can be even larger.

Linux introduces the idea of transparent huge pages. Linux is able to promote automatically a regular page into a huge page transparently (if given the sufficient hints). We can enable THP support for a memory range using the madvise() system call, however Linux does not guarantee that a HP will be allocated. We can use posix_memalign() to hint at the kernel that we really want to allocate a huge page. Let's see the code:

template <typename T, size_t HugePageSize = 1 << 21> 
class THPAllocator {
public:
    using is_always_equal = std::true_type;
    using value_type = T;

    template <typename U>
    struct rebind {
        using other = THPAllocator<U, HugePageSize>;
    };

    THPAllocator() = default;

    template <class U>
    constexpr THPAllocator(const THPAllocator<U>& other) {
        (void) other;
    }

    T *allocate(size_t n) {
        if (n > std::numeric_limits<std::size_t>::max() / sizeof(T)) {
            throw std::bad_alloc();
        }
        const auto total_size = n * sizeof(T);
        void *p = nullptr;
        if (posix_memalign(&p, HugePageSize, total_size) != 0) {
            throw std::bad_alloc();
        }
        
        madvise(p, total_size, MADV_HUGEPAGE);
        if (p == nullptr) {
            throw std::bad_alloc();
        }

        return static_cast<T *>(p);
    }

    void deallocate(T *p, size_t n) { 
        (void) n;
        free(p); 
    }
};

The benchmark is very simple: we try to add 8 MB worth of integers to a vector and see how long it takes. We obtain:

std::allocator mean: 4414 μs
THPAllocator   mean: 2584 μs

It is also worth noting that requesting the kernel to promote a regular page into a huge page may cause latency spikes, because the kernel needs to update the page tables accordingly. Also, the kernel may decide to compact unused pages in order to create a huge page on demand, which can also lead to latency spikes.

Closing thoughts

We have seen how C++ easily lets us implement custom memory allocators for different applications and use cases. The C++ allocator API can go into much more detail and I am not an expert on memory allocators, this post is just an example highlighting the benefits of memory allocators and a simple demo on how to implement them.

Writing custom memory allocators in C++

The allocator API

Example 1: A stateless cache-aligned allocator

Example 2: A stateful pool allocator

Example 3: A huge page allocator

Closing thoughts