Alex Constantin-Gómez

Implementing pattern matching in C++17

Alex Constantin-Gomez — Sat, 21 May 2022 15:21:22 GMT

Pattern matching is one of my favourite features that exist in many modern programming languages, especially functional languages like Haskell or Scala. Unfortunately, it is not a feature that is part of the C++ language standard. However, with recent updates to the C++ language and standard library, it is possible to implement very primitive pattern matching constructs. I will explain how to do this in under 50 lines of code using some of the features introduced in C++17.

Variants and visitation

The main introductions to C++17 that allows us to easily implement pattern matching are std::variant and std::visit. A variant is simply a type-safe union that allows us to store one of multiple specified types in a memory location at any given time.
To operate on the value stored in the variant, we typically use the std::visit function, which invokes a user-supplied functor on the data, dispatching to the correct type.

The usage is simple, but verbose and cumbersome:

std::variant v{42};

struct visitor
{
    void operator()(int d)     { std::cout << "int" << std::endl; }
    void operator()(bool d)    { std::cout << "bool" << std::endl; }
    void operator()(double d)  { std::cout << "double" << std::endl; }
};

std::visit(visitor{}, v);  // Prints "int"

Alternatively, we use if constexpr statements inside a generic lambda:

std::visit([](auto d)
{
    if constexpr(std::is_same_v)
        std::cout << "int" << std::endl;
    else if constexpr(std::is_same_v)
        std::cout << "bool" << std::endl;
    // ...
}, v);

Although both techniques do the job, we would like something shorter and more readable, that resembles pattern matching.

Implementing a `match` statement

The syntax I am aiming to implement looks like this:

std::variant val = "test";

match(val).on(
      [](const std::string& str)    { std::cout << "str: " << str << std::endl; },
      [](int i)                     { std::cout << "int: " << i << std::endl; },
      [](double d)                  { std::cout << "double: " << d << std::endl; }
);

I find this syntax much more readable and it resembles a functional programming language.

As you can see, we take in a set of lambdas, where each one handles a particular type allowed by the variant. So, how do we implement this list of lambdas?
The first important observation is to realise that a lambda is just a struct with the operator() overloaded. Therefore, if we could "compose" multiple structs together such that we have a single struct containing an overloaded operator() for each type of the variant, the problem would be solved.
How do we "compose" these structs together then? Well, we can just use inheritance! This technique is called an overload set.

Building an overload set

As mentioned, we can compose functors through inheritance:

struct visitor_int { void operator()(int x) { std::cout << "int"; } };
struct visitor_float { void operator()(float x) { std::cout << "float"; } };

// Compose both functors via inheritance:
struct visitor : visitor_int, visitor_float
{
    using visitor_int::operator();
    using visitor_float::operator();
};

Note that we explicitly have the two using statements to bring the overloads into the same scope of visitor in order to enable overload resolution. This is because a call to visitor::operator() would be ambiguous because the compiler performs name resolution before overload resolution.

Given the example above, we can generalise the pattern by templating over the base classes:

template 
struct overload_set : A, B
{
    using A::operator();
    using B::operator();
};

Furthermore, we can generalise this pattern even more by using variadic templates in conjunction with the variadic using statements which were introduced in C++17:

template 
struct overload_set : Functors...
{
    using Functors::operator()...;
}

We can now instantiate an overload_set with as many lambdas as we want:

overload_set os{
    [](int d)   { ...; },
    [](float d) { ...; }
};

Putting everything together

The difficult part is done now, we just need to put everything together to implement the syntax described earlier. We can do this by creating a generic class which accepts a variant and performs visitation using an overloaded set:

template 
class match
{
public:
    constexpr explicit match(const Variant& variant) : d_variant(variant) {}

    template
    constexpr auto on(Fs... fs)
    {
        return std::visit(overload_set { std::forward(fs)... }, d_variant);
    }

private:
    Variant d_variant;
};

In very few lines of code, we have implemented a simple but powerful pattern matching construct that allows us to dispatch based on the type of the value of a variant.

The full code with usage examples can be found on my GitHub.

Performance comparison between virtual calls and static dispatch

Alex Constantin-Gomez — Sun, 10 Apr 2022 11:22:29 GMT

Background

Virtual functions in C++ is the most popular way to achieve polymorphism. Any class with a virtual member function declaration will cause the compiler to add an a pointer to every instance of the class. This pointer is known as the virtual table pointer, which points to a table of virtual functions for that particular class. Thus, whenever a virtual function is invoked, the pointer is dereferenced to jump to the correct function. This all occurs at runtime, so it is known as dynamic dispatch (or dynamic polymorphism). Most OOP languages such as Java follow the same strategy; in fact, it is the only form of dispatching that is supported (for instance methods). However, C++ allows for another form of dispatching: static dispatch.

Static dispatch essentially means that the compiler knows in advance what method to jump to. Not only does this avoid the extra layer of indirection (because there is no vtable pointer), it also allows for methods to be inlined which gives room of even more performance benefits (as opposed to virtual functions which cannot be inlined). An important but final benefit is that less space is used per instance because we do not need to store a pointer. This can be beneficial for small objects, since more instances can fit in cache.

In C++, static polymorphism is achieved through the Curiously Recurring Template Pattern (CRTP).

As a refresher, the template pattern allows you to delegate specific behaviour in an algorithm to subclasses. With CRTP, the base class is a template class where the template type is a derived class, so it has direct access to the hook method implemented in the derived class. The derived class then inherits from the base class specialised with the derived class itself.

Comparison

Let's start with dynamic dispatch, by creating a simple interface that we want our subclasses to implement:

struct IShape {
    virtual double area() = 0;
    virtual void scale(int factor) = 0;
};

class Square : public IShape {
    double m_length;
    
public:
    Square(double len) : m_length(len) {}
    double area() override { return m_length * m_length; }
    void scale(int factor) override { m_length *= factor; }
};

Now, let's implement the equivalent code using CRTP:

template
class IShapeCRTP {
public:
    double area() { return impl()->area(); }
    
    void scale(int factor) { impl()->scale(factor); }
private:
    Impl* impl() { return static_cast(this); }
};

class SquareCRTP : public IShapeCRTP {
    double m_length;
    
public:
    SquareCRTP(double len) : m_length(len) {}
    double area() { return m_length * m_length; }
    void scale(int factor) { m_length *= factor; }
};

Now, let's write the benchmarks for both versions. I am using Google Benchmark:

#include 

static void DynamicDispatchBenchmark(benchmark::State &state) {
    IShape* square = new Square(10);
    for (auto _ : state) {
        square->scale(3);
    }
    delete square;
}

static void StaticDispatchBenchmark(benchmark::State &state) {
    IShapeCRTP* square = new SquareCRTP(10);
    for (auto _ : state) {
        square->scale(3);
    }
    delete square;
}

BENCHMARK(DynamicDispatchBenchmark);
BENCHMARK(StaticDispatchBenchmark);

BENCHMARK_MAIN();

The results on my laptop are:

DynamicDispatchBenchmark       2.64 ns
StaticDispatchBenchmark        1.17 ns

That's over x2 speedup! Obviously, as with any microbenchmark, you need to take the results with a pinch of salt, since there are many factors that could affect performance in a real application which are not taken into account in an isolated example like the one above. You can see the assembly output here for yourself.

Either way, from both a practical and theoretical point of view, we see that static dispatch via CRTP provides several benefits that can lead to performance increases, compared with dynamic dispatch.

Intro to lock-free programming

Alex Constantin-Gomez — Mon, 03 Jan 2022 23:30:05 GMT

Disclaimer: Before starting, I want to highlight that I am fairly new to lock-free programming and still have a lot to learn. This post is intended for me to self-study and revisit in the future if needed, and is based on what I have learnt so far as a CS undergraduate and researching online.

Introduction

Lock-free programming is often described as programming with locking and unlocking mutexes. While this is true, it is only a "subset" of the actual definition. A lock-free program/algorithm refers to code that cannot cause the system to lock up, such as a deadlock or livelock. In other words, there is no possible scheduling of threads and executions that would lead to a locked-up state. This means that the whole system makes progress as a whole, ie: at least one thread makes progress. As long as a program is able to invoke lock-free operations, the number of completed invocations keeps increasing no matter what.

Another common term in the world of lock-free programming is wait-free. The latter refers to a program where every thread makes progress and execute in a finite number of steps, regardless of contention. In that sense, wait-free is a strict subset of lock-free and generally, writing wait-free code is harder. However, this post will focus on lock-free programming.

As part of lock-free programming, we need to go through atomics, memory models and orderings.

Memory models

A memory model describes how memory is read from and written to and in what order. This typically means how reads and writes may be reordered and what memory barriers can be used to prevent certain reorderings. We can choose to describe how memory behaves at a hardware level or at a software level.

Hardware memory models

The hardware memory model will characterise the CPU's ability to reorder memory operations. Stronger memory models refer to those that provide more guarantees about how memory operations may be reordered (ie: the stronger the model, the less types of reorderings it admits). The strongest memory model is sequential consistency (SC) however no architecture implements SC at a hardware level because it is too expensive and prevents many optimisations.

In x86, the memory model is known as Total Store Order (TSO). It allows write-read reordering on different memory locations, ie: a read from location x can be reordered before a write to y. This effect is achieved through store buffering: all writes are enqueued in a per-core store buffer, allowing later reads to "overtake" these buffered writes. These buffered writes will eventually be dequeued at non-deterministic times (a type of out-of-order execution), unless a memory barrier is issued to indicate that all writes should be immediately flushed.

A weaker memory model is Partial Store Order (PSO), which is like TSO but allows write-write reordering, and an even weaker memory model is Relaxed Store Ordering (RSO), which allows even more reorderings. These weaker models can be found on SPARC, ARM and Alpha architectures. Therefore, x86 (TSO) is actually known to be a fairly strong memory model, as it only allows one type of reordering.

Because RSO is weaker than PSO and PSO is weaker than TSO, any TSO interleaving is also a possible PSO and RSO interleaving, but not necessarily the other way around.

Software memory models

We can abstract away from the hardware and instead look at how we want our software to behave with regard to memory reorderings. Each architecture will have a set of memory barriers in their instruction set (mfence on x86) which allow us to prevent reordering in our code (more on memory barriers in the next section).

We may choose to manually issue memory fence instructions, however it is common for the compiler to do that work for us when using a higher-level constructs, such as the std::atomic library in C++. We can tag atomic operations with a specific memory order and the compiler will emit the necessary memory barriers.

An idealised and high-level memory model is sequential consistency. SC corresponds to interleaving semantics and makes it easy to reason about your program's correctness. However, as mentioned earlier, CPUs do not implement SC at the hardware level because it prevents optimisations. The way SC memory operations are achieved is by emitting a full memory barrier immediately after it, so that other threads see the effect of the operation. In C++, all atomic operations will work under SC semantics.

On the other extreme, we can choose not to insert any memory barriers at all and let the CPU reorder the instructions as it sees fit. In C/C++, this is known as relaxed ordering (not related to the RMO hardware model). The semantics of relaxed-tagged atomic operations will correspond to the memory model of the hardware.

Somewhere in between the two previous models, we have acquire-release ordering. This model will provide an ordering between atomic stores and loads. An atomic store can be tagged as release and an atomic load can be tagged as acquire.
Acquire semantics prevent memory reordering of the read-acquire operation with any other read or write operation that follows it.
Release semantics prevent memory reordering of the write-release operation with any other read or write operation that precedes it.
Depending on the architecture, acquire-release operations may require fewer memory barriers than SC.

Acquire-release orderings are perfect for message passing and can be used to implement spinlocks. In fact, the term "acquiring/releasing a lock" actually comes from the acquire-release memory semantics.

Types of memory barriers

In lock-free programming, we want to make sure that the correct memory barriers are used/emitted by the compiler so that our program behaves correctly and enforces an ordering of instructions.

Since we have four types of reordering (read-read, read-write, write-read and write-write), we would require four memory fences. In practice, we usually don't see all four. For example, Linux provides three barriers that are sufficient for its supported architectures:

smp_mb(), full memory barrier that orders both reads and writes (any operation before the barrier will be committed to memory before any operation after the barrier).
smp_rmb(), read memory barrier that orders reads.
smp_wmb(), write memory barrier that orders writes.

For more details on memory barriers and hardware, I would strongly recommend reading this paper by Paul E. McKenney.

Atomic operations

In a multithreaded program where multiple threads read/write shared memory, we need to somehow synchronise accesses to avoid unwanted data races. This can be done with mutexes or with the appropriate synchronisation provided by atomic operations.

An atomic operation is one that manipulates memory in an indivisible way, such that no threads can observe the operation half-complete. On most processors nowadays, many operations are already atomic. For example, reads and writes of native types (those that fit in the memory bus, such as integers) which are aligned are atomic. For example, on x86 reads and writes to 64-bit-sized locations are atomic. This why you can write code such as std::atomic x or std::atomic y in C++: the compiler will align these memory locations and just use standard MOV operations to atomically read the values. You can use std::atomic::is_lock_free in C++ to verify that the type T can be atomically read/written without using locks.

You can go a step further than loading and storing data atomically by using Read-Modify-Write (RMW) operations. As the name indicates, they allow you to perform more complex transactions atomically. They're crucial in lock-free code when multiple threads want to write to the same location because when they attempt an RMW on that location, the RMWs effectively become linearisable and execute one-at-a-time, in a line.

Compare-And-Swap

The most important RMW operation is compare-and-swap (CAS). A CAS operation will compare the shared atomic value with an expected value, and if equal, it will update the shared location with a new value. In this case, the CAS is successful and returns true. If not equal, we say the CAS is unsuccessful and it returns false.

It is the only RMW that is absolutely essential, since every other RMW such as fetch_add() or exchange() can be implemented using CAS.

Often, we use CAS loops to repeatedly attempt a transaction until successful. This pattern typically involves:

Copying a shared variable to a local variable (private to other threads)
Performing some speculative work using the local variable
Attempting to publish any changes to the shared variable using CAS

For example, we can implement a seemingly complex operation atomic and lock-free using a CAS loop. Below, we implement a function that atomically doubles a value if odd or increments it if even, and returns the old value:

int atomicDoubleOrIncrement(std::atomic& shared) {
    int old = shared.load();
    int new;
    do {
        if (old % 2 != 0) {
            new = old * 2;
        } else {
            new = old + 1;
        }
    } while (!shared.compare_exchange_strong(old, new));
    return old;
}

The idea is that if the CAS fails due to a modification to shared by another thread, the local variable old gets updated to be the latest value of shared (in C++, if a CAS fails then it will update the expected parameter to contain the latest value) and we try again. You can think of the body of a CAS loop as a protected section that cannot be seen by other threads because we perform work using the local variable.

The code above qualifies as lock-free because:

if the CAS succeeds, then we break from the loop and the thread makes progress
if the CAS fails, then it must be because another thread updated the value first and their CAS was successful, so there is another thread that makes progress

However, this function is not wait-free because only one thread can make progress at any given time and a thread's progress will depend on contention.

Closing thoughts

Lock-free programming is hard, as it requires you to have a good understanding of a variety of topics that go all the way down to the hardware. I am still a novice in this area however I am really enjoying learning about it and writing this post has helped solidify my understanding of memory models, atomics, instruction reordering, etc.

Writing custom memory allocators in C++

Alex Constantin-Gomez — Thu, 30 Dec 2021 12:54:15 GMT

Introduction

An essential part of memory management is how memory is allocated and eventually deallocated. By default, memory in C++ programs is allocated using the new keyword and deallocated using the delete keyword. However, sometimes we want more control over how and where objects are allocated/deallocated to avoid issues like fragmentation.

The C++ standard library allows programmers to write custom allocators which can be used by STL containers for dynamic memory allocation, rather than using the standard allocator.

Allocators can be used to improve performance-related issues such as fragmentation, per-thread allocation and NUMA-friendly allocation.

We will see some examples in this post and their benefits, but before we should mention the different properties and requirements an allocator should have in C++.

The allocator API

In C++, an allocator is a template class that allocates and deallocates memory for a specific type T. There are two types of allocators:

Equal allocators: two equal allocators can be used to allocate and deallocate memory for a type T interchangeably. These are usually stateless allocators.
Unequal allocators: two unequeal allocators cannot be used to allocate and deallocate memory interchangeably. These are usually stateful allocators.

An allocator class should offer a T* allocate(size_t n) method to allocate n number of objects of type T and a void deallocate(T* p, size_t n) method to deallocate an object of type T.

Additionally, we need to provide an empty copy constructor using a template of type U for full compatibility with STL containers, because the container may also need to allocate internal objects (such as linked list nodes) in addition to objects of type T.

The most simple allocator using malloc() can be implemented as follows:

template 
class SimpleAllocator {
public:
    using value_type = T;

    SimpleAllocator() = default;
    
    template 
    SimpleAllocator(const SimpleAllocator& other) {
        (void) other;
    }
    
    T* allocate(size_t n) {
        auto ptr = static_cast(malloc(sizeof(T) * n));
        if (ptr)
            return ptr;
            
        throw std::bad_alloc();
    }
    
    void deallocate(T* ptr, size_t n) {
        (void) n;
        free(ptr);
    }
};

Because this is a stateless allocator and just uses malloc(), this is an equal allocator, so it is good practice to define the following equality operators for our allocator:

template 
bool operator==(const SimpleAllocator& a1, const SimpleAllocator& a2) {
    (void) a1; (void) a2;
    return true;
}

template 
bool operator!=(const SimpleAllocator& a1, const SimpleAllocator& a2) {
    (void) a1; (void) a2;
    return false;
}

In C++17, you can avoid manually writing the two equality operators above by adding the property using is_always_equal = std::true_type if the allocator is equal (or std::false_type if unequal).

Because SimpleAllocator is an equal allocator, it is legal to do the following:

SimpleAllocator a1;
SimpleAllocator a2;

double* ptr = a1.allocate(1);   // allocate a double with a1
a2.deallocate(ptr, 1);          // deallocate the memory with a2

Example 1: A stateless cache-aligned allocator

As an extension to our first stateless allocator which does nothing interesting, we will now implement an allocator that actually does something useful. In this case, our allocator will automatically eliminate false sharing in an STL container being accessed by multiple threads.

Briefly, the solution to false sharing is to align the shared memory locations such that they end up in different cache lines. On x86 CPUs, L1 cache lines are 64 bytes, so our allocator should allocate objects at 64 byte boundaries. Here is the code:

template 
class CacheAlignedAllocator {
public:
    using value_type = T;
    using is_always_equal = std::true_type;

    template 
    struct rebind {
        using other = CacheAlignedAllocator;
    };

    CacheAlignedAllocator() = default;

    template 
    CacheAlignedAllocator(const CacheAlignedAllocator& other) {
        (void) other;
    }

    T* allocate(size_t n) {
        auto ptr = static_cast(aligned_alloc(Alignment, sizeof(T) * n));
        if (ptr)
            return ptr;

        throw std::bad_alloc();
    }

    void deallocate(T* ptr, size_t n) {
        (void) n;
        free(ptr);
    }
};

For allocation, we use C++'s builtin function aligned_alloc(). Everything else is the same, except for the struct rebind. Typically, this struct is generated by the compiler automatically for us but because our allocator takes in more than one template argument (in case the user wants to supply a different alignment amount), we must manually define our own rebind struct (which is used by STL containers to create new allocators without having to call the copy constructor).

You can now specify the custom allocator as a policy template argument to an STL container and verify that all allocations are indeed 64-byte aligned (by inserting a print statement in the allocate method):

std::vector> vec;
for (int i = 0; i < 5; i++) {
    vec.emplace_back(i);
}

Running this benchmark I have prepared to demonstrate the effects of false sharing in a multi-threaded program, for 10 iterations, I obtain:

        std::allocator mean: 2470 ms
CachedAlignedAllocator mean: 2192 ms

Measuring latency in this case may not be fully representative because it takes into account thread creation and context switching which adds jitter to the result. However, you can look at the L1 cache misses using perf on Linux and verify that we get less L1 cache misses using the custom allocator.

Example 2: A stateful pool allocator

Now, we look at a classic use case of custom memory allocators: a pool allocator. The goal of a pool-based allocator is to quickly allocate memory for a fixed-type objects while reducing internal fragmentation of memory.

Pool allocators work by allocating large blocks of memory in advance and dividing this block for individual allocations. This means that memory allocation is much faster than calling malloc(), which is slow.

Because a pool allocator has to manage a list of blocks, it is a stateful allocator (and therefore unequal).
Firstly, we need to create a Pool class that manages the memory of chunks of a given size. We use a stack of addresses to quickly pop an available address when we need to allocate an object and push back a newly available address when we deallocate an object at that address.

template 
class Pool {
private:
    size_t size_;
    std::stack addrs_;
    std::stack> blocks_;

public:
    explicit Pool(size_t size) : size_(size) {
        for (size_t i = 0; i < ReservedBlocks; i++) {
            add_more_addresses();
        }
    }

    void* allocate() {
        if (addrs_.empty()) {
            add_more_addresses();
        }

        auto ptr = addrs_.top();
        addrs_.pop();
        return ptr;
    }

    void deallocate(void *ptr) {
        addrs_.push(ptr);
    }

    void rebind(size_t size) {
        if (!(addrs_.empty() && blocks_.empty())) {
            std::cerr << "Cannot call Pool::rebind() after alloc\n";
            abort();
        }

        size_ = size;
    }

private:
    // Refill the address stack by allocating another block of memory
    void add_more_addresses() {
        auto block = std::make_unique(BlockSize);
        auto total_size = BlockSize % size_ == 0 ? 
                            BlockSize : BlockSize - size_;

        // Divide the block into chunks of size_ bytes, and add their addrs
        for (size_t i = 0; i < total_size; i += size_) {
            addrs_.push(&block.get()[i]);
        }

        // Keep the memory of the block alive by adding it to our stack
        blocks_.push(std::move(block));
    }
};

In our constructor, we specify the fixed-size number of bytes we want to allocate (this will be passed in later as sizeof(T)) and we reserve memory blocks if specified by the template parameter.

Importantly, we use a std::unique_ptr to automatically free the memory when we are done using the allocator.

The allocate() method will simply return the first address on the stack in O(1) time (unless the stack is empty) and our deallocate() just puts back the address onto the stack.
We also provide a rebind() method to keep STL containers happy.

Now, we need to create the actual PoolAllocator class which manages a Pool instance:

template 
class PoolAllocator {
private:
    using PoolType = Pool;
    std::shared_ptr pool_;

public:
    using value_type = T;
    using is_always_equal = std::false_type;

    template 
    struct rebind {
        using other = PoolAllocator;
    };

    PoolAllocator() : pool_(std::make_shared(sizeof(T))) {}

    // Rebind copy constructor
    template 
    PoolAllocator(const PoolAllocator& other) : pool_{other.pool_} {
        pool_->rebind(sizeof(T));
    }

    PoolAllocator(const PoolAllocator& other) = default;
    PoolAllocator(PoolAllocator&& other) = default;
    PoolAllocator& operator=(const PoolAllocator& other) = default;
    PoolAllocator& operator=(PoolAllocator&& other) = default;

    T* allocate(size_t n) {
        if (n > 1) {
            return static_cast(malloc(sizeof(T) * n));
        }

        return static_cast(pool_->allocate());
    }

    void deallocate(T* ptr, size_t n) {
        if (n > 1) {
            free(ptr);
            return;
        }

        pool_->deallocate(ptr);
    }
};

Our allocator class looks very similar as our previous one. The constructor creates a Pool instance which will be used to allocate and deallocate memory. The allocate() and deallocate() methods simply pass the call onto the pool_ instance. Note that our allocator only supports individual allocations: if n > 1, we simply use the standard allocator via malloc() and free().

We must also provide a struct rebind because our class has more than one template parameter and a rebind copy constructor which just passes the call down to the pool_ instance. We also need to provide default copy/move constructors and assignment operators.

Our benchmark in this case will measure the time it takes to add 1 million integers to a std::list (see the code here). We compare the standard allocator, a pool allocator with no reserved blocks in-advance, one with 100 reserved blocks and another with 1000 reserved blocks:

std::allocator            mean: 21611 μs
PoolAllocator             mean: 12885 μs
PoolAllocator  mean: 5718 μs
PoolAllocator mean: 5686 μs

Even without reserving blocks in advance, we get around x2 speed up! And as expected, if we reserve blocks in advance, we get a x3.85 speed up.

Pool allocators are so useful that they were introduced into the standard library in C++17. You can access them in the header (polymorphic memory resource) and use them nicely with STL containers.

Example 3: A huge page allocator

Our final example is just to show off how you can write highly-tailored allocators using a specific feature of Linux: transparent huge pages.

Typically, the OS will allocate memory in fixed-size pages of 4 kB. However, if we have a very large in-memory data structure spanning multiple pages which may be randomly accessed (such as a hash table), we are likely going to suffer from a high number of TLB misses during virtual address translation. The OS can reduce this miss rate by allocating large-sized pages, known as huge pages. These are typically 2 MB, but can be even larger.

A transparent huge page is one that was promoted automatically by the OS from a regular page into a huge page, and is a Linux-specific feature. We can enable THP support for a memory range using the madvise() system call, however Linux does not guarantee that a HP will be allocated. We can use posix_memalign to hint at the kernel that we really want to allocate a huge page. Let's see the code:

template  
class THPAllocator {
public:
    using is_always_equal = std::true_type;
    using value_type = T;

    template 
    struct rebind {
        using other = THPAllocator;
    };

    THPAllocator() = default;

    template 
    constexpr THPAllocator(const THPAllocator& other) {
        (void) other;
    }

    T *allocate(size_t n) {
        if (n > std::numeric_limits::max() / sizeof(T)) {
            throw std::bad_alloc();
        }
        const auto total_size = n * sizeof(T);
        void *p = nullptr;
        if (posix_memalign(&p, HugePageSize, total_size) != 0) {
            throw std::bad_alloc();
        }
        
        madvise(p, total_size, MADV_HUGEPAGE);
        if (p == nullptr) {
            throw std::bad_alloc();
        }

        return static_cast(p);
    }

    void deallocate(T *p, size_t n) { 
        (void) n;
        free(p); 
    }
};

Our benchmark is very simple: we try to add 8 MB worth of integers to a vector and see how long it takes. We obtain:

std::allocator mean: 4414 μs
THPAllocator   mean: 2584 μs

It is also worth noting that requesting the kernel to promote a regular page into a huge page may cause latency spikes, because the kernel needs to update the page tables accordingly. Also, the kernel may decide to compact unused pages in order to create a huge page on demand, which can also lead to latency spikes. If you try running the benchmark with iterations = 1, you will see the large variance.

Closing thoughts

We have seen how C++ easily lets us implement custom memory allocators for different applications and use cases. The C++ allocator API can go into much more detail and I am not an expert on memory allocators, this post is just an example highlighting the benefits of memory allocators and a simple demo on how to implement them.

Implementing an optimised SPSC queue for low-latency message passing

Alex Constantin-Gomez — Tue, 28 Dec 2021 12:31:35 GMT

Background

A single-producer single-consumer queue is a FIFO buffer that acts as a message passing mechanism between two threads (the producer and the consumer). This can be especially useful when a thread produces (or receives) data which should then be passed onto another thread for processing.

An example usage of such queue would be for asynchronous logging: if we have a latency-sensitive section of code and we wish to log something, we should pass it to another background thread which deals with string formatting and I/O. Like this, the only cost incurred in the critical path is just push the contents of the log onto the queue.

Naturally, in a multi-threaded environment, we may decide to have multiple producers and multiple consumers passing data around. However, this requires a more careful implementation to avoid data races and often, using locks which can lead to lower performance (due to context switches). Although implementations without using mutexes exist (via atomic RMWs), they are not technically lock-free nor trivial to implement.

On the other hand, if we have exactly 2 threads (one writer and one reader) sharing the queue, it is possible to implement a wait-free SPSC queue using only atomic loads and stores (no RMW loops!).

A wait-free algorithm means that all the threads in the system make progress regardless of contention and the operations are executed in a finite number of steps.

This is largely why SPSC queues are commonly found in high-throughput multi-threaad systems: they are very fast and fairly easy to implement.

V1: A simple wait-free SPSC queue

We will implement the queue as a circular ring buffer using std::vector as the underlying container. We also need to keep track of the head and tail fields, which are atomically updated. The queue will have a bounded capacity, which is specified on construction.

template 
class SpscQueue {
private:
    std::vector buffer_;
    std::atomic head_;
    std::atomic tail_;
    
public:
    SpscQueue(size_t cap) : buffer_(cap + 1), head_(0), tail_(0) {
        assert(cap + 1 > 0);      // prevent overflow
    }
    
    ...
}

Notice that we are actually providing the capacity plus one to the vector. The reason for this is that we require one item in the queue to distinguish when the queue is empty or full, such that our empty/full conditions are given by:

Is empty? head_ == tail_
Is full? head_ == (tail_ + 1) % queue_size

Given this information, we can implement our enqueue and dequeue methods. The code is given first and then explained:

bool enqueue(const T& item) {
    const size_t t = tail_.load(std::memory_order_relaxed);
    const size_t next_t = (t + 1) % buffer_.size();

    if (next_t == head_.load(std::memory_order_acquire)) {
        return false;
    }

    buffer_[t] = item;
    tail_.store(next_t, std::memory_order_release);
    return true;
}

bool dequeue(T& item) {
    const size_t h = head_.load(std::memory_order_relaxed);

    if (h == tail_.load(std::memory_order_acquire)) {
        return false;
    }

    item = buffer_[h];
    const size_t next_h = (h + 1) % buffer_.size();
    head_.store(next_h, std::memory_order_release);
    return true;
}

As you can see, there are no atomic RMW loops or blocking operations, so both methods are guaranteed to be wait-free and lock-free.

Let's start with enqueue(). We first check that the queue is not full by comparing the next possible tail index with the current head index. We use std::memory_order_acquire to synchronise with possible updates to the head_ field by the consumer thread calling dequeue().
We then store the item into the queue at the current tail and update the tail_ field atomically to point to the next location. Because we want to synchronise with the dequeue() method, we use std::memory_order_release so that our local changes are visible to the other thread. This also happens to be our linearisation point, ie the point at which the enqueue() method appears to take effect.

The reasoning for the dequeue() method is symmetric, but instead we check that the queue is not empty and update the head_ field atomically.

Initially, I tried implementing the queue using only sequentially consistent atomic operations, however the performance is much worse (nearly x4 lower throughput). I recommend you benchmark an SC-only version of the queue to see how much slower it can be (due to memory barriers).

We can now benchmark this unoptimised implementation with a simple program consisting of two threads passing 100 million integers from one to the other, with a queue size of 100k. We measure how long it takes for the consumer thread to read all of the items.

We also pin the consumer and producer threads to different physical cores in our benchmark program. We will see why has makes an important point in the next section.

We compile the benchmark with O3 optimisations and -march=native, and we obtain the following results (run over 10 iterations):

Mean:   46,247,967 elems/s
Median: 46,477,813 elems/s
Min:    44,271,890 elems/s
Max:    46,672,184 elems/s

We are processing around 46 million elements per second, which is quite good, but we can do even better.

Going back to our class definition, we see that our head and tail fields are defined contiguously in memory. This can lead to false sharing, where thread 1 modifies one of the fields and this invalidates thread 2's cache line because both fields lie on the same cache line. This forces thread 2 to go to main memory to fetch the other field even though thread 1 did not change it.

We can easily fix this by adding enough padding between the two fields such that they end up falling on different cache lines. On most modern processors, an L1 cache line is typically 64 bytes, so we can align both fields at 64 bytes each. We can do this as follows in C++:

class SpscQueue {
private:
    std::vector buffer_;
    alignas(64) std::atomic head_;
    alignas(64) std::atomic tail_;
    
...
}

Our queue should no longer suffer from false sharing! Our new results show a slight improvement:

Mean:   49,357,127 elems/s
Median: 49,245,343 elems/s
Min:    47,540,700 elems/s
Max:    50,012,641 elems/s

V3: A cache-optimised queue

The final but substantial optimisation aims to improve cache usage and reduce cache misses (inspired by this).

Consider the case when the consumer calls dequeue() to read an item:

The head_ needs to be updated, so that cache line is loaded into the L1 cache in an exclusive state.
The tail_ needs to be read to check that the queue is not empty, so its cache line is loaded into the L1 cache in a shared state.

Now, assume the producer calls enqueue() to push a new item:
3. The head_ needs to be read (to check that it is not full), so it causes the consumer's cache line containing head_ to transition into a shared state.
4. This causes cache coherency traffic, as the consumer needs to bring back the head_ cache line into an exclusive state.

A symmetric situation occurs the other way round, meaning that in the worst case, there will be one cache line transition from a shared state to an exclusive state for every read and write. In the MESI cache coherency protocol, these transitions are considered cache misses and produce bus traffic.

To reduce bus traffic, the producer and consumer threads will each have their own cached copies of the head and tail indices which can be used to avoid having to always load the head_ or tail_ when checking if the queue is empty or full.

Essentially, when the consumer first observes that N items are available to read, it caches this information and the N-1 subsequent reads won’t need to read the tail_. Similarly when the producer first observes that N items are available for writing, it caches this information and the N-1 subsequent writes won’t need to read the head_.

Our cache-friendly implementation looks like this now:

bool enqueue(const T& item) {
    const size_t t = tail_.load(std::memory_order_relaxed);
    const size_t next_t = (t + 1) % buffer_.size();

    // Use the cached head first instead of loading the actual head from memory.
    // If they are equal, then we know that the queue may be full, so only then load
    // the actual value of head to check if currently full.
    if (next_t == head_cached_) {
        head_cached_ = head_.load(std::memory_order_acquire);
        if (next_t == head_cached_) {
            return false;
        }
    }

    buffer_[t] = item;
    tail_.store(next_t, std::memory_order_release);
    return true;
}

bool dequeue(T& item) {
    const size_t h = head_.load(std::memory_order_relaxed);

    // Use the cached tail first instead of loading the actual tail from memory.
    // If they are equal, then we know that the queue may be empty, so only then load
    // the actual value of tail to check if currently full.
    if (h == tail_cached_) {
        tail_cached_ = tail_.load(std::memory_order_acquire);
        if (h == tail_cached_) {
            return false;
        }
    }

    item = buffer_[h];
    const size_t next_h = (h + 1) % buffer_.size();
    head_.store(next_h, std::memory_order_release);
    return true;
}

Compiling and running our benchmark again with our new queue implementation, we obtain:

Mean:   101,138,856 elems/s
Median: 101,566,018 elems/s
Min:    97,563,021 elems/s
Max:    102,558,883 elems/s

We have more than doubled our initial throughput, reaching 100 million items per second. A fantastic improvement! Obviously, the results shown throughout this post are dependent on your hardware, but we have clearly optimised the queue, starting from a simple implementation and finishing with a cache-efficient queue.

The code for the queue and the benchmark can be found here.

Closing thoughts

We have gone from considering a naive SPSC queue using sequentially consistent operations, introducing weaker memory orderings and improving cache efficiency by understanding the hardware, to obtain a high-throughput low-latency message passing queue.

I find this process of benchmarking and tuning very interesting, especially when seeing that your initial theories drawn up on paper can be translated into big performance boosts in practice.

There are further micro-optimisations you can try and profile, such as ensuring the buffer size is a power of two (in order to avoid performing integer division), using huge pages to reduce TLB misses during virtual address translation, etc.

Alex Constantin-Gómez

Implementing pattern matching in C++17

Variants and visitation

Implementing a match statement

Building an overload set

Putting everything together

Performance comparison between virtual calls and static dispatch

Background

Comparison

Intro to lock-free programming

Introduction

Memory models

Hardware memory models

Software memory models

Types of memory barriers

Atomic operations

Compare-And-Swap

Closing thoughts

Writing custom memory allocators in C++

Introduction

The allocator API

Example 1: A stateless cache-aligned allocator

Example 2: A stateful pool allocator

Example 3: A huge page allocator

Closing thoughts

Implementing an optimised SPSC queue for low-latency message passing

Background

V1: A simple wait-free SPSC queue

V2: Eliminating false sharing

V3: A cache-optimised queue

Closing thoughts

Implementing a `match` statement