Alex Constantin-Gomez

Introduction

Lock-free programming is often described as programming without locking/unlocking mutexes. While this is partially true, it is only a "subset" of the actual definition. A lock-free program refers to code that cannot cause the system to lock up (deadlock or livelock). In other words, there is no possible scheduling of threads and executions that would lead to a locked-up state. This means that the whole system makes progress as a whole, i.e.: at least one thread makes progress. As long as a program is able to invoke lock-free operations, the number of completed invocations keeps increasing no matter what.

Another widely-used phrase is "wait-free". This refers to a program where every thread makes progress and execute in a finite number of steps, regardless of contention. In that sense, wait-free is a strict subset of lock-free and generally, writing wait-free code is harder. However, this post will focus on lock-free programming.

Memory models

A memory model describes how memory is read from and written to and in what order. This typically means how reads and writes may be reordered and what memory barriers can be used to prevent certain reorderings. We can choose to describe how memory behaves at a hardware level or at a software level.

Hardware memory models

The hardware memory model will characterise the CPU's ability to reorder memory operations. Stronger memory models refer to those that provide more guarantees about how memory operations may be reordered (i.e.: the stronger the model, the less types of reorderings it admits). The strongest memory model is sequential consistency (SC) however no architecture implements SC at a hardware level because it is too expensive and prevents many optimisations.

In x86, the memory model is known as Total Store Order (TSO). It allows write-read reordering on different memory locations, i.e.: a read from location x can be reordered before a write to y. This effect is achieved through store buffering: all writes are enqueued in a per-core store buffer, allowing later reads to "overtake" these buffered writes. These buffered writes will eventually be dequeued at non-deterministic times (a type of out-of-order execution), unless a memory barrier is issued to indicate that all writes should be immediately flushed.

A weaker memory model is Partial Store Order (PSO), which is like TSO but allows write-write reordering, and an even weaker memory model is Relaxed Store Ordering (RSO), which allows even more reorderings. These weaker models can be found on SPARC, ARM and Alpha architectures. Therefore, x86 (TSO) is actually known to be a fairly strong memory model, as it only allows one type of reordering.

Software memory models

We can abstract away from the hardware and instead look at how we want our software to behave with regard to memory reorderings. Each architecture will have a set of memory barriers in their instruction set (mfence on x86) which allow us to prevent reordering in our code (more on memory barriers in the next section).

We may choose to manually issue memory fence instructions, however it is common for the compiler to do that work for us when using a higher-level constructs, such as the atomic header in C++. We can tag atomic operations with a specific memory order and the compiler will emit the necessary memory barriers.

An idealised and high-level memory model is sequential consistency. SC corresponds to interleaving semantics and makes it easy to reason about your program's correctness. However, as mentioned earlier, CPUs do not implement SC at the hardware level because it prevents optimisations. The way SC memory operations are achieved is by emitting a full memory barrier immediately after it, so that other threads see the effect of the operation. In C++, all atomic operations are marked as SC by default.

On the other extreme, we can choose not to insert any memory barriers at all and let the CPU reorder the instructions as it sees fit. In C/C++, this is known as relaxed ordering (not related to the RMO hardware model). The semantics of relaxed-tagged atomic operations will correspond to the memory model of the hardware.

Somewhere in between the two previous models, we have acquire-release ordering. This model will provide an ordering between atomic stores and loads. An atomic store can be tagged as release and an atomic load can be tagged as acquire. Acquire semantics prevent memory reordering of the read-acquire operation with any other read or write operation that follows it. Release semantics prevent memory reordering of the write-release operation with any other read or write operation that precedes it. Depending on the architecture, acquire-release operations may require fewer memory barriers than SC.

Acquire-release orderings are perfect for message passing and can be used to implement spinlocks. In fact, the term "acquiring/releasing a lock" actually comes from the acquire-release memory semantics.

Type of memory barriers

In lock-free programming, we want to make sure that the correct memory barriers are used/emitted by the compiler so that our program behaves correctly and enforces an ordering of instructions.

Since we have four types of reordering (read-read, read-write, write-read and write-write), we would require four memory fences. In practice, we usually don't see all four. For example, Linux provides three barriers that are sufficient for its supported architectures:

smp_mb(), full memory barrier that orders both reads and writes (any operation before the barrier will be committed to memory before any operation after the barrier).
smp_rmb(), read memory barrier that orders reads.
smp_wmb(), write memory barrier that orders writes.

For more details on memory barriers and hardware, I would strongly recommend reading this paper by Paul E. McKenney.

Atomic operations

In a multithreaded program where multiple threads read/write shared memory, we need to somehow synchronise accesses to avoid unwanted data races. This can be done with mutexes or with the appropriate synchronisation provided by atomic operations.

An atomic operation is one that manipulates memory in an indivisible way, such that no threads can observe the operation half-complete. On most processors nowadays, many operations are already atomic. For example, reads and writes of native types (those that fit in the memory bus, such as integers) which are aligned are atomic. For example, on x86 reads and writes to 64-bit-sized locations are atomic. This why you can write code such as std::atomic<x> or std::atomic<y> in C++: the compiler will align these memory locations and just use standard MOV operations to atomically read the values. You can use std::atomic<T>::is_lock_free in C++ to verify that the type T can be atomically read/written without using locks.

You can go a step further than loading and storing data atomically by using Read-Modify-Write (RMW) operations. As the name indicates, they allow you to perform more complex transactions atomically. They're crucial in lock-free code when multiple threads want to write to the same location because when they attempt an RMW on that location, the RMWs effectively become linearisable and execute one-at-a-time, in a line.

Compare-And-Swap

The most important RMW operation is compare-and-swap (CAS). A CAS operation will compare the shared atomic value with an expected value, and if equal, it will update the shared location with a new value. In this case, the CAS is successful and returns true. If not equal, we say the CAS is unsuccessful and it returns false.

It is the only RMW that is absolutely essential, since every other RMW such as fetch_add() or exchange() can be implemented using CAS.

Often, we use CAS loops to repeatedly attempt a transaction until successful. This pattern typically involves:

Copying a shared variable to a local variable (private to other threads)
Performing some speculative work using the local variable
Attempting to publish any changes to the shared variable using CAS

For example, we can implement a seemingly complex operation atomic and lock-free using a CAS loop. Below, we implement a function that atomically doubles a value if odd or increments it if even, and returns the old value:

int atomicDoubleOrIncrement(std::atomic<int>& shared) {
    int old_val = shared.load();
    int new_val;
    do {
        if (old_val % 2 != 0) {
            new_val = old_val * 2;
        } else {
            new_val = old_val + 1;
        }
    } while (!shared.compare_exchange_strong(old_val, new_val));
    return old_val;
}

The idea is that if the CAS fails due to a modification to shared by another thread, the local variable old gets updated to be the latest value of shared (in C++, if a CAS fails then it will update the expected parameter to contain the latest value) and we try again. You can think of the body of a CAS loop as a protected section that cannot be seen by other threads because we perform work using the local variable.

The code above qualifies as lock-free because:

if the CAS succeeds, then we break from the loop and the thread makes progress
if the CAS fails, then it must be because another thread updated the value first and their CAS was successful, so there is another thread that makes progress

However, this function is not wait-free because only one thread can make progress at any given time and a thread's progress will depend on contention.

Closing thoughts

Lock-free programming isn't easy; it requires you to have a good understanding of a variety of topics that go all the way down to the hardware. Hopefully this post has provided an introduction into memory models, atomic, instruction reordering and provided some design patterns to writing lock-free algorithms.

Introduction to lock-free programming