Alex Constantin-Gomez

Performance comparison between virtual calls and static dispatch

Background

Virtual functions are the standard approach to polymorphism in C++. Any class with a virtual member function declaration will cause the compiler to add an a pointer to every instance of the class. This pointer is known as the virtual table pointer, which points to a table of virtual functions for that particular class. Thus, whenever a virtual function is invoked, the pointer is dereferenced to jump to the correct function. This all occurs at runtime, so it is known as dynamic dispatch (or dynamic polymorphism). Most OOP languages such as Java follow the same strategy; in fact, it is the only form of dispatching that is supported (for instance methods). However, C++ allows for another form of dispatching: static dispatch.

Static dispatch essentially means that the compiler knows in advance what method to jump to. Not only does this avoid the extra layer of indirection (because there is no vtable pointer), it also allows for methods to be inlined which gives room of even more performance benefits (as opposed to virtual functions which cannot be inlined). An important but final benefit is that less space is used per instance because we do not need to store a pointer. This can be beneficial for small objects, since more instances can fit in cache.

In C++, static polymorphism is achieved through the Curiously Recurring Template Pattern (CRTP). As a refresher, the template pattern allows you to delegate specific behaviour in an algorithm to subclasses. With CRTP, the base class is a template class where the template type is a derived class, so it has direct access to the hook method implemented in the derived class. The derived class then inherits from the base class specialised with the derived class itself.

Comparison

Let's start with dynamic dispatch, by creating a simple interface that we want our subclasses to implement:

struct IShape {
    virtual double area() = 0;
    virtual void scale(int factor) = 0;
};

class Square : public IShape {
    double m_length;
    
public:
    Square(double len) : m_length(len) {}
    double area() override { return m_length * m_length; }
    void scale(int factor) override { m_length *= factor; }
};

Now, let's implement the equivalent code using CRTP:

template
class IShapeCRTP {
public:
    double area() { return impl()->area(); }
    
    void scale(int factor) { impl()->scale(factor); }
private:
    Impl* impl() { return static_cast<Impl*>(this); }
};

class SquareCRTP : public IShapeCRTP<SquareCRTP> {
    double m_length;
    
public:
    SquareCRTP(double len) : m_length(len) {}
    double area() { return m_length * m_length; }
    void scale(int factor) { m_length *= factor; }
};

Now, let's write the benchmarks for both versions. I am using Google Benchmark:

#include 

static void DynamicDispatchBenchmark(benchmark::State &state) {
    IShape* square = new Square(10);
    for (auto _ : state) {
        square->scale(3);
    }
    delete square;
}

static void StaticDispatchBenchmark(benchmark::State &state) {
    IShapeCRTP<SquareCRTP>* square = new SquareCRTP(10);
    for (auto _ : state) {
        square->scale(3);
    }
    delete square;
}

BENCHMARK(DynamicDispatchBenchmark);
BENCHMARK(StaticDispatchBenchmark);

BENCHMARK_MAIN();

Running the benchmarks, I got:

DynamicDispatchBenchmark       CPU time: 15.01 ns
StaticDispatchBenchmark        CPU time: 3.04 ns

That's about a x5 speedup! Obviously, as with any microbenchmark, you need to take the results with a pinch of salt, since there are many factors that could affect performance in a real application which are not taken into account in an isolated example like the one above. You can see the assembly output here for yourself.

Either way, from both a practical and theoretical point of view, we see that static dispatch via CRTP provides several benefits that can lead to performance increases, compared with dynamic dispatch.