Beating std::visit Without Really Trying
One of my current projects is a sum type implementation for the D programming language. It bills itself as having "zero overhead compared to hand-written C," and while there are sound theoretical reasons to think that's true, I'd never actually sat down and verified it empirically. Was I really getting the optimal performance I thought I was?
I was also curious to see how my efforts stacked up next to C++17's
<variant>
and D's Algebraic
, especially after reading a few blog
posts and reddit comments about the challenges faced by the C++
implementation. Could D, and my library, SumType
, in particular, do better?
Methodology
To answer these questions, I designed a simple program and implemented it in C,
in C++, and in D with both Algebraic
and SumType
, using the most natural
and idomatic code for each language and library. My goal was to find out how
well a production-grade, optimizing compiler would handle each sum type
implementation in typical use. The compilers I chose were clang
8, clang++
8,
and ldc
1.17.0, all of which use LLVM 8 for their backends. All programs were
compiled with optimization level -O2
.
Each test program does the following things:
- Defines 10 empty
struct
types, namedT0
throughT9
. - Defines a sum type with 10 members, one of each of those types.
- Defines a function that takes an instance of that sum type as an argument, and returns a unique integer for each possible type the sum type can hold.
For illustration's sake, here's the C++ version, slightly abridged:
#include <variant>
// Source: https://en.cppreference.com/w/cpp/utility/variant/visit
template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;
struct T0 {};
// ...
struct T9 {};
using example_variant = std::variant<T0, T1, T2, T3, T4, T5, T6, T7, T8, T9>;
int do_visit(example_variant v)
{
return std::visit(overloaded {
[](T0 val) { return 3; },
[](T1 val) { return 5; },
[](T2 val) { return 8; },
// ...
[](T9 val) { return 233; },
}, v);
}
After compiling each program, I used objdump
to disassemble them, and
compared the generated assembly for each version of do_visit
.
All of the code used, along with the commands used for compilation and disassembly, are available on Github in the repository pbackus/variant-comparison.
Results
C
The C version uses a union
with an enum
tag as its sum type implementation,
and a switch statement for its "visit" function. It's the straightforward,
obvious implementation you'd use if you were writing the code by hand, without
worrying about making it generic, so it makes a good baseline to compare the
other versions to.
00000000000000a0 <do_visit>:
a0: cmp $0xa,%edi # check if type index is in-bounds
a3: jae b0
a5: movslq %edi,%rax
a8: mov 0x0(,%rax,4),%eax # get result from global array
af: retq
# Error path
b0: push %rax
b1: mov $0x0,%edi
b6: mov $0x0,%esi
bb: mov $0x0,%ecx
c0: mov $0x45,%edx
c5: callq ca # __assert_fail
As you might expect, the compiler is able to optimize the entire thing down
into a single array lookup. It's so compact, in fact, that the code for calling
libc
's assertion-failure function takes up more space than the actual logic.
C++
The C++ version uses std::variant
and std::visit
, with the overloaded
helper template from cppreference.com to allow passing a set of labmda
functions as a visitor. This is standard-library code, so we should expect it
to be as well-optimized as the best and brightest C++ developers around can
make it.
0000000000000000 <do_visit(std::__1::example_variant)>:
0: sub $0x18,%rsp
4: mov %rdi,%rax
7: mov %rdi,0x8(%rsp)
c: shr $0x20,%rax
10: mov $0xffffffff,%ecx
15: cmp %rcx,%rax # check if type index is in-bounds
18: je 38
1a: mov %rsp,%rcx
1d: mov %rcx,0x10(%rsp)
22: lea 0x10(%rsp),%rdi
27: lea 0x8(%rsp),%rsi
2c: callq *0x0(,%rax,8) # call function pointer in global array
33: add $0x18,%rsp
37: retq
# Error path
38: callq 3d # __throw_bad_variant_access
The main thing to notice here is that, unlike in the C version, the compiler is
not able to inline the individual visitor functions. Instead, it generates an
indirect call to a function pointer stored in a global array. (The individual
lambda functions, of course, all consist of a single mov
followed by a
return.)
I'm not enough of a C++ expert to understand the implementation of
std::visit
, so I can only guess why this is the case. Maybe the optimizer
just gives up after too many layers of template-metaprogramming gunk?
Regardless, it's bad news for fans of zero-cost abstractions, since there's a
clear overhead here compared to the C assembly.
[UPDATE: Reddit user matthieum suggests that the function pointers themselves are likely to blame.]
D, with Algebraic
This version uses the sum type implementation from D's standard library,
Phobos. Rather than create a dedicated sum type, the Phobos developers opted to
make Algebraic
a thin wrapper around Variant
, D's equivalent of C++17's
std::any
. That choice turns out to have far-reaching repercussions, as we're
about to see.
0000000000000000 <int dalgebraic.do_visit(ExampleAlgebraic)>:
0: jmpq 5 # int std.variant.visitImpl!(true, ExampleAlgebraic, ...)
0000000000000000 <int std.variant.visitImpl!(true, ExampleAlgebraic, ...)
0: push %rbx
1: sub $0x10,%rsp
5: cmp 0x0(%rip),%rdi # check for uninitialized Variant
c: je 217 # if so, go to error path
12: mov %rdi,%rbx
# This part is repeated 10 times, once for each type
15: movq $0x0,0x8(%rsp) # prepare arguments for VariantN.handler
1e: lea 0x8(%rsp),%rdi
23: xor %esi,%esi
25: xor %edx,%edx
27: callq *%rbx # VariantN.handler: get TypeInfo for current type
29: mov 0x8(%rsp),%rsi
2e: mov 0x0(%rip),%rdi # get TypeInfo for T0
35: callq 3a # Object.opEquals: compare TypeInfos
3a: mov %eax,%ecx
3c: mov $0x3,%eax # load return value for T0 into %eax
41: test %cl,%cl # check if Object.opEquals returned true
43: jne 211 # if so, go to return
...: ... ...
# After checking for T9
20f: je 283 # if none of the types matched, assert(false)
211: add $0x10,%rsp
215: pop %rbx
216: retq
# Error path
217: mov 0x0(%rip),%rdi # get ClassInfo for VariantException
21e: callq 223 # _d_allocclass: allocate VariantException
223: mov %rax,%rbx
226: mov 0x0(%rip),%rax # initialize VariantException vtable
22d: mov %rax,(%rbx)
230: movq $0x0,0x8(%rbx)
238: mov 0x0(%rip),%rax # initialize VariantException
23f: movups 0x50(%rax),%xmm0
243: movups %xmm0,0x50(%rbx)
247: movups 0x10(%rax),%xmm0
24b: movups 0x20(%rax),%xmm1
24f: movups 0x30(%rax),%xmm2
253: movups 0x40(%rax),%xmm3
257: movups %xmm3,0x40(%rbx)
25b: movups %xmm2,0x30(%rbx)
25f: movups %xmm1,0x20(%rbx)
263: movups %xmm0,0x10(%rbx)
267: lea 0x0(%rip),%rdx # get message for VariantException
26e: mov $0x2f,%esi
273: mov %rbx,%rdi
276: callq 27b # VariantException.__ctor
27b: mov %rbx,%rdi
27e: callq 283 # _d_throw_exception
283: lea 0x0(%rip),%rsi # get error message for assert
28a: mov $0x4b,%edi
28f: mov $0xa1c,%edx
294: callq 299 # call D runtime assert function
The good part is that ldc
is able to inline the lambdas into the body of
visitImpl
. The bad part is, well, everything else.
Because Algebraic
is implemented using Variant
, it has to rely on runtime
type information (TypeInfo
) to check the type of the current value. And
because it uses TypeInfo
, rather than an integer index like the C and C++
versions, these checks can't be condensed down into a single array lookup, or
even a jump table. Finally, as if that wasn't enough, each TypeInfo
comparison requires a virtual method call to Object.opEquals
.
Beyond these obvious pessimizations, the other thing that stands out is the
sheer volume of the generated code. It's an order of magnitude larger than both
the C and C++ versions, with significant bloat in both the normal and error
paths. Not only is this bad in isolation, since it puts unnecessary pressure on
the instrution cache, it also means that potential optimization opportunities
exposed by inlining the C and C++ versions of do_visit
won't be available to
D code using Algebraic
.
D, with SumType
This version uses my own implementation of a sum type in D. SumType
does not
rely at all on runtime type information, and instead uses D's powerful
compile-time reflection and metaprogramming capabilities to try to generate
code as close to the C version as possible. Let's see how well it does.
0000000000000000 <int dsumtype.do_visit(ExampleSumType)>:
0: push %rax
1: movsbq 0x10(%rsp),%rax # get type index from stack
7: cmp $0xa,%rax # check if type index is in-bounds
b: jae 19
d: lea 0x0(%rip),%rcx # load address of global array
14: mov (%rcx,%rax,4),%eax # get result from global array
17: pop %rcx
18: retq
# Error path
19: lea 0x0(%rip),%rdx # get error message
20: mov $0x4ca,%edi
25: mov $0x9,%esi
2a: callq 2f # __switch_error
As it turns out, it's almost exactly the same as the C version: a single array
lookup, plus some code for error handling. In fact, as far as I can tell, the
only difference has to do with the way the function argument is passed in
registers—the C version is able to avoid spilling anything to the stack,
whereas this version has a push
and a pop
at the beginning and end,
respectively.
Still, I think it's fair to say that this is a true zero-cost abstraction, and that the claim of "zero overhead compared to hand-written C" is justified.
[UPDATE: Reddit user ais52 explains that the purpose of the push
/pop
pair is to adjust the alignment of the stack pointer.]
Conclusions
If creating a zero-overhead generic sum type is so easy that even I can do it,
why doesn't C++ have one? Are the libc++
developers a bunch of dummies?
No, of course not—but they are working with a significant handicap: C++.
The source code for variant
is terrifyingly complex. In order to get the
results shown above, the authors have had to use every template metaprogramming
trick in the book, and then some. It's clearly the result of immense effort by
some very skilled programmers.
By contrast, the source code of SumType
is almost embarrassingly simple. The
reason match
, SumType
's equivalent of std::visit
, is able to optimize
down to the same code as a C switch statement is that it literally is a switch
statement, with a static foreach
loop inside to generate the cases.
To be honest, I wasn't even thinking about optimization when I coded SumType
.
I wanted to make sure the interface was nice, and figured I could worry about
performance later. But D makes it so easy to write simple, straightforward code
that the compiler understands, I ended up falling into the pit of success
and beating std::visit
entirely by accident.