Aug 23, 2020

What Does Memory Safety Really Mean in D?

Recently, in the D community, there's been a good deal of discussion about what exactly is meant by "memory safety," and under what circumstances (if any) it's appropriate to consider an extern(C) function to be "memory safe." This post is my attempt to clarify my own thoughts on the subject by putting them in writing.

None of these ideas are original to me, and many are not even original to D. Any mistakes you might notice, however, are almost certainly my own. :)

What is Memory Safety?

The D language's specification of memory safety defines "memory safety" as follows:

Memory Safety for a program is defined as it being impossible for the program to corrupt memory.

This definition is frustratingly vague. What does it mean, exactly, for a program to "corrupt memory"? Does an out-of-bounds array access "corrupt memory"? Does it matter if the access is a read or a write? And what about "impossible"—does it count if a D program causes memory corruption when running on faulty hardware? If not, where do we draw the line? The D language specification gives no answers to these questions.

A concept closely related to "memory safety" is "undefined behavior." The C99 standard defines "undefined behavior" as

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements

Many systems languages, including D, adopt this definition.

It's clear that whatever it may mean to "corrupt memory," a program whose behavior is undefined is free to do so; after all, it's free to do anything. So, at minimum, a program that is memory safe must have no undefined behavior.

Whether there is any defined behavior that is also considered "memory corruption" is a question I'll return to later.

Is @safe Code Memory-Safe?

In D, functions marked with @safe are checked automatically for memory safety. Assuming these checks are implemented correctly, a program consisting only of @safe D code is guaranteed to be memory-safe.

Unfortunately, it's impossible to write a program that does anything useful using only @safe code, because @safe code can't do I/O or make system calls. In order to bridge this gap between the world of @safe and the world outside, D has @trusted functions, which are allowed to perform operations that violate @safe's checks (including calling unchecked @system functions), but can still be called from @safe functions.

When checking @safe code, the D compiler assumes that all @trusted functions are memory-safe. Therefore, in any useful program, your @safe code is truly memory-safe only if your @trusted code is also memory-safe.

Establishing that your @trusted code is memory-safe is left as an excercise to the programmer.

If @safe doesn't actually guarantee memory safety, why is it useful at all? Because it reduces the incredibly difficult problem of proving that your entire program is memory-safe to the relatively tractable one of proving that your @trusted code is memory-safe.

Can External Functions be @trusted?

What about external library functions, whose source code may not be available? Is there any way to prove that such a function is memory-safe?

The strict, hard-line answer is...no. It's impossible. When you write an extern function declaration, the only guarantee you have is that the function you call matches the specified signature. The actual implementation of that function isn't known until your program is compiled and linked--or in the case of dynamic linking, until it's executed!

Does that mean @trusted is useless? Not quite. To see why, let's look at an an example.

Suppose we're trying to write a memory-safe @trusted wrapper around the C library function puts. From reading the C99 standard, we know that puts has defined behavior when called with a pointer to a null-terminated string, and undefined behavior otherwise, so we write our function as follows:

@trusted void safePuts(const(char)[] s) {
    if (s.length == 0)
        throw new Exception("s must not be empty");
    if (s[$ - 1] != '\0')
        throw new Exception("s must be null-terminated");
    puts(s.ptr);
}

Now we ask: what would have to go wrong for this function to corrupt memory?

The array access, s[$ - 1], can't corrupt memory, because array accesses in D are bounds-checked. Similarly, we know that checking the length of an array, comparing two characters, and throwing an exception are all safe operations, because their behavior as defined by the language spec does not include any possibility of memory corruption. The only remaining thing that could possibly go wrong is the call to puts.

Let's assume, for now, that the only way a call to puts can corrupt memory is by causing undefined behavior. There are two possible ways this could happen:

  1. We call puts with something other than a pointer to a null-terminated string.
  2. puts has a bug in its implementation and can cause undefined behavior even when it's called with a null-terminated string.

We can rule out #1 because our function checks for null-termination before calling puts, so the only possibility left is #2. Since the C99 standard defines the behavior of puts when it's called with a null-terminated string, we can conclude that our code is memory-safe as long as puts conforms to the C99 standard.

This is not a 100% ironclad guarantee, but it's a pretty good one. As long as our C library implementation is well-tested, our @trusted code should be fine.

We can apply the same approach to other external functions: assume that they behave as described in the relevant standard, specification, or documentation, and then prove that our @trusted code is memory-safe as long as that assumption holds.

Is All Well-Defined Behavior Memory-Safe?

The D spec doesn't answer this question directly, but the section on function safety gives us some clues.

Recall that a D program consisting of only @safe code is guaranteed to be memory-safe, according to D's definition of memory safety. It follows from this that any operation a purely-@safe program is allowed to perform must be a memory-safe operation—again, according to the definition of "memory-safe" used in the D spec.

Here's the twist: @safe D code is allowed to dereference a pointer.

At first glance, this seems impossible. In most systems languages, including D, dereferencing a pointer when you're not supposed to is considered undefined behavior. How does D get away with allowing @safe code to do this?

If we scroll down a bit to the section on safe values, we find our answer. The D spec divides pointer values into two categories: "valid" and "invalid". A valid pointer is one that can be dereferenced without causing undefined behavior.

In purely-@safe D code, every operation that could possibly create an invalid pointer is a compile-time error. Pointer arithmetic, casting, and type punning are banned. Escape analysis ensures stack pointers are always valid; GC ensures heap pointers are always valid.

Interestingly, @safe D code is allowed to create null pointers, which implies that in D, the behavior of dereferencing null is actually well-defined.

So dereferencing a pointer in @safe code is considered memory-safe because there's no way to access an invalid pointer from @safe code. It follows that any code which allows an invalid pointer to be accessed from @safe code is guilty of violating memory safety. Even if the offending code contains no undefined behavior.

The easiest way for this to happen is for a @trusted function to either (a) create an invalid pointer and return it, or (b) invalidate an existing pointer that's accessible from @safe code.

Currently, the only perfectly-reliable way to work with invalid pointers while preventing @safe code from accessing them is to store them in local variables of @trusted or @system functions (or in closures that capture those variables). For global variables and aggregate members, the only access-protection mechanism is private, which still allows access from @safe code in the same module.

What About Arrays?

Array access is considered memory-safe (and therefore @safe) because it's bounds-checked at runtime. But this bounds-checking relies on the fact that the .length property of an array matches the actual length of the array in memory. How is that guaranteed?

Just like with pointers, D forbids, in @safe code, any operation that could possibly create an invalid array. In particular, any change to the array's length that would create an invalid array triggers a reallocation, using the GC.

Because arrays are built into the language, their internals do not suffer from the same access-protection hole as private variables w.r.t. @safe code in the same module. In this way, built-in arrays are privileged over user-defined types, which have no equivalent access-protection mechanism.

Invariants, Generally

Generalizing from the above two examples, we can derive the general principle that any data which must uphold some run-time invariant in order to be "valid" (i.e., memory-safe to use) must not be directly accessible from @safe code, but instead encapsulated behind a @safe/@trusted interface that maintains that invariant.

Some examples of such potentially-invalid data include:

  • The tag and union of a tagged union.
  • The reference count and data pointer of a reference-counted pointer.
  • The allocator instance of an allocator-aware container.

To protect potentially-invalid data from access from @safe code, there are several possible solutions:

  1. Hide the data inside closures. While this solution requires no language changes, it is unsatisfactory because it requires the use of the GC and substantially complicates the implementation of aggregates that wish to use it.

  2. Forbid @safe code from accessing private data, even in the same module. While this would work, it is a breaking change, and would cause many "false positives," since many private member variables are not subject to memory-safety related invariants.

  3. Allow the data to be explicitly marked as inaccessible from @safe code. This is the solution proposed by DIP 1035.

  4. Just rely on private. This is not a total loss, since private does provide some protection, but it means that all @safe functions in a module that contains potentially-invalid variables or aggregate members must be considered effectively @trusted, and subject to manual rather than automatic verification of memory safety.

Summary and Conclusions

Memory safety in D means both the absence of undefined behavior, and the absence of certain kinds of data (which I've termed "invalid") that could lead to undefined behavior. To corrupt memory means to either (a) invoke undefined behavior, or (b) expose invalid data to @safe code (which could then use it to invoke undefined behavior).

To prove that a @safe program is memory safe, all of the following steps must be completed:

  1. Prove that the assumptions made by @trusted code about external functions hold.
  2. Prove that, if those assumptions hold, the @trusted code is memory safe.
  3. Prove that, if the @trusted code is memory safe, the @safe code is memory safe.

Step 1 is performed by the party compiling and linking the program. Step 2 is performed by the author(s) of the @trusted code. And step 3 is performed by the compiler. If any step is skipped, or performed incorrectly, the proof does not hold.

When performing step 3, the D compiler relies on the assumption that invalid data cannot be accessed from @safe code. It upholds this assumption automatically in purely-@safe code, and relies on the programmer to uphold it in @trusted code.

Unfortunately, in the current D language, it is impossible for the programmer to uphold this assumption in @trusted code in all cases without making unacceptable feature, performance, and implementation-quality tradeoffs. DIP 1035 would solve this problem. The current approach, of relying on private, requires manual verification of @safe code in order for the compiler's proof of memory safety to remain valid.

Oct 04, 2019

Beating std::visit Without Really Trying

One of my current projects is a sum type implementation for the D programming language. It bills itself as having "zero overhead compared to hand-written C," and while there are sound theoretical reasons to think that's true, I'd never actually sat down and verified it empirically. Was I really getting the optimal performance I thought I was?

I was also curious to see how my efforts stacked up next to C++17's <variant> and D's Algebraic, especially after reading a few blog posts and reddit comments about the challenges faced by the C++ implementation. Could D, and my library, SumType, in particular, do better?

Methodology

To answer these questions, I designed a simple program and implemented it in C, in C++, and in D with both Algebraic and SumType, using the most natural and idomatic code for each language and library. My goal was to find out how well a production-grade, optimizing compiler would handle each sum type implementation in typical use. The compilers I chose were clang 8, clang++ 8, and ldc 1.17.0, all of which use LLVM 8 for their backends. All programs were compiled with optimization level -O2.

Each test program does the following things:

  • Defines 10 empty struct types, named T0 through T9.
  • Defines a sum type with 10 members, one of each of those types.
  • Defines a function that takes an instance of that sum type as an argument, and returns a unique integer for each possible type the sum type can hold.

For illustration's sake, here's the C++ version, slightly abridged:

#include <variant>

// Source: https://en.cppreference.com/w/cpp/utility/variant/visit
template<class... Ts> struct overloaded : Ts... { using Ts::operator()...; };
template<class... Ts> overloaded(Ts...) -> overloaded<Ts...>;

struct T0 {};
// ...
struct T9 {};

using example_variant = std::variant<T0, T1, T2, T3, T4, T5, T6, T7, T8, T9>;

int do_visit(example_variant v)
{
    return std::visit(overloaded {
        [](T0 val) { return 3; },
        [](T1 val) { return 5; },
        [](T2 val) { return 8; },
        // ...
        [](T9 val) { return 233; },
    }, v);
}

After compiling each program, I used objdump to disassemble them, and compared the generated assembly for each version of do_visit.

All of the code used, along with the commands used for compilation and disassembly, are available on Github in the repository pbackus/variant-comparison.

Results

C

The C version uses a union with an enum tag as its sum type implementation, and a switch statement for its "visit" function. It's the straightforward, obvious implementation you'd use if you were writing the code by hand, without worrying about making it generic, so it makes a good baseline to compare the other versions to.

00000000000000a0 <do_visit>:
  a0:    cmp    $0xa,%edi         # check if type index is in-bounds
  a3:    jae    b0
  a5:    movslq %edi,%rax
  a8:    mov    0x0(,%rax,4),%eax # get result from global array
  af:    retq   

  # Error path
  b0:    push   %rax
  b1:    mov    $0x0,%edi
  b6:    mov    $0x0,%esi
  bb:    mov    $0x0,%ecx
  c0:    mov    $0x45,%edx
  c5:    callq  ca                # __assert_fail

As you might expect, the compiler is able to optimize the entire thing down into a single array lookup. It's so compact, in fact, that the code for calling libc's assertion-failure function takes up more space than the actual logic.

C++

The C++ version uses std::variant and std::visit, with the overloaded helper template from cppreference.com to allow passing a set of labmda functions as a visitor. This is standard-library code, so we should expect it to be as well-optimized as the best and brightest C++ developers around can make it.

0000000000000000 <do_visit(std::__1::example_variant)>:
   0:    sub    $0x18,%rsp
   4:    mov    %rdi,%rax
   7:    mov    %rdi,0x8(%rsp)
   c:    shr    $0x20,%rax
  10:    mov    $0xffffffff,%ecx
  15:    cmp    %rcx,%rax        # check if type index is in-bounds
  18:    je     38
  1a:    mov    %rsp,%rcx
  1d:    mov    %rcx,0x10(%rsp)
  22:    lea    0x10(%rsp),%rdi
  27:    lea    0x8(%rsp),%rsi
  2c:    callq  *0x0(,%rax,8)    # call function pointer in global array
  33:    add    $0x18,%rsp
  37:    retq   

  # Error path
  38:    callq  3d               # __throw_bad_variant_access

The main thing to notice here is that, unlike in the C version, the compiler is not able to inline the individual visitor functions. Instead, it generates an indirect call to a function pointer stored in a global array. (The individual lambda functions, of course, all consist of a single mov followed by a return.)

I'm not enough of a C++ expert to understand the implementation of std::visit, so I can only guess why this is the case. Maybe the optimizer just gives up after too many layers of template-metaprogramming gunk? Regardless, it's bad news for fans of zero-cost abstractions, since there's a clear overhead here compared to the C assembly.

[UPDATE: Reddit user matthieum suggests that the function pointers themselves are likely to blame.]

D, with Algebraic

This version uses the sum type implementation from D's standard library, Phobos. Rather than create a dedicated sum type, the Phobos developers opted to make Algebraic a thin wrapper around Variant, D's equivalent of C++17's std::any. That choice turns out to have far-reaching repercussions, as we're about to see.

0000000000000000 <int dalgebraic.do_visit(ExampleAlgebraic)>:
   0:    jmpq   5 # int std.variant.visitImpl!(true, ExampleAlgebraic, ...)

0000000000000000 <int std.variant.visitImpl!(true, ExampleAlgebraic, ...)
   0:    push   %rbx
   1:    sub    $0x10,%rsp
   5:    cmp    0x0(%rip),%rdi    # check for uninitialized Variant
   c:    je     217               # if so, go to error path
  12:    mov    %rdi,%rbx

  # This part is repeated 10 times, once for each type
  15:    movq   $0x0,0x8(%rsp)    # prepare arguments for VariantN.handler
  1e:    lea    0x8(%rsp),%rdi
  23:    xor    %esi,%esi
  25:    xor    %edx,%edx
  27:    callq  *%rbx             # VariantN.handler: get TypeInfo for current type
  29:    mov    0x8(%rsp),%rsi
  2e:    mov    0x0(%rip),%rdi    # get TypeInfo for T0
  35:    callq  3a                # Object.opEquals: compare TypeInfos
  3a:    mov    %eax,%ecx
  3c:    mov    $0x3,%eax         # load return value for T0 into %eax
  41:    test   %cl,%cl           # check if Object.opEquals returned true
  43:    jne    211               # if so, go to return
 ...:    ...    ...

 # After checking for T9
 20f:    je     283               # if none of the types matched, assert(false)
 211:    add    $0x10,%rsp
 215:    pop    %rbx
 216:    retq   

 # Error path
 217:    mov    0x0(%rip),%rdi    # get ClassInfo for VariantException
 21e:    callq  223               # _d_allocclass: allocate VariantException
 223:    mov    %rax,%rbx
 226:    mov    0x0(%rip),%rax    # initialize VariantException vtable
 22d:    mov    %rax,(%rbx)
 230:    movq   $0x0,0x8(%rbx)
 238:    mov    0x0(%rip),%rax    # initialize VariantException
 23f:    movups 0x50(%rax),%xmm0
 243:    movups %xmm0,0x50(%rbx)
 247:    movups 0x10(%rax),%xmm0
 24b:    movups 0x20(%rax),%xmm1
 24f:    movups 0x30(%rax),%xmm2
 253:    movups 0x40(%rax),%xmm3
 257:    movups %xmm3,0x40(%rbx)
 25b:    movups %xmm2,0x30(%rbx)
 25f:    movups %xmm1,0x20(%rbx)
 263:    movups %xmm0,0x10(%rbx)
 267:    lea    0x0(%rip),%rdx    # get message for VariantException
 26e:    mov    $0x2f,%esi
 273:    mov    %rbx,%rdi
 276:    callq  27b               # VariantException.__ctor
 27b:    mov    %rbx,%rdi
 27e:    callq  283               # _d_throw_exception
 283:    lea    0x0(%rip),%rsi    # get error message for assert
 28a:    mov    $0x4b,%edi
 28f:    mov    $0xa1c,%edx
 294:    callq  299               # call D runtime assert function

The good part is that ldc is able to inline the lambdas into the body of visitImpl. The bad part is, well, everything else.

Because Algebraic is implemented using Variant, it has to rely on runtime type information (TypeInfo) to check the type of the current value. And because it uses TypeInfo, rather than an integer index like the C and C++ versions, these checks can't be condensed down into a single array lookup, or even a jump table. Finally, as if that wasn't enough, each TypeInfo comparison requires a virtual method call to Object.opEquals.

Beyond these obvious pessimizations, the other thing that stands out is the sheer volume of the generated code. It's an order of magnitude larger than both the C and C++ versions, with significant bloat in both the normal and error paths. Not only is this bad in isolation, since it puts unnecessary pressure on the instrution cache, it also means that potential optimization opportunities exposed by inlining the C and C++ versions of do_visit won't be available to D code using Algebraic.

D, with SumType

This version uses my own implementation of a sum type in D. SumType does not rely at all on runtime type information, and instead uses D's powerful compile-time reflection and metaprogramming capabilities to try to generate code as close to the C version as possible. Let's see how well it does.

0000000000000000 <int dsumtype.do_visit(ExampleSumType)>:
   0:    push   %rax
   1:    movsbq 0x10(%rsp),%rax    # get type index from stack
   7:    cmp    $0xa,%rax          # check if type index is in-bounds
   b:    jae    19
   d:    lea    0x0(%rip),%rcx     # load address of global array
  14:    mov    (%rcx,%rax,4),%eax # get result from global array
  17:    pop    %rcx
  18:    retq   

  # Error path
  19:    lea    0x0(%rip),%rdx     # get error message
  20:    mov    $0x4ca,%edi
  25:    mov    $0x9,%esi
  2a:    callq  2f                 # __switch_error

As it turns out, it's almost exactly the same as the C version: a single array lookup, plus some code for error handling. In fact, as far as I can tell, the only difference has to do with the way the function argument is passed in registers—the C version is able to avoid spilling anything to the stack, whereas this version has a push and a pop at the beginning and end, respectively.

Still, I think it's fair to say that this is a true zero-cost abstraction, and that the claim of "zero overhead compared to hand-written C" is justified.

[UPDATE: Reddit user ais52 explains that the purpose of the push/pop pair is to adjust the alignment of the stack pointer.]

Conclusions

If creating a zero-overhead generic sum type is so easy that even I can do it, why doesn't C++ have one? Are the libc++ developers a bunch of dummies?

No, of course not—but they are working with a significant handicap: C++.

The source code for variant is terrifyingly complex. In order to get the results shown above, the authors have had to use every template metaprogramming trick in the book, and then some. It's clearly the result of immense effort by some very skilled programmers.

By contrast, the source code of SumType is almost embarrassingly simple. The reason match, SumType's equivalent of std::visit, is able to optimize down to the same code as a C switch statement is that it literally is a switch statement, with a static foreach loop inside to generate the cases.

To be honest, I wasn't even thinking about optimization when I coded SumType. I wanted to make sure the interface was nice, and figured I could worry about performance later. But D makes it so easy to write simple, straightforward code that the compiler understands, I ended up falling into the pit of success and beating std::visit entirely by accident.