Building a Rusty, Lock-free Dynamically Resizable Array

Dedicated to RS, whom I wouldn't compare-and-swap with anyone else.

You can read the code we'll write and the code for the book here. The main files for the Rust code are leaky.rs and sealed.rs.

The Goal

This book has a few goals.

Inspired by Learn Rust With Entirely Too Many Linked Lists, the main goal of this book is to teach you some Rust while implementing a useful container. We'll be implementing the lock-free vector described in the paper Lock-free Dynamically Resizable Arrays by Dechev et al., 2006

I hope that this book will inspire other new Rustaceans like myself to push their capabilities. I also hope that non-Rustaceans will see the how awesome Rust is as well. No matter whether you code or not, I hope that this book will show you a interesting area of computer science and a beautiful language!

Topics We'll Cover

  • Concurrency
    • Cache
    • Exponential Backoff
  • Atomics
    • Memory Orderings
    • Compare-and-Swap
  • Memory Management
    • Allocations in Rust
    • Hazard Pointers
  • Using Rust
    • Box
    • Drop
  • Using unsafe Rust
    • Raw Pointers
    • How to write unsafe code
  • Anything else I find interesting!

Necessary Experience

tl;dr it's good to know some Rust

It will be helpful to be familiar with Rust or another language like C and C++, as we will be dealing with low-level constructs like pointers, atomics, and memory management. However, even if you are only familiar with Some(_) or None of these things, I believe you will be able to learn an interesting thing or two. I should say though, there is a lot of code in the later portions of the book.

Of course, the code will be in Rust, so prior knowledge will be helpful. I'm not going to spend time explaining syntax. However, I will comment the code well and explain what is going on. I think if you're comfortable with the first 15 chapters of The Book, you should be fine. Even if not, as long as you understand most of Rust syntax and are fine with looking something up every once in a while, you'll be fine. Chapter 16 is very helpful as well as it's the chapter on concurrency.

A Message

Although I want you to come away from this book having learned something, and hopefully feeling inspired to write more exciting code, this book is also a learning experience for me.

I've never used Markdown before except for writing GitHub README's that only I'm going to read, so simply using Mdbook is super exciting! I think Markdown is cool because it's like HTML for lazy people (and I am very lazy).

Throughout the book, I will have many questions. I started teaching myself Rust around ~half a year ago, so perhaps I'm not really qualified to write this book. I'd never touched an atomic variable before I started this project (and I still technically haven't), so I really mean it when I say this is a learning experience for me. I'll try to document the answers to the questions I have so you can learn from them as well. There will also be a whole section of reflections, as this is also for a school project. You might enjoy that section more than the technical sections.

Before we get started, I want to clarify the structure of the book. There are three main "sections": Theory/Algorithm, Code, and Reflections. Feel free to bounce around if one section becomes too much.

And without further ado . . . cargo run!

Concurrency

Concurrent (Merriam-Webster): operating or occurring at the same time

Concurrent programming is simply programming that involves more than one event happening at a time, in the sense that we think of events in a program happening. In a non-concurrent program, if we wanted to perform two calculations, we would perform one, and then the other. In a concurrent approach, we might spawn two threads, and assign each of them a calculation to perform. A big idea in concurrent programming is having multiple processes running at the same time. You can think of it like your computer running Firefox and Spotify at the same time.1

On a hardware level, one way to implement concurrency to have multiple CPU cores (processors, the chips that do the math). Thus, we can add two numbers on one core while dividing two numbers on another core.

1 Your computer might actually just be switching between the applications really fast if you only have one CPU core, giving the illusion of multiple processes happening at the same time. Even if you have many cores, it's possible that the the applications could be running on the same core. It's all up to the task scheduler.

Keywords

Note: don't read all of this if it's boring, just come back if there is a word you don't know later on.

Shared-state

A concurrency model where multiple threads operate on the same memory.

Lock-free

A property of a system where after some finite number of time steps, a thread will make progress.

Ok, but what does lock-free actually mean? Suppose we have a thread holding a mutex. If that thread gets scheduled off by the OS, and never gets scheduled on again, no other threads can get the mutex and make progress. With a lock-free algorithm, we are guaranteed that after some amount of time, at least one thread will make progress; one thread cannot block all of the others indefinitely.

Buffer

Some block of memory that we use to hold data. The vector's buffers are where we hold the actual elements.

Cache

A type of memory that is faster to access but has less capacity that main memory (RAM). If we need a value frequently, we can cache it so that accesses are faster.

Null Pointer

The pointer with address 0x0. Never safe to dereference.

Heap

A region of memory for storing long-lived or large values.

Stack

A region of memory for storing local variables and small variables.

Data race

When multiple threads access a value without synchronization and one of them is writing. Picture a shared Google document. If people are trying to read it as someone writes on it, they'll be reading gibberish until the writer is done.

Thread

Like another program running within the main program. Threads can run the same time as each other, and the first thread (the one we have at program start) is called the main thread. Eventually, all threads are joined back into the main thread.

Mutex {mut}ual {ex}clusion

A data structure that gives a thread exclusive access to some data. When you call lock on a Mutex, you block until the Mutex is unlocked and you get the lock. Then, you can do whatever you want with the data inside. Once you're done, you unlock the Mutex to relinquish it back to the other threads.

Atomics

Besides having a cool name, atomics are crucial for writing concurrent code.

We first need to think about how computers perform operations. On a a 32-bit machine, loading (reading) a 64-bit value would require two CPU operations, one for the first 32 bits and one for the second 32 bits.

Suppose each box represents a byte (8 bits):


v    Load 1             v
+-----+-----+-----+-----+-----+-----+-----+-----+
|     |     |     |     |     |     |     |     |
+-----+-----+-----+-----+-----+-----+-----+-----+
                        ^         Load 2        ^

This shows how a load of a variable can take multiple steps.

Atomic operations take only one step. They have no intermediate observable state, which means the CPU only observes them as having happened or not.

This is very important in multithreaded scenarios because if threads use non-atomic operations, loads and scores might end up overlapping, resulting in torn reads and writes.

For example, on our hypothetical 32-bit machine, one core might finish the first write to the 32-bit value, another core then might perform the two loads needed to load the value, and then the first core might finish the storing the last 32 bits. Now, one core has a value that is half gibberish!

This is an example of a data race, an example of undefined behavior.

What are Memory Orderings?

In a concurrent environment, each variable has a modification history, all the values it has been. Say we have a variable A. We could store 1 into it, then 2, then 3.

The problem comes from the fact that another thread reading A can "read" any of those values, even after the last store is executed, "in real time". For example, it might have an older (stale) copy of the variable cached.

To ensure that our programs run the way we want, we need to specify more explicitly which values in the modification history the CPU is allowed to use.

Another problem is the compiler reordering instructions. The Golden Rule of instruction reordering is do not modify the behavior of a single-threaded program. The compiler might not think it's doing anything wrong moving some instructions around in one thread. And from the perspective of the thread that's being modified, everything will seem alright. Other threads might start receiving crazy results though.

An ordering is a parameter you provide to operations with atomic variables that specifies which reorderings can happen and which values in the modification history the CPU can use.

I'm not going to go super in-depth into the intricacies of each ordering, but I will explain the important parts of each. If you're curious, Jon Gjenset has a great youtube video on Atomics, which sections on each ordering: Crust of Rust: Atomics and Memory Ordering.

Going into the orderings, I find it helpful to separate their effects into two categories: those that have to do with compiler reordering, and those that have to do with the CPU. The compiler deals with the synchronization in the operation's thread, (well, actually the CPU does too, but that's a different story), and the CPU handles the synchronization across the other threads.

Relaxed

The first ordering is Relaxed. When it comes to the CPU, there are no guarantees imposed by this ordering. The compiler can reorder Relaxed operations as long it follows the Golden Rule; it does not need to consider other threads. The classic use case (I think this use case is classic at least, I always see it used in examples) of the Relaxed ordering is incrementing/decrementing a counter. We don't really care about observing the state of the counter; we just want to make sure our updates happen correctly. When we finally load the counter, we can use an ordering with stronger guarantees.

Release

Release is used with stores. You can think of Release as Releaseing a lock. We want any changes that happened while we had the lock to become visible to other threads. When you store with Release, it's like saying "I'm done with this, use these changes." Thus, the compiler cannot reorder operations after a Release store.

STORE (Relaxed) ─┐
STORE (Release) -+-// "Release the lock"
    X          <─┘ // nope, this happened while we "held the lock"

There is also a CPU property to this ordering, which I'll go over with Acquire.

Acquire

Acquire is used with loads. You can think of Acquire like Acquireing a lock. This means that no memory operations in the current thread can get reordered before taking the lock. Anything that happens after "taking the lock" stays after the "lock was taken".

    X                              <─┐ // nope, this happened "after taking the lock"
    X               <─┐              │ // nope, this happened "after taking the lock"
LOAD (Acquire) -------+--------------+-// "Take the lock"
STORE (Relaxed)      ─┘              │
LOAD a different variable (Relaxed) ─┘

Acquire also has an important interaction with Release at the CPU level. Any load Acquire or stronger must see the changes published by the release store of the same variable.

How does this achieve proper synchronization? You see, when two Orderings love each other very much . . . we get Acquire-Release semantics. Watch what happens when we use Acquire and Release together (diagram inspired by this blog post):

└───┘ Release store

  | Read most recent data because the load is Acquire and the store is Release
  V

┌───┐ Acquire load
Memory operations cannot go above


Memory operations cannot go below
└───┘ Release store

  | Read most recent data because the load is Acquire and the store is Release
  V

┌───┐ Acquire load

All operations are trapped in their own sections, and each section gets the most recent modifications because of the way Acquire loads synchronize with the Release stores.

Note: Although the lock metaphor is helpful for understanding Acquire and Release, remember there are no actual locks involved.

AcqRel (Acquire and Release)

An AcqRel load/store is just Release for stores and Acquire for loads. When used with an operation that loads and stores, it is both Acquire and Release. AcqRel's main use case is Read-Modify-Write operations, like loading a variable, adding one, and storing it back. We want the load to be Acquire and the store Release so we would use AcqRel to achieve this. Foreshadowing: this ordering will play a prominent part later on!

SeqCst (Sequentially Consistent)

The SeqCst ordering makes has the same reordering effects of AcqRel, and also establishes a consistent modification order across all threads. Two stores tagged Relaxed might show up in different orders to different threads. However, if they are both tagged SeqCst, they will show up in the same order to all threads. SeqCst is the strongest ordering, and thus also the safest (see Jon Gjenset's video for weird things that can happen with weaker orderings). Safety comes at a price though, with the CPU often having to emit memory fences1 to guarantee sequential consistency. This can affect performance.

1 A memory fence prevents the CPU from reordering operations in certain ways. This is a great article which describes many different types of fences, kind of like the different Atomic orderings, which restrict the compiler instead of the CPU.

Compare-and-Swap

(also known as CAS and compare_exchange)

Definition: swap a value with a new value only if the the current value is what we think it is.

This reason this is important is that loosely, we can say the state during the swap is the same as the state we observed when preparing for the swap.

Here's a code example:

LOAD A
CAS old = A, new = 2 * A // Only swap in the double if the number hasn't changed

A more realistic example with a linked list would be this;

LOAD LIST_NODE

CAS old = LIST_NODE.pointer, new = new_pointer

In this case, we switch the LIST_NODE pointer only if it hasn't changed.

Here's what we did:

  1. We loaded the node
  2. We read the pointer
  3. We called CAS with the new pointer

At this point, there are two things that can happen:

  1. The pointer changed, and the CAS fails. This means that someone else changed the pointer first, and it's good that the CAS failed, because it's possible the the change that succeeded invalidates the change we just tried to make.
  2. The pointer is the same, and CAS succeeds. Because the pointer was the same, our assumptions about the state of the vector held, and our change was valid.

At first, this might seem contrived and confusing (as it did to me). I would focus on this intuition: if CAS succeeds, loosely, we can say the state during the swap was the same as the state we observed when preparing for the swap. Our assumptions were consistent throughout the whole process.

The compare_exchange function in the Rust Standard Library returns a Result<T, T>, where T is the type being exchanged. The Result contains the value that the variable actually was. If compare_exchange fails, it returns Err(actual_value), on success, it returns Ok(expected_value) (if it succeeded, that means actual_value == expected_value).

Note: for the rest of the book, I'm going to refer to compare-and-swap as compare_exchange, as that is what the Rust Standard Library uses. I used compare-and-swap on this page because the name is very explicit about what the operation does.

Introduction to the Paper

Firstly, a link to the paper can be found here.

In their 2006 paper, Dechev et al. describe an implementation for a vector than can safely be shared across threads.

Structure of the vector

This is where we begin working on the vector.

When thinking about the structure of the vector, I find it helpful to think about it in two parts: memory and synchronization. By memory I mean allocation of space and by synchronization I mean synchronizing reads and writes. Let's start with memory.

Memory

The vector is, broadly, a two-level array.

+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | Top-level
+---+---+---+---+---+
  |   |   |   |   |
  v   v   v   v   v
+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | Lower-level, notice: these arrays are represented vertically
+---+---+---+---+---+
    | 2 | 3 | 4 | 5 |
    +---+---+---+---+
        | 3 | 4 | 5 |
        +---+---+---+
            | 4 | 5 |
            +---+---+
                | 5 |
                +---+

The vector stores a pointer to a first array (the top-level one). The elements in this array are also pointers, to more arrays (the lower-level ones). This is why this organization is called a two-level array.

The reason for this is resizing. Suppose we have 4 elements of capacity, and they are all filled. We need to allocate more memory. We allocate more memory using something like malloc(), and the allocator returns a pointer to the new allocation.

For a normal vector, we would simply copy our vector's elements over to the new allocation. We can't do this for a lockless vector though because copying isn't atomic, and we can't lock down the vector, copy, and unlock. Therefore, we need a different system.

A little tangent on allocations: When allocating memory for a normal vector, we generally make larger and larger allocations. For example, the first allocation could be 4 elements, the next 8, then 16, 32 . . . This reduces the total number of allocations we need to perform, which is good for performance. We're going to use this same idea for the vector.

Returning to the idea of a two-level array, the first level is going to hold pointers to blocks of memory we can call buckets. The amount of memory a bucket holds is related to it's index. The first bucket will hold some constant (which we'll call FIRST_BUCKET_SIZE) times 2 to the power of its index elements. Here are some sample calculations for the first few buckets to show the principle, using FIRST_BUCKET_SIZE=8:

# Bucket 1
CAP = FIRST_BUCKET_SIZE * 2 ^ INDEX
    = 8 * 2 ^ 0
    = 8

# Bucket 2
CAP = FIRST_BUCKET_SIZE * 2 ^ INDEX
    = 8 * 2 ^ 1
    = 16

3: 32
4: 64
To infinity and beyond . . .

The next part of the vector's structure is the synchronization aspect, which goes hand in hand with explaining the algorithm. I'll cover them together.

Synchronization

Synchronization, that is, coordinating concurrent operations on the vector, is achieved through two little data structures: the Descriptor and the WriteDescriptor. As you might expect, the Descriptor describes the vector and the WriteDescriptor describes a write operation.

The Descriptor

The descriptor holds two values: a pointer to a WriteDescriptor, and a value indicating the size of the vector.

The WriteDescriptor

The WriteDescriptor holds three values: the location of the write (a pointer-like object), an old value, and a new value. If your spidey-sense is tingling, it might be because this new/old business is hinting at a compare_exchange in the future.

Now that we've seen the Descriptor and WriteDescriptor, here's a quick summary of the vector's structure:

# Data Organization
Vector: 
    [Pointer -> Memory],
    Pointer -> Descriptor

Descriptor: 
    Pointer -> Possible WriteDescriptor, 
    Size

WriteDescriptor: 
    Pointer -> Element location, 
    New Element, 
    Old Element

How does this actually help with synchronization?

The major challenges of providing lock-free vector implementation stem from the fact that key operations need to atomically modify two or more non-colocated words (Dechev et. al., 2006)

This translates to, "We need to change two things (without locking the vector down) to ensure the vector is in the right state." For a push operation, say, we would need to change the length of the vector and write the new data.

The descriptor system gets around this by saying, "If you want to change the Descriptor, you need to complete a pending write if there is one." Why does this ensure the correct semantics? Consider this example from the paper:

The semantics of the pop_back[pop] and push_back[push] operations are guaranteed by the Descriptor object. Consider the case when a pop_back is interrupted by any matching number of push_back and pop_back operations. In a naive implementation, the size of the vector would appear unchanged when the original pop_back resumes and the operation could produce an erroneous result. (Dechev et. al., 2006)

Under the "naive implementation", in this scenario, the vector might look like [1, 2, 3]. Someone calls pop, and the vector should return 3. However, the thread gets preempted (the OS says another thread can run, and the current thread is paused), and the running thread executes a bunch of pops and pushes. The vector becomes [4, 5, 6]. When the original pop finally runs, it incorrectly returns 6.

Let's consider when the first push happens after the original pop under the correct implementation. When the push happens, it swaps in a new Descriptor, which says that the size is now one bigger and points to a new WriteDescriptor representing a push operation. Because it swapped in a Descriptor, it has to complete the operation specified in the current WriteDescriptor, and the original pop returns 3, as it should.

Let me clarify all this swapping business!

The Algorithm

As I’ve said before, I think of the vector as two connected systems: memory and synchronization. By “The Algorithm”, I mean the synchronization aspect. To recap, synchronization is controlled by two little data structures, the Descriptor and the WriteDescriptor. These data structures describe the vector itself and a write operation, respectively.

I think the best way to explain the algorithm is to dive right in.

complete_write()

First, I want to explain a little routine called complete_write. This function is true to its name and completes a write.

Write means "write operation", in this context, a push or pop. In my experience, "write" has been a more colloquial term used in CS for whenever we make a modification to something. Really anything can technically be a "write", but I would say things that are more final are "writes". For example, incrementing a loop variable is pretty insignificant in the grand scheme of things, so it's not really a "write", but increasing the size of the vector is an important "write". This usage might also be particular to concurrent programming, where balancing reads/writes is an important consideration for designing a data structure. Concurrent data structures are often designed for infrequent writes and frequent reads. Modifications to databases (which are heavily concurrent) can also be called writes. tl;dr a "write" in this case means the details describing a particular instance of "writing"

complete_write takes two arguments, a WriteDescriptor, and the vector itself. complete_write applies the write operation described in the WriteDescriptor on the vector. Recall that a WriteDescriptor contains three things: a reference/pointer to the location where the write will take place, a new value to write, and an old value that we loaded in from the location.

First we perform a compare_exchange using the data in the WriteDescriptor. We only swap in the new data if the data at the location of the swap matches the old data we have. If the compare_exchange succeeds, this means that we swapped in the value we want to write. If it fails, it means someone else beat us to it and performed the write. Remember, many threads can access the vector's Descriptor and WriteDescriptor at once, so many threads will be trying to complete the same write. Only one of them can succeed. It's a fight to the death! Arrghhh!!!

I'm kidding. After performing the compare_exchange, successful for not, we modify the vector to indicate that there is no pending write operation. If all threads do this, at least once will succeed, and all will indicate that there is no pending write operations. Though some of the threads may be sad because their compare_exchange failed, the vector is happy because it's in a consistent and correct state.

push()

Now that we know writes are actually performed, let’s get into how a push operation works. Here are the steps:

  1. Load in the current Descriptor.

  2. If the Descriptor contains a write operation, complete it . This is important because it ensures that before any new write operation happens, the previous one is completed. We cannot do anything before completing the previous write operation, so all operations will eventually get executed.

  3. Calculate which bucket our new element will go into.

  4. If that bucket has not been allocated memory yet, do so.

  5. Make a new WriteDescriptor. The new value in the WriteDescriptor will be the data passed into the push function.

  6. Make a new Descriptor which contains the following data: the size held in the current Descriptor + 1, and the new WriteDescriptor.

  7. Now, here comes the part that makes this a compare-and-swap or compare_exchange algorithm. We compare_exchange the new Descriptor we made with the old one. If the Descriptor held in the vector didn't change, our new Descriptor will replace it. If it did change, we will fail to swap in our new Descriptor, and we go back to Step 1.

    Note: I think it's important to consider why this routine (particularly step 6) ensures correctness. If the compare_exchange succeeds, this means that the vector did not change in the time it took us to prepare a new Descriptor. Why is this important? It means our assumptions about the vector's state did not change. In our new Descriptor, we used the size from the Descriptor we loaded in, and incremented that. So, if the size we loaded in was 4, our new Descriptor would say the size of the vector is 5. Now, imagine that we could just swap in our fresh Descriptor without comparing it with the current one. If someone else was also trying to push, their Descriptor might get swapped in before ours. It would say the size of the vector is 5, because it made the same assumptions we did. Then we swap in our Descriptor, our Descriptor would maintain that the size of the vector is 5, even though it should be 6 because there were two push operations. Furthermore, we would overwrite the element that was pushed on by the first call to push, because both our WriteDescriptors would be referencing the same location in memory. This is terrible! compare_exchange is our friend.

  8. Now that we have swapped in our Descriptor, we execute the WriteDescriptor we made using complete_write, finalizing the changes we want to make to the vector.

And that's a push!

Pop pretty much works the same except for some small variations, so we'll get into that when we implement push/pop. However, the way we make sure changes are valid using compare_exchange is identical for both operations.

I think it's finally time to start looking at some code. When I was writing code, it felt very different from reasoning about the theory. I really felt like I had to consider every line I wrote and every decision I made. I'll walk you through what I came up with now.

Note: we're going to first write a version of the vector that doesn't reclaim the memory it uses; it leaks.

Starting Code

Pseudocode

This "pythonesque" pseudocode with some pointer operations thrown in shows the general API and implementation details of the vector. The pseudocode is a conversion of the paper's pseudocode into a more (in my opinion) understandable form. It completely ignores memory reclamation.

You don't need to read this entire thing, it's just here as a reference.

# Calculate the index of the correct bucket
# Return a pointer
def at(vector, i):
    pos = i + FIRST_BUCKET_SIZE
    hibit = highest_bit(pos)
    index = pos ^ 2 ** hibit
    return &vector.memory[hibit - highest_bit(FIRST_BUCKET_SIZE)][index]
# Perform an atomic load at the correct index
def read(vector, i):
    return *at(vector, i).load(Ordering)
# Perform an atomic store at the correct index
def write(vector, i, elem):
    return *at(vector, i).store(elem, Ordering)
# Calculate the number of allocations needed
# Then perform each allocation
def reserve(vector, size):
    i = highest_bit(vector.descriptor.size + FIRST_BUCKET_SIZE - 1)
        - highest_bit(FIRST_BUCKET_SIZE)
    if i < 0 {
        i = 0
    }
    while i < highest_bit(size + FIRST_BUCKET_SIZE - 1)
        - highest_bit(FIRST_BUCKET_SIZE):
        i += 1
        allocate_bucket(vector, i)
# Calculate the amount of memory needed
# Allocate that much memory
# Try to CAS it in
# If CAS fails, the bucket is already initalized, so free the memory
def allocate_bucket(vector, bucket):
    bucket_size = FIRST_BUCKET_SIZE * (2 ** bucket)
    mem = allocate(bucket_size)
    if not CAS(&vector.memory[bucket], nullptr, mem):
        free(mem)
# Get the size of the current descriptor
# If there is a pending write operation, subtract one from the size
def size(vector):
    size = vector.descriptor.size
    if descriptor.writeop.pending:
        size -= 1
    return size
# Get the current descriptor
# Complete a pending write operation
# Allocate memory if needed
# Make a new WriteDescriptor
# Try to CAS it in
# If CAS failed go back to first step
# Complete a pending write operation
def push(vector, elem):
    while True:
        current_desc = vector.descriptor
        complete_write(vector, current_desc.pending)
        bucket = highest_bit(current_desc.size + FIRST_BUCKET_SIZE)
            - highest_bit(FIRST_BUCKET_SIZE)
        if vector.memory[bucket] == nullptr:
            allocate_bucket(vector, bucket)
        writeop = WriteDescriptor(
            *at(vector, current_desc.size),
            elem,
            current_desc.size
        )
        next_desc = Descriptor(1 + current_desc.size, writeop)
        if CAS(&vector.descriptor, current_desc, next_desc):
            break
    complete_write(vector, next_desc.pending)
# Get the current descriptor
# Complete a pending write operation
# Read the last element of the vector
# Make a new WriteDescriptor
# Try to CAS it in
# If CAS failed go back to first step
# Return the last element
def pop(vector):
    while True:
        current_desc = vector.descriptor
        complete_write(vector, current_desc.pending)
        elem = *at(current_desc.size - 1).load(Ordering)
        next_desc = Descriptor(curr_desc.size - 1, null)
        if CAS(&vector.descriptor, current_desc, next_desc):
            break
    return elem

Memory Allocation

The first thing I did when writing the code out was think about the pieces that would make up the vector. In Rust, an extremely common building block for any type is the struct. A struct just sticks its members' data next to each other in memory. Here is the vector itself, as a struct:

#![allow(unused)]
fn main() {
pub struct SecVec<'a, T: Sized + Copy> {
    buffers: CachePadded<Box<[AtomicPtr<AtomicU64>; 60]>>,
    descriptor: CachePadded<AtomicPtr<Descriptor<'a, T>>>,
    _boo: PhantomData<T>, // Data is stored as transmuted T's
}
}

Boo! 👻

I bet the PhantomData scared you. We have a generic parameter T, but we have no struct members of SecVec or either of the descriptors that actually contains a T (because we transmute T into u64s). Therefore, to let the compiler know we really are carrying T's, we add a little ghost that tells it, "We're carrying this Phantom T wink "

Sharing is caring

There is a lot to unpack here. Firstly, CachePadded is a struct provided by the crossbeam_utils crate.

A note on cache: you may have heard of CPU cache, a small buffer of memory stored on the CPU to allow for fast access. The cache in CachePadded actually refers to a buffer between main RAM and the CPU's. It's just a larger and slower cache compared to a CPU cache. The cache is split into contiguous blocks of memory called cache lines. This is the most granular level at which cache coherency is maintained. When multiple threads both have a value in the same cache line, one thread modifying the value it owns marks the entire cache line as "dirty". Even though the other thread's value hasn't been changed, the cache coherency protocol might cause the thread to reload the entire line when it uses the value, incurring some overhead. This is called false sharing, and cause severe performance degradation. Cache is an extremely important consideration when data structures. It's why linked lists are algorithmically fine but terribly slow in practice. As the saying goes, cache is king.

The CachePadded struct aligns its contents to the beginning of the cache line to prevent false sharing. If all CachePadded objects are at the beginning of a cache line (assuming they do not cross a cache line), there can't be false sharing between them. Preventing false sharing can lead to a huge speedup, but it also does increase the size of the type. If you're wondering how CachePadded is implemented, check out #[repr(align(n))] in the Nomicon.

Here's how I picture cache padding:

|-----Cache line-----|-----Cache Line-----|
v                    v                    v
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|69|xx|xx|xx|xx|xx|xx|42|xx|xx|xx|xx|xx|xx|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
^                    ^
 \                    \
  \                    \
   \                    \
    Different cache lines -> no false sharing

Two-level array

The first member of SecVec<T> is a cache-padded array of 60 pointers allocated on the heap (notice the Box). These pointers will each point into another array. The pointers start off as null pointers (0x0), and will get swapped out for valid pointers once they need to point to an actual array.

The AtomicPtrs point to AtomicU64s because each element is going to get transmuted into a u64 so that we can atomically perform writes on the vector. When returning an element, we'll transmute it back into a T. Transmuting means interpreting the bits of one type as the bits of another.

For example, 0b10100001 means -95 when interpreted as a signed integer but 161 when interpreted as an unsigned integer. Transmuting one to the other would just change how we interpret the bits, not tha actual bits themselves.

Descriptors galore

The second member of SecVec<T> is a cache-padded AtomicPtr to a Descriptor. As you've probably noticed, there are a bunch of AtomicPtrs here. That's because we can modify the pointer atomically, specify which Ordering to use, and compare_exchange the pointer. A common way of writing data in concurrent programming is to change a pointer instead of actually modifying a buffer. Since a buffer can't necessarily be modified atomically or without locking, what we can do is prepare a buffer and then change a pointer so that it points to our new buffer. All new readers will see the new data when they dereference the pointer.

                 Pointer
                 /     \
                /       \
           +---+        +----+
          /                   \
         /         ->          \
        v                       v
       Old                      New
+---+---+---+---+        +---+---+---+---+
| 9 | 9 | 9 | 9 |        | 6 | 6 | 6 | 6 |
+---+---+---+---+        +---+---+---+---+

What do we do with the old pointer you might ask? Worry not, we will get into that 😅

The Descriptor and WriteDescriptor

#![allow(unused)]
fn main() {
pub struct Descriptor<'a, T: Sized> {
    pending: AtomicPtr<Option<WriteDescriptor<'a, T>>>,
    size: usize,
}

pub struct WriteDescriptor<'a, T: Sized> {
    new: u64,
    old: u64,
    location: &'a AtomicU64,
    _boo: PhantomData<T>, // New and old are transmuted T's
}
}

The trait bounds

Notice how T is Sized, this means that its size is always known at compile-time. We need to ensure this because our values need to be transmutable. Part of the safety contract of transmute_copy is making sure our types are of compatible sizes.

The Copy bound is necessary because the data in the vector is copied in and out of the buffers, with transmute_copy.

OK, enough talk about structs, let's get to the first function: get()

get()

The first, and simplest function to write is vector.get(i), which returns a pointer to the element at index i.

Here is the code to that implements get.

#![allow(unused)]
fn main() {
/// Return a *const T to the index specified
///
/// # Safety
/// The index this is called on **must** be a valid index, meaning:
/// there must already be a bucket allocated which would hold that index
/// **and** the index must already have been initialized with push/set
unsafe fn get(&self, i: usize) -> *const AtomicU64 {
    // Check for overflow
    let pos = i
        .checked_add(FIRST_BUCKET_SIZE)
        .expect("index too large, integer overflow");

    let hibit = highest_bit(pos);

    let offset = pos ^ (1 << hibit);

    // Select the correct buffer to index into
    // # Safety
    // Since hibit = highest_bit(pos), and pos >= FIRST_BUCKET_SIZE
    // The subtraction hibit - highest_bit(FIRST_BUCKET_SIZE) cannot underflow
    let buffer = &self.buffers[(hibit - highest_bit(FIRST_BUCKET_SIZE)) as usize];

    // Check that the offset doesn't exceed isize::MAX
    assert!(
        offset
            .checked_mul(mem::size_of::<T>())
            .map(|val| val < isize::MAX as usize)
            .is_some(),
        "pointer offset exceed isize::MAX bytes"
    );

    // Offset the pointer to return a pointer to the correct element
    unsafe {
        // # Safety
        // We know that we can offset the pointer because we will have allocated a
        // bucket to store the value. Since we only call values that are
        // `self.descriptor.size` or smaller, we know the offset will not go out of
        // bounds because of the assert.
        buffer.load(Ordering::Acquire).add(offset)
    }
}
}

A few points to note

Notice how the function is marked as unsafe. This is because there is a safety contract the compiler can't enforce: the index must be valid. This is automatically guaranteed through the usage of the function in the algorithm, but it's worth it marking it unsafe just to be explicit.

Summarizing what the function does, we calculate which buffer the item is in, load the pointer to the start of the buffer, and offset it to the correct element. There are two other things I want to point out. First, notice all the checks we make to avoid overflow. Secondly, notice the use of Acquire for loading in the pointer to the buffer. Since the store part of the compare_exchange(AcqRel) we use to set the pointer to the buffer is Release, we are guaranteed to get the most recent pointer, because an Acquire load sees the contents Releaseed by a Release store! I find it very satisfying how Acquire and Release work together. It's like two puzzle pieces fitting nicely into each other.

What are all these bitwise operations?

I'm honestly not sure. That's why they wrote the paper and I didn't :).

Next up is the allocate_bucket.

allocate_bucket()

Remember that whole "two-level array" thingy? This is where it starts coming into play. allocate_bucket does just what is sounds like: allocating a bucket. Recall that a bucket is one of the arrays in the second level of the two level array.

+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 |
+---+---+---+---+---+
  |   |   |   |   |
  v   v   v   v   v
+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 |
+---+---+---+---+---+
    | 2 | 3 | 4 | 5 |
    +---+---+---+---+
        | 3 | 4 | 5 |
        +---+---+---+
          ^ | 4 | 5 |
          | +---+---+
          |     | 5 |
          |     +---+
          |
        we're allocating one of these little guys

There are two parts to allocate_bucket: allocating the memory and setting the pointer. We start off by tapping into the alloc crate's API. First, we create a Layout, which describes the allocation we want. The Layout::array::<Atomic64>() indicates that we want a bunch of AtomicU64 right next to each other in memory. If creating the layout fails (due to overflow), we call capacity_overflow, which just panics.

You might ask, why not just directly call panic!? Apparently, it reduces the generated code size if we just have panic in one function, which we then call from multiple places. I found this trick in the source code for std::vec::Vec. You can learn a lot from reading the Standard Library code. That's how I've learned a lot of the low level stuff I know. It's also a good way to see what good, idiomatic Rust looks like.

#![allow(unused)]
fn main() {
const FIRST_BUCKET_SIZE: usize = 8;

fn allocate_bucket(&self, bucket: usize) {
    // The shift-left is equivalent to raising 2 to the power of bucket
    let size = FIRST_BUCKET_SIZE * (1 << bucket);
    let layout = match Layout::array::<AtomicU64>(size) {
        Ok(layout) => layout,
        Err(_) => capacity_overflow(),
    };

}

The next thing we do is just another check. The Standard Library does both checks and I trust their strategy.

#![allow(unused)]
fn main() {
// Make sure allocation is ok
match alloc_guard(layout.size()) {
    Ok(_) => {}
    Err(_) => capacity_overflow(),
}

}

Miri is about to make its debut! Miri is a tool that runs your code in a special environment and detects undefined behavior (or UB, as the cool kids call it).

Now that our layout is all good, we can perform the actual allocation. We instantiate the Global struct, which is the allocator we're using. The allocator returns a pointer to our new allocation once it's finished allocating. Why are we using allocated_zeroed you might ask? Why not just allocate normally? The answer: It's Utmost Holiness: Miri. In all seriousness though, Miri has been and invaluable tool in catching memory and concurrency bugs. When we just allocate normally, Miri throws and error when we start actually using the memory later on, saying that intrinsics::atomic_cxchg_acqrel_failrelaxed(dst, old, new) requires initialized data. Thus, we just zero the memory for now. Later, it might be worth it to do some MaybeUninit magic, but honestly, I don't know if there'll be much, if any, performance gains.

Once again, we have more checks, and we'll just panic! if the allocation fails. handle_alloc_error is from the alloc crate:

#![allow(unused)]
fn main() {
let allocator = Global;

let allocation = allocator.allocate_zeroed(layout);
let ptr = match allocation {
    Ok(ptr) => ptr.as_ptr() as *mut AtomicU64,
    Err(_) => handle_alloc_error(layout),
};

}

The final part is to swap in the pointer into our array of buffers (the first level of the two-level array). We use the compare_exchange function, with a null pointer as the expected value, and our new pointer from the allocation. If compare_exchange fails, that means the pointer is no longer null, and someone else compare_exchangeded in a pointer. Therefore, the bucket is already allocated. In this case, we deallocate the freshly allocated memory. Notice how we assess the result of compare_exchange with Result::is_err(); we don't care about the value compare_exchange returns.

#![allow(unused)]
fn main() {
    if self.buffers[bucket] // <- this is an AtomicPtr<AtomicU64>
        .compare_exchange(
            ptr::null_mut::<AtomicU64>(), // old value
            ptr, // new value
            Ordering::AcqRel, // ordering on success
            Ordering::Relaxed, // ordering on fail
        )
        .is_err()
    {
        unsafe {
            // # Safety
            // We know that the pointer returned from the allocation is NonNull so
            // we can call unwrap() on NonNull::new(). We also know that the pointer
            // is pointing to the correct memory because we just got it from the
            // allocation. We know the layout is valid, as it is the same layout we
            // used to allocate.
            allocator.deallocate(NonNull::new(ptr as *mut u8).unwrap(), layout);
        }
    }
}

}

Success and Fail Orderings

Like all atomic operations, compare_exchange uses the orderings. Most operations take 1, but this bad boy takes two. Since compare_exchange reads and writes a memory location, we're using AcqRel. Since we always use AcqRel for the buckets, the load part (Acquire) of the compare_exchange will always see the most recent value because the store part is Release. If we just used Acquire, the store part of the compare_exchange would be Relaxed, which doesn't guarantee that the modification to the AtomicPtr<AtomicU64> is published to other threads by any certain point. Under a Relaxed situation, another thread might load a null pointer in its compare_exchange, even though our thread swapped in a pointer to memory!

That's the success ordering. The fail ordering is Relaxed because we don't need to establish any synchronization if the operation fails. It failed; we're not doing any stores. When I first saw this, I had the question, "Why do we provide different success and fail orderings if the compare_exchange doesn't know if it will fail or not?" The answer, thanks to Alice on the Rust User Forums, is that the compiler picks an ordering that will always satisfy the stronger ordering. Thus, compare_exchange(success: AcqRel, fail: Release) executes as compare_exchange(success: AcqRel, fail: Acquire) to ensure that the initial load is Acquire for both cases.

There's a little more to it; if you're still curious, see this thread on the Rust User Forums.

The last function in the "memory" section is reserve(), which I've "reserved" for last.


Complete source for allocate_bucket()

#![allow(unused)]
fn main() {
fn allocate_bucket(&self, bucket: usize) {
    // The shift-left is equivalent to raising 2 to the power of bucket
    let size = FIRST_BUCKET_SIZE * (1 << bucket);
    let layout = match Layout::array::<AtomicU64>(size) {
        Ok(layout) => layout,
        Err(_) => capacity_overflow(),
    };

    // Make sure allocation is ok
    match alloc_guard(layout.size()) {
        Ok(_) => {}
        Err(_) => capacity_overflow(),
    }

    let allocator = Global;
    // allocate_zeroed because miri complains about accessing uninitialized memory
    // TODO: Maybe use MaybeUninit?
    let allocation = allocator.allocate_zeroed(layout);
    let ptr = match allocation {
        Ok(ptr) => ptr.as_ptr() as *mut AtomicU64,
        Err(_) => handle_alloc_error(layout),
    };

    // If the CAS fails, then the bucket has already been initalized with memory
    // and we free the memory we just allocated
    if self.buffers[bucket]
        .compare_exchange(
            ptr::null_mut::<AtomicU64>(),
            ptr,
            Ordering::AcqRel,
            Ordering::Relaxed,
        )
        .is_err()
    {
        unsafe {
            // # Safety
            // We know that the pointer returned from the allocation is NonNull so
            // we can call unwrap() on NonNull::new(). We also know that the pointer
            // is pointing to the correct memory because we just got it from the
            // allocation. We know the layout is valid, as it is the same layout we
            // used to allocate.
            allocator.deallocate(NonNull::new(ptr as *mut u8).unwrap(), layout);
        }
    }
}

}

reserve()

The goal of reserve(n) is simple: allocate enough memory to perform n pushes without allocating more memory.

This is a useful function, because, as we've seen, allocate_bucket requires some heavy atomics with compare_exchange. If we can do our allocations in a calmer scenario with less contention, we'll experience some performance gains.

We start by calculating the number of allocations we'll need to perform to reserve enough space. The calculation is a little funky, and there's an edge case where it can't distinguish between 0 and sizes between 1 and FIRST_BUCKET_SIZE. That's why we need to explicitly allocate the first bucket. We'll see the implementation of size() later, but it does use some atomic synchronization, so we just cache the result so we can keep using it later without calling size again.

#![allow(unused)]
fn main() {
pub fn reserve(&self, size: usize) {
    // Cache the size to prevent another atomic op from due to calling `size()` again
    let current_size = self.size();
    if current_size == 0 {
        self.allocate_bucket(0);
    }

}

Now, we calculate the number of allocations we've made.

highest_bit returns the highest set bit in a number. A bit is set if it's equal to one. The highest set bit of 7 (0b111), for example, is 2 (0-indexed). Since the buckets are increasing by a factor of two each time, the highest set bit of the indices in each bucket is one greater than the highest set bit of the indices in the previous bucket. Therefore, by using the highest bit of a number in conjunction with FIRST_BUCKET_SIZE, we can figure out how many allocations are needed for a certain capacity. I know I'm waving my hands a little; I haven't taken the time to rigorously understand the arithmetic, as it's not that interesting to me, and in practice it works.

#![allow(unused)]
fn main() {
let mut num_current_allocs =
    highest_bit(current_size.saturating_add(FIRST_BUCKET_SIZE) - 1)
        .saturating_sub(highest_bit(FIRST_BUCKET_SIZE));

}

Then we calculate the number of allocations we need to reserve the space, and for each allocation missing, we allocate.

#![allow(unused)]
fn main() {
    // Compare with the number of allocations needed for size `new`
    while num_current_allocs
        < highest_bit(size.saturating_add(FIRST_BUCKET_SIZE) - 1)
            .saturating_sub(highest_bit(FIRST_BUCKET_SIZE))
    {
        num_current_allocs += 1;
        self.allocate_bucket(num_current_allocs as usize);
    }
}

}

And that's it for memory. We can now do every thing we need to do to access and top up the vector's memory. Now's time for the really hard part: actually implementing the vector's functions.


Complete source for reserve()

#![allow(unused)]
fn main() {
pub fn reserve(&self, size: usize) {
    // Cache the size to prevent another atomic op from due to calling `size()` again
    let current_size = self.size();
    if current_size == 0 {
        self.allocate_bucket(0);
    }

    // Number of allocations needed for current size
    let mut num_current_allocs =
        highest_bit(current_size.saturating_add(FIRST_BUCKET_SIZE) - 1)
            .saturating_sub(highest_bit(FIRST_BUCKET_SIZE));

    // Compare with the number of allocations needed for size `new`
    while num_current_allocs
        < highest_bit(size.saturating_add(FIRST_BUCKET_SIZE) - 1)
            .saturating_sub(highest_bit(FIRST_BUCKET_SIZE))
    {
        num_current_allocs += 1;
        self.allocate_bucket(num_current_allocs as usize);
    }
}

}

Operations

Now that we have a solid memory backbone, we're going to implement the public API of the vector: new, push, pop, and size, as well as complete_write and some helper functions.

Since it's been a little bit, here's what the vector look like in code form again:

#![allow(unused)]
fn main() {
pub struct SecVec<'a, T: Sized + Copy> {
    buffers: CachePadded<Box<[AtomicPtr<AtomicU64>; 60]>>,
    descriptor: CachePadded<AtomicPtr<Descriptor<'a, T>>>,
    _boo: PhantomData<T>, // Data is stored as transmuted T's
}

struct Descriptor<'a, T: Sized> {
    pending: AtomicPtr<Option<WriteDescriptor<'a, T>>>,
    size: usize,
}

struct WriteDescriptor<'a, T: Sized> {
    new: u64,
    old: u64,
    location: &'a AtomicU64,
    _boo: PhantomData<T>, // New and old are transmuted T's
}

}

new()

We've got to have some way of making a vector (or at least for an outside user to make one).

What are the ingredients we need to make the vector? Buffers, Descriptor, and WriteDescriptor. The WriteDescriptor is going to be None, as we don't have any pending writes yet.

Here's the code:

#![allow(unused)]
fn main() {
// We actually do want this to be copied
#[allow(clippy::declare_interior_mutable_const)]
const ATOMIC_NULLPTR: AtomicPtr<AtomicU64>
    = AtomicPtr::new(ptr::null_mut::<AtomicU64>());

pub fn new() -> Self {
    // Make an array of 60 AtomicPtr<Atomicu64> set to the null pointer
    let buffers = Box::new([ATOMIC_NULLPTR; 60]);

    // Make a new WriteDescriptor
    let pending = WriteDescriptor::<T>::new_none_as_ptr();

    // Make a new descriptor
    let descriptor = Descriptor::<T>::new_as_ptr(pending, 0, 0);

    // Return self
    Self {
        descriptor: CachePadded::new(AtomicPtr::new(descriptor)),
        buffers: CachePadded::new(buffers),
        _boo: PhantomData,
    }
}

}

Firstly, we declare the constant ATOMIC_NULLPTR. This is just an AtomicPtr containging a null pointer. The reason the const declaration is necessary is that when we make an array of something [SOMETHING; 60], that SOMETHING needs to be Copy or evaluatable at compile time. Since AtomicPtr<AtomicU64> is not Copy, we resort to creating ATOMIC_NULLPTR at compile time. Once we have our array of null pointers, we put it on the heap to reduce the size of the vector. If we were carrying it all directly, the vector would be over 480 bytes large! With a Box, we only store 8 bytes pointing to the first level in our two-level array.

Then, we make a WriteDescriptor using new_none_as_ptr(), which returns an Option<WriteDescriptor<T>>. We pass this into the constructor (new_as_ptr) for Descriptor<T>, and then assemble the Descriptor and the Boxed array together to make the vector.

The constructors for the descriptor types end in as_ptr because they actually return a raw pointer pointing to a heap allocation containing the value. We achieve this by making a Box and then extracting the inner raw pointer.

let b = Box::(5);
let b_ptr = Box::into_raw(b); <- That's a raw pointer to heap memory!

My first UB mistake

I introduced the heap and the stack earlier in the keywords section, but I didn't explain why the distinction is important.

When a function is called, a stack frame is pushed onto the stack. This stack frame contains all the function's local variables. When the function returns, the stack frame is popped off the stack, and all local variables are destroyed. This invalidates all references to local variables that were just popped off.

The heap is different. You allocate on the heap, and you deallocate on the heap. Nothing happens automatically. This is the legendary malloc/free combo from C.

Understanding the distinction between the stack and the heap is important because we are using raw pointers, which don't have the guarantees of references.

Here is my first mistake, summarized a little:

use core::sync::atomic::{Ordering, AtomicPtr};

fn main() {
    let ptr = new_descriptor();
    // Use the pointer to the Descriptor
    let d = unsafe { &*ptr.load(Ordering::Acquire) };
}

// Return a pointer to a Descriptor
fn new_descriptor() -> AtomicPtr<Descriptor> {
    let d = Descriptor { size: 0, write: None };
    AtomicPtr::new(&d as *const _ as *mut _)
}

struct Descriptor {
    size: usize,
    write: Option<bool>
}
$ cargo miri run
error: Undefined Behavior: pointer to alloc1184 was dereferenced after this allocation got freed
  --> src\main.rs:46:22
   |
46 |     let d = unsafe { &*ptr.load(Ordering::Acquire) };
   |                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pointer to alloc1184 was dereferenced after this allocation got freed
   |
   = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior

Miri says pointer to alloc1184 was dereferenced after this allocation got freed. Translation: use-after-free; classic UB.

So why is the Descriptor's allocation being freed? Because it's allocated on the stack. When new_descriptor returns, the local variable d: Descriptor get's destroyed, and the pointer we made from the reference is invalidated. Thus, we use-after-free when we deference a freed allocation.

This is the danger of using raw pointers. If we just passed on the reference Descriptor, Rust would promote that value to have a 'static lifetime if possible, or return an error if not. With raw pointers, Rust doesn't manage lifetimes, so we have to ensure that our pointers are valid.

This is why only dereferencing a raw pointer is unsafe. It's perfectly safe to make one, but we have no guarantees about what it's pointing to, and that's why the dereference is unsafe.

Thank you Miri!

complete_write()

I think complete_write is the first function I wrote for the vector's core operations. We execute the WriteDescriptor passed in and set the one stored in the vector to None

Here's what the function signature looks like:

#![allow(unused)]
fn main() {
fn complete_write(&self, pending: &Option<WriteDescriptor<T>>) {

}

The first thing we do is execute the WriteDescriptor, if there is one. We can use if let syntax to concisely express this. The result of the compare_exchange doesn't matter. If it succeeds, we performed the write. If it doesn't, someone else performed it. Also, notice how we are compare_exchangeing an AtomicU64. The data is transmuted into those bytes, allowing us to make atomic modifications to the contents of the vector. Because the data needs to be transmuted into an atomic type, the vector can't support types larger than 8 bytes. Finally, because we are using AcqRel as the success ordering, any subsequent Acquire loads will see the that there is no pending write.

#![allow(unused)]
fn main() {
    #[allow(unused_must_use)]
    if let Some(writedesc) = pending {
        AtomicU64::compare_exchange(
            writedesc.location,
            writedesc.old,
            writedesc.new,
            Ordering::AcqRel,
            Ordering::Relaxed,
        );

}

Now that we've done the store to the contents, we change the WriteDescriptor status of the vector to indicate that there is no pending write (we just took care of it!).

#![allow(unused)]
fn main() {
        let new_writedesc = WriteDescriptor::<T>::new_none_as_ptr();
        // # Safety
        // The pointer is valid to dereference because it started off valid
        // and only pointers made from WriteDescriptor::new_*_as_ptr()
        // (which are valid because of Box) are CAS'd in
        unsafe { &*self.descriptor.load(Ordering::Acquire) } // Loading with `Acquire`
            .pending
            .store(new_writedesc, Ordering::Release); // Storing with `Release`

        // Memory leak alert!
        // What happens to the old pointer stored in the
        // `AtomicPtr<Option<WriteDescriptor<T>>>`?
        // We never reclaim it.
    }
}

}

This is standard. We make a new WriteDescriptor and store it with Release so that all subsequent Acquire loads will see it.

Leaking memory

Leaking memory is when you use memory (allocating) but never free it. This is the first chunk of code that leaks. Our Descriptor has a pointer to an Option<WriteDescriptor>. When we store a different pointer, we lose the old pointer forever. Since we never do anything to deallocate the memory pointed to by the old pointer, like Box::from_raw, that memory will stay allocated until the end of the program.

We can't just directly free the memory right away though, as there could be another thread reading it. Later on, I'm going to show you how we can use a technique called hazard pointers to safely reclaim (deallocate) objects.

For now, the vector will stay leaky, and we'll move on the push.


Complete source for complete_write()

#![allow(unused)]
fn main() {
fn complete_write(&self, pending: &Option<WriteDescriptor<T>>) {
    #[allow(unused_must_use)]
    if let Some(writedesc) = pending {
        AtomicU64::compare_exchange(
            writedesc.location,
            writedesc.old,
            writedesc.new,
            Ordering::AcqRel,
            Ordering::Relaxed,
        );
        let new_writedesc = WriteDescriptor::<T>::new_none_as_ptr();
        // # Safety
        // The pointer is valid to dereference because it started off valid
        // and only pointers made from WriteDescriptor::new_*_as_ptr()
        // (which are valid because of Box) are CAS'd in
        unsafe { &*self.descriptor.load(Ordering::Acquire) }
            .pending
            .store(new_writedesc, Ordering::Release);
    }
}

}

push()

You made it! We're going to implement half of the main functionality of the vector. The code is going to get a little complex, but I'm confident in you. I eventually understood what was going on, so you can too.

We're going to track the steps described in The Algorithm closely. We don't want to mess up the concurrent semantics of the vector during implementation. The first thing we do is load in the Descriptor and WriteDescriptor. This is actually harder than it might seem, as we're working with unsafe things like raw pointers. We need to be very careful. But wait, there's one more thing we should cover, and that's exponential backoff!

Exponential Backoff

Exponential backoff is another one of those techniques that's unique to concurrent programming. compare_exchange algorithms like the one we're implementing can produce a lot of contention over a couple specific memory locations. For example, may threads are trying to compare_exchange the AtomicPtr<Descriptor> stored in the vector. That spot in memory is constantly bombarded with heavy atomic operations. One way we can alleviate this is by waiting a little bit after failing to compare_exchange. The first time we fail, we back off for 1 tick. If we fail again, we back off for 2 ticks, then 4, 8 . . . this is why the backoff is exponential. By backing off, we give another thread some room to successfully perform their compare_exchange. In some mircobenchmarks I did, introducing exponential backoff greatly speeded up the vector. It's cool that going slower at a micro level allows us to go faster on a macro level. crossbeam_utils has a useful little struct called Backoff that we're going to use.

#![allow(unused)]
fn main() {
pub fn push(&self, elem: T) {
    let backoff = Backoff::new(); // Backoff causes significant speedup
    loop {
        // # Safety
        // It is safe to dereference the raw pointers because they started off valid
        // and can only be CAS'd with pointers from `Box::into_raw`
        let current_desc = unsafe { &*self.descriptor.load(Ordering::Acquire) };

        // Complete a pending write op if there is any
        let pending = unsafe { &*current_desc.pending.load(Ordering::Acquire) };

}

There is already a lot going on here, in just these 10ish lines of code. Firstly, we've instantiated a Backoff. A the bottom of the loop, if we failed to compare_exchange in our new Descriptor, we'll call Backoff::spin() to wait a little bit, then we'll come back up to the top of the loop and try again.

This code also contains a very unsafe operation: dereferencing a raw pointer. The more I read about the dangers of raw pointers, the more scared I got. Paraphrasing from The Book, raw pointers aren't guaranteed to point to valid memory, aren't guaranteed to be non-null, don't implement cleanup (like Box), and ignore all the aliasing rules (&/&mut semantics).

After watching Demystifying unsafe code I felt better. unsafe code isn't intrinsically bad, it's just code that comes with an extra contract that we must uphold and document.

In the case of these first raw pointer dereferences, we know the dereference is safe because the pointers to the Descriptor and WriteDescriptor come from Box::into_raw, which returns a non-null and aligned pointer. unsafe is scary, but not necessarily bad. Obviously, we should try to limit its uses as much as possible though, as we can slip up and violate contracts.

Mitigating unsafe code: there are ways we can construct API's that need unsafe code to work without exposing users to danger. For example, we could make a type AtomicBox<T> that's mostly a wrapper around AtomicPtr<T>. It might look a little something like this:

#![allow(unused)]
fn main() {
#[repr(transparent)]
struct AtomicBox<T> {
    ptr: AtomicPtr<T>
}

impl<T> AtomicBox<T> {
    // We can only make a `Self` from a `Box`'s pointer!
    pub fn new(box: Box<T>) -> Self {
        AtomicPtr::new(Box::into_raw(box))
    }

    // Caller knows they are receiving a pointer from `Box`
    pub fn load(&self, ordering: Ordering) -> *mut T {
        self.0.load(ordering)
    }

    // -- snip --
}

}

There's nothing super crazy going on here, it's just that we've configured the API so that we know the pointer inside the AtomicBox<T> is valid because it could only have come from Box. Now, instead of manually ensuring the invariant that we use Box::into_raw pointers, the compiler/type system does so for us.

After loading in the WriteDescriptor, we execute it if need be.

#![allow(unused)]
fn main() {
    self.complete_write(pending);

}

Since we're pushing onto the vector, we might need more memory:

#![allow(unused)]
fn main() {
    // Calculate which bucket this element is going into
    let bucket = (highest_bit(current_desc.size + FIRST_BUCKET_SIZE)
        - highest_bit(FIRST_BUCKET_SIZE)) as usize;

    // If the bucket is null, allocate the memory
    if self.buffers[bucket].load(Ordering::Acquire).is_null() {
        self.allocate_bucket(bucket)
    }

}

Let's make our new WriteDescriptor now:

#![allow(unused)]
fn main() {
    // # Safety
    // It is safe to call `self.get()` because if the vector has reached
    // `current_desc.size`, so there is a bucket allocated for element `size`.
    // Therefore, the pointer is also valid to dereference because it points
    // into properly allocated memory.
    let last_elem = unsafe { &*self.get(current_desc.size) };
    let write_desc = WriteDescriptor::<T>::new_some_as_ptr(
        unsafe { mem::transmute_copy::<T, u64>(&elem) },
        // Load from the AtomicU64, which really contains the bytes for T
        last_elem.load(Ordering::Acquire),
        last_elem,
    );

}

For now we are assuming that the vector is only storing values 8 bytes big, therefore it is safe to transmute_copy to an AtomicU64. I plan on writing a macro that produces different implementations of the vector with different atomic types when storing types of different sizes. For example, SecVec<(i8, i8)> would store the data in AtomicU16. This would save on space. I don't think the vector would work for zero-sized types because of how we transmute. It would also be very inefficient because of all the unnecessary allocations!

Note that last_elem's type is &AtomicU64; it's the location of the write. When we load from last_elem, we are getting the old element. We now have the three pieces of data necessary for compare_exchange: a memory location (the reference), an old element, and a new element (the T passed to this function).

Let's package everything up in a Descriptor.

#![allow(unused)]
fn main() {
    let next_desc = Descriptor::<T>::new_as_ptr(write_desc, current_desc.size + 1);

}

Since we are adding one more element onto the vector, the new Descriptor's size is one more than the old one's.

Here comes the crucial compare_exchange, in the AcqRel/Relaxed flavor:

#![allow(unused)]
fn main() {
    if AtomicPtr::compare_exchange_weak(
        &self.descriptor,
        current_desc as *const _ as *mut _,
        next_desc,
        Ordering::AcqRel,
        Ordering::Relaxed,
    )
    .is_ok()
    {
        // We know the current write_desc is the one we just sent in
        // with the compare_exchange so avoid loading it atomically
        self.complete_write(unsafe { &*write_desc });
        break;
    }

}

If the compare_exchange succeeds, we call complete_write on the descriptor we just made to finalize the changes, then we break out of the loop.

If compare_exchange fails, we'll simply start over again.

Either way, we have a memory leak. If the compare_exchange succeeded, we never deal with the old Descriptor's pointer. We can never safely deallocate it because we don't know if anyone is reading it. It would be terribly rude to pull the rug out from under them! Also the deallocation would probably cause a use-after-free which would cause the OS to terminate the program which would rip a hole in the space-time continuum which would. Wait what? Uhh, moving on . . .

If the compare_exchange failed, the new Descriptor and WriteDescriptor leak. Once we reach the end of the loop, all local variables in that scope are lost. So, we never get back the pointers to our new describe-y objects, and their memory is lost the the void, never to be seen again (unless we do some wildly dumb stuff and read a random address or something). In any case, within the code for the vector, I try not to tempt the segfault gods. My other projects, maybe a little bit.

At this point, we've failed the compare_exchange. Let's Backoff::spin() and then retry:

#![allow(unused)]
fn main() {
        backoff.spin();
    } // Closing brace for the loop
} // Closing brace for the function

}

Once we finish looping and finally succeed with the compare_exchange, we're done! That's a push. The pseudocode is so simple, and the code is so . . . not simple. Props to you for getting this far, concurrent programming is not for the weak of spirit.

I'll cover the minor differences in pop, and then we'll cap off the leaky code with size.


Complete source for push

#![allow(unused)]
fn main() {
pub fn push(&self, elem: T) {
    let backoff = Backoff::new(); // Backoff causes significant speedup
    loop {
        // # Safety
        // It is safe to dereference the raw pointers because they started off valid
        // and can only be CAS'd with pointers from `Box::into_raw`
        let current_desc = unsafe { &*self.descriptor.load(Ordering::Acquire) };
        let pending = unsafe { &*current_desc.pending.load(Ordering::Acquire) };

        // Complete a pending write op if there is any
        self.complete_write(pending);

        // Allocate memory if need be
        let bucket = (highest_bit(current_desc.size + FIRST_BUCKET_SIZE)
            - highest_bit(FIRST_BUCKET_SIZE)) as usize;
        if self.buffers[bucket].load(Ordering::Acquire).is_null() {
            self.allocate_bucket(bucket)
        }
        // # Safety
        // It is safe to call `self.get()` because if the vector has reached
        // `current_desc.size`, so there is a bucket allocated for element `size`.
        // Therefore, the pointer is also valid to dereference because it points
        // into properly allocated memory.
        let last_elem = unsafe { &*self.get(current_desc.size) };

        let write_desc = WriteDescriptor::<T>::new_some_as_ptr(
            unsafe { mem::transmute_copy::<T, u64>(&elem) },
            last_elem.load(Ordering::Acquire),
            last_elem,
        );

        let next_desc = Descriptor::<T>::new_as_ptr(write_desc, current_desc.size + 1);

        // Handle result of compare_exchange
        if AtomicPtr::compare_exchange_weak(
            &self.descriptor,
            current_desc as *const _ as *mut _,
            next_desc,
            Ordering::AcqRel,
            Ordering::Relaxed,
        )
        .is_ok()
        {
            // We know the current write_desc is the one we just sent in
            // with the compare_exchange so avoid loading it atomically
            self.complete_write(unsafe { &*write_desc });
            break;
        }

        backoff.spin();
    }
}

}

pop

There are three main differences between pop and push. Firstly, pop never needs to allocate. Secondly, pop swaps in a slightly different descriptor, with None as the WriteDescriptor and current_desc.size - 1 as the new size.

#![allow(unused)]
fn main() {
    let new_pending = WriteDescriptor::<T>::new_none_as_ptr();
    let next_desc = Descriptor::<T>::new_as_ptr(new_pending, current_desc.size - 1);

}

The final difference is that after we succeed with the compare_exchange, we read the last element and return it.

#![allow(unused)]
fn main() {
    if AtomicPtr::compare_exchange_weak(
        &self.descriptor,
        current_desc as *const _ as *mut _,
        next_desc,
        Ordering::AcqRel,
        Ordering::Relaxed,
    )
    .is_ok()
    {
        // # Safety
        // This is ok because only 64-bit values can be stored in the vector
        // We also know that elem is a valid T because it was transmuted into a usize
        // from a valid T, therefore we are only transmuting it back
        return Some(unsafe { mem::transmute_copy::<u64, T>(&elem) });
    }

}

The rest of the function: loading the Descriptors, compare_exchange, Backoff, is identical.

Like push, pop also leaks memory profusely. Luckily, this means that when we implement memory reclamation, it'll be the same solution for push and pop.


Complete source for pop()

#![allow(unused)]
fn main() {
pub fn pop(&self) -> Option<T> {
    let backoff = Backoff::new(); // Backoff causes significant speedup
    loop {
        let current_desc = unsafe { &*self.descriptor.load(Ordering::Acquire) };
        let pending = unsafe { &*current_desc.pending.load(Ordering::Acquire) };

        self.complete_write(pending);
        if current_desc.size == 0 {
            return None;
        }

        // # Safety
        // Do not need to worry about underflow for the sub because we would have
        // already returned
        let elem = unsafe { &*self.get(current_desc.size - 1) }
            .load(Ordering::Acquire);

        let write_desc = WriteDescriptor::<T>::new_none_as_ptr();
        let next_desc = Descriptor::<T>::new_as_ptr(write_desc, current_desc.size - 1);

        if AtomicPtr::compare_exchange_weak(
            &self.descriptor,
            current_desc as *const _ as *mut _,
            next_desc,
            Ordering::AcqRel,
            Ordering::Relaxed,
        )
        .is_ok()
        {
            // # Safety
            // This is ok because only 64-bit values can be stored in the vector
            // We also know that elem is a valid T because it was transmuted into a
            // usize from a valid T, therefore we are only transmuting it back
            return Some(unsafe { mem::transmute_copy::<u64, T>(&elem) });
        }
        backoff.spin();
    }
}

}

size()

The procedure for size is simple. Load the vector's size from the Descriptor, then subtract one if there is a pending write.

It seems like so long ago that we went over The Algorithm, but recall that when we perform a push, we swap in a Descriptor, then call complete_write. This means that when we do a write, the increase in size is reflected in the vector's state before the write actually happens. If there is still a WriteDescriptor contained in the Descriptor, that means the size stored in the Descriptor is one greater than the actual size of the vector, because complete_write replaces the WriteDescriptor with None when it executes the write.

Here is the code:

#![allow(unused)]
fn main() {
pub fn size(&self) -> usize {
    // # Safety
    // The pointers are safe to dereference because they all came from `Box::into_raw`
    // and point to valid objects

    let desc = unsafe { &*self.descriptor.load(Ordering::Acquire) };
    let size = desc.size;

    // If there is a pending descriptor, we subtract one from the size because
    // `push` increments the size, swaps the new descriptor in, and _then_ writes
    // the value. Therefore the size is one greater because the write hasn't happened
    // yet
    match unsafe { &*desc.pending.load(Ordering::Acquire) } {
        Some(_) => size - 1,
        None => size,
    }
}
}

There have been many momentous moments throughout the book: understanding the algorithm, finishing push, and finally, completing the vector's public API. When I was writing the code, this moment felt huge, and I jumped up and down after pushing 10 elements onto the vector, poping 10 times, and running assert_eq!(sv.size(), 0); withough crashing.

Let's run some tests (more fun than you might think)!

Tests

Just for fun, I wrote some tests, and we get to satisfyingly see them pass.

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn size_starts_at_0() {
        let sv = SecVec::<usize>::new();
        assert_eq!(0, sv.size());
    }

    #[test]
    fn pop_empty_returns_none() {
        let sv = SecVec::<usize>::new();
        assert_eq!(sv.pop(), None);
    }

    #[test]
    fn ten_push_ten_pop() {
        let sv = SecVec::<isize>::new();
        for i in 0..10 {
            sv.push(i);
        }
        for i in (0..10).rev() {
            assert_eq!(sv.pop(), Some(i));
        }
    }

    #[test]
    fn does_not_allocate_buffers_on_new() {
        let sv = SecVec::<isize>::new();
        for buffer in &**sv.buffers {
            assert!(buffer.load(Ordering::Relaxed).is_null())
        }
    }
}

}

Cargo is super nice and we can use it to test. Running cargo test produces the following output:

~/C/r/unlocked (main) > cargo test -- leaky::tests
    Finished test [unoptimized + debuginfo] target(s) in 0.01s
     Running unittests (target/debug/deps/unlocked-e6f64e7ba9c7e004)

running 4 tests
test leaky::tests::size_starts_at_0 ... ok
test leaky::tests::pop_empty_returns_none ... ok
test leaky::tests::does_not_allocate_buffers_on_new ... ok
test leaky::tests::ten_push_ten_pop ... ok

Although you can't see it, the green on those "ok"s warms my heart.

We know the vector is leaky, but otherwise it shouldn't be doing any other funky things or UB. Let's see if Miri finds anything with MIRIFLAGS=-Zmiri-ignore-leaks cargo miri test -- leaky::tests:

~/C/r/unlocked (main) > MIRIFLAGS=-Zmiri-ignore-leaks cargo miri test -- leaky::tests
    Finished test [unoptimized + debuginfo] target(s) in 0.01s
     Running unittests (target/miri/x86_64-apple-darwin/debug/deps/unlocked-4269)

running 4 tests
test leaky::tests::does_not_allocate_buffers_on_new ... ok
test leaky::tests::pop_empty_returns_none ... ok
test leaky::tests::size_starts_at_0 ... ok
test leaky::tests::ten_push_ten_pop ... ok

Nothing? Awesome! Just because Miri doesn't find anything doesn't mean nothing fishy is happening. Miri combined with the rigorous analysis of the code we did though is a very good sign.

Memory Reclamation

Allocation isn't the hard part when it comes to concurrency, deallocation/reclamation is. When multiple threads/entities are concurrently accessing an object, it is never safe to deallocate it without verifying that no one has a reference/pointer to it. What if they were to use that pointer after the deallocation?

This problem arises in the vector when reclaiming the Descriptors and WriteDescriptors. Multiple threads can hold a reference to them at once, so we never know when it is safe to deallocation.

To solve this problem, we'll used technique called hazard pointers via the haphazard crate.

What is meant by reclamation/deallocation? When we allocate memory, the allocator returns a pointer to an allocation on the heap. Internally, the allocator also notes down that the space is being used. When we deallocate, or reclaim, memory, we return the pointer back to the allocator. The allocator goes back to the books and notes that no one is using the memory anymore, freeing the memory. It can now hand that memory back out again if it's needed.

I'm not sure if this is the case, but I think the term "reclaim" might be used specifically in concurrent contexts. Rust automatically deallocates memory in single-threaded contexts using the borrow checker. We have to do everything manually in multithreaded contexts

Hazard Pointers

The idea of hazard pointers is to protect memory addresses from reclamation. At any moment in time, we have a list of addresses that are not safe to reclaim. We can store the addresses in a data structure like a concurrent linked list; I think this is what haphazard uses.

Whenever we want to access a pointer, we access it through a hazard pointer. Accessing through a hazard pointer adds the address we are accessing to the list of addresses to protect. When the hazard pointer gets dropped, or we explicitly disassociate the hazard pointer from the underlying raw pointer, the protection ends.

So why is the Protected list important? When we are done with an object, we retire it, marking it for eventual reclamation. By retiring the pointer, we agree to not use it anymore. Any thread that is already accessing it can continue to do so, but there can be no new readers/writers.

Every once in a while, the Domain, which holds the hazard pointers, will go through the Retired list. For each pointer on this list, the Domain checks whether the pointer is protected by reading the Protected list. If the pointer isn't protected, the Domain reclaims the object it points to (deallocating the pointer). If it is protected, the Domain does not reclaim it, because someone is using it. In this way, we prevent pointers in use from being deallocated, but those out of use are deallocated.

An example

Hazard pointers are pretty complicated, so here's a visual example that I hope helps:

Protected: [1<0x22>]
Retired: []
              0x20   0x22   0x23   0x24
            +------+------+------+------+
Thread 1    |      |  <>  |      |      |
Thread 2    |      |      |      |      |
            +------+------+------+------+

Right now Thread 1 is accessing 0x22 via a hazard pointer, so the Protected list contains the pointer Ox22, annotated with 1 to indicate Thread 1 is protecting it. I'm not sure if you would actually keep track of which thread is protecting a pointer in an actual implementation. I think if another thread tries to protect an already protected pointer, nothing will happen.

Ok, now, Thread 2 accesses 0x22 and protects the pointer.

Protected: [1<0x22>, 2<0x22>]
Retired: []
              0x20   0x22   0x23   0x24
            +------+------+------+------+
Thread 1    |      |  <>  |      |      |
Thread 2    |      |  <>  |      |      |
            +------+------+------+------+

Thread 1 finishes with its access, and retires 0x22. Thread 1 is saying, "No one new will use this pointer, deallocate it when it's safe to do so!" 0x22 is added to the Retired list. The Domain can't retire the pointer yet because Thread 2 is still accessing it.

Protected: [2<0x22>]
Retired: [0x22]
              0x20   0x22   0x23   0x24
            +------+------+------+------+
Thread 1    |      |      |      |      |
Thread 2    |      |  <>  |      |      |
            +------+------+------+------+

Finally, Thread 2 finishes using the pointer, removing 0x22 from the Protected list.

Protected: []
Retired: [0x22]
              0x20   0x22   0x23   0x24
            +------+------+------+------+
Thread 1    |      |      |      |      |
Thread 2    |      |      |      |      |
            +------+------+------+------+

The Domain sees that 0x22 is retired and no one is protecting it, so it deallocates the allocation at 0x22. We have reclaimed memory, and 0x22 will not leak!

Code changes

To use the hazard pointers, we're going to need to make a small change in the vector's structure.

The hardest part was getting started.

Following the documentation on Domain, I wrote a bunch of type aliases using the type keyword:

#![allow(unused)]
fn main() {
// Setting up hazard pointers
// This makes sure they all use the same Domain, guaranteeing the protection is valid.
#[non_exhaustive]
struct Family;
type Domain = haphazard::Domain<Family>;
type HazardPointer<'domain> = haphazard::HazardPointer<'domain, Family>;
type HazAtomicPtr<T> = haphazard::AtomicPtr<T, Family>;
}

We only use Domains produced from struct Family. This prevents us from retiring a pointer in the Global domain that is being guarded in a different domain. The Global domain can't see the other Domain's protected list, so might prematurely retire the pointer.

Secondly, all the HazardPointers and HazAtomicPtrs we construct will be in same family as our Domains. This ensures the same protection against overlapping with the Global domain.

The difference between HazAtomicPtr which is an an alias for haphazard::AtomicPtr, and std::sync::atomic::AtomicPtr, is that HazAtomicPtr uses hazard pointers to guard loads. Additionally, all atomic operations with HazAtomicPtr have Acquire-Release semantics built in. Nifty!

To ensure that we always retire and protect in the same domain, we will also carry a Domain in the struct itself. Then, it's pretty easy to just always use &self.domain whenever we need a Domain. All we have to do is add one more struct field to SecVec:

#![allow(unused)]
fn main() {
pub struct SecVec<'a, T: Sized + Copy> {
    buffers: CachePadded<Box<[AtomicPtr<AtomicU64>; 60]>>,
    descriptor: CachePadded<HazAtomicPtr<Descriptor<'a, T>>>,
    domain: Domain, // Hi there :)
    _boo: PhantomData<T>,
}

struct Descriptor<'a, T: Sized> {
    pending: HazAtomicPtr<Option<WriteDescriptor<'a, T>>>,
    size: usize,
}

struct WriteDescriptor<'a, T: Sized> {
    new: u64,
    old: u64,
    location: &'a AtomicU64,
    _boo: PhantomData<T>,
}
}

And with that out of the way, we can now plug some leaks!

Fixing complete_write

complete_write was the easiest leak to seal. When we swap out the WriteDescriptor, we get back the old one. All we have to do is retire it, and its memory will eventually get reclaimed.

We execute the WriteDescriptor and make a new one (None) like normal:

#![allow(unused)]
fn main() {
fn complete_write(&self, pending: *mut Option<WriteDescriptor<T>>) {
    // If cas of actual value fails, someone else did the write
    // Result of compare_exchange doesn't matter
    if let Some(writedesc) = unsafe { &*pending } {
        let _ = AtomicU64::compare_exchange(
            writedesc.location,
            writedesc.old,
            writedesc.new,
            Ordering::AcqRel,
            Ordering::Relaxed,
        );

        let new_writedesc = WriteDescriptor::<T>::new_none_as_ptr();

}

Here comes the part where the hazard pointers kick in. We make a hazard pointer in &self.domain, then load in the Descriptor. Now, the current Descriptor cannot get reclaimed as long as our hazard pointer is alive. Then we swap in a new pointer to the None WriteDescriptor.

Here comes the big change, instead of just doing nothing with the pointer that swapped out, we retire it in &self.domain. According to the documentation for retire_in, there is a safety contract we need to follow (hence the marking unsafe fn).

Let's look at that:

  1. The pointed-to object will never again be returned by any [Haz]AtomicPtr::load.
  2. The pointed-to object has not already been retired.
  3. All calls to load that can have seen the pointed-to object were using hazard pointers from domain.

Alright, let's make sure we're fulfilling the contract.

Number one, we swapped out the pointer, so all new calls to HazAtomicPtr::load will use the new pointer. This is Acquire-Release semantics in action under the hood. Since the swap_ptr uses Release, all HazAtomicPtr::loads (which use Acquire) will see the new value. Thus, the old value is safe from being loaded again.

Number two, only one thread can get a pointer as the result of a swap. If I'm holding a marble in my hand and I give it away, no one else can take that marble from my hand. The person who took it can do whatever they want with it without worrying about others interfering. Since we got the pointer as the result of swap_ptr, no other thread has exclusive access like we do. We took the marble. Therefore, we know that know other thread has already or might retire the pointer. They can't access the marble anymore, and if we have the marble, it means they never had it.

Finally, number 3, all operations (creating hazard pointers, retiring pointers) happen through &self.domain!

After writing a 1000 word essay, we can confirm that retire_in is safe to call. This is the argument we'll use for retireing the results of compare_exchange in push/pop.

#![allow(unused)]
fn main() {
        let mut hp = HazardPointer::new_in_domain(&self.domain);

        let old = unsafe {
            self.descriptor
                .load(&mut hp)
                .expect("ptr is null")
                .pending // This is a HazAtomicPtr<WriteDescriptor>
                // # Safety
                // new_writedesc conforms to the requirements of HazAtomicPtr::new()
                // because it comes from Box::into_raw and is a valid WriteDescriptor
                .swap_ptr(new_writedesc)
        };

        // # Safety
        // We are the only thread that will retire this pointer because
        // only one thread can get the result of the swap (this one).
        // Two threads couldn't have performed a swap and both got this pointer.
        unsafe { old.unwrap().retire_in(&self.domain) };

        // hp gets dropped, protection ends
    }
}

}

That's the only change to complete_write. push/pop aren't much worse.


Complete source for complete_write (not leaky)

#![allow(unused)]
fn main() {
fn complete_write(&self, pending: *mut Option<WriteDescriptor<T>>) {
    // If cas of actual value fails, someone else did the write
    // Result of cmpxchng doesn matter
    if let Some(writedesc) = unsafe { &*pending } {
        let _ = AtomicU64::compare_exchange(
            writedesc.location,
            writedesc.old,
            writedesc.new,
            Ordering::AcqRel,
            Ordering::Relaxed,
        );

        let new_writedesc = WriteDescriptor::<T>::new_none_as_ptr();

        let mut hp = HazardPointer::new_in_domain(&self.domain);

        let old = unsafe {
            self.descriptor
                .load(&mut hp)
                .unwrap()
                .pending
                // # Safety
                // new_writedesc conforms to the requirements of HazAtomicPtr::new()
                // because it comes from Box::into_raw and is a valid WriteDescriptor
                .swap_ptr(new_writedesc)
        };

        // # Safety
        // We are the only thread that will retire this pointer because
        // only one thread can get the result of the swap (this one).
        // Two threads couldn't have performed a swap and both got this pointer.
        unsafe { old.unwrap().retire_in(&self.domain) };
    }
}

}

Fixing push & pop

I know I said that the changes to push and pop aren't that bad, which is true. Getting to those changes however, took a while. I'm going to explain what I did with pseudocode first, and then show the final code.

The first thing I tried was just retiring the old Descriptor after a successful compare_exchange, however, this didn't reduce the leakage at all for some reason. I figured it might be because the Descriptor was pointing a live WriteDescriptor. So then, I also retired the WriteDescriptor. However, this produced use-after-frees and data races according to Miri, so I knew I was doing something wrong.

I decided to review the safety contract of retire_in again, and that is when I found the bug. Retiring the Descriptor is safe for the same reason retiring the WriteDescriptor after complete_write is. Since the Descriptor is the result of a swap, we are the only thread who will retire it. The thing is, if we also retire the WriteDescriptor, a thread who is already accessing the Descriptor could make a new load to the just retired WriteDescriptor, violating the safety contract of retire_in, and causing UB.

The problem in picture form

We, Thread 1, have the Descriptor as the result of a successful compare_exchange. Thread 2 is also reading the Descriptor (but not the inner WriteDescriptor)

               Thread 2
               /
Thread 1 (us) /
   |         /
   |        /
   V       v
  Descriptor
     \
      \
       \
        v
        WriteDescriptor

Because the compare_exchange was successful, we retire the Descriptor and WriteDescriptor. The Descriptor is protected from reclamation because Thread 2 is reading it, but the WriteDescriptor has no readers so it gets deallocated.

               Thread 2
               /
Thread 1 (us) /
   |         /
   |        /
   V       v
  Descriptor
     \
   ---+----------------
       \
        v
        WriteDescriptor <Deallocated>

Now, Thread 2 goes to read the (now reclaimed!!) WriteDescriptor by loading the pointer contained in the Descriptor (which is still protected, and safe to access).

               Thread 2
                  |
Thread 1 (us)     |
   |              |
   |              |
   V              |
  Descriptor      |
     \            |
   ---+-----------+----
       \          |
        v         V
        WriteDescriptor <Deallocated>

And here we have it, Thread 2 accessing deallocated memory!

The solution

The solution I came up with is to make sure a reference to a WriteDescriptor never outlives the reference to it's parent Descriptor. Visually this looks like:

-- Descriptor Reference Start

    -- WriteDescriptor Reference Start


    -- WriteDescriptor Reference End



-- Descriptor Reference End

This means that when there are no people accessing a Descriptor, there are also no people accessing the inner WriteDescriptor. Therefore, when a Descriptor is retireded, the WriteDescriptor is also safe to retire because there are no references to it. Since no one can get a new reference to a retireed Descriptor, no once can access the inner WriteDescriptor.

Why is this important? Whenever we reclaim a Descriptor, we also reclaim the inner WriteDescriptor, fixing our leaks without causing any UB.

To implement this custom behavior for Descriptor, we implement the Drop trait. A type that implements Drop executes some custom behavior when it goes out of scope and is reclaimed.

The Drop implementation looks like this:

#![allow(unused)]
fn main() {
impl<T> Drop for Descriptor<'_, T>
{
    fn drop(&mut self) {
        // # Safety
        // The pointer is valid because it's from Box::into_raw
        // We must also ensure ref to wdesc never outlasts ref to desc
        unsafe {
            Box::from_raw(
                self.pending
                    .swap_ptr(ptr::null_mut())
                    .unwrap()
                    .into_inner() // This is a NonNull<T>
                    .as_ptr() // Turn it into a raw pointer
            );
        }
    }
}
}

All we're doing is extracting the pointer to the WriteDescriptor and calling Box::from_raw on it so that its memory will be reclaimed by Box when it goes out of scope.

Reclaiming the Descriptors

Its time to finally go over the code changes to push. All accesses to the Descriptor and WriteDescriptor are guarded with a hazard pointer. The access returns a reference to the Descriptor/WriteDescriptor, which is valid as long as the hazard pointer guarding the access is alive. Access to the inner WriteDescriptor is explicitly scoped within its own block to make clear that access to the WriteDescriptor cannot outlive the access to the parent Descriptor.

#![allow(unused)]
fn main() {
pub fn push(&self, elem: T) {
    let backoff = Backoff::new(); // Backoff causes significant speedup
    loop {
        let mut dhp = HazardPointer::new_in_domain(&self.domain);
        let current_desc = unsafe { self.descriptor.load(&mut dhp) }
            .expect("invalid ptr for descriptor in push");

        // Use a block to make explicit that the use of the wdesc does not outlive
        // the use of the desc.
        // This means that when the desc is dropped, there will be no references
        // to the wdesc inside.
        // And we can deallocate the wdesc with `Box::from_raw`
        {
            let mut wdhp = HazardPointer::new_in_domain(&self.domain);
            let pending = unsafe { current_desc.pending.load(&mut wdhp) }
                .expect("invalid ptr from write-desc in push");

            self.complete_write(pending as *const _ as *mut _);
            // Hazard pointer is dropped, protection ends
        }
}

This stuff is all the same as before.

#![allow(unused)]
fn main() {
        // If we need more memory, calculate the bucket
        let bucket = (highest_bit(current_desc.size + FIRST_BUCKET_SIZE)
            - highest_bit(FIRST_BUCKET_SIZE)) as usize;
        // Allocate it
        if self.buffers[bucket].load(Ordering::Acquire).is_null() {
            self.allocate_bucket(bucket)
        }

        let last_elem = unsafe { &*self.get(current_desc.size) };

        let next_write_desc = WriteDescriptor::<T>::new_some_as_ptr(
            // TODO: address this in macro
            // # Safety
            // The `transmute_copy` is safe because we have ensured that T is the
            // correct size at compile time
            unsafe { mem::transmute_copy::<T, u64>(&elem) },
            // Load from the AtomicU64, which really contains the bytes for T
            last_elem.load(Ordering::Acquire),
            last_elem,
        );

        let next_desc = Descriptor::<T>::new_as_ptr(next_write_desc,
            current_desc.size + 1);

}

The compare_exchange syntax is slightly different, but it's doing the exact same thing. We don't have to specify orderings because they're built in by haphazard. On a successful compare_exchange, we retire the pointer to the old Descriptor. When it is finally reclaimed, its Drop implementation will run and its inner WriteDescriptor will also get reclaimed safely.

If the compare_exchange fails, we deallocate our local Descriptor normally by calling Box::from_raw. Since the local Descriptor was never shared across threads, we don't have to worry about synchronizing the deallocation. Then, we spin using the Backoff and go back to the top of the loop.

#![allow(unused)]
fn main() {
        if let Ok(replaced) = unsafe {
            HazAtomicPtr::compare_exchange_weak_ptr(
                // # Safety
                // Safe because the pointer we swap in points to a valid object that
                // is !null
                &self.descriptor,
                current_desc as *const _ as *mut _,
                next_desc,
            )
        } {
            self.complete_write(next_write_desc);

            // # Safety
            // Since the we only retire when swapping out a pointer, this is the only
            // thread that will retire, since only one thread receives the result of
            // the swap (this one)
            //
            // There will never be another load call to the ptr because all calls will
            // go the new one. Since all uses of the inner wdesc are contained within
            // the lifetime of the reference to the desc, there will also be no new
            // loads on the inner wdesc.
            unsafe {
                replaced.unwrap().retire_in(&self.domain);
            }
            break;
        }

        // Deallocate the write_desc and desc that we failed to swap in
        // # Safety
        // Box the write_desc and desc ptrs were made from Box::into_raw, so it
        // is safe to Box::from_raw
        unsafe {
            // Note: the inner wdesc also get's dropped as part of the desc's drop impl
            Box::from_raw(next_desc);
        }

        backoff.spin();
    }
}

}

The changes for pop are identical. We are so close to being done with code. Our Descriptors and WriteDescriptors are eventually reclaimed, which is a big step forward. The last thing is to deallocate the buckets and the final Descriptor when the vector itself is dropped.


Complete source for push() and pop()

#![allow(unused)]
fn main() {
pub fn push(&self, elem: T) {
    let backoff = Backoff::new(); // Backoff causes significant speedup
    loop {
        let mut dhp = HazardPointer::new_in_domain(&self.domain);
        let current_desc = unsafe { self.descriptor.load(&mut dhp) }
            .expect("invalid ptr for descriptor in push");

        // Use a block to make explicit that the use of the wdesc does not
        // outlive the use of the desc. This means that when the desc is dropped,
        // there will be no references to the wdesc inside.And we can deallocate
        // the wdesc with `Box::from_raw`
        {
            let mut wdhp = HazardPointer::new_in_domain(&self.domain);
            let pending = unsafe { current_desc.pending.load(&mut wdhp) }
                .expect("invalid ptr from write-desc in push");

            self.complete_write(pending as *const _ as *mut _);
            // Hazard pointer is dropped, protection ends
        }

        // If we need more memory, calculate the bucket
        let bucket = (highest_bit(current_desc.size + FIRST_BUCKET_SIZE)
            - highest_bit(FIRST_BUCKET_SIZE)) as usize;
        // Allocate it
        if self.buffers[bucket].load(Ordering::Acquire).is_null() {
            self.allocate_bucket(bucket)
        }

        let last_elem = unsafe { &*self.get(current_desc.size) };

        let next_write_desc = WriteDescriptor::<T>::new_some_as_ptr(
            // TODO: address this in macro
            // # Safety
            // The `transmute_copy` is safe because we have ensured that T is
            // the correct size at compile time
            unsafe { mem::transmute_copy::<T, u64>(&elem) },
            // Load from the AtomicU64, which really contains the bytes for T
            last_elem.load(Ordering::Acquire),
            last_elem,
        );

        let next_desc = Descriptor::<T>::new_as_ptr(next_write_desc,
            current_desc.size + 1);

        if let Ok(replaced) = unsafe {
            HazAtomicPtr::compare_exchange_weak_ptr(
                // # Safety
                // Safe because the pointer we swap in points to a valid object that
                // is !null
                &self.descriptor,
                current_desc as *const _ as *mut _,
                next_desc,
            )
        } {
            self.complete_write(next_write_desc);

            // # Safety
            // Since the we only retire when swapping out a pointer, this is the only
            // thread that will retire, since only one thread receives the result of
            // the swap (this one)
            //
            // There will never be another load call to the ptr because all calls will
            // go the new one. Since all uses of the inner wdesc are contained within
            // the lifetime of the reference to the desc, there will also be no new
            // loads on the inner wdesc.
            unsafe {
                replaced.unwrap().retire_in(&self.domain);
            }
            break;
        }

        // Deallocate the write_desc and desc that we failed to swap in
        // # Safety
        // Box the write_desc and desc ptrs were made from Box::into_raw, so it is
        // safe to Box::from_raw
        unsafe {
            // Note: the inner wdesc also get's dropped as part of the desc's drop impl
            Box::from_raw(next_desc);
        }

        backoff.spin();
    }
}

pub fn pop(&self) -> Option<T> {
    let backoff = Backoff::new(); // Backoff causes significant speedup
    loop {
        let mut dhp = HazardPointer::new_in_domain(&self.domain);
        let current_desc = unsafe { self.descriptor.load(&mut dhp) }
            .expect("invalid ptr for descriptor in pop");

        // Use a block to make explicit that the use of the wdesc does not
        // outlive the use of the desc. This means that when the desc is
        //  dropped, there will be no references to the wdesc inside.
        // And we can deallocate the wdesc with `Box::from_raw`
        {
            let mut wdhp = HazardPointer::new_in_domain(&self.domain);
            let pending = unsafe { current_desc.pending.load(&mut wdhp) }
                .expect("invalid ptr for write-descriptor in pop");

            self.complete_write(pending as *const _ as *mut _);
            // Hazard pointer is dropped, protection ends
        }

        if current_desc.size == 0 {
            return None;
        }

        // TODO: add safety comment
        // Consider if new desc is swapped in, can we read deallocated memory?
        // Do not need to worry about underflow for the sub because we would
        // have already returned
        let elem = unsafe { &*self.get(current_desc.size - 1) }
            .load(Ordering::Acquire);

        let new_pending = WriteDescriptor::<T>::new_none_as_ptr();

        let next_desc = Descriptor::<T>::new_as_ptr(new_pending,
            current_desc.size - 1);

        if let Ok(replaced) = unsafe {
            HazAtomicPtr::compare_exchange_weak_ptr(
                // # Safety
                // Safe because the pointer we swap in points to a valid object that
                // is !null
                &self.descriptor,
                current_desc as *const _ as *mut _,
                next_desc,
            )
        } {
            // # Safety
            // Since the we only retire when swapping out a pointer, this is the only
            // thread that will retire, since only one thread receives the result of
            // the swap (this one)
            //
            // There will never be another load call to the ptr because all calls will
            // go the new one. Since all uses of the inner wdesc are contained within
            // the lifetime of the reference to the desc, there will also be no new
            // loads  on the inner wdesc.
            unsafe {
                replaced.unwrap().retire_in(&self.domain);
            }

            // # Safety
            // TODO: address this in macro
            // This is ok because we ensure T is the correct size at compile time
            // We also know that elem is a valid T because it was transmuted into a
            // usize from a valid T, therefore we are only transmuting it back
            return Some(unsafe { mem::transmute_copy::<u64, T>(&elem) });
        }

        // Deallocate the write_desc and desc that we failed to swap in
        // # Safety
        // Box the write_desc and desc ptrs were made from Box::into_raw, so
        // it is safe to Box::from_raw
        unsafe {
            // Note: the inner wdesc also get's dropped as part of the desc's drop impl
            Box::from_raw(next_desc);
        }

        backoff.spin();
    }
}
}

Dropping the vector

We approach the end! As its last action, the vector will free the memory allocated in its buckets and the Descriptor it holds. Once again, we achieve this by implementing the Drop trait.

Using a bunch of chained function calls on &self.buffers, we can get all of the buffers that aren't null. Then, we recreate the Layout they hold and deallocate them.

Dropping the current Descriptor is simple, we just Box::from_raw it!. It's destructor runs and the inner WriteDescriptor is also deallocated.

#![allow(unused)]
fn main() {
impl<T> Drop for SecVec<'_, T>
where
    T: Copy,
{
    fn drop(&mut self) {
        // Drop buffers
        let allocator = Global;
        for (bucket, ptr) in self
            .buffers
            .iter()
            .filter(|ptr| !ptr.load(Ordering::Relaxed).is_null())
            .enumerate()
        // Getting all non-null buckets
        {
            let size = FIRST_BUCKET_SIZE * (1 << bucket);
            let layout = match Layout::array::<AtomicU64>(size) {
                Ok(layout) => layout,
                Err(_) => capacity_overflow(),
            };
            unsafe {
                // # Safety
                // We have recreated the exact same layout used to alloc the ptr in
                // `allocate_bucket`. We know the ptr isn't null because of the filter
                allocator.deallocate(
                    NonNull::new(ptr.load(Ordering::Relaxed) as *mut u8).unwrap(),
                    layout,
                )
            };
        }

        // Retiring the current desc and wdesc
        // # Safety
        // Since we have &mut self, we have exclusive access, so we can retire the
        // desc and wdesc ptrs.
        //
        // It is safe to dereference the ptr to the desc because it is valid because
        // it was created with Descriptor::new_as_ptr.
        let desc = self.descriptor.load_ptr();
        unsafe {
            Box::from_raw(desc);
        };
    }
}

}

That's it. All the leaks. All the code I'm going to show you. I hope that was a satisfying journey from learning about the algorithm to fully implementing it!. Let's run the tests one more time with Miri :)

More tests

Here are the tests. I added a new one up at the top that spawns a bunch of threads which push and pop. I just want to make sure Miri does not detect any UB in a complex scenario like that

#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;
    extern crate std;
    use std::sync::atomic::{AtomicIsize, Ordering};
    use std::sync::Arc;
    use std::thread::{self, JoinHandle};
    use std::vec::Vec;

    #[test]
    fn the_big_multithread() {
        static FIVE: isize = 5;
        let data = Arc::new(SecVec::<isize>::new());
        data.reserve(100 * 5);
        let sum = Arc::new(AtomicIsize::new(0));
        #[allow(clippy::needless_collect)]
        let handles = (0..5)
            .map(|_| {
                let data = Arc::clone(&data);
                thread::spawn(move || {
                    for _ in 0..100 {
                        data.push(FIVE);
                    }
                })
            })
            .into_iter()
            .collect::<Vec<JoinHandle<_>>>();
        handles.into_iter().for_each(|h| h.join().unwrap());
        #[allow(clippy::needless_collect)]
        let handles = (0..5)
            .map(|_| {
                let data = Arc::clone(&data);
                let sum = Arc::clone(&sum);
                thread::spawn(move || {
                    for _ in 0..100 {
                        sum.fetch_add(data.pop().unwrap_or(0), Ordering::Relaxed);
                    }
                })
            })
            .into_iter()
            .collect::<Vec<JoinHandle<_>>>();
        handles.into_iter().for_each(|h| h.join().unwrap());
    }

    #[test]
    fn size_starts_at_0() {
        let sv = SecVec::<usize>::new();
        assert_eq!(0, sv.size());
    }

    #[test]
    fn pop_empty_returns_none() {
        let sv = SecVec::<usize>::new();
        assert_eq!(sv.pop(), None);
    }

    #[test]
    fn ten_push_ten_pop() {
        let sv = SecVec::<isize>::new();
        for i in 0..10 {
            sv.push(i);
        }
        for i in (0..10).rev() {
            assert_eq!(sv.pop(), Some(i));
        }
    }

    #[test]
    fn does_not_allocate_buffers_on_new() {
        let sv = SecVec::<isize>::new();
        for buffer in &**sv.buffers {
            assert!(buffer.load(Ordering::Relaxed).is_null())
        }
    }
}

}

And here is the result, shuffled around a little so it all fits:

~/C/r/unlocked (main) [1] > cargo miri test -- sealed::tests
    Finished test [unoptimized + debuginfo] target(s) in 0.01s
     Running unittests (target/miri/x86_64-apple-darwin/debug/deps/unlocked-666)

running 5 tests
test sealed::tests::does_not_allocate_buffers_on_new ... ok
test sealed::tests::pop_empty_returns_none ... ok
test sealed::tests::size_starts_at_0 ... ok
test sealed::tests::ten_push_ten_pop ... ok
test sealed::tests::the_big_multithread ... ok
    warning: thread support is experimental and incomplete:
        weak memory effects are not emulated.

Mmm . . . I love those greens! <3

Reflections

Here are some reflections on the process of writing the vector and learning about concurrent programming.

Potential Optimizations

Each time we make a new Descriptor or WriteDescriptor, we allocate it on the heap. This means we will make many heap allocations for only one Descriptor to succeed at being compare_exchange'd in. What if we instead made one heap allocation at the beginning of push and pop, and just overwrote the contents on every failed iteration of the compare-exchange loop?

#![allow(unused)]
fn main() {
// Normal flow
fn push() {
    loop {
        // New allocation every iteration, expensive :(
        <allocate Descriptor> 
        <compare-exchange Descriptor>
        <if compare-exchange failed, reloop>
    }
}

// Efficient flow
fn push() {
    <allocate Descriptor> // One time cost :)
    loop {
        <write to allocation> // Cheap :)
        <compare-exchange Descriptor>
        <if compare-exchange failed, reloop>
    }
}
}

I tried it, and the results range from worse for one microbenchmark to being somewhat better on other microbenchmarks.

Here's the results of the vector we implemented:

test sealed::bench::pop                ... bench:     169,980 ns/iter (+/- 21,594)
test sealed::bench::push               ... bench:   1,025,550 ns/iter (+/- 43,945)
test sealed::bench::push_and_pop       ... bench:     829,768 ns/iter (+/- 63,895)
test sealed::bench::push_then_pop      ... bench:   1,732,666 ns/iter (+/- 113,670)

Here's the results for the modified vector:

test sealed::bench::pop                ... bench:     269,311 ns/iter (+/- 11,669)
test sealed::bench::push               ... bench:     962,469 ns/iter (+/- 23,620)
test sealed::bench::push_and_pop       ... bench:     786,135 ns/iter (+/- 32,104)
test sealed::bench::push_then_pop      ... bench:   1,611,816 ns/iter (+/- 68,167)

As you can see, pop (which is just a bunch of threads poping an empty vector) is worse for the modified vector. At the beginning of pop, we make an allocation to hold the Descriptors that we'll try to swap in. However, in this test, we are always poping off an empty vector, so we never even need to write to the allocation because we just return None when we see the length of the vector is 0. So, we make an unnecessary allocation when popping off an empty vector, but save many allocations when there is actual contention.

The other microbenchmarks look better, but the intervals for the modified and original overlap, so I doubt the change is significant (#AP Stats knowledge).

unsafe code

As I said before, unsafe code isn't inherently bad, it's just code that comes with a contract. Keeping this in mind helped me get over my initial apprehension about using unsafe code.

If you write concurrent code, I think unsafe code is inevitable. There's just too much to do with raw pointers and memory. Additionally, there are many more contracts the compiler can't enforce in multithreaded scenarios.

Philosophically, concurrent code has to deal with shared mutable state, otherwise it wouldn't do anything useful. Shared mutable state is inherently unsafe! That's why Rust only allows one mutable reference (&mut) at a time: it prevents memory bugs like data races. Thus, there is some danger intrinsically associated with writing low-level concurrent code.

"Shared mutable state is, among other things, the root of all evil" - Me, 2022

Although it seems scary at first, I'm really glad Rust has the concept of unsafe. Whenever I had any memory bugs, I knew that the root cause must have been in an unsafe block. Systematically checking over those blocks allowed me to fix my code quickly.

It's good that we have to make explicit where we are doing potentially unsafe things. Not just because of debugging, but because it makes us pause and check everything over one more time. If nothing was unsafe, or everything was unsafe, reasoning about our code would be much harder in my opinion.

A note on debugging: always read the safety contract and document why what you're doing is safe! I caught so many bugs just by going over the safety contract again and realizing I wasn't following it.

Atomic Intuition

It's pretty safe to say that atomics are confusing. Just the context itself is confusing: that CPUs and compilers can reorder instructions. I found that as I developed an intuition for atomics, it became easier to reason about my code and its correctness.

If you're getting started with concurrent programming, my advice would be to get a solid grasp on atomics. You don't need to know every detail and all the ins and outs. When you're writing code and you think "This Acquire load synchronizes with that Release store", you gain confidence and it becomes easier to get going.

The biggest moment for me was when I stopped having to look at the Standard Library Documentation every time I used an atomic. I had developed an intuitive sense of the orderings, and I could see why each one was useful in my code. At first, I thought the orderings seemed a little random. As I started to use atomics more and more, I saw how the orderings fit in nicely with actual use cases, from using Acquire to load a bucket to AcqRel in compare_exchange.

Building an intuition for atomics is both satisfying and extremely useful.

Debugging

Here are a couple debugging tricks I stumbled upon out as I wrote the vector.

Pointers

One time, I had a problem where something was reading a null pointer. I thought that after a swap, no thread could read the swapped in value, so I just swapped in a null pointer. Sadly, I was mistaken. Looking at the error message, which said pointer 0x0 is invalid to dereference, I got an idea. There were three places swapping in a null pointer; the first I changed to swap in 0x1, the second 0x2, and the third 0x3.

After running the program through Miri, I got the error message pointer 0x2 is invalid to dereference, and I knew where the bug was originating from.

unwrap vs. expect

Whenever you unwrap an Option or Result that is None or Err, Rust will print out a little diagnostic saying where the panic! happened. I found it helpful to use expect instead of unwrap because of the ability to provide some extra context.

For example, there is a method in the haphazard crate called AtomicPtr::load which returns an Option<&T>. It only returns a None value if the underlying AtomicPtr contains a null pointer. Instead of unwraping the return value of load, I called expect("read null pointer"). When I inevitably messed up and unwraped a None, I new there was a null pointer floating around because of the error message.

Although these tricks seem small, they actually saved me a lot of time.

Acknowledgements

Thank you to my advisor and my friends for your feedback and support.

Helpful Resources

Reading source code is a great way to learn!