Building a Rusty, Lock-free Dynamically Resizable Array
Dedicated to RS, whom I wouldn't compare-and-swap with anyone else.
You can read the code we'll write and the code for the book here. The main files for the
Rust
code are leaky.rs and sealed.rs.
The Goal
This book has a few goals.
Inspired by Learn Rust With Entirely Too Many Linked Lists, the main goal of this book is to teach you some Rust while implementing a useful container. We'll be implementing the lock-free vector described in the paper Lock-free Dynamically Resizable Arrays by Dechev et al., 2006
I hope that this book will inspire other new Rustaceans like myself to push their capabilities. I also hope that non-Rustaceans will see the how awesome Rust is as well. No matter whether you code or not, I hope that this book will show you a interesting area of computer science and a beautiful language!
Topics We'll Cover
- Concurrency
- Cache
- Exponential Backoff
- Atomics
- Memory Orderings
- Compare-and-Swap
- Memory Management
- Allocations in Rust
- Hazard Pointers
- Using Rust
Box
Drop
- Using
unsafe
Rust- Raw Pointers
- How to write
unsafe
code
- Anything else I find interesting!
Necessary Experience
tl;dr it's good to know some Rust
It will be helpful to be familiar with Rust or another language like C and C++,
as we will be dealing with low-level constructs like pointers, atomics, and
memory management. However, even if you are only familiar with Some(_)
or
None
of these things, I believe you will be able to learn an interesting thing
or two. I should say though, there is a lot of code in the later portions of
the book.
Of course, the code will be in Rust, so prior knowledge will be helpful. I'm not going to spend time explaining syntax. However, I will comment the code well and explain what is going on. I think if you're comfortable with the first 15 chapters of The Book, you should be fine. Even if not, as long as you understand most of Rust syntax and are fine with looking something up every once in a while, you'll be fine. Chapter 16 is very helpful as well as it's the chapter on concurrency.
A Message
Although I want you to come away from this book having learned something, and hopefully feeling inspired to write more exciting code, this book is also a learning experience for me.
I've never used Markdown before except for writing GitHub README's that only I'm going to read, so simply using Mdbook is super exciting! I think Markdown is cool because it's like HTML for lazy people (and I am very lazy).
Throughout the book, I will have many questions. I started teaching myself Rust around ~half a year ago, so perhaps I'm not really qualified to write this book. I'd never touched an atomic variable before I started this project (and I still technically haven't), so I really mean it when I say this is a learning experience for me. I'll try to document the answers to the questions I have so you can learn from them as well. There will also be a whole section of reflections, as this is also for a school project. You might enjoy that section more than the technical sections.
Before we get started, I want to clarify the structure of the book. There are three main "sections": Theory/Algorithm, Code, and Reflections. Feel free to bounce around if one section becomes too much.
And without further ado . . . cargo run!
Concurrency
Concurrent (Merriam-Webster): operating or occurring at the same time
Concurrent programming is simply programming that involves more than one event happening at a time, in the sense that we think of events in a program happening. In a non-concurrent program, if we wanted to perform two calculations, we would perform one, and then the other. In a concurrent approach, we might spawn two threads, and assign each of them a calculation to perform. A big idea in concurrent programming is having multiple processes running at the same time. You can think of it like your computer running Firefox and Spotify at the same time.1
On a hardware level, one way to implement concurrency to have multiple CPU cores (processors, the chips that do the math). Thus, we can add two numbers on one core while dividing two numbers on another core.
1 Your computer might actually just be switching between the applications really fast if you only have one CPU core, giving the illusion of multiple processes happening at the same time. Even if you have many cores, it's possible that the the applications could be running on the same core. It's all up to the task scheduler.
Keywords
Note: don't read all of this if it's boring, just come back if there is a word you don't know later on.
Shared-state
A concurrency model where multiple threads operate on the same memory.
Lock-free
A property of a system where after some finite number of time steps, a thread will make progress.
Ok, but what does lock-free actually mean? Suppose we have a thread holding a mutex. If that thread gets scheduled off by the OS, and never gets scheduled on again, no other threads can get the mutex and make progress. With a lock-free algorithm, we are guaranteed that after some amount of time, at least one thread will make progress; one thread cannot block all of the others indefinitely.
Buffer
Some block of memory that we use to hold data. The vector's buffers are where we hold the actual elements.
Cache
A type of memory that is faster to access but has less capacity that main memory (RAM). If we need a value frequently, we can cache it so that accesses are faster.
Null Pointer
The pointer with address 0x0
. Never safe to dereference.
Heap
A region of memory for storing long-lived or large values.
Stack
A region of memory for storing local variables and small variables.
Data race
When multiple threads access a value without synchronization and one of them is writing. Picture a shared Google document. If people are trying to read it as someone writes on it, they'll be reading gibberish until the writer is done.
Thread
Like another program running within the main program. Threads can run the same time as each other, and the first thread (the one we have at program start) is called the main thread. Eventually, all threads are joined back into the main thread.
Mutex {mut}ual {ex}clusion
A data structure that gives a thread exclusive access to some data. When you
call lock
on a Mutex
, you block until the Mutex
is unlocked and you get
the lock. Then, you can do whatever you want with the data inside. Once you're
done, you unlock the Mutex
to relinquish it back to the other threads.
Atomics
Besides having a cool name, atomics are crucial for writing concurrent code.
We first need to think about how computers perform operations. On a a 32-bit machine, loading (reading) a 64-bit value would require two CPU operations, one for the first 32 bits and one for the second 32 bits.
Suppose each box represents a byte (8 bits):
v Load 1 v
+-----+-----+-----+-----+-----+-----+-----+-----+
| | | | | | | | |
+-----+-----+-----+-----+-----+-----+-----+-----+
^ Load 2 ^
This shows how a load of a variable can take multiple steps.
Atomic operations take only one step. They have no intermediate observable state, which means the CPU only observes them as having happened or not.
This is very important in multithreaded scenarios because if threads use non-atomic operations, loads and scores might end up overlapping, resulting in torn reads and writes.
For example, on our hypothetical 32-bit machine, one core might finish the first write to the 32-bit value, another core then might perform the two loads needed to load the value, and then the first core might finish the storing the last 32 bits. Now, one core has a value that is half gibberish!
This is an example of a data race, an example of undefined behavior.
What are Memory Orderings?
In a concurrent environment, each variable has a modification history, all the values it has been. Say we have a variable A. We could store 1 into it, then 2, then 3.
The problem comes from the fact that another thread reading A can "read" any of those values, even after the last store is executed, "in real time". For example, it might have an older (stale) copy of the variable cached.
To ensure that our programs run the way we want, we need to specify more explicitly which values in the modification history the CPU is allowed to use.
Another problem is the compiler reordering instructions. The Golden Rule of instruction reordering is do not modify the behavior of a single-threaded program. The compiler might not think it's doing anything wrong moving some instructions around in one thread. And from the perspective of the thread that's being modified, everything will seem alright. Other threads might start receiving crazy results though.
An ordering is a parameter you provide to operations with atomic variables that specifies which reorderings can happen and which values in the modification history the CPU can use.
I'm not going to go super in-depth into the intricacies of each ordering, but I will explain the important parts of each. If you're curious, Jon Gjenset has a great youtube video on Atomics, which sections on each ordering: Crust of Rust: Atomics and Memory Ordering.
Going into the orderings, I find it helpful to separate their effects into two categories: those that have to do with compiler reordering, and those that have to do with the CPU. The compiler deals with the synchronization in the operation's thread, (well, actually the CPU does too, but that's a different story), and the CPU handles the synchronization across the other threads.
Relaxed
The first ordering is Relaxed
. When it comes to the CPU, there are no
guarantees imposed by this ordering. The compiler can reorder Relaxed
operations as long it follows the Golden Rule; it does not need to consider
other threads. The classic use case (I think this use case is classic at least,
I always see it used in examples) of the Relaxed
ordering is
incrementing/decrementing a counter. We don't really care about observing the
state of the counter; we just want to make sure our updates happen correctly.
When we finally load the counter, we can use an ordering with stronger
guarantees.
Release
Release
is used with stores. You can think of Release
as Release
ing a
lock. We want any changes that happened while we had the lock to become visible
to other threads. When you store with Release
, it's like saying "I'm done with
this, use these changes." Thus, the compiler cannot reorder operations after a
Release
store.
STORE (Relaxed) ─┐
STORE (Release) -+-// "Release the lock"
X <─┘ // nope, this happened while we "held the lock"
There is also a CPU property to this ordering, which I'll go over with
Acquire
.
Acquire
Acquire
is used with loads. You can think of Acquire
like Acquire
ing a
lock. This means that no memory operations in the current thread can get
reordered before taking the lock. Anything that happens after "taking the
lock" stays after the "lock was taken".
X <─┐ // nope, this happened "after taking the lock"
X <─┐ │ // nope, this happened "after taking the lock"
LOAD (Acquire) -------+--------------+-// "Take the lock"
STORE (Relaxed) ─┘ │
LOAD a different variable (Relaxed) ─┘
Acquire
also has an important interaction with Release
at the CPU level. Any
load Acquire
or stronger must see the changes published by the release store
of the same variable.
How does this achieve proper synchronization? You see, when two
Ordering
s love each other very much . . . we getAcquire-Release
semantics. Watch what happens when we useAcquire
andRelease
together (diagram inspired by this blog post):└───┘ Release store | Read most recent data because the load is Acquire and the store is Release V ┌───┐ Acquire load Memory operations cannot go above Memory operations cannot go below └───┘ Release store | Read most recent data because the load is Acquire and the store is Release V ┌───┐ Acquire load
All operations are trapped in their own sections, and each section gets the most recent modifications because of the way
Acquire
loads synchronize with theRelease
stores.
Note: Although the lock metaphor is helpful for understanding Acquire
and
Release
, remember there are no actual locks involved.
AcqRel (Acquire and Release)
An AcqRel
load/store is just Release
for stores and Acquire
for loads.
When used with an operation that loads and stores, it is both Acquire
and
Release
. AcqRel
's main use case is Read-Modify-Write operations, like
loading a variable, adding one, and storing it back. We want the load to be
Acquire
and the store Release
so we would use AcqRel
to achieve this.
Foreshadowing: this ordering will play a prominent part later on!
SeqCst (Sequentially Consistent)
The SeqCst
ordering makes has the same reordering effects of AcqRel
, and
also establishes a consistent modification order across all threads. Two stores
tagged Relaxed
might show up in different orders to different threads.
However, if they are both tagged SeqCst
, they will show up in the same order
to all threads. SeqCst
is the strongest ordering, and thus also the safest
(see Jon Gjenset's video for weird things that can happen with weaker
orderings). Safety comes at a price though, with the CPU often having to
emit memory fences1 to guarantee sequential consistency. This can affect
performance.
1 A memory fence prevents the CPU from reordering operations in certain ways. This is a great article which describes many different types of fences, kind of like the different Atomic orderings, which restrict the compiler instead of the CPU.
Compare-and-Swap
(also known as CAS and compare_exchange
)
Definition: swap a value with a new value only if the the current value is what we think it is.
This reason this is important is that loosely, we can say the state during the swap is the same as the state we observed when preparing for the swap.
Here's a code example:
LOAD A
CAS old = A, new = 2 * A // Only swap in the double if the number hasn't changed
A more realistic example with a linked list would be this;
LOAD LIST_NODE
CAS old = LIST_NODE.pointer, new = new_pointer
In this case, we switch the LIST_NODE
pointer only if it hasn't changed.
Here's what we did:
- We loaded the node
- We read the pointer
- We called CAS with the new pointer
At this point, there are two things that can happen:
- The pointer changed, and the CAS fails. This means that someone else changed the pointer first, and it's good that the CAS failed, because it's possible the the change that succeeded invalidates the change we just tried to make.
- The pointer is the same, and CAS succeeds. Because the pointer was the same, our assumptions about the state of the vector held, and our change was valid.
At first, this might seem contrived and confusing (as it did to me). I would focus on this intuition: if CAS succeeds, loosely, we can say the state during the swap was the same as the state we observed when preparing for the swap. Our assumptions were consistent throughout the whole process.
The compare_exchange
function in the Rust Standard Library returns a
Result<T, T>
, where T
is the type being exchanged. The Result
contains the
value that the variable actually was. If compare_exchange
fails, it returns
Err(actual_value)
, on success, it returns Ok(expected_value)
(if it
succeeded, that means actual_value == expected_value
).
Note: for the rest of the book, I'm going to refer to compare-and-swap
as
compare_exchange
, as that is what the Rust Standard Library uses. I used
compare-and-swap
on this page because the name is very explicit about what the
operation does.
Introduction to the Paper
Firstly, a link to the paper can be found here.
In their 2006 paper, Dechev et al. describe an implementation for a vector than can safely be shared across threads.
Structure of the vector
This is where we begin working on the vector.
When thinking about the structure of the vector, I find it helpful to think about it in two parts: memory and synchronization. By memory I mean allocation of space and by synchronization I mean synchronizing reads and writes. Let's start with memory.
Memory
The vector is, broadly, a two-level array.
+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | Top-level
+---+---+---+---+---+
| | | | |
v v v v v
+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 | Lower-level, notice: these arrays are represented vertically
+---+---+---+---+---+
| 2 | 3 | 4 | 5 |
+---+---+---+---+
| 3 | 4 | 5 |
+---+---+---+
| 4 | 5 |
+---+---+
| 5 |
+---+
The vector stores a pointer to a first array (the top-level one). The elements in this array are also pointers, to more arrays (the lower-level ones). This is why this organization is called a two-level array.
The reason for this is resizing. Suppose we have 4 elements of capacity, and
they are all filled. We need to allocate more memory. We allocate more memory
using something like malloc()
, and the allocator returns a pointer to the new
allocation.
For a normal vector, we would simply copy our vector's elements over to the new allocation. We can't do this for a lockless vector though because copying isn't atomic, and we can't lock down the vector, copy, and unlock. Therefore, we need a different system.
A little tangent on allocations: When allocating memory for a normal vector, we generally make larger and larger allocations. For example, the first allocation could be 4 elements, the next 8, then 16, 32 . . . This reduces the total number of allocations we need to perform, which is good for performance. We're going to use this same idea for the vector.
Returning to the idea of a two-level array, the first level is going to hold
pointers to blocks of memory we can call buckets. The amount of memory a
bucket holds is related to it's index. The first bucket will hold
some constant (which we'll call FIRST_BUCKET_SIZE) times 2 to the power of its index
elements. Here are some sample calculations for the first few buckets to show
the principle, using FIRST_BUCKET_SIZE=8
:
# Bucket 1
CAP = FIRST_BUCKET_SIZE * 2 ^ INDEX
= 8 * 2 ^ 0
= 8
# Bucket 2
CAP = FIRST_BUCKET_SIZE * 2 ^ INDEX
= 8 * 2 ^ 1
= 16
3: 32
4: 64
To infinity and beyond . . .
The next part of the vector's structure is the synchronization aspect, which goes hand in hand with explaining the algorithm. I'll cover them together.
Synchronization
Synchronization, that is, coordinating concurrent operations on the vector, is
achieved through two little data structures: the Descriptor
and the
WriteDescriptor
. As you might expect, the Descriptor
describes the vector
and the WriteDescriptor
describes a write operation.
The Descriptor
The descriptor holds two values: a pointer to a WriteDescriptor
, and a value
indicating the size of the vector.
The WriteDescriptor
The WriteDescriptor holds three values: the location of the write (a
pointer-like object), an old value, and a new value. If your spidey-sense is
tingling, it might be because this new/old business is hinting at a
compare_exchange
in the future.
Now that we've seen the Descriptor
and WriteDescriptor
, here's a quick
summary of the vector's structure:
# Data Organization
Vector:
[Pointer -> Memory],
Pointer -> Descriptor
Descriptor:
Pointer -> Possible WriteDescriptor,
Size
WriteDescriptor:
Pointer -> Element location,
New Element,
Old Element
How does this actually help with synchronization?
The major challenges of providing lock-free vector implementation stem from the fact that key operations need to atomically modify two or more non-colocated words (Dechev et. al., 2006)
This translates to, "We need to change two things (without locking the vector
down) to ensure the vector is in the right state." For a push
operation, say,
we would need to change the length of the vector and write the new data.
The descriptor system gets around this by saying, "If you want to change the
Descriptor
, you need to complete a pending write if there is one." Why does
this ensure the correct semantics? Consider this example from the paper:
The semantics of the
pop_back
[pop
] andpush_back
[push
] operations are guaranteed by theDescriptor
object. Consider the case when apop_back
is interrupted by any matching number ofpush_back
andpop_back
operations. In a naive implementation, the size of the vector would appear unchanged when the originalpop_back
resumes and the operation could produce an erroneous result. (Dechev et. al., 2006)
Under the "naive implementation", in this scenario, the vector might look like
[1, 2, 3]
. Someone calls pop
, and the vector should return 3
. However, the
thread gets preempted (the OS says another thread can run, and the current
thread is paused), and the running thread executes a bunch of pop
s and
push
es. The vector becomes [4, 5, 6]
. When the original pop finally runs, it
incorrectly returns 6
.
Let's consider when the first push
happens after the original pop
under the
correct implementation. When the push
happens, it swaps in a new Descriptor
,
which says that the size is now one bigger and points to a new WriteDescriptor
representing a push
operation. Because it swapped in a Descriptor
, it has to
complete the operation specified in the current WriteDescriptor
, and the
original pop
returns 3
, as it should.
Let me clarify all this swapping business!
The Algorithm
As I’ve said before, I think of the vector as two connected systems: memory and
synchronization. By “The Algorithm”, I mean the synchronization aspect. To
recap, synchronization is controlled by two little data structures, the
Descriptor
and the WriteDescriptor
. These data structures describe the
vector itself and a write operation, respectively.
I think the best way to explain the algorithm is to dive right in.
complete_write()
First, I want to explain a little routine called complete_write
. This function
is true to its name and completes a write.
Write means "write operation", in this context, a
push
orpop
. In my experience, "write" has been a more colloquial term used in CS for whenever we make a modification to something. Really anything can technically be a "write", but I would say things that are more final are "writes". For example, incrementing a loop variable is pretty insignificant in the grand scheme of things, so it's not really a "write", but increasing the size of the vector is an important "write". This usage might also be particular to concurrent programming, where balancing reads/writes is an important consideration for designing a data structure. Concurrent data structures are often designed for infrequent writes and frequent reads. Modifications to databases (which are heavily concurrent) can also be called writes. tl;dr a "write" in this case means the details describing a particular instance of "writing"
complete_write
takes two arguments, a WriteDescriptor
, and the vector
itself. complete_write
applies the write operation described in the
WriteDescriptor
on the vector. Recall that a WriteDescriptor
contains three
things: a reference/pointer to the location where the write will take place, a
new value to write, and an old value that we loaded in from the location.
First we perform a compare_exchange
using the data in the WriteDescriptor
.
We only swap in the new data if the data at the location of the swap matches the
old data we have. If the compare_exchange
succeeds, this means that we swapped
in the value we want to write. If it fails, it means someone else beat us to it
and performed the write. Remember, many threads can access the vector's
Descriptor
and WriteDescriptor
at once, so many threads will be trying to
complete the same write. Only one of them can succeed. It's a fight to the
death! Arrghhh!!!
I'm kidding. After performing the compare_exchange
, successful for not, we
modify the vector to indicate that there is no pending write operation. If all
threads do this, at least once will succeed, and all will indicate that there is
no pending write operations. Though some of the threads may be sad because their
compare_exchange
failed, the vector is happy because it's in a consistent and
correct state.
push()
Now that we know writes are actually performed, let’s get into how a push
operation works. Here are the steps:
-
Load in the current
Descriptor
. -
If the
Descriptor
contains a write operation, complete it . This is important because it ensures that before any new write operation happens, the previous one is completed. We cannot do anything before completing the previous write operation, so all operations will eventually get executed. -
Calculate which bucket our new element will go into.
-
If that bucket has not been allocated memory yet, do so.
-
Make a new
WriteDescriptor
. Thenew
value in theWriteDescriptor
will be the data passed into thepush
function. -
Make a new
Descriptor
which contains the following data: the size held in the currentDescriptor
+ 1, and the newWriteDescriptor
. -
Now, here comes the part that makes this a
compare-and-swap
orcompare_exchange
algorithm. Wecompare_exchange
the newDescriptor
we made with the old one. If theDescriptor
held in the vector didn't change, our newDescriptor
will replace it. If it did change, we will fail to swap in our newDescriptor
, and we go back to Step 1.Note: I think it's important to consider why this routine (particularly step 6) ensures correctness. If the
compare_exchange
succeeds, this means that the vector did not change in the time it took us to prepare a newDescriptor
. Why is this important? It means our assumptions about the vector's state did not change. In our newDescriptor
, we used the size from theDescriptor
we loaded in, and incremented that. So, if the size we loaded in was4
, our newDescriptor
would say the size of the vector is5
. Now, imagine that we could just swap in our freshDescriptor
without comparing it with the current one. If someone else was also trying topush
, theirDescriptor
might get swapped in before ours. It would say the size of the vector is5
, because it made the same assumptions we did. Then we swap in ourDescriptor
, ourDescriptor
would maintain that the size of the vector is5
, even though it should be6
because there were twopush
operations. Furthermore, we would overwrite the element that waspush
ed on by the first call topush
, because both ourWriteDescriptor
s would be referencing the same location in memory. This is terrible!compare_exchange
is our friend. -
Now that we have swapped in our
Descriptor
, we execute theWriteDescriptor
we made usingcomplete_write
, finalizing the changes we want to make to the vector.
And that's a push
!
Pop
pretty much works the same except for some small variations, so we'll get
into that when we implement push
/pop
. However, the way we make sure changes
are valid using compare_exchange
is identical for both operations.
I think it's finally time to start looking at some code. When I was writing code, it felt very different from reasoning about the theory. I really felt like I had to consider every line I wrote and every decision I made. I'll walk you through what I came up with now.
Note: we're going to first write a version of the vector that doesn't reclaim the memory it uses; it leaks.
Starting Code
Pseudocode
This "pythonesque" pseudocode with some pointer operations thrown in shows the general API and implementation details of the vector. The pseudocode is a conversion of the paper's pseudocode into a more (in my opinion) understandable form. It completely ignores memory reclamation.
You don't need to read this entire thing, it's just here as a reference.
# Calculate the index of the correct bucket
# Return a pointer
def at(vector, i):
pos = i + FIRST_BUCKET_SIZE
hibit = highest_bit(pos)
index = pos ^ 2 ** hibit
return &vector.memory[hibit - highest_bit(FIRST_BUCKET_SIZE)][index]
# Perform an atomic load at the correct index
def read(vector, i):
return *at(vector, i).load(Ordering)
# Perform an atomic store at the correct index
def write(vector, i, elem):
return *at(vector, i).store(elem, Ordering)
# Calculate the number of allocations needed
# Then perform each allocation
def reserve(vector, size):
i = highest_bit(vector.descriptor.size + FIRST_BUCKET_SIZE - 1)
- highest_bit(FIRST_BUCKET_SIZE)
if i < 0 {
i = 0
}
while i < highest_bit(size + FIRST_BUCKET_SIZE - 1)
- highest_bit(FIRST_BUCKET_SIZE):
i += 1
allocate_bucket(vector, i)
# Calculate the amount of memory needed
# Allocate that much memory
# Try to CAS it in
# If CAS fails, the bucket is already initalized, so free the memory
def allocate_bucket(vector, bucket):
bucket_size = FIRST_BUCKET_SIZE * (2 ** bucket)
mem = allocate(bucket_size)
if not CAS(&vector.memory[bucket], nullptr, mem):
free(mem)
# Get the size of the current descriptor
# If there is a pending write operation, subtract one from the size
def size(vector):
size = vector.descriptor.size
if descriptor.writeop.pending:
size -= 1
return size
# Get the current descriptor
# Complete a pending write operation
# Allocate memory if needed
# Make a new WriteDescriptor
# Try to CAS it in
# If CAS failed go back to first step
# Complete a pending write operation
def push(vector, elem):
while True:
current_desc = vector.descriptor
complete_write(vector, current_desc.pending)
bucket = highest_bit(current_desc.size + FIRST_BUCKET_SIZE)
- highest_bit(FIRST_BUCKET_SIZE)
if vector.memory[bucket] == nullptr:
allocate_bucket(vector, bucket)
writeop = WriteDescriptor(
*at(vector, current_desc.size),
elem,
current_desc.size
)
next_desc = Descriptor(1 + current_desc.size, writeop)
if CAS(&vector.descriptor, current_desc, next_desc):
break
complete_write(vector, next_desc.pending)
# Get the current descriptor
# Complete a pending write operation
# Read the last element of the vector
# Make a new WriteDescriptor
# Try to CAS it in
# If CAS failed go back to first step
# Return the last element
def pop(vector):
while True:
current_desc = vector.descriptor
complete_write(vector, current_desc.pending)
elem = *at(current_desc.size - 1).load(Ordering)
next_desc = Descriptor(curr_desc.size - 1, null)
if CAS(&vector.descriptor, current_desc, next_desc):
break
return elem
Memory Allocation
The first thing I did when writing the code out was think about the pieces that
would make up the vector. In Rust, an extremely common building block for any
type is the struct
. A struct
just sticks its members' data next to each
other in memory. Here is the vector itself, as a struct:
#![allow(unused)] fn main() { pub struct SecVec<'a, T: Sized + Copy> { buffers: CachePadded<Box<[AtomicPtr<AtomicU64>; 60]>>, descriptor: CachePadded<AtomicPtr<Descriptor<'a, T>>>, _boo: PhantomData<T>, // Data is stored as transmuted T's } }
Boo! 👻
I bet the PhantomData
scared you. We have a generic parameter T
, but we have
no struct
members of SecVec
or either of the descriptors that actually
contains a T
(because we transmute T
into u64
s). Therefore, to let the
compiler know we really are carrying T
's, we add a little ghost that tells it,
"We're carrying this Phantom T
wink "
Sharing is caring
There is a lot to unpack here. Firstly, CachePadded
is a struct
provided by
the crossbeam_utils
crate.
A note on cache: you may have heard of CPU cache, a small buffer of memory stored on the CPU to allow for fast access. The
cache
inCachePadded
actually refers to a buffer between main RAM and the CPU's. It's just a larger and slower cache compared to a CPU cache. The cache is split into contiguous blocks of memory called cache lines. This is the most granular level at which cache coherency is maintained. When multiple threads both have a value in the same cache line, one thread modifying the value it owns marks the entire cache line as "dirty". Even though the other thread's value hasn't been changed, the cache coherency protocol might cause the thread to reload the entire line when it uses the value, incurring some overhead. This is called false sharing, and cause severe performance degradation. Cache is an extremely important consideration when data structures. It's why linked lists are algorithmically fine but terribly slow in practice. As the saying goes, cache is king.
The CachePadded
struct
aligns its contents to the beginning of the cache
line to prevent false sharing. If all CachePadded
objects are at the beginning
of a cache line (assuming they do not cross a cache line), there can't be false
sharing between them. Preventing false sharing can lead to a huge speedup, but
it also does increase the size of the type. If you're wondering how
CachePadded
is implemented, check out
#[repr(align(n))]
in the
Nomicon.
Here's how I picture cache padding:
|-----Cache line-----|-----Cache Line-----|
v v v
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|69|xx|xx|xx|xx|xx|xx|42|xx|xx|xx|xx|xx|xx|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
^ ^
\ \
\ \
\ \
Different cache lines -> no false sharing
Two-level array
The first member of SecVec<T>
is a cache-padded array of 60 pointers allocated
on the heap (notice the Box
). These pointers will each point into another
array. The pointers start off as null pointers (0x0
), and will get swapped out
for valid pointers once they need to point to an actual array.
The AtomicPtr
s point to AtomicU64
s because each element is going to get
transmuted into a u64
so that we can atomically perform writes on the vector.
When returning an element, we'll transmute it back into a T. Transmuting means
interpreting the bits of one type as the bits of another.
For example, 0b10100001
means -95
when interpreted as a signed integer but
161
when interpreted as an unsigned integer. Transmuting one to the other
would just change how we interpret the bits, not tha actual bits themselves.
Descriptors galore
The second member of SecVec<T>
is a cache-padded AtomicPtr
to a
Descriptor
. As you've probably noticed, there are a bunch of AtomicPtr
s
here. That's because we can modify the pointer atomically, specify which
Ordering
to use, and compare_exchange
the pointer. A common way of writing
data in concurrent programming is to change a pointer instead of actually
modifying a buffer. Since a buffer can't necessarily be modified atomically or
without locking, what we can do is prepare a buffer and then change a pointer so
that it points to our new buffer. All new readers will see the new data when
they dereference the pointer.
Pointer
/ \
/ \
+---+ +----+
/ \
/ -> \
v v
Old New
+---+---+---+---+ +---+---+---+---+
| 9 | 9 | 9 | 9 | | 6 | 6 | 6 | 6 |
+---+---+---+---+ +---+---+---+---+
What do we do with the old pointer you might ask? Worry not, we will get into that 😅
The Descriptor and WriteDescriptor
#![allow(unused)] fn main() { pub struct Descriptor<'a, T: Sized> { pending: AtomicPtr<Option<WriteDescriptor<'a, T>>>, size: usize, } pub struct WriteDescriptor<'a, T: Sized> { new: u64, old: u64, location: &'a AtomicU64, _boo: PhantomData<T>, // New and old are transmuted T's } }
The trait bounds
Notice how T is Sized
, this means that its size is always known at
compile-time. We need to ensure this because our values need to be transmutable.
Part of the safety contract of transmute_copy
is making sure our types are of
compatible sizes.
The Copy
bound is necessary because the data in the vector is copied in and
out of the buffers, with transmute_copy
.
OK, enough talk about struct
s, let's get to the first function: get()
get()
The first, and simplest function to write is vector.get(i)
, which returns a
pointer to the element at index i.
Here is the code to that implements get
.
#![allow(unused)] fn main() { /// Return a *const T to the index specified /// /// # Safety /// The index this is called on **must** be a valid index, meaning: /// there must already be a bucket allocated which would hold that index /// **and** the index must already have been initialized with push/set unsafe fn get(&self, i: usize) -> *const AtomicU64 { // Check for overflow let pos = i .checked_add(FIRST_BUCKET_SIZE) .expect("index too large, integer overflow"); let hibit = highest_bit(pos); let offset = pos ^ (1 << hibit); // Select the correct buffer to index into // # Safety // Since hibit = highest_bit(pos), and pos >= FIRST_BUCKET_SIZE // The subtraction hibit - highest_bit(FIRST_BUCKET_SIZE) cannot underflow let buffer = &self.buffers[(hibit - highest_bit(FIRST_BUCKET_SIZE)) as usize]; // Check that the offset doesn't exceed isize::MAX assert!( offset .checked_mul(mem::size_of::<T>()) .map(|val| val < isize::MAX as usize) .is_some(), "pointer offset exceed isize::MAX bytes" ); // Offset the pointer to return a pointer to the correct element unsafe { // # Safety // We know that we can offset the pointer because we will have allocated a // bucket to store the value. Since we only call values that are // `self.descriptor.size` or smaller, we know the offset will not go out of // bounds because of the assert. buffer.load(Ordering::Acquire).add(offset) } } }
A few points to note
Notice how the function is marked as unsafe
. This is because there is a
safety contract the compiler can't enforce: the index must be valid. This is
automatically guaranteed through the usage of the function in the algorithm, but
it's worth it marking it unsafe
just to be explicit.
Summarizing what the function does, we calculate which buffer the item is in,
load the pointer to the start of the buffer, and offset it to the correct
element. There are two other things I want to point out. First, notice all the
checks we make to avoid overflow. Secondly, notice the use of Acquire
for
loading in the pointer to the buffer. Since the store part of the
compare_exchange(AcqRel)
we use to set the pointer to the buffer is Release
,
we are guaranteed to get the most recent pointer, because an Acquire
load sees
the contents Release
ed by a Release
store! I find it very satisfying how
Acquire
and Release
work together. It's like two puzzle pieces fitting
nicely into each other.
What are all these bitwise operations?
I'm honestly not sure. That's why they wrote the paper and I didn't :)
.
Next up is the allocate_bucket
.
allocate_bucket()
Remember that whole "two-level array" thingy? This is where it starts coming
into play. allocate_bucket
does just what is sounds like: allocating a bucket.
Recall that a bucket is one of the arrays in the second level of the two level
array.
+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 |
+---+---+---+---+---+
| | | | |
v v v v v
+---+---+---+---+---+
| 1 | 2 | 3 | 4 | 5 |
+---+---+---+---+---+
| 2 | 3 | 4 | 5 |
+---+---+---+---+
| 3 | 4 | 5 |
+---+---+---+
^ | 4 | 5 |
| +---+---+
| | 5 |
| +---+
|
we're allocating one of these little guys
There are two parts to allocate_bucket
: allocating the memory and setting the
pointer. We start off by tapping into the alloc
crate's API. First, we create
a Layout
, which describes the allocation we want. The
Layout::array::<Atomic64>()
indicates that we want a bunch of AtomicU64
right next to each other in memory. If creating the layout fails (due to
overflow), we call capacity_overflow
, which just panics.
You might ask, why not just directly call
panic!
? Apparently, it reduces the generated code size if we just have panic in one function, which we then call from multiple places. I found this trick in the source code forstd::vec::Vec
. You can learn a lot from reading the Standard Library code. That's how I've learned a lot of the low level stuff I know. It's also a good way to see what good, idiomatic Rust looks like.
#![allow(unused)] fn main() { const FIRST_BUCKET_SIZE: usize = 8; fn allocate_bucket(&self, bucket: usize) { // The shift-left is equivalent to raising 2 to the power of bucket let size = FIRST_BUCKET_SIZE * (1 << bucket); let layout = match Layout::array::<AtomicU64>(size) { Ok(layout) => layout, Err(_) => capacity_overflow(), }; }
The next thing we do is just another check. The Standard Library does both checks and I trust their strategy.
#![allow(unused)] fn main() { // Make sure allocation is ok match alloc_guard(layout.size()) { Ok(_) => {} Err(_) => capacity_overflow(), } }
Miri
is about to make its debut!Miri
is a tool that runs your code in a special environment and detects undefined behavior (or UB, as the cool kids call it).
Now that our layout is all good, we can perform the actual allocation. We
instantiate the Global
struct
, which is the allocator we're using. The
allocator returns a pointer to our new allocation once it's finished allocating.
Why are we using allocated_zeroed
you might ask? Why not just allocate
normally? The answer: It's Utmost Holiness: Miri
. In all seriousness though,
Miri
has been and invaluable tool in catching memory and concurrency bugs.
When we just allocate normally, Miri
throws and error when we start actually
using the memory later on, saying that
intrinsics::atomic_cxchg_acqrel_failrelaxed(dst, old, new)
requires
initialized data. Thus, we just zero the memory for now. Later, it might be
worth it to do some MaybeUninit
magic, but honestly, I don't know if there'll
be much, if any, performance gains.
Once again, we have more checks, and we'll just panic!
if the allocation
fails. handle_alloc_error
is from the alloc
crate:
#![allow(unused)] fn main() { let allocator = Global; let allocation = allocator.allocate_zeroed(layout); let ptr = match allocation { Ok(ptr) => ptr.as_ptr() as *mut AtomicU64, Err(_) => handle_alloc_error(layout), }; }
The final part is to swap in the pointer into our array of buffers (the first
level of the two-level array). We use the compare_exchange
function, with a
null pointer as the expected value, and our new pointer from the allocation. If
compare_exchange
fails, that means the pointer is no longer null, and someone
else compare_exchanged
ed in a pointer. Therefore, the bucket is already
allocated. In this case, we deallocate the freshly allocated memory. Notice how
we assess the result of compare_exchange
with Result::is_err()
; we don't
care about the value compare_exchange
returns.
#![allow(unused)] fn main() { if self.buffers[bucket] // <- this is an AtomicPtr<AtomicU64> .compare_exchange( ptr::null_mut::<AtomicU64>(), // old value ptr, // new value Ordering::AcqRel, // ordering on success Ordering::Relaxed, // ordering on fail ) .is_err() { unsafe { // # Safety // We know that the pointer returned from the allocation is NonNull so // we can call unwrap() on NonNull::new(). We also know that the pointer // is pointing to the correct memory because we just got it from the // allocation. We know the layout is valid, as it is the same layout we // used to allocate. allocator.deallocate(NonNull::new(ptr as *mut u8).unwrap(), layout); } } } }
Success and Fail Orderings
Like all atomic operations, compare_exchange
uses the orderings. Most
operations take 1, but this bad boy takes two. Since compare_exchange
reads
and writes a memory location, we're using AcqRel
. Since we always use
AcqRel
for the buckets, the load part (Acquire
) of the compare_exchange
will always see the most recent value because the store part is Release
. If we
just used Acquire
, the store part of the compare_exchange
would be
Relaxed
, which doesn't guarantee that the modification to the
AtomicPtr<AtomicU64>
is published to other threads by any certain point. Under
a Relaxed
situation, another thread might load a null pointer in its
compare_exchange
, even though our thread swapped in a pointer to memory!
That's the success ordering. The fail ordering is Relaxed
because we don't
need to establish any synchronization if the operation fails. It failed; we're
not doing any stores. When I first saw this, I had the question, "Why do we
provide different success and fail orderings if the compare_exchange
doesn't
know if it will fail or not?" The answer, thanks to Alice on the Rust User
Forums, is that the compiler picks an ordering that will always satisfy the stronger
ordering. Thus, compare_exchange(success: AcqRel, fail: Release)
executes as
compare_exchange(success: AcqRel, fail: Acquire)
to ensure that the initial
load is Acquire
for both cases.
There's a little more to it; if you're still curious, see this thread on the Rust User Forums.
The last function in the "memory" section is reserve()
, which I've "reserved" for last.
Complete source for allocate_bucket()
#![allow(unused)] fn main() { fn allocate_bucket(&self, bucket: usize) { // The shift-left is equivalent to raising 2 to the power of bucket let size = FIRST_BUCKET_SIZE * (1 << bucket); let layout = match Layout::array::<AtomicU64>(size) { Ok(layout) => layout, Err(_) => capacity_overflow(), }; // Make sure allocation is ok match alloc_guard(layout.size()) { Ok(_) => {} Err(_) => capacity_overflow(), } let allocator = Global; // allocate_zeroed because miri complains about accessing uninitialized memory // TODO: Maybe use MaybeUninit? let allocation = allocator.allocate_zeroed(layout); let ptr = match allocation { Ok(ptr) => ptr.as_ptr() as *mut AtomicU64, Err(_) => handle_alloc_error(layout), }; // If the CAS fails, then the bucket has already been initalized with memory // and we free the memory we just allocated if self.buffers[bucket] .compare_exchange( ptr::null_mut::<AtomicU64>(), ptr, Ordering::AcqRel, Ordering::Relaxed, ) .is_err() { unsafe { // # Safety // We know that the pointer returned from the allocation is NonNull so // we can call unwrap() on NonNull::new(). We also know that the pointer // is pointing to the correct memory because we just got it from the // allocation. We know the layout is valid, as it is the same layout we // used to allocate. allocator.deallocate(NonNull::new(ptr as *mut u8).unwrap(), layout); } } } }
reserve()
The goal of reserve(n)
is simple: allocate enough memory to perform n
pushes
without allocating more memory.
This is a useful function, because, as we've seen, allocate_bucket
requires
some heavy atomics with compare_exchange
. If we can do our allocations in a
calmer scenario with less contention, we'll experience some performance gains.
We start by calculating the number of allocations we'll need to perform to
reserve enough space. The calculation is a little funky, and there's an edge
case where it can't distinguish between 0 and sizes between 1 and
FIRST_BUCKET_SIZE
. That's why we need to explicitly allocate the first bucket.
We'll see the implementation of size()
later, but it does use some atomic
synchronization, so we just cache the result so we can keep using it later
without calling size
again.
#![allow(unused)] fn main() { pub fn reserve(&self, size: usize) { // Cache the size to prevent another atomic op from due to calling `size()` again let current_size = self.size(); if current_size == 0 { self.allocate_bucket(0); } }
Now, we calculate the number of allocations we've made.
highest_bit
returns the highest set bit in a number. A bit is set
if it's equal to one. The highest set bit of 7 (0b111
), for example, is 2
(0-indexed). Since the buckets are increasing by a factor of two each time, the
highest set bit of the indices in each bucket is one greater than the highest
set bit of the indices in the previous bucket. Therefore, by using the highest
bit of a number in conjunction with FIRST_BUCKET_SIZE
, we can figure out how
many allocations are needed for a certain capacity. I know I'm waving my hands a
little; I haven't taken the time to rigorously understand the arithmetic, as
it's not that interesting to me, and in practice it works.
#![allow(unused)] fn main() { let mut num_current_allocs = highest_bit(current_size.saturating_add(FIRST_BUCKET_SIZE) - 1) .saturating_sub(highest_bit(FIRST_BUCKET_SIZE)); }
Then we calculate the number of allocations we need to reserve the space, and for each allocation missing, we allocate.
#![allow(unused)] fn main() { // Compare with the number of allocations needed for size `new` while num_current_allocs < highest_bit(size.saturating_add(FIRST_BUCKET_SIZE) - 1) .saturating_sub(highest_bit(FIRST_BUCKET_SIZE)) { num_current_allocs += 1; self.allocate_bucket(num_current_allocs as usize); } } }
And that's it for memory. We can now do every thing we need to do to access and top up the vector's memory. Now's time for the really hard part: actually implementing the vector's functions.
Complete source for reserve()
#![allow(unused)] fn main() { pub fn reserve(&self, size: usize) { // Cache the size to prevent another atomic op from due to calling `size()` again let current_size = self.size(); if current_size == 0 { self.allocate_bucket(0); } // Number of allocations needed for current size let mut num_current_allocs = highest_bit(current_size.saturating_add(FIRST_BUCKET_SIZE) - 1) .saturating_sub(highest_bit(FIRST_BUCKET_SIZE)); // Compare with the number of allocations needed for size `new` while num_current_allocs < highest_bit(size.saturating_add(FIRST_BUCKET_SIZE) - 1) .saturating_sub(highest_bit(FIRST_BUCKET_SIZE)) { num_current_allocs += 1; self.allocate_bucket(num_current_allocs as usize); } } }
Operations
Now that we have a solid memory backbone, we're going to implement the public
API of the vector: new
, push
, pop
, and size
, as well as complete_write
and some helper functions.
Since it's been a little bit, here's what the vector look like in code form again:
#![allow(unused)] fn main() { pub struct SecVec<'a, T: Sized + Copy> { buffers: CachePadded<Box<[AtomicPtr<AtomicU64>; 60]>>, descriptor: CachePadded<AtomicPtr<Descriptor<'a, T>>>, _boo: PhantomData<T>, // Data is stored as transmuted T's } struct Descriptor<'a, T: Sized> { pending: AtomicPtr<Option<WriteDescriptor<'a, T>>>, size: usize, } struct WriteDescriptor<'a, T: Sized> { new: u64, old: u64, location: &'a AtomicU64, _boo: PhantomData<T>, // New and old are transmuted T's } }
new()
We've got to have some way of making a vector (or at least for an outside user to make one).
What are the ingredients we need to make the vector? Buffers, Descriptor
, and
WriteDescriptor
. The WriteDescriptor
is going to be None
, as we don't have
any pending writes yet.
Here's the code:
#![allow(unused)] fn main() { // We actually do want this to be copied #[allow(clippy::declare_interior_mutable_const)] const ATOMIC_NULLPTR: AtomicPtr<AtomicU64> = AtomicPtr::new(ptr::null_mut::<AtomicU64>()); pub fn new() -> Self { // Make an array of 60 AtomicPtr<Atomicu64> set to the null pointer let buffers = Box::new([ATOMIC_NULLPTR; 60]); // Make a new WriteDescriptor let pending = WriteDescriptor::<T>::new_none_as_ptr(); // Make a new descriptor let descriptor = Descriptor::<T>::new_as_ptr(pending, 0, 0); // Return self Self { descriptor: CachePadded::new(AtomicPtr::new(descriptor)), buffers: CachePadded::new(buffers), _boo: PhantomData, } } }
Firstly, we declare the constant ATOMIC_NULLPTR
. This is just an AtomicPtr
containging a null pointer. The reason the const
declaration is necessary is
that when we make an array of something [SOMETHING; 60]
, that SOMETHING
needs to be Copy
or evaluatable at compile time. Since AtomicPtr<AtomicU64>
is not Copy
, we resort to creating ATOMIC_NULLPTR
at compile time. Once we
have our array of null pointers, we put it on the heap to reduce the size of the
vector. If we were carrying it all directly, the vector would be over 480 bytes
large! With a Box
, we only store 8 bytes pointing to the first level in our
two-level array.
Then, we make a WriteDescriptor
using new_none_as_ptr()
, which returns an
Option<WriteDescriptor<T>>
. We pass this into the constructor (new_as_ptr
)
for Descriptor<T>
, and then assemble the Descriptor
and the Box
ed array
together to make the vector.
The constructors for the descriptor types end in as_ptr
because they actually
return a raw pointer pointing to a heap allocation containing the value. We
achieve this by making a Box
and then extracting the inner raw pointer.
let b = Box::(5);
let b_ptr = Box::into_raw(b); <- That's a raw pointer to heap memory!
My first UB mistake
I introduced the heap and the stack earlier in the keywords section, but I didn't explain why the distinction is important.
When a function is called, a stack frame is pushed onto the stack. This stack frame contains all the function's local variables. When the function returns, the stack frame is popped off the stack, and all local variables are destroyed. This invalidates all references to local variables that were just popped off.
The heap is different. You allocate on the heap, and you deallocate on the heap.
Nothing happens automatically. This is the legendary malloc/free
combo from
C
.
Understanding the distinction between the stack and the heap is important because we are using raw pointers, which don't have the guarantees of references.
Here is my first mistake, summarized a little:
use core::sync::atomic::{Ordering, AtomicPtr}; fn main() { let ptr = new_descriptor(); // Use the pointer to the Descriptor let d = unsafe { &*ptr.load(Ordering::Acquire) }; } // Return a pointer to a Descriptor fn new_descriptor() -> AtomicPtr<Descriptor> { let d = Descriptor { size: 0, write: None }; AtomicPtr::new(&d as *const _ as *mut _) } struct Descriptor { size: usize, write: Option<bool> }
$ cargo miri run
error: Undefined Behavior: pointer to alloc1184 was dereferenced after this allocation got freed
--> src\main.rs:46:22
|
46 | let d = unsafe { &*ptr.load(Ordering::Acquire) };
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pointer to alloc1184 was dereferenced after this allocation got freed
|
= help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
Miri
says
pointer to alloc1184 was dereferenced after this allocation got freed
.
Translation: use-after-free
; classic UB.
So why is the Descriptor
's allocation being freed? Because it's allocated on
the stack. When new_descriptor
returns, the local variable d: Descriptor
get's destroyed, and the pointer we made from the reference is invalidated.
Thus, we use-after-free
when we deference a freed allocation.
This is the danger of using raw pointers. If we just passed on the reference
Descriptor
, Rust
would
promote that
value to have a 'static
lifetime if possible, or return an error if not. With
raw pointers, Rust
doesn't manage lifetimes, so we have to ensure that our
pointers are valid.
This is why only dereferencing a raw pointer is unsafe
. It's perfectly safe to
make one, but we have no guarantees about what it's pointing to, and that's why
the dereference is unsafe
.
Thank you Miri
!
complete_write()
I think complete_write
is the first function I wrote for the vector's core
operations. We execute the WriteDescriptor
passed in and set the one stored in
the vector to None
Here's what the function signature looks like:
#![allow(unused)] fn main() { fn complete_write(&self, pending: &Option<WriteDescriptor<T>>) { }
The first thing we do is execute the WriteDescriptor
, if there is one. We can
use if let
syntax to concisely express this. The result of the
compare_exchange
doesn't matter. If it succeeds, we performed the write. If it
doesn't, someone else performed it. Also, notice how we are
compare_exchange
ing an AtomicU64
. The data is transmuted into those bytes,
allowing us to make atomic modifications to the contents of the vector. Because
the data needs to be transmuted into an atomic type, the vector can't support
types larger than 8 bytes. Finally, because we are using AcqRel
as the success
ordering, any subsequent Acquire
loads will see the that there is no pending
write.
#![allow(unused)] fn main() { #[allow(unused_must_use)] if let Some(writedesc) = pending { AtomicU64::compare_exchange( writedesc.location, writedesc.old, writedesc.new, Ordering::AcqRel, Ordering::Relaxed, ); }
Now that we've done the store to the contents, we change the WriteDescriptor
status of the vector to indicate that there is no pending write (we just took
care of it!).
#![allow(unused)] fn main() { let new_writedesc = WriteDescriptor::<T>::new_none_as_ptr(); // # Safety // The pointer is valid to dereference because it started off valid // and only pointers made from WriteDescriptor::new_*_as_ptr() // (which are valid because of Box) are CAS'd in unsafe { &*self.descriptor.load(Ordering::Acquire) } // Loading with `Acquire` .pending .store(new_writedesc, Ordering::Release); // Storing with `Release` // Memory leak alert! // What happens to the old pointer stored in the // `AtomicPtr<Option<WriteDescriptor<T>>>`? // We never reclaim it. } } }
This is standard. We make a new WriteDescriptor
and store it with Release
so
that all subsequent Acquire
loads will see it.
Leaking memory
Leaking memory is when you use memory (allocating) but never free it. This is
the first chunk of code that leaks. Our Descriptor
has a pointer to an
Option<WriteDescriptor>
. When we store a different pointer, we lose the old
pointer forever. Since we never do anything to deallocate the memory pointed to
by the old pointer, like Box::from_raw
, that memory will stay allocated until
the end of the program.
We can't just directly free the memory right away though, as there could be another thread reading it. Later on, I'm going to show you how we can use a technique called hazard pointers to safely reclaim (deallocate) objects.
For now, the vector will stay leaky, and we'll move on the push
.
Complete source for complete_write()
#![allow(unused)] fn main() { fn complete_write(&self, pending: &Option<WriteDescriptor<T>>) { #[allow(unused_must_use)] if let Some(writedesc) = pending { AtomicU64::compare_exchange( writedesc.location, writedesc.old, writedesc.new, Ordering::AcqRel, Ordering::Relaxed, ); let new_writedesc = WriteDescriptor::<T>::new_none_as_ptr(); // # Safety // The pointer is valid to dereference because it started off valid // and only pointers made from WriteDescriptor::new_*_as_ptr() // (which are valid because of Box) are CAS'd in unsafe { &*self.descriptor.load(Ordering::Acquire) } .pending .store(new_writedesc, Ordering::Release); } } }
push()
You made it! We're going to implement half of the main functionality of the vector. The code is going to get a little complex, but I'm confident in you. I eventually understood what was going on, so you can too.
We're going to track the steps described in
The Algorithm closely. We don't want to mess up
the concurrent semantics of the vector during implementation. The first thing we
do is load in the Descriptor
and WriteDescriptor
. This is actually harder
than it might seem, as we're working with unsafe
things like raw pointers. We
need to be very careful. But wait, there's one more thing we should cover, and
that's exponential backoff!
Exponential Backoff
Exponential backoff is another one of those techniques that's unique to
concurrent programming. compare_exchange
algorithms like the one we're
implementing can produce a lot of contention over a couple specific memory
locations. For example, may threads are trying to compare_exchange
the
AtomicPtr<Descriptor>
stored in the vector. That spot in memory is constantly
bombarded with heavy atomic operations. One way we can alleviate this is by
waiting a little bit after failing to compare_exchange
. The first time we
fail, we back off for 1
tick. If we fail again, we back off for 2
ticks,
then 4
, 8
. . . this is why the backoff is exponential. By backing off,
we give another thread some room to successfully perform their
compare_exchange
. In some mircobenchmarks I did, introducing exponential
backoff greatly speeded up the vector. It's cool that going slower at a micro
level allows us to go faster on a macro level. crossbeam_utils
has a useful
little struct
called Backoff
that we're going to use.
#![allow(unused)] fn main() { pub fn push(&self, elem: T) { let backoff = Backoff::new(); // Backoff causes significant speedup loop { // # Safety // It is safe to dereference the raw pointers because they started off valid // and can only be CAS'd with pointers from `Box::into_raw` let current_desc = unsafe { &*self.descriptor.load(Ordering::Acquire) }; // Complete a pending write op if there is any let pending = unsafe { &*current_desc.pending.load(Ordering::Acquire) }; }
There is already a lot going on here, in just these 10ish lines of code.
Firstly, we've instantiated a Backoff
. A the bottom of the loop, if we failed
to compare_exchange
in our new Descriptor
, we'll call Backoff::spin()
to
wait a little bit, then we'll come back up to the top of the loop and try again.
This code also contains a very unsafe
operation: dereferencing a raw pointer.
The more I read about the dangers of raw pointers, the more scared I got.
Paraphrasing from
The Book,
raw pointers aren't guaranteed to point to valid memory, aren't guaranteed to be
non-null, don't implement cleanup (like Box
), and ignore all the aliasing
rules (&/&mut
semantics).
After watching
Demystifying unsafe
code I felt
better. unsafe
code isn't intrinsically bad, it's just code that comes with an
extra contract that we must uphold and document.
In the case of these first raw pointer dereferences, we know the dereference is
safe because the pointers to the Descriptor
and WriteDescriptor
come from
Box::into_raw
, which returns a non-null and aligned pointer. unsafe
is
scary, but not necessarily bad. Obviously, we should try to limit its uses as
much as possible though, as we can slip up and violate contracts.
Mitigating
unsafe
code: there are ways we can construct API's that needunsafe
code to work without exposing users to danger. For example, we could make a typeAtomicBox<T>
that's mostly a wrapper aroundAtomicPtr<T>
. It might look a little something like this:#![allow(unused)] fn main() { #[repr(transparent)] struct AtomicBox<T> { ptr: AtomicPtr<T> } impl<T> AtomicBox<T> { // We can only make a `Self` from a `Box`'s pointer! pub fn new(box: Box<T>) -> Self { AtomicPtr::new(Box::into_raw(box)) } // Caller knows they are receiving a pointer from `Box` pub fn load(&self, ordering: Ordering) -> *mut T { self.0.load(ordering) } // -- snip -- } }
There's nothing super crazy going on here, it's just that we've configured the API so that we know the pointer inside the
AtomicBox<T>
is valid because it could only have come fromBox
. Now, instead of manually ensuring the invariant that we useBox::into_raw
pointers, the compiler/type system does so for us.
After loading in the WriteDescriptor
, we execute it if need be.
#![allow(unused)] fn main() { self.complete_write(pending); }
Since we're push
ing onto the vector, we might need more memory:
#![allow(unused)] fn main() { // Calculate which bucket this element is going into let bucket = (highest_bit(current_desc.size + FIRST_BUCKET_SIZE) - highest_bit(FIRST_BUCKET_SIZE)) as usize; // If the bucket is null, allocate the memory if self.buffers[bucket].load(Ordering::Acquire).is_null() { self.allocate_bucket(bucket) } }
Let's make our new WriteDescriptor
now:
#![allow(unused)] fn main() { // # Safety // It is safe to call `self.get()` because if the vector has reached // `current_desc.size`, so there is a bucket allocated for element `size`. // Therefore, the pointer is also valid to dereference because it points // into properly allocated memory. let last_elem = unsafe { &*self.get(current_desc.size) }; let write_desc = WriteDescriptor::<T>::new_some_as_ptr( unsafe { mem::transmute_copy::<T, u64>(&elem) }, // Load from the AtomicU64, which really contains the bytes for T last_elem.load(Ordering::Acquire), last_elem, ); }
For now we are assuming that the vector is only storing values 8 bytes big,
therefore it is safe to transmute_copy
to an AtomicU64
. I plan on writing a
macro that produces different implementations of the vector with different
atomic types when storing types of different sizes. For example,
SecVec<(i8, i8)>
would store the data in AtomicU16
. This would save on
space. I don't think the vector would work for zero-sized types because of how
we transmute
. It would also be very inefficient because of all the unnecessary
allocations!
Note that last_elem
's type is &AtomicU64
; it's the location of the write.
When we load from last_elem
, we are getting the old
element. We now have the
three pieces of data necessary for compare_exchange
: a memory location (the
reference), an old element, and a new element (the T
passed to this function).
Let's package everything up in a Descriptor
.
#![allow(unused)] fn main() { let next_desc = Descriptor::<T>::new_as_ptr(write_desc, current_desc.size + 1); }
Since we are adding one more element onto the vector, the new Descriptor
's
size is one more than the old one's.
Here comes the crucial compare_exchange
, in the AcqRel/Relaxed
flavor:
#![allow(unused)] fn main() { if AtomicPtr::compare_exchange_weak( &self.descriptor, current_desc as *const _ as *mut _, next_desc, Ordering::AcqRel, Ordering::Relaxed, ) .is_ok() { // We know the current write_desc is the one we just sent in // with the compare_exchange so avoid loading it atomically self.complete_write(unsafe { &*write_desc }); break; } }
If the compare_exchange
succeeds, we call complete_write
on the descriptor
we just made to finalize the changes, then we break
out of the loop.
If compare_exchange
fails, we'll simply start over again.
Either way, we have a memory leak. If the compare_exchange
succeeded, we
never deal with the old Descriptor
's pointer. We can never safely deallocate
it because we don't know if anyone is reading it. It would be terribly rude to
pull the rug out from under them! Also the deallocation would probably cause a
use-after-free
which would cause the OS to terminate the program which would
rip a hole in the space-time continuum which would. Wait what? Uhh, moving on .
. .
If the compare_exchange
failed, the new Descriptor
and WriteDescriptor
leak. Once we reach the end of the loop, all local variables in that scope are
lost. So, we never get back the pointers to our new describe-y objects, and
their memory is lost the the void, never to be seen again (unless we do some
wildly dumb stuff and read a random address or something). In any case, within
the code for the vector, I try not to tempt the segfault gods. My other
projects, maybe a little bit.
At this point, we've failed the compare_exchange
. Let's Backoff::spin()
and
then retry:
#![allow(unused)] fn main() { backoff.spin(); } // Closing brace for the loop } // Closing brace for the function }
Once we finish looping and finally succeed with the compare_exchange
, we're
done! That's a push
. The pseudocode is so simple, and the code is so . . . not
simple. Props to you for getting this far, concurrent programming is not for the
weak of spirit.
I'll cover the minor differences in pop
, and then we'll cap off the leaky code
with size
.
Complete source for push
#![allow(unused)] fn main() { pub fn push(&self, elem: T) { let backoff = Backoff::new(); // Backoff causes significant speedup loop { // # Safety // It is safe to dereference the raw pointers because they started off valid // and can only be CAS'd with pointers from `Box::into_raw` let current_desc = unsafe { &*self.descriptor.load(Ordering::Acquire) }; let pending = unsafe { &*current_desc.pending.load(Ordering::Acquire) }; // Complete a pending write op if there is any self.complete_write(pending); // Allocate memory if need be let bucket = (highest_bit(current_desc.size + FIRST_BUCKET_SIZE) - highest_bit(FIRST_BUCKET_SIZE)) as usize; if self.buffers[bucket].load(Ordering::Acquire).is_null() { self.allocate_bucket(bucket) } // # Safety // It is safe to call `self.get()` because if the vector has reached // `current_desc.size`, so there is a bucket allocated for element `size`. // Therefore, the pointer is also valid to dereference because it points // into properly allocated memory. let last_elem = unsafe { &*self.get(current_desc.size) }; let write_desc = WriteDescriptor::<T>::new_some_as_ptr( unsafe { mem::transmute_copy::<T, u64>(&elem) }, last_elem.load(Ordering::Acquire), last_elem, ); let next_desc = Descriptor::<T>::new_as_ptr(write_desc, current_desc.size + 1); // Handle result of compare_exchange if AtomicPtr::compare_exchange_weak( &self.descriptor, current_desc as *const _ as *mut _, next_desc, Ordering::AcqRel, Ordering::Relaxed, ) .is_ok() { // We know the current write_desc is the one we just sent in // with the compare_exchange so avoid loading it atomically self.complete_write(unsafe { &*write_desc }); break; } backoff.spin(); } } }
pop
There are three main differences between pop
and push
. Firstly, pop
never
needs to allocate. Secondly, pop
swaps in a slightly different descriptor,
with None
as the WriteDescriptor
and current_desc.size - 1
as the new
size.
#![allow(unused)] fn main() { let new_pending = WriteDescriptor::<T>::new_none_as_ptr(); let next_desc = Descriptor::<T>::new_as_ptr(new_pending, current_desc.size - 1); }
The final difference is that after we succeed with the compare_exchange
, we
read the last element and return it.
#![allow(unused)] fn main() { if AtomicPtr::compare_exchange_weak( &self.descriptor, current_desc as *const _ as *mut _, next_desc, Ordering::AcqRel, Ordering::Relaxed, ) .is_ok() { // # Safety // This is ok because only 64-bit values can be stored in the vector // We also know that elem is a valid T because it was transmuted into a usize // from a valid T, therefore we are only transmuting it back return Some(unsafe { mem::transmute_copy::<u64, T>(&elem) }); } }
The rest of the function: loading the Descriptors
, compare_exchange
,
Backoff
, is identical.
Like push
, pop
also leaks memory profusely. Luckily, this means that when we
implement memory reclamation, it'll be the same solution for push
and pop
.
Complete source for pop()
#![allow(unused)] fn main() { pub fn pop(&self) -> Option<T> { let backoff = Backoff::new(); // Backoff causes significant speedup loop { let current_desc = unsafe { &*self.descriptor.load(Ordering::Acquire) }; let pending = unsafe { &*current_desc.pending.load(Ordering::Acquire) }; self.complete_write(pending); if current_desc.size == 0 { return None; } // # Safety // Do not need to worry about underflow for the sub because we would have // already returned let elem = unsafe { &*self.get(current_desc.size - 1) } .load(Ordering::Acquire); let write_desc = WriteDescriptor::<T>::new_none_as_ptr(); let next_desc = Descriptor::<T>::new_as_ptr(write_desc, current_desc.size - 1); if AtomicPtr::compare_exchange_weak( &self.descriptor, current_desc as *const _ as *mut _, next_desc, Ordering::AcqRel, Ordering::Relaxed, ) .is_ok() { // # Safety // This is ok because only 64-bit values can be stored in the vector // We also know that elem is a valid T because it was transmuted into a // usize from a valid T, therefore we are only transmuting it back return Some(unsafe { mem::transmute_copy::<u64, T>(&elem) }); } backoff.spin(); } } }
size()
The procedure for size
is simple. Load the vector's size from the
Descriptor
, then subtract one if there is a pending write.
It seems like so long ago that we went over
The Algorithm, but recall that when we perform a
push
, we swap in a Descriptor
, then call complete_write
. This means that
when we do a write, the increase in size is reflected in the vector's state
before the write actually happens. If there is still a WriteDescriptor
contained in the Descriptor
, that means the size stored in the Descriptor
is
one greater than the actual size of the vector, because complete_write
replaces the WriteDescriptor
with None
when it executes the write.
Here is the code:
#![allow(unused)] fn main() { pub fn size(&self) -> usize { // # Safety // The pointers are safe to dereference because they all came from `Box::into_raw` // and point to valid objects let desc = unsafe { &*self.descriptor.load(Ordering::Acquire) }; let size = desc.size; // If there is a pending descriptor, we subtract one from the size because // `push` increments the size, swaps the new descriptor in, and _then_ writes // the value. Therefore the size is one greater because the write hasn't happened // yet match unsafe { &*desc.pending.load(Ordering::Acquire) } { Some(_) => size - 1, None => size, } } }
There have been many momentous moments throughout the book: understanding the
algorithm, finishing push
, and finally, completing the vector's public API.
When I was writing the code, this moment felt huge, and I jumped up and down
after push
ing 10 elements onto the vector, pop
ing 10 times, and running
assert_eq!(sv.size(), 0);
withough crashing.
Let's run some tests (more fun than you might think)!
Tests
Just for fun, I wrote some tests, and we get to satisfyingly see them pass.
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn size_starts_at_0() { let sv = SecVec::<usize>::new(); assert_eq!(0, sv.size()); } #[test] fn pop_empty_returns_none() { let sv = SecVec::<usize>::new(); assert_eq!(sv.pop(), None); } #[test] fn ten_push_ten_pop() { let sv = SecVec::<isize>::new(); for i in 0..10 { sv.push(i); } for i in (0..10).rev() { assert_eq!(sv.pop(), Some(i)); } } #[test] fn does_not_allocate_buffers_on_new() { let sv = SecVec::<isize>::new(); for buffer in &**sv.buffers { assert!(buffer.load(Ordering::Relaxed).is_null()) } } } }
Cargo
is super nice and we can use it to test. Running cargo test
produces
the following output:
~/C/r/unlocked (main) > cargo test -- leaky::tests
Finished test [unoptimized + debuginfo] target(s) in 0.01s
Running unittests (target/debug/deps/unlocked-e6f64e7ba9c7e004)
running 4 tests
test leaky::tests::size_starts_at_0 ... ok
test leaky::tests::pop_empty_returns_none ... ok
test leaky::tests::does_not_allocate_buffers_on_new ... ok
test leaky::tests::ten_push_ten_pop ... ok
Although you can't see it, the green on those "ok"s warms my heart.
We know the vector is leaky, but otherwise it shouldn't be doing any other funky
things or UB. Let's see if Miri
finds anything with
MIRIFLAGS=-Zmiri-ignore-leaks cargo miri test -- leaky::tests
:
~/C/r/unlocked (main) > MIRIFLAGS=-Zmiri-ignore-leaks cargo miri test -- leaky::tests
Finished test [unoptimized + debuginfo] target(s) in 0.01s
Running unittests (target/miri/x86_64-apple-darwin/debug/deps/unlocked-4269)
running 4 tests
test leaky::tests::does_not_allocate_buffers_on_new ... ok
test leaky::tests::pop_empty_returns_none ... ok
test leaky::tests::size_starts_at_0 ... ok
test leaky::tests::ten_push_ten_pop ... ok
Nothing? Awesome! Just because Miri
doesn't find anything doesn't mean nothing
fishy is happening. Miri
combined with the rigorous analysis of the code we
did though is a very good sign.
Memory Reclamation
Allocation isn't the hard part when it comes to concurrency, deallocation/reclamation is. When multiple threads/entities are concurrently accessing an object, it is never safe to deallocate it without verifying that no one has a reference/pointer to it. What if they were to use that pointer after the deallocation?
This problem arises in the vector when reclaiming the Descriptor
s and
WriteDescriptors
. Multiple threads can hold a reference to them at once, so we
never know when it is safe to deallocation.
To solve this problem, we'll used technique called hazard pointers via the
haphazard
crate.
What is meant by reclamation/deallocation? When we allocate memory, the allocator returns a pointer to an allocation on the heap. Internally, the allocator also notes down that the space is being used. When we deallocate, or reclaim, memory, we return the pointer back to the allocator. The allocator goes back to the books and notes that no one is using the memory anymore, freeing the memory. It can now hand that memory back out again if it's needed.
I'm not sure if this is the case, but I think the term "reclaim" might be used specifically in concurrent contexts. Rust automatically deallocates memory in single-threaded contexts using the borrow checker. We have to do everything manually in multithreaded contexts
Hazard Pointers
The idea of hazard pointers is to protect memory addresses from reclamation.
At any moment in time, we have a list of addresses that are not safe to reclaim.
We can store the addresses in a data structure like a concurrent linked list; I
think this is what haphazard
uses.
Whenever we want to access a pointer, we access it through a hazard pointer. Accessing through a hazard pointer adds the address we are accessing to the list of addresses to protect. When the hazard pointer gets dropped, or we explicitly disassociate the hazard pointer from the underlying raw pointer, the protection ends.
So why is the Protected
list important? When we are done with an object, we
retire it, marking it for eventual reclamation. By retiring the pointer, we
agree to not use it anymore. Any thread that is already accessing it can
continue to do so, but there can be no new readers/writers.
Every once in a while, the Domain
, which holds the hazard pointers, will go
through the Retired
list. For each pointer on this list, the Domain
checks
whether the pointer is protected by reading the Protected
list. If the pointer
isn't protected, the Domain
reclaims the object it points to (deallocating the
pointer). If it is protected, the Domain
does not reclaim it, because someone
is using it. In this way, we prevent pointers in use from being deallocated, but
those out of use are deallocated.
An example
Hazard pointers are pretty complicated, so here's a visual example that I hope helps:
Protected: [1<0x22>]
Retired: []
0x20 0x22 0x23 0x24
+------+------+------+------+
Thread 1 | | <> | | |
Thread 2 | | | | |
+------+------+------+------+
Right now Thread 1 is accessing 0x22
via a hazard pointer, so the Protected
list contains the pointer Ox22
, annotated with 1
to indicate Thread 1 is
protecting it. I'm not sure if you would actually keep track of which thread is
protecting a pointer in an actual implementation. I think if another thread
tries to protect an already protected pointer, nothing will happen.
Ok, now, Thread 2 accesses 0x22
and protects the pointer.
Protected: [1<0x22>, 2<0x22>]
Retired: []
0x20 0x22 0x23 0x24
+------+------+------+------+
Thread 1 | | <> | | |
Thread 2 | | <> | | |
+------+------+------+------+
Thread 1 finishes with its access, and retires 0x22
. Thread 1 is saying, "No
one new will use this pointer, deallocate it when it's safe to do so!" 0x22
is
added to the Retired
list. The Domain
can't retire the pointer yet because
Thread 2 is still accessing it.
Protected: [2<0x22>]
Retired: [0x22]
0x20 0x22 0x23 0x24
+------+------+------+------+
Thread 1 | | | | |
Thread 2 | | <> | | |
+------+------+------+------+
Finally, Thread 2 finishes using the pointer, removing 0x22
from the
Protected
list.
Protected: []
Retired: [0x22]
0x20 0x22 0x23 0x24
+------+------+------+------+
Thread 1 | | | | |
Thread 2 | | | | |
+------+------+------+------+
The Domain
sees that 0x22
is retired and no one is protecting it, so it
deallocates the allocation at 0x22
. We have reclaimed memory, and 0x22
will
not leak!
Code changes
To use the hazard pointers, we're going to need to make a small change in the vector's structure.
The hardest part was getting started.
Following the documentation on
Domain
, I
wrote a bunch of type alias
es using the type
keyword:
#![allow(unused)] fn main() { // Setting up hazard pointers // This makes sure they all use the same Domain, guaranteeing the protection is valid. #[non_exhaustive] struct Family; type Domain = haphazard::Domain<Family>; type HazardPointer<'domain> = haphazard::HazardPointer<'domain, Family>; type HazAtomicPtr<T> = haphazard::AtomicPtr<T, Family>; }
We only use Domain
s produced from struct Family
. This prevents us from
retiring a pointer in the Global
domain that is being guarded in a different
domain. The Global
domain can't see the other Domain
's protected list, so
might prematurely retire the pointer.
Secondly, all the HazardPointer
s and HazAtomicPtr
s we construct will be in
same family as our Domain
s. This ensures the same protection against
overlapping with the Global
domain.
The difference between
HazAtomicPtr
which is an an alias forhaphazard::AtomicPtr
, andstd::sync::atomic::AtomicPtr
, is thatHazAtomicPtr
uses hazard pointers to guard loads. Additionally, all atomic operations withHazAtomicPtr
haveAcquire-Release
semantics built in. Nifty!
To ensure that we always retire and protect in the same domain, we will also
carry a Domain
in the struct
itself. Then, it's pretty easy to just always
use &self.domain
whenever we need a Domain
. All we have to do is add one more
struct
field to SecVec
:
#![allow(unused)] fn main() { pub struct SecVec<'a, T: Sized + Copy> { buffers: CachePadded<Box<[AtomicPtr<AtomicU64>; 60]>>, descriptor: CachePadded<HazAtomicPtr<Descriptor<'a, T>>>, domain: Domain, // Hi there :) _boo: PhantomData<T>, } struct Descriptor<'a, T: Sized> { pending: HazAtomicPtr<Option<WriteDescriptor<'a, T>>>, size: usize, } struct WriteDescriptor<'a, T: Sized> { new: u64, old: u64, location: &'a AtomicU64, _boo: PhantomData<T>, } }
And with that out of the way, we can now plug some leaks!
Fixing complete_write
complete_write
was the easiest leak to seal. When we swap out the
WriteDescriptor
, we get back the old one. All we have to do is retire it, and
its memory will eventually get reclaimed.
We execute the WriteDescriptor
and make a new one (None
) like normal:
#![allow(unused)] fn main() { fn complete_write(&self, pending: *mut Option<WriteDescriptor<T>>) { // If cas of actual value fails, someone else did the write // Result of compare_exchange doesn't matter if let Some(writedesc) = unsafe { &*pending } { let _ = AtomicU64::compare_exchange( writedesc.location, writedesc.old, writedesc.new, Ordering::AcqRel, Ordering::Relaxed, ); let new_writedesc = WriteDescriptor::<T>::new_none_as_ptr(); }
Here comes the part where the hazard pointers kick in. We make a hazard pointer
in &self.domain
, then load in the Descriptor
. Now, the current Descriptor
cannot get reclaimed as long as our hazard pointer is alive. Then we swap in a
new pointer to the None
WriteDescriptor
.
Here comes the big change, instead of just doing nothing with the pointer that
swapped out, we retire
it in &self.domain
. According to the documentation
for retire_in
, there is a safety contract we need to follow (hence the marking
unsafe fn
).
Let's look at that:
- The pointed-to object will never again be returned by any
[Haz]AtomicPtr::load
.- The pointed-to object has not already been retired.
- All calls to load that can have seen the pointed-to object were using hazard pointers from domain.
Alright, let's make sure we're fulfilling the contract.
Number one, we swapped out the pointer, so all new calls to HazAtomicPtr::load
will use the new pointer. This is Acquire-Release
semantics in action under
the hood. Since the swap_ptr
uses Release
, all HazAtomicPtr::load
s (which
use Acquire
) will see the new value. Thus, the old value is safe from being
load
ed again.
Number two, only one thread can get a pointer as the result of a swap. If I'm
holding a marble in my hand and I give it away, no one else can take that marble
from my hand. The person who took it can do whatever they want with it without
worrying about others interfering. Since we got the pointer as the result of
swap_ptr
, no other thread has exclusive access like we do. We took the marble.
Therefore, we know that know other thread has already or might retire the
pointer. They can't access the marble anymore, and if we have the marble, it
means they never had it.
Finally, number 3, all operations (creating hazard pointers, retiring pointers)
happen through &self.domain
!
After writing a 1000 word essay, we can confirm that retire_in
is safe to
call. This is the argument we'll use for retire
ing the results of
compare_exchange
in push
/pop
.
#![allow(unused)] fn main() { let mut hp = HazardPointer::new_in_domain(&self.domain); let old = unsafe { self.descriptor .load(&mut hp) .expect("ptr is null") .pending // This is a HazAtomicPtr<WriteDescriptor> // # Safety // new_writedesc conforms to the requirements of HazAtomicPtr::new() // because it comes from Box::into_raw and is a valid WriteDescriptor .swap_ptr(new_writedesc) }; // # Safety // We are the only thread that will retire this pointer because // only one thread can get the result of the swap (this one). // Two threads couldn't have performed a swap and both got this pointer. unsafe { old.unwrap().retire_in(&self.domain) }; // hp gets dropped, protection ends } } }
That's the only change to complete_write
. push
/pop
aren't much worse.
Complete source for complete_write
(not leaky)
#![allow(unused)] fn main() { fn complete_write(&self, pending: *mut Option<WriteDescriptor<T>>) { // If cas of actual value fails, someone else did the write // Result of cmpxchng doesn matter if let Some(writedesc) = unsafe { &*pending } { let _ = AtomicU64::compare_exchange( writedesc.location, writedesc.old, writedesc.new, Ordering::AcqRel, Ordering::Relaxed, ); let new_writedesc = WriteDescriptor::<T>::new_none_as_ptr(); let mut hp = HazardPointer::new_in_domain(&self.domain); let old = unsafe { self.descriptor .load(&mut hp) .unwrap() .pending // # Safety // new_writedesc conforms to the requirements of HazAtomicPtr::new() // because it comes from Box::into_raw and is a valid WriteDescriptor .swap_ptr(new_writedesc) }; // # Safety // We are the only thread that will retire this pointer because // only one thread can get the result of the swap (this one). // Two threads couldn't have performed a swap and both got this pointer. unsafe { old.unwrap().retire_in(&self.domain) }; } } }
Fixing push & pop
I know I said that the changes to push
and pop
aren't that bad, which is
true. Getting to those changes however, took a while. I'm going to explain what
I did with pseudocode first, and then show the final code.
The first thing I tried was just retiring the old Descriptor
after a
successful compare_exchange
, however, this didn't reduce the leakage at all
for some reason. I figured it might be because the Descriptor
was pointing a
live WriteDescriptor
. So then, I also retired the WriteDescriptor
. However,
this produced use-after-frees and data races according to Miri
, so I knew I
was doing something wrong.
I decided to review the safety contract of retire_in
again, and that is when I
found the bug. Retiring the Descriptor
is safe for the same reason retiring
the WriteDescriptor
after complete_write
is. Since the Descriptor
is the
result of a swap, we are the only thread who will retire it. The thing is, if we
also retire the WriteDescriptor
, a thread who is already accessing the
Descriptor
could make a new load to the just retired WriteDescriptor
,
violating the safety contract of retire_in
, and causing UB.
The problem in picture form
We, Thread 1, have the Descriptor
as the result of a successful
compare_exchange
. Thread 2 is also reading the Descriptor
(but not the
inner WriteDescriptor
)
Thread 2
/
Thread 1 (us) /
| /
| /
V v
Descriptor
\
\
\
v
WriteDescriptor
Because the compare_exchange
was successful, we retire
the Descriptor
and
WriteDescriptor
. The Descriptor
is protected from reclamation because Thread
2 is reading it, but the WriteDescriptor
has no readers so it gets
deallocated.
Thread 2
/
Thread 1 (us) /
| /
| /
V v
Descriptor
\
---+----------------
\
v
WriteDescriptor <Deallocated>
Now, Thread 2 goes to read the (now reclaimed!!) WriteDescriptor
by loading
the pointer contained in the Descriptor
(which is still protected, and safe to
access).
Thread 2
|
Thread 1 (us) |
| |
| |
V |
Descriptor |
\ |
---+-----------+----
\ |
v V
WriteDescriptor <Deallocated>
And here we have it, Thread 2 accessing deallocated memory!
The solution
The solution I came up with is to make sure a reference to a WriteDescriptor
never outlives the reference to it's parent Descriptor
. Visually this looks
like:
-- Descriptor Reference Start
-- WriteDescriptor Reference Start
-- WriteDescriptor Reference End
-- Descriptor Reference End
This means that when there are no people accessing a Descriptor
, there are
also no people accessing the inner WriteDescriptor
. Therefore, when a
Descriptor
is retired
ed, the WriteDescriptor
is also safe to retire
because there are no references to it. Since no one can get a new reference to a
retire
ed Descriptor
, no once can access the inner WriteDescriptor
.
Why is this important? Whenever we reclaim a Descriptor
, we also reclaim the
inner WriteDescriptor
, fixing our leaks without causing any UB.
To implement this custom behavior for Descriptor
, we implement the Drop
trait. A type that implements Drop
executes some custom behavior when it goes
out of scope and is reclaimed.
The Drop
implementation looks like this:
#![allow(unused)] fn main() { impl<T> Drop for Descriptor<'_, T> { fn drop(&mut self) { // # Safety // The pointer is valid because it's from Box::into_raw // We must also ensure ref to wdesc never outlasts ref to desc unsafe { Box::from_raw( self.pending .swap_ptr(ptr::null_mut()) .unwrap() .into_inner() // This is a NonNull<T> .as_ptr() // Turn it into a raw pointer ); } } } }
All we're doing is extracting the pointer to the WriteDescriptor
and calling
Box::from_raw
on it so that its memory will be reclaimed by Box
when it
goes out of scope.
Reclaiming the Descriptor
s
Its time to finally go over the code changes to push
. All accesses to the
Descriptor
and WriteDescriptor
are guarded with a hazard pointer. The access
returns a reference to the Descriptor
/WriteDescriptor
, which is valid as
long as the hazard pointer guarding the access is alive. Access to the inner
WriteDescriptor
is explicitly scoped within its own block to make clear that
access to the WriteDescriptor
cannot outlive the access to the parent
Descriptor
.
#![allow(unused)] fn main() { pub fn push(&self, elem: T) { let backoff = Backoff::new(); // Backoff causes significant speedup loop { let mut dhp = HazardPointer::new_in_domain(&self.domain); let current_desc = unsafe { self.descriptor.load(&mut dhp) } .expect("invalid ptr for descriptor in push"); // Use a block to make explicit that the use of the wdesc does not outlive // the use of the desc. // This means that when the desc is dropped, there will be no references // to the wdesc inside. // And we can deallocate the wdesc with `Box::from_raw` { let mut wdhp = HazardPointer::new_in_domain(&self.domain); let pending = unsafe { current_desc.pending.load(&mut wdhp) } .expect("invalid ptr from write-desc in push"); self.complete_write(pending as *const _ as *mut _); // Hazard pointer is dropped, protection ends } }
This stuff is all the same as before.
#![allow(unused)] fn main() { // If we need more memory, calculate the bucket let bucket = (highest_bit(current_desc.size + FIRST_BUCKET_SIZE) - highest_bit(FIRST_BUCKET_SIZE)) as usize; // Allocate it if self.buffers[bucket].load(Ordering::Acquire).is_null() { self.allocate_bucket(bucket) } let last_elem = unsafe { &*self.get(current_desc.size) }; let next_write_desc = WriteDescriptor::<T>::new_some_as_ptr( // TODO: address this in macro // # Safety // The `transmute_copy` is safe because we have ensured that T is the // correct size at compile time unsafe { mem::transmute_copy::<T, u64>(&elem) }, // Load from the AtomicU64, which really contains the bytes for T last_elem.load(Ordering::Acquire), last_elem, ); let next_desc = Descriptor::<T>::new_as_ptr(next_write_desc, current_desc.size + 1); }
The compare_exchange
syntax is slightly different, but it's doing the exact
same thing. We don't have to specify orderings because they're built in by
haphazard
. On a successful compare_exchange
, we retire
the pointer
to the old Descriptor
. When it is finally reclaimed, its Drop
implementation will run and its inner WriteDescriptor
will also get reclaimed
safely.
If the compare_exchange
fails, we deallocate our local Descriptor
normally
by calling Box::from_raw
. Since the local Descriptor
was never shared across
threads, we don't have to worry about synchronizing the deallocation. Then, we
spin using the Backoff
and go back to the top of the loop.
#![allow(unused)] fn main() { if let Ok(replaced) = unsafe { HazAtomicPtr::compare_exchange_weak_ptr( // # Safety // Safe because the pointer we swap in points to a valid object that // is !null &self.descriptor, current_desc as *const _ as *mut _, next_desc, ) } { self.complete_write(next_write_desc); // # Safety // Since the we only retire when swapping out a pointer, this is the only // thread that will retire, since only one thread receives the result of // the swap (this one) // // There will never be another load call to the ptr because all calls will // go the new one. Since all uses of the inner wdesc are contained within // the lifetime of the reference to the desc, there will also be no new // loads on the inner wdesc. unsafe { replaced.unwrap().retire_in(&self.domain); } break; } // Deallocate the write_desc and desc that we failed to swap in // # Safety // Box the write_desc and desc ptrs were made from Box::into_raw, so it // is safe to Box::from_raw unsafe { // Note: the inner wdesc also get's dropped as part of the desc's drop impl Box::from_raw(next_desc); } backoff.spin(); } } }
The changes for pop
are identical. We are so close to being done with code.
Our Descriptor
s and WriteDescriptors
are eventually reclaimed, which is a
big step forward. The last thing is to deallocate the buckets and the final
Descriptor
when the vector itself is dropped.
Complete source for push()
and pop()
#![allow(unused)] fn main() { pub fn push(&self, elem: T) { let backoff = Backoff::new(); // Backoff causes significant speedup loop { let mut dhp = HazardPointer::new_in_domain(&self.domain); let current_desc = unsafe { self.descriptor.load(&mut dhp) } .expect("invalid ptr for descriptor in push"); // Use a block to make explicit that the use of the wdesc does not // outlive the use of the desc. This means that when the desc is dropped, // there will be no references to the wdesc inside.And we can deallocate // the wdesc with `Box::from_raw` { let mut wdhp = HazardPointer::new_in_domain(&self.domain); let pending = unsafe { current_desc.pending.load(&mut wdhp) } .expect("invalid ptr from write-desc in push"); self.complete_write(pending as *const _ as *mut _); // Hazard pointer is dropped, protection ends } // If we need more memory, calculate the bucket let bucket = (highest_bit(current_desc.size + FIRST_BUCKET_SIZE) - highest_bit(FIRST_BUCKET_SIZE)) as usize; // Allocate it if self.buffers[bucket].load(Ordering::Acquire).is_null() { self.allocate_bucket(bucket) } let last_elem = unsafe { &*self.get(current_desc.size) }; let next_write_desc = WriteDescriptor::<T>::new_some_as_ptr( // TODO: address this in macro // # Safety // The `transmute_copy` is safe because we have ensured that T is // the correct size at compile time unsafe { mem::transmute_copy::<T, u64>(&elem) }, // Load from the AtomicU64, which really contains the bytes for T last_elem.load(Ordering::Acquire), last_elem, ); let next_desc = Descriptor::<T>::new_as_ptr(next_write_desc, current_desc.size + 1); if let Ok(replaced) = unsafe { HazAtomicPtr::compare_exchange_weak_ptr( // # Safety // Safe because the pointer we swap in points to a valid object that // is !null &self.descriptor, current_desc as *const _ as *mut _, next_desc, ) } { self.complete_write(next_write_desc); // # Safety // Since the we only retire when swapping out a pointer, this is the only // thread that will retire, since only one thread receives the result of // the swap (this one) // // There will never be another load call to the ptr because all calls will // go the new one. Since all uses of the inner wdesc are contained within // the lifetime of the reference to the desc, there will also be no new // loads on the inner wdesc. unsafe { replaced.unwrap().retire_in(&self.domain); } break; } // Deallocate the write_desc and desc that we failed to swap in // # Safety // Box the write_desc and desc ptrs were made from Box::into_raw, so it is // safe to Box::from_raw unsafe { // Note: the inner wdesc also get's dropped as part of the desc's drop impl Box::from_raw(next_desc); } backoff.spin(); } } pub fn pop(&self) -> Option<T> { let backoff = Backoff::new(); // Backoff causes significant speedup loop { let mut dhp = HazardPointer::new_in_domain(&self.domain); let current_desc = unsafe { self.descriptor.load(&mut dhp) } .expect("invalid ptr for descriptor in pop"); // Use a block to make explicit that the use of the wdesc does not // outlive the use of the desc. This means that when the desc is // dropped, there will be no references to the wdesc inside. // And we can deallocate the wdesc with `Box::from_raw` { let mut wdhp = HazardPointer::new_in_domain(&self.domain); let pending = unsafe { current_desc.pending.load(&mut wdhp) } .expect("invalid ptr for write-descriptor in pop"); self.complete_write(pending as *const _ as *mut _); // Hazard pointer is dropped, protection ends } if current_desc.size == 0 { return None; } // TODO: add safety comment // Consider if new desc is swapped in, can we read deallocated memory? // Do not need to worry about underflow for the sub because we would // have already returned let elem = unsafe { &*self.get(current_desc.size - 1) } .load(Ordering::Acquire); let new_pending = WriteDescriptor::<T>::new_none_as_ptr(); let next_desc = Descriptor::<T>::new_as_ptr(new_pending, current_desc.size - 1); if let Ok(replaced) = unsafe { HazAtomicPtr::compare_exchange_weak_ptr( // # Safety // Safe because the pointer we swap in points to a valid object that // is !null &self.descriptor, current_desc as *const _ as *mut _, next_desc, ) } { // # Safety // Since the we only retire when swapping out a pointer, this is the only // thread that will retire, since only one thread receives the result of // the swap (this one) // // There will never be another load call to the ptr because all calls will // go the new one. Since all uses of the inner wdesc are contained within // the lifetime of the reference to the desc, there will also be no new // loads on the inner wdesc. unsafe { replaced.unwrap().retire_in(&self.domain); } // # Safety // TODO: address this in macro // This is ok because we ensure T is the correct size at compile time // We also know that elem is a valid T because it was transmuted into a // usize from a valid T, therefore we are only transmuting it back return Some(unsafe { mem::transmute_copy::<u64, T>(&elem) }); } // Deallocate the write_desc and desc that we failed to swap in // # Safety // Box the write_desc and desc ptrs were made from Box::into_raw, so // it is safe to Box::from_raw unsafe { // Note: the inner wdesc also get's dropped as part of the desc's drop impl Box::from_raw(next_desc); } backoff.spin(); } } }
Dropping the vector
We approach the end! As its last action, the vector will free the memory
allocated in its buckets and the Descriptor
it holds. Once again, we achieve
this by implementing the Drop
trait.
Using a bunch of chained function calls on &self.buffers
, we can get all of
the buffers that aren't null. Then, we recreate the Layout
they hold and
deallocate them.
Dropping the current Descriptor
is simple, we just Box::from_raw
it!. It's
destructor runs and the inner WriteDescriptor
is also deallocated.
#![allow(unused)] fn main() { impl<T> Drop for SecVec<'_, T> where T: Copy, { fn drop(&mut self) { // Drop buffers let allocator = Global; for (bucket, ptr) in self .buffers .iter() .filter(|ptr| !ptr.load(Ordering::Relaxed).is_null()) .enumerate() // Getting all non-null buckets { let size = FIRST_BUCKET_SIZE * (1 << bucket); let layout = match Layout::array::<AtomicU64>(size) { Ok(layout) => layout, Err(_) => capacity_overflow(), }; unsafe { // # Safety // We have recreated the exact same layout used to alloc the ptr in // `allocate_bucket`. We know the ptr isn't null because of the filter allocator.deallocate( NonNull::new(ptr.load(Ordering::Relaxed) as *mut u8).unwrap(), layout, ) }; } // Retiring the current desc and wdesc // # Safety // Since we have &mut self, we have exclusive access, so we can retire the // desc and wdesc ptrs. // // It is safe to dereference the ptr to the desc because it is valid because // it was created with Descriptor::new_as_ptr. let desc = self.descriptor.load_ptr(); unsafe { Box::from_raw(desc); }; } } }
That's it. All the leaks. All the code I'm going to show you. I hope that was a
satisfying journey from learning about the algorithm to fully implementing it!.
Let's run the tests one more time with Miri :)
More tests
Here are the tests. I added a new one up at the top that spawns a bunch of
threads which push
and pop
. I just want to make sure Miri
does not detect
any UB in a complex scenario like that
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; extern crate std; use std::sync::atomic::{AtomicIsize, Ordering}; use std::sync::Arc; use std::thread::{self, JoinHandle}; use std::vec::Vec; #[test] fn the_big_multithread() { static FIVE: isize = 5; let data = Arc::new(SecVec::<isize>::new()); data.reserve(100 * 5); let sum = Arc::new(AtomicIsize::new(0)); #[allow(clippy::needless_collect)] let handles = (0..5) .map(|_| { let data = Arc::clone(&data); thread::spawn(move || { for _ in 0..100 { data.push(FIVE); } }) }) .into_iter() .collect::<Vec<JoinHandle<_>>>(); handles.into_iter().for_each(|h| h.join().unwrap()); #[allow(clippy::needless_collect)] let handles = (0..5) .map(|_| { let data = Arc::clone(&data); let sum = Arc::clone(&sum); thread::spawn(move || { for _ in 0..100 { sum.fetch_add(data.pop().unwrap_or(0), Ordering::Relaxed); } }) }) .into_iter() .collect::<Vec<JoinHandle<_>>>(); handles.into_iter().for_each(|h| h.join().unwrap()); } #[test] fn size_starts_at_0() { let sv = SecVec::<usize>::new(); assert_eq!(0, sv.size()); } #[test] fn pop_empty_returns_none() { let sv = SecVec::<usize>::new(); assert_eq!(sv.pop(), None); } #[test] fn ten_push_ten_pop() { let sv = SecVec::<isize>::new(); for i in 0..10 { sv.push(i); } for i in (0..10).rev() { assert_eq!(sv.pop(), Some(i)); } } #[test] fn does_not_allocate_buffers_on_new() { let sv = SecVec::<isize>::new(); for buffer in &**sv.buffers { assert!(buffer.load(Ordering::Relaxed).is_null()) } } } }
And here is the result, shuffled around a little so it all fits:
~/C/r/unlocked (main) [1] > cargo miri test -- sealed::tests
Finished test [unoptimized + debuginfo] target(s) in 0.01s
Running unittests (target/miri/x86_64-apple-darwin/debug/deps/unlocked-666)
running 5 tests
test sealed::tests::does_not_allocate_buffers_on_new ... ok
test sealed::tests::pop_empty_returns_none ... ok
test sealed::tests::size_starts_at_0 ... ok
test sealed::tests::ten_push_ten_pop ... ok
test sealed::tests::the_big_multithread ... ok
warning: thread support is experimental and incomplete:
weak memory effects are not emulated.
Mmm . . . I love those greens! <3
Reflections
Here are some reflections on the process of writing the vector and learning about concurrent programming.
Potential Optimizations
Each time we make a new Descriptor
or WriteDescriptor
, we allocate it on the
heap. This means we will make many heap allocations for only one Descriptor
to
succeed at being compare_exchange
'd in. What if we instead made one heap
allocation at the beginning of push
and pop
, and just overwrote the contents
on every failed iteration of the compare-exchange
loop?
#![allow(unused)] fn main() { // Normal flow fn push() { loop { // New allocation every iteration, expensive :( <allocate Descriptor> <compare-exchange Descriptor> <if compare-exchange failed, reloop> } } // Efficient flow fn push() { <allocate Descriptor> // One time cost :) loop { <write to allocation> // Cheap :) <compare-exchange Descriptor> <if compare-exchange failed, reloop> } } }
I tried it, and the results range from worse for one microbenchmark to being somewhat better on other microbenchmarks.
Here's the results of the vector we implemented:
test sealed::bench::pop ... bench: 169,980 ns/iter (+/- 21,594)
test sealed::bench::push ... bench: 1,025,550 ns/iter (+/- 43,945)
test sealed::bench::push_and_pop ... bench: 829,768 ns/iter (+/- 63,895)
test sealed::bench::push_then_pop ... bench: 1,732,666 ns/iter (+/- 113,670)
Here's the results for the modified vector:
test sealed::bench::pop ... bench: 269,311 ns/iter (+/- 11,669)
test sealed::bench::push ... bench: 962,469 ns/iter (+/- 23,620)
test sealed::bench::push_and_pop ... bench: 786,135 ns/iter (+/- 32,104)
test sealed::bench::push_then_pop ... bench: 1,611,816 ns/iter (+/- 68,167)
As you can see, pop
(which is just a bunch of threads pop
ing an empty
vector) is worse for the modified vector. At the beginning of pop
, we make an
allocation to hold the Descriptor
s that we'll try to swap in. However, in this
test, we are always pop
ing off an empty vector, so we never even need to write
to the allocation because we just return None
when we see the length of the
vector is 0. So, we make an unnecessary allocation when popping off an empty
vector, but save many allocations when there is actual contention.
The other microbenchmarks look better, but the intervals for the modified and original overlap, so I doubt the change is significant (#AP Stats knowledge).
unsafe
code
As I said before, unsafe
code isn't inherently bad, it's just code that comes
with a contract. Keeping this in mind helped me get over my initial apprehension
about using unsafe
code.
If you write concurrent code, I think unsafe
code is inevitable. There's just
too much to do with raw pointers and memory. Additionally, there are many more
contracts the compiler can't enforce in multithreaded scenarios.
Philosophically, concurrent code has to deal with shared mutable state,
otherwise it wouldn't do anything useful. Shared mutable state is inherently
unsafe! That's why Rust
only allows one mutable reference (&mut
) at a time:
it prevents memory bugs like data races. Thus, there is some danger
intrinsically associated with writing low-level concurrent code.
"Shared mutable state is, among other things, the root of all evil" - Me, 2022
Although it seems scary at first, I'm really glad Rust
has the concept of
unsafe
. Whenever I had any memory bugs, I knew that the root cause must have
been in an unsafe
block. Systematically checking over those blocks allowed me
to fix my code quickly.
It's good that we have to make explicit where we are doing potentially unsafe
things. Not just because of debugging, but because it makes us pause and check
everything over one more time. If nothing was unsafe
, or everything was
unsafe
, reasoning about our code would be much harder in my opinion.
A note on debugging: always read the safety contract and document why what you're doing is safe! I caught so many bugs just by going over the safety contract again and realizing I wasn't following it.
Atomic Intuition
It's pretty safe to say that atomics are confusing. Just the context itself is confusing: that CPUs and compilers can reorder instructions. I found that as I developed an intuition for atomics, it became easier to reason about my code and its correctness.
If you're getting started with concurrent programming, my advice would be to get
a solid grasp on atomics. You don't need to know every detail and all the ins
and outs. When you're writing code and you think "This Acquire
load
synchronizes with that Release
store", you gain confidence and it becomes
easier to get going.
The biggest moment for me was when I stopped having to look at the Standard
Library Documentation every time I used an atomic. I had developed an intuitive
sense of the orderings, and I could see why each one was useful in my code. At
first, I thought the orderings seemed a little random. As I started to
use atomics more and more, I saw how the orderings fit in nicely with actual use
cases, from using Acquire
to load a bucket to AcqRel
in compare_exchange
.
Building an intuition for atomics is both satisfying and extremely useful.
Debugging
Here are a couple debugging tricks I stumbled upon out as I wrote the vector.
Pointers
One time, I had a problem where something was reading a null pointer. I thought
that after a swap, no thread could read the swapped in value, so I just swapped
in a null pointer. Sadly, I was mistaken. Looking at the error message, which
said pointer 0x0 is invalid to dereference
, I got an idea. There were three
places swapping in a null pointer; the first I changed to swap in 0x1
, the
second 0x2
, and the third 0x3
.
After running the program through Miri
, I got the error message
pointer 0x2 is invalid to dereference
, and I knew where the bug was
originating from.
unwrap
vs. expect
Whenever you unwrap
an Option
or Result
that is None
or Err
, Rust will
print out a little diagnostic saying where the panic!
happened. I found it
helpful to use expect
instead of unwrap
because of the ability to provide
some extra context.
For example, there is a method in the haphazard
crate called AtomicPtr::load
which returns an Option<&T>
. It only returns a None
value if the underlying
AtomicPtr
contains a null pointer. Instead of unwrap
ing the return value of
load
, I called expect("read null pointer")
. When I inevitably messed up and
unwrap
ed a None
, I new there was a null pointer floating around because of
the error message.
Although these tricks seem small, they actually saved me a lot of time.
Acknowledgements
Thank you to my advisor and my friends for your feedback and support.
Helpful Resources
- Rust Standard Library Documentation
- Jon Gjenset's Youtube Channel
- The Book
- The Rust Reference
- The Rustonomicon
- Preshing's Blog (especially his content on atomics and lock-free programming!)
- Learn Rust With Entirely Too Many Linked Lists
Reading source code is a great way to learn!
crossbeam_queue
crossbeam_utils
atomic
haphazard
std::vec::Vec
(also check outRawVec
)