Memory Allocation
The first thing I did when writing the code out was think about the pieces that
would make up the vector. In Rust, an extremely common building block for any
type is the struct. A struct just sticks its members' data next to each
other in memory. Here is the vector itself, as a struct:
#![allow(unused)] fn main() { pub struct SecVec<'a, T: Sized + Copy> { buffers: CachePadded<Box<[AtomicPtr<AtomicU64>; 60]>>, descriptor: CachePadded<AtomicPtr<Descriptor<'a, T>>>, _boo: PhantomData<T>, // Data is stored as transmuted T's } }
Boo! 👻
I bet the PhantomData scared you. We have a generic parameter T, but we have
no struct members of SecVec or either of the descriptors that actually
contains a T (because we transmute T into u64s). Therefore, to let the
compiler know we really are carrying T's, we add a little ghost that tells it,
"We're carrying this Phantom T wink "
Sharing is caring
There is a lot to unpack here. Firstly, CachePadded is a struct provided by
the crossbeam_utils crate.
A note on cache: you may have heard of CPU cache, a small buffer of memory stored on the CPU to allow for fast access. The
cacheinCachePaddedactually refers to a buffer between main RAM and the CPU's. It's just a larger and slower cache compared to a CPU cache. The cache is split into contiguous blocks of memory called cache lines. This is the most granular level at which cache coherency is maintained. When multiple threads both have a value in the same cache line, one thread modifying the value it owns marks the entire cache line as "dirty". Even though the other thread's value hasn't been changed, the cache coherency protocol might cause the thread to reload the entire line when it uses the value, incurring some overhead. This is called false sharing, and cause severe performance degradation. Cache is an extremely important consideration when data structures. It's why linked lists are algorithmically fine but terribly slow in practice. As the saying goes, cache is king.
The CachePadded struct aligns its contents to the beginning of the cache
line to prevent false sharing. If all CachePadded objects are at the beginning
of a cache line (assuming they do not cross a cache line), there can't be false
sharing between them. Preventing false sharing can lead to a huge speedup, but
it also does increase the size of the type. If you're wondering how
CachePadded is implemented, check out
#[repr(align(n))] in the
Nomicon.
Here's how I picture cache padding:
|-----Cache line-----|-----Cache Line-----|
v v v
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|69|xx|xx|xx|xx|xx|xx|42|xx|xx|xx|xx|xx|xx|
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
^ ^
\ \
\ \
\ \
Different cache lines -> no false sharing
Two-level array
The first member of SecVec<T> is a cache-padded array of 60 pointers allocated
on the heap (notice the Box). These pointers will each point into another
array. The pointers start off as null pointers (0x0), and will get swapped out
for valid pointers once they need to point to an actual array.
The AtomicPtrs point to AtomicU64s because each element is going to get
transmuted into a u64 so that we can atomically perform writes on the vector.
When returning an element, we'll transmute it back into a T. Transmuting means
interpreting the bits of one type as the bits of another.
For example, 0b10100001 means -95 when interpreted as a signed integer but
161 when interpreted as an unsigned integer. Transmuting one to the other
would just change how we interpret the bits, not tha actual bits themselves.
Descriptors galore
The second member of SecVec<T> is a cache-padded AtomicPtr to a
Descriptor. As you've probably noticed, there are a bunch of AtomicPtrs
here. That's because we can modify the pointer atomically, specify which
Ordering to use, and compare_exchange the pointer. A common way of writing
data in concurrent programming is to change a pointer instead of actually
modifying a buffer. Since a buffer can't necessarily be modified atomically or
without locking, what we can do is prepare a buffer and then change a pointer so
that it points to our new buffer. All new readers will see the new data when
they dereference the pointer.
Pointer
/ \
/ \
+---+ +----+
/ \
/ -> \
v v
Old New
+---+---+---+---+ +---+---+---+---+
| 9 | 9 | 9 | 9 | | 6 | 6 | 6 | 6 |
+---+---+---+---+ +---+---+---+---+
What do we do with the old pointer you might ask? Worry not, we will get into that 😅
The Descriptor and WriteDescriptor
#![allow(unused)] fn main() { pub struct Descriptor<'a, T: Sized> { pending: AtomicPtr<Option<WriteDescriptor<'a, T>>>, size: usize, } pub struct WriteDescriptor<'a, T: Sized> { new: u64, old: u64, location: &'a AtomicU64, _boo: PhantomData<T>, // New and old are transmuted T's } }
The trait bounds
Notice how T is Sized, this means that its size is always known at
compile-time. We need to ensure this because our values need to be transmutable.
Part of the safety contract of transmute_copy is making sure our types are of
compatible sizes.
The Copy bound is necessary because the data in the vector is copied in and
out of the buffers, with transmute_copy.
OK, enough talk about structs, let's get to the first function: get()