Potential Optimizations

Each time we make a new Descriptor or WriteDescriptor, we allocate it on the heap. This means we will make many heap allocations for only one Descriptor to succeed at being compare_exchange'd in. What if we instead made one heap allocation at the beginning of push and pop, and just overwrote the contents on every failed iteration of the compare-exchange loop?

#![allow(unused)]
fn main() {
// Normal flow
fn push() {
    loop {
        // New allocation every iteration, expensive :(
        <allocate Descriptor> 
        <compare-exchange Descriptor>
        <if compare-exchange failed, reloop>
    }
}

// Efficient flow
fn push() {
    <allocate Descriptor> // One time cost :)
    loop {
        <write to allocation> // Cheap :)
        <compare-exchange Descriptor>
        <if compare-exchange failed, reloop>
    }
}
}

I tried it, and the results range from worse for one microbenchmark to being somewhat better on other microbenchmarks.

Here's the results of the vector we implemented:

test sealed::bench::pop                ... bench:     169,980 ns/iter (+/- 21,594)
test sealed::bench::push               ... bench:   1,025,550 ns/iter (+/- 43,945)
test sealed::bench::push_and_pop       ... bench:     829,768 ns/iter (+/- 63,895)
test sealed::bench::push_then_pop      ... bench:   1,732,666 ns/iter (+/- 113,670)

Here's the results for the modified vector:

test sealed::bench::pop                ... bench:     269,311 ns/iter (+/- 11,669)
test sealed::bench::push               ... bench:     962,469 ns/iter (+/- 23,620)
test sealed::bench::push_and_pop       ... bench:     786,135 ns/iter (+/- 32,104)
test sealed::bench::push_then_pop      ... bench:   1,611,816 ns/iter (+/- 68,167)

As you can see, pop (which is just a bunch of threads poping an empty vector) is worse for the modified vector. At the beginning of pop, we make an allocation to hold the Descriptors that we'll try to swap in. However, in this test, we are always poping off an empty vector, so we never even need to write to the allocation because we just return None when we see the length of the vector is 0. So, we make an unnecessary allocation when popping off an empty vector, but save many allocations when there is actual contention.

The other microbenchmarks look better, but the intervals for the modified and original overlap, so I doubt the change is significant (#AP Stats knowledge).