mr-edd.co.uk :: horsing around with the C++ programming language

Faster Windows Events

[15th December 2012]

Windows Events have one capability not offered by any other synchronization primitive that I've encountered: when configured for 'manual reset', they have the ability to awaken multiple threads with a single kernel call. A condition variable broadcast can also do this, but each thread has to acquire and release a shared mutex once awoken, creating an unnecessary bottleneck.

Used in this way, events are useful for implementing things like futures that cater for multiple simultaneous waiters, such as std::shared_future<>. Here, the event is used to keep threads at bay until the future's value materializes.

However, because events are represented by Windows HANDLEs in user-space code, any operation on an event has to transition to the kernel, even if it's only asking whether or not the event is set (or signalled). Here I'll describe how we can add fast paths to all event operations using atomic operations and provide some simple benchmarks to illustrate the improvements.

First I should mention that I'm only going to consider 'manual reset events', here. I'm also going to call my class 'barricade' rather than 'event', because I think this new name is more descriptive. Perhaps at some point future, I'll write another post about how to implement the same interface for non-Windows systems.

Here's a synopsis:

class barricade : public uncopyable
{
    public:
        enum state 
        { 
            lowered, // Akin to a 'signalled' event; let threads through.
            raised   // Not 'signalled'; block threads until further notice.
        };

        explicit barricade(state initial);
        ~barricade();

        // Block until the barricade is lowered. If the barricade is already 
        // lowered, the call will return immediately.
        // c.f. WaitForSingleObject(h, INFINTE)
        void wait();

        // Returns the instantaneous state of the barricade. Use with care
        // as the state can be changed at any moment in another thread.
        // c.f. WaitForSingleObject(h, 0) == WAIT_OBJECT_0
        bool is_lowered() const;

        // Lower the barricade, releasing any wait()ing threads and allowing
        // any subsequent waiters to pass through until the next call to raise().
        // Calls to lower() on a barricade that is already lowered are benign.
        // c.f. SetEvent(h)
        void lower();

        // Raise the barricade, causing any subsequent calls of wait() to block
        // until it is lower()ed again.
        // Calls to raise() on a barricade that is already raised are benign.
        // c.f. ResetEvent(h)
        void raise();

    private:
        long state_now;
        HANDLE event;
};

Any thread can call any member function at any time.

I'm not going to bother with a PulseEvent() replacement as its use is now strongly discouraged by Microsoft[1].

Let's quickly get the constructor and destructor out of the way so that we can get on to the meat. As you can see above, we've got two member variables. In addition to the HANDLE for the underlying event, we have state_now, which is going to enable the lock-free fast paths. These will need to be initialized according to the 'initial' state specified by the constructor's caller. The destructor simply closes the event handle.

barricade::barricade(state initial) :
    state_now(initial),
    event(CreateEventW(0, TRUE, initial == lowered ? TRUE : FALSE, 0))
{
    if (!event)
        throw std::bad_alloc(); // or whatever you prefer
}

barricade::~barricade()
{
    CloseHandle(event);
}

The need for a third state

In the state enum we have two constants, but internally we'll need a third, which I'll call transitioning:

const long transitioning = 2;
STATIC_ASSERT(transitioning != barricade::lowered);
STATIC_ASSERT(transitioning != barricade::raised);

Why do we need this third state?

As you might be able to tell already, we're going to use the state_now member as a kind of cache for the real state of the underlying event. The event itself will merely provide a way to have threads wait() efficiently in the kernel when they encounter a raised barricade.

If we only had two possible states, let's consider how we might attempt to implement raise() and lower(). I'll be using Windows' own atomic functions and intrinsics, but playing along with the standard C++ atomics will also work.

// WRONG.
// Problematic implementations of raise() and lower()

void barricade::raise()
{
    (void)InterlockedExchange(&state_now, barricade::raised);
    ResetEvent(event);
}

void barricade::lower()
{
    (void)InterlockedExchange(&state_now, barricade::lowered);
    SetEvent(event);
}

To illustrate why this naive implementation is incorrect, suppose we have two threads. Thread 1 is calling raise() and thread 2 is calling lower(). The following interleaving of calls puts our barricade in to an inconsistent state:

    (void)InterlockedExchange(&state_now, barricade::raised);  // Thread 1
    (void)InterlockedExchange(&state_now, barricade::lowered); // Thread 2
    SetEvent(event);                                           // Thread 2
    ResetEvent(event);                                         // Thread 1

Here, the state_now variable ends up with a value indicating that the barricade is lowered, whereas the event state in the kernel says it's raised.

Using the third state for a correct implementation

So, one solution — admittedly the one that came to mind the fastest — was to introduce this third state, transitioning. Here's how it is used in the implementation of raise():

void barricade::raise()
{
    if (InterlockedCompareExchange(
            &state_now, 
            transitioning,     // new value
            barricade::lowered // comparand
        ) != barricade::lowered)
    {
        return;
    }

    // We've atomically changed state_now from lowered to transitioning.
    ResetEvent(event);

    InterlockedExchange(&state_now, barricade::raised);
}

While changing the state of the event from lowered (signalled) to raised (not signalled), we set state_now to transitioning. Once the event has been signalled by ResetEvent(), state_now is updated to match.

Note that if upon entry to this function, state_now is raised or transitioning, no state modifications are made, as the InterlockedCompareExchange() call will not perform the exchange.

In the case where state_now is already barricade::raised, it's easy to see that this early-return is fine. Calling raise() on a raised barricade should indeed do nothing.

But what if state_now is equal to transitioning at the point of the InterlockedCompareExchange() call? Well, in this case another thread is also modifying the state of the barricade. Encountering a transitioning state here quite possibly indicates a bug in the calling code, but it's not something that should trouble us here, beyond ensuring that our internal state is not rendered inconsistent in this scenario:

In this kind of race, returning from the function upon discovery of the transitioning state is to deliberately let the other thread win the race. Whether the other thread is attempting to modify the barricade in the same or opposite 'direction' as us doesn't matter, as in either case we haven't modified any state, so we can be sure that we haven't corrupted our data.

If you understood the implementation of raise(), you can probably guess what lower() looks like. It's just the same algorithm in reverse:

void barricade::lower()
{
    if (InterlockedCompareExchange(
            &state_now, 
            transitioning,    // new value
            barricade::raised // comparand
        ) != barricade::raised)
    {
        return;
    }

    // We've atomically changed state_now from raised to transitioning.
    SetEvent(event);

    InterlockedExchange(&state_now, barricade::lowered);
}

Now that we're aware of our three states, the is_lowered() function is simple:

bool barricade::is_lowered() const
{
    COMPILER_LOAD_STORE_BARRIER;
    const bool ret = (state_now == barricade::lowered);
    return ret;
}

COMPILER_LOAD_STORE_BARRIER is defined as follows, which I believe is suitable for x86:

#if defined(_MSC_VER)
#   define COMPILER_LOAD_STORE_BARRIER _ReadWriteBarrier()
#elif defined(__GNUC__)
#   define COMPILER_LOAD_STORE BARRIER __asm__ __volatile__ ("":::"memory")
#else
#   error "unsupported platform :("
#endif

Again, if state_now is equal to transitioning when entering the is_lowered() function, we're potentially in a race with another thread that's changing the state from raised to lowered or vice versa. But returning false in this case provides a sequentially consistent outcome.

Finally, we need a wait() method. Here's the simplest way of implementing it:

void barricade::wait()
{
    // Fast path for a lowered barricade. Doesn't touch the kernel.
    if (is_lowered())
        return; 

    // Slow path, have the kernel sleep until the barricade is raised.
    WaitForSingleObject(event, INFINITE);
}

Depending on how the barricade is used, it might be beneficial to add an initial spin-loop:

void barricade::wait()
{
    for (unsigned spins = 0; spins != 2000; ++spins)
    {
        if (is_lowered())
            return; 

        Sleep(1);
    }

    // Slow path, have the kernel sleep until the barricade is raised.
    WaitForSingleObject(event, INFINITE);
}

Sleep(1) comes recommended by the Duffmeister, but improvements might be possible by combining it with YieldProcessor() on a hyper-threaded machines. I won't bother exploring that option at this time, but I will note that in the benchmarks, using Sleep(1) was noticeably better than both Sleep(0) and SwitchToThread() (for what it's worth).

Speaking of which…

Benchmark 1: polling

In this benchmark, a number of threads are fired up and the state of a barricade or event is polled 100,000,000 times in each. The state of the barricade does not change at all during the entire run.

The code was built with Microsoft Visual C++ 2010, with /O2 and global optimizations enabled.

AMD Athlon X2 4200+. Dual Core.
Windows XP Home Edition +SP3.
Best of 5 runs.

threads    barricade     event
----------------------------------
   1        0.250s      43.703s
   2        0.265s      98.312s
   4        0.422s     265.672s 


Intel Core i7 870. Quad Core + HyperThreading (8 logical cores).
Windows 7 Professional +SP1.
Best of 5 runs.

threads    barricade     event
----------------------------------
   1        0.177s      30.641s
   2        0.184s      40.141s
   4        0.191s      75.123s
   8        0.293s     213.231s
   16       0.550s     529.012s

Admittedly, this benchmark isn't very realistic, as polling is typically avoided. However one good use of polling is to see if something is ready before going off and doing something else for a while — just as you might use a try-lock. This situation might arise when you have a graph of tasks to churn through; if the current node is still pending, you might try to lend a hand somewhere else.

Given that this operation can be done with a single mov instruction on x86, hitting the kernel seems extremely wasteful, which is what an event has to do to poll (via WaitForSingleObject()).

We can also observe that polling a single Windows event in multiple threads appears to create contention internally, since the running time increases linearly with the number of threads, which is a little bit surprising.

The times for the barricade on the other hand are pretty much flat until you start using more threads than cores, which is what we'd like to see in the absence of shared writes.

Benchmark 2: hurdles

In this benchmark, a number of threads are fired up. All approach a barricade or event which blocks all but the last thread to arrive. That last thread lowers the barricade to let the waiting threads and itself through to the next 'hurdle' where the process repeats. A total of 20,000,000 hurdles are encountered by each thread.

Again the code was built with Microsoft Visual C++ 2010, with /O2 and global optimizations enabled.

AMD Athlon X2 4200+. Dual Core.
Windows XP Home Edition +SP3.
Best of 5 runs.

threads    barricade     event
----------------------------------
   1        12.375s     20.969s
   2        12.860s     38.953s
   4        12.687s     79.703s


Intel Core i7 870. Quad Core + HyperThreading (8 logical cores).
Windows 7 Professional +SP1.
Best of 5 runs.

threads    barricade     event
----------------------------------
   1        8.839s      14.513s
   2        9.087s      20.878s
   4        9.285s      29.691s
   8        9.780s      51.323s
   16      10.947s      88.787s

This benchmark exercises all of the functionality of the barricade. Again, you'd find this kind of code in a situation where you are processing a graph of tasks.

While the event slows down as more threads are added, the running times for the barricade appear to be almost independent of the number of threads used (though we'd of course expect this to change if we really cranked up the thread count). The benchmark running against events and using a single thread are slower than any of the barricade runs. This is likely due to the ability to spin-poll efficiently in wait() for a short while before falling back to a kernel call.

Downloads

Footnotes
  1. aka: it's broken []

Comments

Boris

[05/01/2013 at 05:10:22]

It all sounds a bit like what critical sections do, strange you make no mention of them.

Edd

[05/01/2013 at 15:53:27]

CRITICAL_SECTIONs do indeed have a similar spin-loop. But that's probably true of any well-written synchronization primitive, or at least for those designed for purely intraprocess use.

I'm certainly not trying to hide anything or claim to have invented the technique.

(optional)
(optional)
(required, hint)

Links can be added like [this one -> http://www.mr-edd.co.uk], to my homepage.
Phrases and blocks of code can be enclosed in {{{triple braces}}}.
Any HTML markup will be escaped.