天天看点

Mechanical Sympathy

Mechanical Sympathy

Memory Barriers/Fences

In this article I'll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors.

CPUs have employed many techniques to try and accommodate the fact that

CPU execution unit performance has greatly outpaced main memory

performance.  In my “Write Combining”

article I touched on just one of these techniques.  The most common

technique employed by CPUs to hide memory latency is to pipeline

instructions and then spend significant effort, and resource, on trying

to re-order these pipelines to minimise stalls related to cache misses.

When a program is executed it does not matter if its instructions are

re-ordered provided the same end result is achieved.  For example,

within a loop it does not matter when the loop counter is updated if no

operation within the loop uses it.  The compiler and CPU are free to

re-order the instructions to best utilise the CPU provided it is updated

by the time the next iteration is about to commence.  Also over the

execution of a loop this variable may be stored in a register and never

pushed out to cache or main memory, thus it is never visible to another

CPU.

CPU cores contain multiple execution units.  For example, a modern Intel

CPU contains 6 execution units which can do a combination of

arithmetic, conditional logic, and memory manipulation.  Each execution

unit can do some combination of these tasks.  These execution units

operate in parallel allowing instructions to be executed in parallel. 

This introduces another level of non-determinism to program order if it

was observed from another CPU.

Finally, when a cache-miss occurs, a modern CPU can make an assumption

on the results of a memory load and continue executing based on this

assumption until the load returns the actual data.

Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance.

Figure 1.
private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE;
 
// from inside the run() method
 
T event = null;
long nextSequence = sequence.get() + 1L;
while (running)
{
    try
    {
        final long availableSequence = barrier.waitFor(nextSequence);
 
        while (nextSequence <= availableSequence)
        {
            event = ringBuffer.get(nextSequence);
            boolean endOfBatch = nextSequence == availableSequence;
            eventHandler.onEvent(event, nextSequence, endOfBatch);
            nextSequence++;
        }
 
        sequence.set(nextSequence - 1L);
        // store barrier inserted here !!!
    }
    catch (final Exception ex)
    {
        exceptionHandler.handle(ex, nextSequence, event);
        sequence.set(nextSequence);
        // store barrier inserted here !!!
        nextSequence++;
    }
}