Mechanical Sympathy
Memory Barriers/Fences
In this article I'll discuss the most fundamental technique in concurrent programming known as memory barriers, or fences, that make the memory state within a processor visible to other processors.
CPUs have employed many techniques to try and accommodate the fact that
CPU execution unit performance has greatly outpaced main memory
performance. In my “Write Combining”
article I touched on just one of these techniques. The most common
technique employed by CPUs to hide memory latency is to pipeline
instructions and then spend significant effort, and resource, on trying
to re-order these pipelines to minimise stalls related to cache misses.
When a program is executed it does not matter if its instructions are
re-ordered provided the same end result is achieved. For example,
within a loop it does not matter when the loop counter is updated if no
operation within the loop uses it. The compiler and CPU are free to
re-order the instructions to best utilise the CPU provided it is updated
by the time the next iteration is about to commence. Also over the
execution of a loop this variable may be stored in a register and never
pushed out to cache or main memory, thus it is never visible to another
CPU.
CPU cores contain multiple execution units. For example, a modern Intel
CPU contains 6 execution units which can do a combination of
arithmetic, conditional logic, and memory manipulation. Each execution
unit can do some combination of these tasks. These execution units
operate in parallel allowing instructions to be executed in parallel.
This introduces another level of non-determinism to program order if it
was observed from another CPU.
Finally, when a cache-miss occurs, a modern CPU can make an assumption
on the results of a memory load and continue executing based on this
assumption until the load returns the actual data.
Provided “program order” is preserved the CPU, and compiler, are free to do whatever they see fit to improve performance.
Figure 1. |
private volatile long sequence = RingBuffer.INITIAL_CURSOR_VALUE;
// from inside the run() method
T event = null;
long nextSequence = sequence.get() + 1L;
while (running)
{
try
{
final long availableSequence = barrier.waitFor(nextSequence);
while (nextSequence <= availableSequence)
{
event = ringBuffer.get(nextSequence);
boolean endOfBatch = nextSequence == availableSequence;
eventHandler.onEvent(event, nextSequence, endOfBatch);
nextSequence++;
}
sequence.set(nextSequence - 1L);
// store barrier inserted here !!!
}
catch (final Exception ex)
{
exceptionHandler.handle(ex, nextSequence, event);
sequence.set(nextSequence);
// store barrier inserted here !!!
nextSequence++;
}
}