Monday, November 18, 2013

Memory models for Atomic Operations and Optimizations


Atomic Memory Models

Memory model is an important topic for understanding synchronization techniques using atomic operations.
Each memory model specifies how memory accesses are to be ordered around an atomic operation.
In a typical multi-threaded environment, there will be concurrent read and write to several variables and the order in which it happens will differ among the threads.
Understanding the memory models will help to know how to constrain the memory ordering to attain low latency synchronizations.

Each of the memory models will have some effect on the compiler optimizations.
But none of them have any effect on the thread local optimizations.

Every atomic operation allows to specify an addition argument, which is the memory model for that operation.
There are mainly 3 models:

Sequentially Consistent

pre

This is the default mode when none is specified in an atomic operation and is the most restrictive.
It means the same programmatic order is visible by all the threads.
It can also be explicitly specified using std::memory_order_seq_cst.
Consider the following example:

int a;
atomic_int b, c;
--Thread 1--
a=10;
b.store(5);
c.store(4);

--Thread 2--
if(c.load() == 4)
{
 assert(a==10);
 assert(b.load()==5);
}

Both the above assertions cannot fail.
All the stores, including dependent and non-dependent, happens before (will be visible after) the current operation.
This is the most expensive of all the operations and the most safe from a synchronization perspective.
It is required to enforce a system wide consistency between all threads.
Compiler will not be able to move instructions in either direction(before a store or after a load) and hence will have impact on optimizations.

Relaxed

This doesn't have any happens-before restrictions. Any operations before relaxed can happen in any order across threads.
In the example above, either of the assertions can fail, with relaxed memory ordering.
Relaxed ordering can be specified using std::memory_order_relaxed
The only ordering imposed is that once a value for a variable from thread 1 is observed in thread 2, thread 2 cannot see an earlier value for that variable from thread1.
For eg:

atomic_int a=0, b, c;
--Thread1--
a.store(1, memory_order_relaxed)
a.store(2, memory_order_relaxed)
--Thread2--
b=a.load(memory_order_relaxed)
c=a.load(memory_order_relaxed)
assert(b<=c)

The above assert can never fail.
Once store of 1 is visible in Thread2, it can never be 0(initial value).
Also if store of 2 is visible in Thread2, it cannot be 1 or 0 in the same thread.

Relaxed is not commonly used for synchronization, but merely to make an operation atomic.
Relaxed mode has no synchronization side effects and compiler can perform optimizations.
But it cannot be treated like a normal shared memory access as compiler cannot reorder the atomic operation.

Acquire-Release

In acquire-release, we specify store operation to use acquire mode(std::memory_order_acquire) and load operation to use release mode(std::memory_order_release).
When a store with acquire is issued, all the stores of dependent atomic variables will be visible after the operation. All the non-atomic stores also will be visible.
When a load with release is issued, all the loads of dependent atomic variables will be visible after the operation. All the non-atomic loads also will be visible.

For eg:
int a;
std::atomic_int b, c, d;
--Thread 1--
a=20;
b.store(10, std::memory_order_acquire);
c.store(30, std::memory_order_acquire);
d.store(b.load(std::memory_order_release));

--Thread 2--
if(d.load(std::memory_order_release) == 10)
 assert(b.load(std::memory_order_release) == 10);//will succeed as d is dependent on b
if(d.load(std::memory_order_release) == 10)
 assert(a == 20);//will succeed as all non-atomic stores follows happens-before for acquire
if(d.load(std::memory_order_release) == 10)
 assert(c.load(std::memory_order_release) == 30);// will fail as c is not dependent on d.

In the example above, the first two asserts succeeds as b is dependent on d and a is a non-atomic variable.
Third assert fails as c is non-dependent on b.

Acquire-Release does not allow compiler to rearrange the code before a store and after a load.
So there will be an optimization impact but it will be better than sequential consistency.

Acquire-Consume

It is similar to acquire-release but there is no happens-before ordering on the non-dependent variables(both atomic and non-atomic).
Only the stores to the dependent variables are visible after the operation.
In the previous example, if load on d is invoked with std::memory_order_consume, then its not guaranteed to have set a to 20.

int a;
std::atomic_int b, c, d;
--Thread 1--
a=20;
b.store(10, std::memory_order_acquire);
c.store(30, std::memory_order_acquire);
d.store(b.load(std::memory_order_release));

--Thread 2--
if(d.load(std::memory_order_release) == 10)
 assert(b.load(std::memory_order_release) == 10);//will succeed as d is dependent on b
if(d.load(std::memory_order_consume) == 10)
 assert(a == 20);//can fail as non-dependent variables don't follow happens-before
if(d.load(std::memory_order_release) == 10)
 assert(c.load(std::memory_order_release) == 30);// can fail as c is not dependent on d.

Consume can be faster than release, at hardware level, as it doesn't need to flush the states of the non-dependent variables.
As far as compiler optimizes are concerned, it will be treated as an acquire.

All the above models are defined in an enum in the header <atomic> (<cstdatomic> for gcc < 4.7)

enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
};

memory_order_acq_rel is an order which performs acquire on load and release on store.

No comments :