Monday, November 18, 2013

Performance of Atomics with GCC


Atomic Builtins

Atomic builtins were first introduced in GCC 4.1.
Apart from the usual builtin functions, these functions does not use __builtin__.
They use __sync_ prefix.
There is a C++ wrapper of these calls available under <cstdatomic>

This API guarantee atomicity on the cost of performance.
There was no memory ordering support till GCC 4.7(refer Data-dependency ordering: atomics and memory model in GCC 4.6 status).
This resulted in a major performance impact for atomic implementation in GCC < 4.7.
One of the major culprits is the load() call.
Following is the source code of std::atomic::load() implementation from GCC 4.6:

__int_type load(memory_order __m = memory_order_seq_cst) const volatile
{
  __glibcxx_assert(__m != memory_order_release);
  __glibcxx_assert(__m != memory_order_acq_rel);

  __sync_synchronize();
  __int_type __ret = _M_i;
  __sync_synchronize();
  return __ret;
}

__sync_synchronize() performs full memory barrier.
In short it insert MFENCE instruction, which is both store and load barrier.
So every load operation generates two MFENCE instruction, before and after the actual load.
MFENCE is the most expensive instructions compared to others(SFENCE and LFENCE)

Implementation in GCC 4.7

GCC 4.7 introduced new set of atomic builtin functions, with __atomic_ as prefix.
These builtins have properly implemented memory ordering.
GCC 4.7 implementation of atomic builtins are twice as fast as GCC < 4.7
This can be understood from the implementation of load() itself.

__int_type load(memory_order __m = memory_order_seq_cst) const noexcept
{
  __glibcxx_assert(__m != memory_order_release);
  __glibcxx_assert(__m != memory_order_acq_rel);

  return __atomic_load_n(&_M_i, __m);
}

No comments :