Monday, November 18, 2013

Performance of Atomics with GCC


Atomic Builtins

Atomic builtins were first introduced in GCC 4.1.
Apart from the usual builtin functions, these functions does not use __builtin__.
They use __sync_ prefix.
There is a C++ wrapper of these calls available under <cstdatomic>

This API guarantee atomicity on the cost of performance.
There was no memory ordering support till GCC 4.7(refer Data-dependency ordering: atomics and memory model in GCC 4.6 status).
This resulted in a major performance impact for atomic implementation in GCC < 4.7.
One of the major culprits is the load() call.
Following is the source code of std::atomic::load() implementation from GCC 4.6:

__int_type load(memory_order __m = memory_order_seq_cst) const volatile
{
  __glibcxx_assert(__m != memory_order_release);
  __glibcxx_assert(__m != memory_order_acq_rel);

  __sync_synchronize();
  __int_type __ret = _M_i;
  __sync_synchronize();
  return __ret;
}

__sync_synchronize() performs full memory barrier.
In short it insert MFENCE instruction, which is both store and load barrier.
So every load operation generates two MFENCE instruction, before and after the actual load.
MFENCE is the most expensive instructions compared to others(SFENCE and LFENCE)

Implementation in GCC 4.7

GCC 4.7 introduced new set of atomic builtin functions, with __atomic_ as prefix.
These builtins have properly implemented memory ordering.
GCC 4.7 implementation of atomic builtins are twice as fast as GCC < 4.7
This can be understood from the implementation of load() itself.

__int_type load(memory_order __m = memory_order_seq_cst) const noexcept
{
  __glibcxx_assert(__m != memory_order_release);
  __glibcxx_assert(__m != memory_order_acq_rel);

  return __atomic_load_n(&_M_i, __m);
}

Memory models for Atomic Operations and Optimizations


Atomic Memory Models

Memory model is an important topic for understanding synchronization techniques using atomic operations.
Each memory model specifies how memory accesses are to be ordered around an atomic operation.
In a typical multi-threaded environment, there will be concurrent read and write to several variables and the order in which it happens will differ among the threads.
Understanding the memory models will help to know how to constrain the memory ordering to attain low latency synchronizations.

Each of the memory models will have some effect on the compiler optimizations.
But none of them have any effect on the thread local optimizations.

Every atomic operation allows to specify an addition argument, which is the memory model for that operation.
There are mainly 3 models:

Sequentially Consistent

pre

This is the default mode when none is specified in an atomic operation and is the most restrictive.
It means the same programmatic order is visible by all the threads.
It can also be explicitly specified using std::memory_order_seq_cst.
Consider the following example:

int a;
atomic_int b, c;
--Thread 1--
a=10;
b.store(5);
c.store(4);

--Thread 2--
if(c.load() == 4)
{
 assert(a==10);
 assert(b.load()==5);
}

Both the above assertions cannot fail.
All the stores, including dependent and non-dependent, happens before (will be visible after) the current operation.
This is the most expensive of all the operations and the most safe from a synchronization perspective.
It is required to enforce a system wide consistency between all threads.
Compiler will not be able to move instructions in either direction(before a store or after a load) and hence will have impact on optimizations.

Relaxed

This doesn't have any happens-before restrictions. Any operations before relaxed can happen in any order across threads.
In the example above, either of the assertions can fail, with relaxed memory ordering.
Relaxed ordering can be specified using std::memory_order_relaxed
The only ordering imposed is that once a value for a variable from thread 1 is observed in thread 2, thread 2 cannot see an earlier value for that variable from thread1.
For eg:

atomic_int a=0, b, c;
--Thread1--
a.store(1, memory_order_relaxed)
a.store(2, memory_order_relaxed)
--Thread2--
b=a.load(memory_order_relaxed)
c=a.load(memory_order_relaxed)
assert(b<=c)

The above assert can never fail.
Once store of 1 is visible in Thread2, it can never be 0(initial value).
Also if store of 2 is visible in Thread2, it cannot be 1 or 0 in the same thread.

Relaxed is not commonly used for synchronization, but merely to make an operation atomic.
Relaxed mode has no synchronization side effects and compiler can perform optimizations.
But it cannot be treated like a normal shared memory access as compiler cannot reorder the atomic operation.

Acquire-Release

In acquire-release, we specify store operation to use acquire mode(std::memory_order_acquire) and load operation to use release mode(std::memory_order_release).
When a store with acquire is issued, all the stores of dependent atomic variables will be visible after the operation. All the non-atomic stores also will be visible.
When a load with release is issued, all the loads of dependent atomic variables will be visible after the operation. All the non-atomic loads also will be visible.

For eg:
int a;
std::atomic_int b, c, d;
--Thread 1--
a=20;
b.store(10, std::memory_order_acquire);
c.store(30, std::memory_order_acquire);
d.store(b.load(std::memory_order_release));

--Thread 2--
if(d.load(std::memory_order_release) == 10)
 assert(b.load(std::memory_order_release) == 10);//will succeed as d is dependent on b
if(d.load(std::memory_order_release) == 10)
 assert(a == 20);//will succeed as all non-atomic stores follows happens-before for acquire
if(d.load(std::memory_order_release) == 10)
 assert(c.load(std::memory_order_release) == 30);// will fail as c is not dependent on d.

In the example above, the first two asserts succeeds as b is dependent on d and a is a non-atomic variable.
Third assert fails as c is non-dependent on b.

Acquire-Release does not allow compiler to rearrange the code before a store and after a load.
So there will be an optimization impact but it will be better than sequential consistency.

Acquire-Consume

It is similar to acquire-release but there is no happens-before ordering on the non-dependent variables(both atomic and non-atomic).
Only the stores to the dependent variables are visible after the operation.
In the previous example, if load on d is invoked with std::memory_order_consume, then its not guaranteed to have set a to 20.

int a;
std::atomic_int b, c, d;
--Thread 1--
a=20;
b.store(10, std::memory_order_acquire);
c.store(30, std::memory_order_acquire);
d.store(b.load(std::memory_order_release));

--Thread 2--
if(d.load(std::memory_order_release) == 10)
 assert(b.load(std::memory_order_release) == 10);//will succeed as d is dependent on b
if(d.load(std::memory_order_consume) == 10)
 assert(a == 20);//can fail as non-dependent variables don't follow happens-before
if(d.load(std::memory_order_release) == 10)
 assert(c.load(std::memory_order_release) == 30);// can fail as c is not dependent on d.

Consume can be faster than release, at hardware level, as it doesn't need to flush the states of the non-dependent variables.
As far as compiler optimizes are concerned, it will be treated as an acquire.

All the above models are defined in an enum in the header <atomic> (<cstdatomic> for gcc < 4.7)

enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
};

memory_order_acq_rel is an order which performs acquire on load and release on store.

Changing Shared Memory size in linux

view Shared Memory limits

Use the ipcs command to view the limits.

ipcs -ml
Output of the above command will look like

------ Shared Memory Limits --------                                                                                                                                                          
max number of segments = 4096                                                                                                                                                                 
max seg size (kbytes) = 67108864                                                                                                                                                              
max total shared memory (kbytes) = 17179869184                                                                                                                                                
min seg size (bytes) = 1                                                                                                                                                                      

Changing shared memory size

There are 3 parameters which we should be aware of:

SHMMAX

This defines the maximum size in bytes of a single shared memory segment that a Linux process can allocate. To set this value, change /proc/sys/kernel/shmmax to the desired value

echo 2147483648 > /proc/sys/kernel/shmmax
We can also use sysctl to change it.
sysctl -w kernel.shmmax=1073741824

SHMMNI

This defines the system wide maximum number of shared memory segments.
To change:

sysctl -w kernel.shmmni=4096

SHMALL

This parameter sets the total amount of shared memory pages that can be used system wide.
Hence, SHMALL should always be at least ceil(shmmax/PAGE_SIZE). To change:

sysctl -w kernel.shmall=2097152
If not sure what the default PAGE_SIZE is, run the following command:
getconf PAGE_SIZE

To make a permanent change, edit /etc/sysctl.conf file.

echo "kernel.shmmax=2147483648" >> /etc/sysctl.conf
To load sysctl.conf after the change:
sysctl -p

Fast Callback using variadic templates


C++11 has a powerful callback mechanism included in the header functional
It is a strongly typed callback mechanism and as such should know the type of callee object and arguments.
Many a times, we come across situation where we need to invoke a callback(like event handlers) without exactly knowing the type of callee.
We could impose some restrictions on the type of callee, to implement a particular interface.
This is a traditional approach of solving this problem, by using dynamic polymorphism.
Dynamic polymorphism is not recommended for low latency applications, as it involves vtable lookup, every time a function is invoked.
We can create a Functor class which can achieve the same results as dynamic polymorphism, but with the same performance as std::function.
Following is an implemenation of such a functor class.

Functor using variadic templates

Our functor should have a pointer to a callee object and member function object.
It then will have function call operator overloaded to take any number of arguments and just pass it on to the member method.
We will have templates for both the objects.
A simple implementation is shown below:

template<typename Client, typename Memfunc>
class Functor 
{
private:
    Client* m_client;
    Memfunc m_func;
public:
    
    Functor(Client* client, Memfunc memfunc):m_client(client), m_func(memfunc)
    {        
    }

    template<typename ...T>                       
    void operator()(T... args)
    {
        (m_client->*m_func)(args...);
    }
};

Usage of this functor is shown below:

class Receiver
{
    public:
        void singleArg(int i)
        {
            std::cout<<"Single arg: "<<i<<"\n";
        }
        void doubleArg(int i, double j)
        {
            std::cout<<"double arg: "<<i<<":"<<j<<"\n";
        }

};
Receiver receiver;
Functor<Receiver, decltype(&Receiver::singleArg)> f1(&receiver, &Receiver::singleArg);
f1(10);
Functor<Receiver, decltype(&Receiver::doubleArg)> f2(&receiver, &Receiver::doubleArg);
f2(10, 3.14);

This works well for invoking member methods with any arguments.
What if we want to save an instance of Functor and invoke it later?
The problem arise as the Functor type is dependent on Client and Memfunc.
Anywhere we need to use Functor, we have to mention Client and Memfunc.
To get rid of this dependency, we can have a base class which could abstract out both the types.

class FunctorBase
{
public:
    void *_client;    
    typedef void ( FunctorBase::*memfunc ) ();
    char _ppMemfunc[sizeof ( memfunc )];
    
    FunctorBase(void *cl, void *f):_client(cl)
    {
         memcpy ( _ppMemfunc, f, sizeof ( memfunc ) );
    }
    
    template<typename T>
    virtual void operator()(T... args)=0;
};

Alas!! We are getting compilation error, gcc complaining "templates may not be 'virtual'"
We can forward the base class call to derived class by means of static method.
Since static methods are not really associated with object, we can use it for forwarding the call.
Here is the complete implementation

//Encapsulates the callee and member method
//Member method would be copied to a byte array, whcih will be used later to invoke the same.
class CallbackData
{
public:
 typedef void ( CallbackData::*memfunc ) ();
 CallbackData ( const void* pClient_, const void* ppMemfunc_ ) : _pClient ( pClient_ )
 {
  memcpy ( _ppMemfunc,ppMemfunc_,sizeof ( memfunc ) );
 }
 const void* pClient() const
 {
  return _pClient;
 }
 const void* ppMemfunc() const
 {
  return _ppMemfunc;
 }
protected:
 char _ppMemfunc[sizeof ( memfunc )];
 const void* _pClient;
};

//Base class of functor, which just invokes the saved member method.
template<typename ReturnType, typename... Param>                
class FunctorBase
{
public:
    CallbackData data;
    typedef ReturnType (*Fptr)(CallbackData, Param...);
    Fptr func;
    
    FunctorBase(const void* client, const void* memfunc, Fptr f):data(client, memfunc), func(f)
    {
    }
    
    template<typename... T>
    ReturnType operator()(T... args)
    {
 return (func)(data, args...);
    }
};


template<typename Client, typename Memfunc, typename ReturnType, typename... Param>
class Functor : public FunctorBase<ReturnType, Param...>
{
public:
    //Hooks a public static method to the base class, so that there is no dependency.
    //From the static method, callee and member method is determined and method is invoked.
    Functor(Client* client, Memfunc* memfunc):FunctorBase<ReturnType, Param...>(client, memfunc, hook)
    {        
    }
       
    template<typename... T>
    static ReturnType hook(CallbackData data, T... args)
    {
     Client* pClient = ( Client* ) data.pClient();
     Memfunc& pmemfunc = ( * ( Memfunc* ) ( data.ppMemfunc() ) );
     return ( pClient->*pmemfunc ) (args...);
 
    }
};

Now the base class only requires type of the parameters, not the client object or member method.
We can use the functor in the following way:

typedef void (Receiver::*SFP)(int);
SFP temp = &Receiver::singleArg;
FunctorBase *f = new Functor<Receiver, SFP, int>
                   (&receiver, &temp);
(*f)(10);

The flexibility of the above code is that FunctorBase<type> is not depended on the actual functor implementation.
We can assign any functor objects at the runtime to this base class and get it executed.
So the class which needs to invoke a callback, can have FunctorBase<type>* and can invoke any callback without knowing the type of exact callee object.
Currently we only have support for member methods.
To add support for static methods or non-member methods, we just need to provide a template specialization of our functor.

template<typename Memfunc, typename ReturnType, typename... Param>
class Functor<void*, Memfunc, ReturnType, Param...> : public FunctorBase<ReturnType, Param...>
{
public:
    Functor(Memfunc* memfunc):FunctorBase<ReturnType, Param...>(NULL, memfunc, hook)
    {        
    }
   
    
    template<typename... T>
    static ReturnType hook(CallbackData data, T... args)
    {
 Memfunc& pmemfunc = ( * ( Memfunc* ) ( data.ppMemfunc() ) );
 return ( *pmemfunc ) (args...);
    }
};

The performance of this functor implementation is almost same as std::function, as there is nothing but just a function call when the callback is invoked.