Monday, December 9, 2013

Force inline functions in C++(GCC)


Inline functions

In C++, inline function invocations will get replaced by the complete body of the function, similar to macros. This step performed by compiler is called inline expansion. Major advantage of inline function is that there is no overhead of a function call. Hence there is a performance boost.

Compiler considers all functions whose definitions are in the header, for inline expansion. For member methods, which are not define in header files, we can request compiler to consider for inline. Use the inline keyword in the function declaration

inline int calculate(int x, int y);

The above methods are for requesting compiler to consider a function for inline expansion. Compiler can choose to inline or not. When do we need to force compiler to inline always? Compiler heuristically decide which functions are worth inlining. Size of a function plays a major role. If you have a function, which could let your compiler decide not to inline because of its size and if its getting frequently invoked, then its a good candidate for force inlining.

Force Inline

In GCC, you can use the following compiler attribute to always inline a function __attribute__((always_inline))

__attribute__((always_inline)) int calculate(int x, int y);

GCC Performance consideration for inline

Using force inline is not a good idea always.
Compiler usually makes good choice, GCC provides a set of performance options related to inline
  • -fno-inline
  • Do not expand any function except the once which are forced inline.
    This is the default when no optimizations are performed.
  • -finline-small-functions
  • Integrate functions into their callers when their body is smaller than expected function call code (so overall size of program gets smaller).
    The compiler heuristically decides which functions are simple enough to be worth integrating in this way.
    This inlining applies to all functions, even those not declared inline.
    Enabled at level -O2.
  • -findirect-inlining
  • Inline also indirect calls that are discovered to be known at compile time thanks to previous inlining.
    This option has any effect only when inlining itself is turned on by the -finline-functions or -finline-small-functions options.
    Enabled at level -O2.
  • -finline-functions
  • Consider all functions for inlining, even if they are not declared inline.
    The compiler heuristically decides which functions are worth integrating in this way.
    If all calls to a given function are integrated, and the function is declared static, then the function is normally not output as assembler code in its own right.
    Enabled at level -O3.
  • -finline-functions-called-once
  • Consider all static functions called once for inlining into their caller even if they are not marked inline.
    If a call to a given function is integrated, then the function is not output as assembler code in its own right.
    Enabled at levels -O1, -O2, -O3 and -Os.
  • -fearly-inlining
  • Inline functions marked by always_inline(force inline) and functions whose body seems smaller than the function call overhead early before doing -fprofile-generate instrumentation and real inlining pass.
    Doing so makes profiling significantly cheaper and usually inlining faster on programs having large chains of nested wrapper functions.
    Enabled by default.
  • -finline-limit=n
  • By default, GCC limits the size of functions that can be inlined.
    This flag allows coarse control of this limit. n is the size of functions that can be inlined in number of pseudo instructions.
    Pseudo instruction represents, in this particular context, an abstract measurement of function's size.
    In no way does it represent a count of assembly instructions and as such its exact meaning might change from one release to an another.

Monday, November 18, 2013

Performance of Atomics with GCC


Atomic Builtins

Atomic builtins were first introduced in GCC 4.1.
Apart from the usual builtin functions, these functions does not use __builtin__.
They use __sync_ prefix.
There is a C++ wrapper of these calls available under <cstdatomic>

This API guarantee atomicity on the cost of performance.
There was no memory ordering support till GCC 4.7(refer Data-dependency ordering: atomics and memory model in GCC 4.6 status).
This resulted in a major performance impact for atomic implementation in GCC < 4.7.
One of the major culprits is the load() call.
Following is the source code of std::atomic::load() implementation from GCC 4.6:

__int_type load(memory_order __m = memory_order_seq_cst) const volatile
{
  __glibcxx_assert(__m != memory_order_release);
  __glibcxx_assert(__m != memory_order_acq_rel);

  __sync_synchronize();
  __int_type __ret = _M_i;
  __sync_synchronize();
  return __ret;
}

__sync_synchronize() performs full memory barrier.
In short it insert MFENCE instruction, which is both store and load barrier.
So every load operation generates two MFENCE instruction, before and after the actual load.
MFENCE is the most expensive instructions compared to others(SFENCE and LFENCE)

Implementation in GCC 4.7

GCC 4.7 introduced new set of atomic builtin functions, with __atomic_ as prefix.
These builtins have properly implemented memory ordering.
GCC 4.7 implementation of atomic builtins are twice as fast as GCC < 4.7
This can be understood from the implementation of load() itself.

__int_type load(memory_order __m = memory_order_seq_cst) const noexcept
{
  __glibcxx_assert(__m != memory_order_release);
  __glibcxx_assert(__m != memory_order_acq_rel);

  return __atomic_load_n(&_M_i, __m);
}

Memory models for Atomic Operations and Optimizations


Atomic Memory Models

Memory model is an important topic for understanding synchronization techniques using atomic operations.
Each memory model specifies how memory accesses are to be ordered around an atomic operation.
In a typical multi-threaded environment, there will be concurrent read and write to several variables and the order in which it happens will differ among the threads.
Understanding the memory models will help to know how to constrain the memory ordering to attain low latency synchronizations.

Each of the memory models will have some effect on the compiler optimizations.
But none of them have any effect on the thread local optimizations.

Every atomic operation allows to specify an addition argument, which is the memory model for that operation.
There are mainly 3 models:

Sequentially Consistent

pre

This is the default mode when none is specified in an atomic operation and is the most restrictive.
It means the same programmatic order is visible by all the threads.
It can also be explicitly specified using std::memory_order_seq_cst.
Consider the following example:

int a;
atomic_int b, c;
--Thread 1--
a=10;
b.store(5);
c.store(4);

--Thread 2--
if(c.load() == 4)
{
 assert(a==10);
 assert(b.load()==5);
}

Both the above assertions cannot fail.
All the stores, including dependent and non-dependent, happens before (will be visible after) the current operation.
This is the most expensive of all the operations and the most safe from a synchronization perspective.
It is required to enforce a system wide consistency between all threads.
Compiler will not be able to move instructions in either direction(before a store or after a load) and hence will have impact on optimizations.

Relaxed

This doesn't have any happens-before restrictions. Any operations before relaxed can happen in any order across threads.
In the example above, either of the assertions can fail, with relaxed memory ordering.
Relaxed ordering can be specified using std::memory_order_relaxed
The only ordering imposed is that once a value for a variable from thread 1 is observed in thread 2, thread 2 cannot see an earlier value for that variable from thread1.
For eg:

atomic_int a=0, b, c;
--Thread1--
a.store(1, memory_order_relaxed)
a.store(2, memory_order_relaxed)
--Thread2--
b=a.load(memory_order_relaxed)
c=a.load(memory_order_relaxed)
assert(b<=c)

The above assert can never fail.
Once store of 1 is visible in Thread2, it can never be 0(initial value).
Also if store of 2 is visible in Thread2, it cannot be 1 or 0 in the same thread.

Relaxed is not commonly used for synchronization, but merely to make an operation atomic.
Relaxed mode has no synchronization side effects and compiler can perform optimizations.
But it cannot be treated like a normal shared memory access as compiler cannot reorder the atomic operation.

Acquire-Release

In acquire-release, we specify store operation to use acquire mode(std::memory_order_acquire) and load operation to use release mode(std::memory_order_release).
When a store with acquire is issued, all the stores of dependent atomic variables will be visible after the operation. All the non-atomic stores also will be visible.
When a load with release is issued, all the loads of dependent atomic variables will be visible after the operation. All the non-atomic loads also will be visible.

For eg:
int a;
std::atomic_int b, c, d;
--Thread 1--
a=20;
b.store(10, std::memory_order_acquire);
c.store(30, std::memory_order_acquire);
d.store(b.load(std::memory_order_release));

--Thread 2--
if(d.load(std::memory_order_release) == 10)
 assert(b.load(std::memory_order_release) == 10);//will succeed as d is dependent on b
if(d.load(std::memory_order_release) == 10)
 assert(a == 20);//will succeed as all non-atomic stores follows happens-before for acquire
if(d.load(std::memory_order_release) == 10)
 assert(c.load(std::memory_order_release) == 30);// will fail as c is not dependent on d.

In the example above, the first two asserts succeeds as b is dependent on d and a is a non-atomic variable.
Third assert fails as c is non-dependent on b.

Acquire-Release does not allow compiler to rearrange the code before a store and after a load.
So there will be an optimization impact but it will be better than sequential consistency.

Acquire-Consume

It is similar to acquire-release but there is no happens-before ordering on the non-dependent variables(both atomic and non-atomic).
Only the stores to the dependent variables are visible after the operation.
In the previous example, if load on d is invoked with std::memory_order_consume, then its not guaranteed to have set a to 20.

int a;
std::atomic_int b, c, d;
--Thread 1--
a=20;
b.store(10, std::memory_order_acquire);
c.store(30, std::memory_order_acquire);
d.store(b.load(std::memory_order_release));

--Thread 2--
if(d.load(std::memory_order_release) == 10)
 assert(b.load(std::memory_order_release) == 10);//will succeed as d is dependent on b
if(d.load(std::memory_order_consume) == 10)
 assert(a == 20);//can fail as non-dependent variables don't follow happens-before
if(d.load(std::memory_order_release) == 10)
 assert(c.load(std::memory_order_release) == 30);// can fail as c is not dependent on d.

Consume can be faster than release, at hardware level, as it doesn't need to flush the states of the non-dependent variables.
As far as compiler optimizes are concerned, it will be treated as an acquire.

All the above models are defined in an enum in the header <atomic> (<cstdatomic> for gcc < 4.7)

enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
};

memory_order_acq_rel is an order which performs acquire on load and release on store.

Changing Shared Memory size in linux

view Shared Memory limits

Use the ipcs command to view the limits.

ipcs -ml
Output of the above command will look like

------ Shared Memory Limits --------                                                                                                                                                          
max number of segments = 4096                                                                                                                                                                 
max seg size (kbytes) = 67108864                                                                                                                                                              
max total shared memory (kbytes) = 17179869184                                                                                                                                                
min seg size (bytes) = 1                                                                                                                                                                      

Changing shared memory size

There are 3 parameters which we should be aware of:

SHMMAX

This defines the maximum size in bytes of a single shared memory segment that a Linux process can allocate. To set this value, change /proc/sys/kernel/shmmax to the desired value

echo 2147483648 > /proc/sys/kernel/shmmax
We can also use sysctl to change it.
sysctl -w kernel.shmmax=1073741824

SHMMNI

This defines the system wide maximum number of shared memory segments.
To change:

sysctl -w kernel.shmmni=4096

SHMALL

This parameter sets the total amount of shared memory pages that can be used system wide.
Hence, SHMALL should always be at least ceil(shmmax/PAGE_SIZE). To change:

sysctl -w kernel.shmall=2097152
If not sure what the default PAGE_SIZE is, run the following command:
getconf PAGE_SIZE

To make a permanent change, edit /etc/sysctl.conf file.

echo "kernel.shmmax=2147483648" >> /etc/sysctl.conf
To load sysctl.conf after the change:
sysctl -p

Fast Callback using variadic templates


C++11 has a powerful callback mechanism included in the header functional
It is a strongly typed callback mechanism and as such should know the type of callee object and arguments.
Many a times, we come across situation where we need to invoke a callback(like event handlers) without exactly knowing the type of callee.
We could impose some restrictions on the type of callee, to implement a particular interface.
This is a traditional approach of solving this problem, by using dynamic polymorphism.
Dynamic polymorphism is not recommended for low latency applications, as it involves vtable lookup, every time a function is invoked.
We can create a Functor class which can achieve the same results as dynamic polymorphism, but with the same performance as std::function.
Following is an implemenation of such a functor class.

Functor using variadic templates

Our functor should have a pointer to a callee object and member function object.
It then will have function call operator overloaded to take any number of arguments and just pass it on to the member method.
We will have templates for both the objects.
A simple implementation is shown below:

template<typename Client, typename Memfunc>
class Functor 
{
private:
    Client* m_client;
    Memfunc m_func;
public:
    
    Functor(Client* client, Memfunc memfunc):m_client(client), m_func(memfunc)
    {        
    }

    template<typename ...T>                       
    void operator()(T... args)
    {
        (m_client->*m_func)(args...);
    }
};

Usage of this functor is shown below:

class Receiver
{
    public:
        void singleArg(int i)
        {
            std::cout<<"Single arg: "<<i<<"\n";
        }
        void doubleArg(int i, double j)
        {
            std::cout<<"double arg: "<<i<<":"<<j<<"\n";
        }

};
Receiver receiver;
Functor<Receiver, decltype(&Receiver::singleArg)> f1(&receiver, &Receiver::singleArg);
f1(10);
Functor<Receiver, decltype(&Receiver::doubleArg)> f2(&receiver, &Receiver::doubleArg);
f2(10, 3.14);

This works well for invoking member methods with any arguments.
What if we want to save an instance of Functor and invoke it later?
The problem arise as the Functor type is dependent on Client and Memfunc.
Anywhere we need to use Functor, we have to mention Client and Memfunc.
To get rid of this dependency, we can have a base class which could abstract out both the types.

class FunctorBase
{
public:
    void *_client;    
    typedef void ( FunctorBase::*memfunc ) ();
    char _ppMemfunc[sizeof ( memfunc )];
    
    FunctorBase(void *cl, void *f):_client(cl)
    {
         memcpy ( _ppMemfunc, f, sizeof ( memfunc ) );
    }
    
    template<typename T>
    virtual void operator()(T... args)=0;
};

Alas!! We are getting compilation error, gcc complaining "templates may not be 'virtual'"
We can forward the base class call to derived class by means of static method.
Since static methods are not really associated with object, we can use it for forwarding the call.
Here is the complete implementation

//Encapsulates the callee and member method
//Member method would be copied to a byte array, whcih will be used later to invoke the same.
class CallbackData
{
public:
 typedef void ( CallbackData::*memfunc ) ();
 CallbackData ( const void* pClient_, const void* ppMemfunc_ ) : _pClient ( pClient_ )
 {
  memcpy ( _ppMemfunc,ppMemfunc_,sizeof ( memfunc ) );
 }
 const void* pClient() const
 {
  return _pClient;
 }
 const void* ppMemfunc() const
 {
  return _ppMemfunc;
 }
protected:
 char _ppMemfunc[sizeof ( memfunc )];
 const void* _pClient;
};

//Base class of functor, which just invokes the saved member method.
template<typename ReturnType, typename... Param>                
class FunctorBase
{
public:
    CallbackData data;
    typedef ReturnType (*Fptr)(CallbackData, Param...);
    Fptr func;
    
    FunctorBase(const void* client, const void* memfunc, Fptr f):data(client, memfunc), func(f)
    {
    }
    
    template<typename... T>
    ReturnType operator()(T... args)
    {
 return (func)(data, args...);
    }
};


template<typename Client, typename Memfunc, typename ReturnType, typename... Param>
class Functor : public FunctorBase<ReturnType, Param...>
{
public:
    //Hooks a public static method to the base class, so that there is no dependency.
    //From the static method, callee and member method is determined and method is invoked.
    Functor(Client* client, Memfunc* memfunc):FunctorBase<ReturnType, Param...>(client, memfunc, hook)
    {        
    }
       
    template<typename... T>
    static ReturnType hook(CallbackData data, T... args)
    {
     Client* pClient = ( Client* ) data.pClient();
     Memfunc& pmemfunc = ( * ( Memfunc* ) ( data.ppMemfunc() ) );
     return ( pClient->*pmemfunc ) (args...);
 
    }
};

Now the base class only requires type of the parameters, not the client object or member method.
We can use the functor in the following way:

typedef void (Receiver::*SFP)(int);
SFP temp = &Receiver::singleArg;
FunctorBase *f = new Functor<Receiver, SFP, int>
                   (&receiver, &temp);
(*f)(10);

The flexibility of the above code is that FunctorBase<type> is not depended on the actual functor implementation.
We can assign any functor objects at the runtime to this base class and get it executed.
So the class which needs to invoke a callback, can have FunctorBase<type>* and can invoke any callback without knowing the type of exact callee object.
Currently we only have support for member methods.
To add support for static methods or non-member methods, we just need to provide a template specialization of our functor.

template<typename Memfunc, typename ReturnType, typename... Param>
class Functor<void*, Memfunc, ReturnType, Param...> : public FunctorBase<ReturnType, Param...>
{
public:
    Functor(Memfunc* memfunc):FunctorBase<ReturnType, Param...>(NULL, memfunc, hook)
    {        
    }
   
    
    template<typename... T>
    static ReturnType hook(CallbackData data, T... args)
    {
 Memfunc& pmemfunc = ( * ( Memfunc* ) ( data.ppMemfunc() ) );
 return ( *pmemfunc ) (args...);
    }
};

The performance of this functor implementation is almost same as std::function, as there is nothing but just a function call when the callback is invoked.

Tuesday, May 7, 2013

Callback Mechanism in C++

Callbacks

Callback is a routine which is passed to some other code and which is expected to get called at some point of time. This will be very useful in sending notifications. It could also help in decoupling components. A framework for event notifications using callbacks will be very useful when designing complex systems.

To make a generic implementation, a callback should be able to take any number of inputs with varying  types. It could either be a simple non-member function, static function or a class member function.
C++11  provides an easy way of using callback using std::function.


std::function

std::function is introduced in C++11 and can be accessed after including header file functional .
It can store any callable target including static or non-member functions, member functions, lambda functions.
Let's see an example

#include <functional>
#include <iostream>

class Handler
{
    void invoke(int i, int j)
    {
        std::cout<<"Handler invoked with "<<i<<":"<<j<<"\n";
    }
};


int main(int argc, char* argv)
{
    std::function<void(Handler*, int, int)> 
        callback(&Handler::invoke);
    Handler h;
    callback(&h, 10, 20);
    
    return 0;
}

std::bind

Consider a scenario where some of the arguments of the callbacks need to be constant.
For eg:
Consider a function display which takes three parameters: Polygon, width and height.
Suppose we want to create a callback for this function, such that width and height should always be 100 and 200.
Client should only be able to invoke the callback with Polygon object. Width and Height should be set automatically.
This can be achieved by binding caller object and its parameters while initializing the callback.

Polygon p;
auto disp = std::bind(&display, std::placeholders::_1, 100, 200);
disp(p);

We have 29 such placeholders defined in the functional header.

std::bind creates a callback by binding the object and method(in case of member method) together with the arguments passed. If we do not know the arguments, we can specify placeholders.
While invoking the callback, we just need to pass the values of the placeholders.

Consider the first example again.
There we have passed pointer to object as the first argument.
What if we only wants to make a callback to member method of the same object?(we don't want to invoke the callback with a different object)
How to get rid of passing object pointer every time we invoke the same callback?
We can use the same approach using std::bind to achieve it.

auto callback2 = std::bind(&Handler::invoke, &h, std::placeholders::_1, std::placeholders::_2);
callback2(20, 30);

This will, ofcourse, make the callback2 bound to the same object.

std::function along with std::bind provides an elegant way of invoking callbacks in C++.