Thursday, June 12, 2014

Linux Kernel Time Calculation


Time

Linux provides various methods for calculating time. The most frequently used one is time() defined in time.h
But time() returns time_t which represents number of seconds since epoch.
What if we want microsecond resolution? Better use gettimeofday. It takes timeval structure which provides time in microsecond resolution.
What if we want nanosecond resolution? What if we want time interval rather than absolute time?
Linux provides methods to get time with nanosecond resolution, provided your hardware support it. Also Linux allows you to choose the clock source for time calculation.

Clock Source

Linux provides multiple sources based on which, time is calculated. The sources can be found in /sys/devices/system/clocksource/clocksource0/available_clocksource
The current clock source selected by kernel can found in /sys/devices/system/clocksource/clocksource0/current_clocksource

root@jijith-M17xR4:/home/jijith# cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
tsc hpet acpi_pm 
In my machine i have the following clocks available
  • TSC
  • TSC stands for Time Stamp Counter. Its a running counter provided by the hardware, which usually provides a count of CPU clock cycles.
  • HPET
  • HPET is the High Precision Event Time. Its also a hardware based counter which provides high resolution timers. But its slower than TSC. So most of the linux kernels prefer TSC if available in the hardware
  • ACPI_PM
  • The ACPI Power Management Timer (or ACPI PMT) is yet another clock device included in almost all ACPI-based motherboards. Its clock signal has a fixed frequency. Its slower than HPET.

Redhat Customer Portal has a good comparison of the performance of various clocks available. Here is a summary:
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
# time ./clock_timing

	real	0m0.601s
	user	0m0.592s
	sys	0m0.002s

# echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
# time ./clock_timing

	real	0m12.263s
	user	0m12.197s
	sys	0m0.001s

# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
acpi_pm
# time ./clock_timing

	real	0m24.461s
	user	0m0.504s
	sys	0m23.776s

From the above results the efficiency of generating timestamps, in descending order, is: TSC, HPET, ACPI_PM.

Time Calculation

Time in Nanoseconds

clock_gettime can be used for getting time in nanoseconds(since epoch). It gives result in struct timespec which has the following members

struct timespec {
               time_t   tv_sec;        /* seconds */
               long     tv_nsec;       /* nanoseconds */
           };
clock_gettime also let you specify clock source, identified by clockid_t.
Possible values are
  • CLOCK_REALTIME
  • System wide real time clock.
  • CLOCK_MONOTONIC
  • Clock that cannot be set and represents monotonic time since some unspecified starting point. More like a running counter.
  • CLOCK_MONOTONIC_RAW
  • Similar to CLOCK_MONOTONIC but provides access to a raw hardware-based time that is not subject to NTP adjustments.
To check if your hardware supports nanosecond resolution, use clock_getres and check the value in tv_nsec.If its greater than 10^9 then you can get time in nanoseconds.
Note: You have to link your application with -lrt to use the above methods.

Time interval

If we use CLOCK_REALTIME then we get the absolute time. It can also be used to calculate the time interval, by taking diff of two instances. Any issues?
CLOCK_REALTIME is the system wide real time clock and it can cause drifts in the time if you have ntpdeamon running(or periodical ntdpate). This could result in a negative value of the time interval if you take a diff of two instances.
CLOCK_MONOTONIC_RAW solves this problem. Its a monotonously increasing number which is not subjected to NTP adjustments.

Performance

Usually clock_gettime() is a kernel call and there are overheads in making kernel calls.
Based on kernel configuration, Linux kernel can implement these methods as VDSO(Virtually Dynamic Shared Object). So they can run on user space. If you have CONFIG_GENERIC_TIME_VSYSCALL set to y in the kernel config, then the clock_gettime() will be available as VDSO.
How will you be able to access system time in user space?
Linux kernel timer handler can calculate the time and store it on a global memory. The user space call for clock_gettime() will read the contents of this memory. But it has to be guarded with a spin lock.
So basically, clock_gettime() can run in user space(if its available as VDSO) and can read the time from a global memory, which is already updated by timer interrupt handler.

Wednesday, May 14, 2014

Linux Network Packet Drops


NIC Level Packet Drops

Packet drops hinders the performance of networks. We can use ethtool to check NIC level packet drops in Linux. Use the -S option to get the statistics.

Sample Usage:

ethtool -S em1  | egrep "(nodesc)|(bad)"
     tx_bad_bytes: 0
     tx_bad: 0
     rx_bad_bytes: 65824
     rx_bad: 0
     rx_bad_lt64: 0
     rx_bad_64_to_15xx: 0
     rx_bad_15xx_to_jumbo: 0
     rx_bad_gtjumbo: 0
     rx_nodesc_drop_cnt: 1039

The field rx_nodesc_drop_cnt increasing over time is an indication that packets are being dropped by the adapter.

You can also view the packet drops using ifconfig

ifconfig em1 | grep drop
          RX packets:4154237628 errors:4 dropped:1429 overruns:0 frame:4
          TX packets:4148105177 errors:0 dropped:0 overruns:0 carrier:0

Fixing NIC Level Packet Drops

One common issue with Packet drops is that the NIC ring buffer is not enough for buffering. Solution is to increase the buffer size.
To view the existing buffer size:

ethtool -g em1
     Ring parameters for em1:
     Pre-set maximums:
     RX:             1020
     RX Mini:        0
     RX Jumbo:       4080
     TX:             255
     Current hardware settings:
     RX:             255
     RX Mini:        0
     RX Jumbo:       0
     TX:             255

Pre-set maximums are the maximum values for the buffers. As in the above example, we can increase the receive buffer(RX) to 1020. To increase the buffer size, use the following command

ethtool -G em1 rx 1020

Socket Level Packet Drops

Packets can also be dropped with UDP due to datagrams arriving when the socket buffer is full i.e. traffic is arriving faster than the application can consume. Solution is to increase the receive buffer

Increasing socket receive buffer size

uint64_t size=10*1024*1024;//10 MB
setsockopt(m_socketFd, SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));

SO_RCVBUF sets or gets the maximum socket receive buffer in bytes. The kernel doubles this value to allow space for bookkeeping overhead when it is set using setsockopt.
The maximum allowed value for this option is set by /proc/sys/net/core/rmem_max. You might have to change it to allow higher values in the SO_RCVBUF.
Maximum value can also be set using sysctl -w net.core.rmem_max=262144

Solarflare multicast and kernel bypass


Kernel ByPass

Kernel ByPass is the latest technology in improving the performance of network applications. Whenever a network packet is received by NIC, it writes the packet to a ring buffer and raise a soft IRQ to the kernel. Kernel will then perform packet processing and finally write to the user space buffer. This kernel intervention can lead to application jitters.
Using Kernel ByPass, the network packet can get delivered directly to the user space application. This could be the raw network packet, in which case the application will have to take care of the network packet processing. Or some NIC vendors (like Mellanox or Solarflare) provides NIC which has the network packet processing inside it. Usually NIC will be an FPGA card which has the packet processing logic in it. NIC can then directly transfer the data to user space. Solareflare, using Zero Copy API, can also provide direct access to the data in the ring buffer rather than copying it to the application buffer.

Kernel ByPass in Solarflare

Solarflare provides an application onload for bypassing kernel. Its very easy to use. You don't need to change your existing applications if you are using BSD sockets. Just start your application using onload and you will get the benefit of Kernel ByPass.
Using solarflare onload typically will save you ~2 micro seconds and save most of the jitters.
Internally each onload will create an onload stack which will get the packet directly from the NIC. Packets are copied from the onload stack to the application space. To view all the onload stacks issue the command onload_stackdump It shows Id of each stack along with the PID.

Multicast and Kernel ByPass in Solarflare

Multicast provides new challenges to handling Kernel ByPass as more than one application in the same host can subscribe to the same multicast channel. onload performs Kernel ByPass if we subscribe for a multicast stream. But if we have two applications which subscribe to the same multicast stream using two onload stack, then it goes through kernel. Still it uses onload stack, but user space acceleration is lost. The reason is because solarflare NIC will not deliver a single packet multiple times to multiple stacks, its only delivered once. On handing over the packet to kernel-space, kernel TCP/IP stack will copy to each of the onload stacks.

How to kernel bypass multiple subscribers to same multicast stream from the same host

To achieve kernel bypass in this case, all the subscribers should share the same onload stack. The problem with solarflare not able to perform kernel bypass was that it cannot copy packets to multiple stacks. So if we share the same stack, then solarflare doesn't need to copy and it can bypass kernel.
To share the stack, all processes should use the same value for the environment variable EF_NAME

Fast double to string conversion


Double to String Conversion

Converting numbers to string is a frequent operation when we have to transmit data as a character string to any system. In ultra low latency systems this conversion can contribute to almost 20% of the system latency, when the precision requirement is higher.
Let us consider the popular ways of converting double to string in C++.

stringstream

std::stringstream provides a simple and easy way of converting double to string.
Here is a code snippet.
std::stringstream strval;
double dval = 1729.314;
strval<<dval;
const char* finalValue = strval.str().c_str();

to_string()

C++11 strings library provides a convenient method to_string() to convert numbers to string.
double dval = 1729.314;
const char* finalValue = std::to_string(dval).to_string()

sprintf()

This is the C way of converting double to string.
char finalValue[1024];
double dval = 1729.314;
sprintf(finalValue, "%f", dval);

Which one do you prefer in a latency sensitive project?
Let us analyze the performance of each of the above methods.
But buffer we proceed, let us think of a custom way of doing it on our own for better performance.

Custom Implementation

Integer to string conversion can be cached if we know the range of integers we are going to use. For example, we can cache a millon integers which could cost hardly 10 MB.
We can cut double precision numbers to two integers, numbers before and after the decimal separator. Then its a matter of converting these integers and concatenating both with decimal separator.
The only edge case we need to handle here is when the number after decimal separator has leading 0s, like 1.02.
One idea is to multiply the number after decimal separator with power of 10 (better to take 10^maximum precision). If the number of digits is less than the power of 10(maximum precision), then we have to prepend the number with 0s.

Lets now see the performance comparison of all the above approaches:

The above benchmark is done by executing the same example using all the 4 approaches discussed before.
Benchmark is done on converting random double precision numbers of 4 digits decimal precision, 1000 times.

From the above graph you can see that stringstream is the worst. Median latency for stringstream is ~5.8 micro seconds.
stod() is the next better, which is ~4.2 micro seconds.
sprintf() provides event better, which is ~3 micro seconds. So the powerful C function beats its C++ counter parts.
Finally our custom implementation outperforms all the standard functions by miles. Its median latency is ~500 nano seconds which is more than 6 times better than sprintf!!

Conclusion

For ultra low latency applications its better to have handcrafted functions for double conversion.
Standard library functions is not fast enough especially if the decimal precision required is more.
Custom implementation can gain lot of advantage depending on the use case. For eg: in our case we have used more memory for gaining on reducing latency. Its a better bet if you have enough RAM and has more stricter latency requirements.

Monday, December 9, 2013

Force inline functions in C++(GCC)


Inline functions

In C++, inline function invocations will get replaced by the complete body of the function, similar to macros. This step performed by compiler is called inline expansion. Major advantage of inline function is that there is no overhead of a function call. Hence there is a performance boost.

Compiler considers all functions whose definitions are in the header, for inline expansion. For member methods, which are not define in header files, we can request compiler to consider for inline. Use the inline keyword in the function declaration

inline int calculate(int x, int y);

The above methods are for requesting compiler to consider a function for inline expansion. Compiler can choose to inline or not. When do we need to force compiler to inline always? Compiler heuristically decide which functions are worth inlining. Size of a function plays a major role. If you have a function, which could let your compiler decide not to inline because of its size and if its getting frequently invoked, then its a good candidate for force inlining.

Force Inline

In GCC, you can use the following compiler attribute to always inline a function __attribute__((always_inline))

__attribute__((always_inline)) int calculate(int x, int y);

GCC Performance consideration for inline

Using force inline is not a good idea always.
Compiler usually makes good choice, GCC provides a set of performance options related to inline
  • -fno-inline
  • Do not expand any function except the once which are forced inline.
    This is the default when no optimizations are performed.
  • -finline-small-functions
  • Integrate functions into their callers when their body is smaller than expected function call code (so overall size of program gets smaller).
    The compiler heuristically decides which functions are simple enough to be worth integrating in this way.
    This inlining applies to all functions, even those not declared inline.
    Enabled at level -O2.
  • -findirect-inlining
  • Inline also indirect calls that are discovered to be known at compile time thanks to previous inlining.
    This option has any effect only when inlining itself is turned on by the -finline-functions or -finline-small-functions options.
    Enabled at level -O2.
  • -finline-functions
  • Consider all functions for inlining, even if they are not declared inline.
    The compiler heuristically decides which functions are worth integrating in this way.
    If all calls to a given function are integrated, and the function is declared static, then the function is normally not output as assembler code in its own right.
    Enabled at level -O3.
  • -finline-functions-called-once
  • Consider all static functions called once for inlining into their caller even if they are not marked inline.
    If a call to a given function is integrated, then the function is not output as assembler code in its own right.
    Enabled at levels -O1, -O2, -O3 and -Os.
  • -fearly-inlining
  • Inline functions marked by always_inline(force inline) and functions whose body seems smaller than the function call overhead early before doing -fprofile-generate instrumentation and real inlining pass.
    Doing so makes profiling significantly cheaper and usually inlining faster on programs having large chains of nested wrapper functions.
    Enabled by default.
  • -finline-limit=n
  • By default, GCC limits the size of functions that can be inlined.
    This flag allows coarse control of this limit. n is the size of functions that can be inlined in number of pseudo instructions.
    Pseudo instruction represents, in this particular context, an abstract measurement of function's size.
    In no way does it represent a count of assembly instructions and as such its exact meaning might change from one release to an another.

Monday, November 18, 2013

Performance of Atomics with GCC


Atomic Builtins

Atomic builtins were first introduced in GCC 4.1.
Apart from the usual builtin functions, these functions does not use __builtin__.
They use __sync_ prefix.
There is a C++ wrapper of these calls available under <cstdatomic>

This API guarantee atomicity on the cost of performance.
There was no memory ordering support till GCC 4.7(refer Data-dependency ordering: atomics and memory model in GCC 4.6 status).
This resulted in a major performance impact for atomic implementation in GCC < 4.7.
One of the major culprits is the load() call.
Following is the source code of std::atomic::load() implementation from GCC 4.6:

__int_type load(memory_order __m = memory_order_seq_cst) const volatile
{
  __glibcxx_assert(__m != memory_order_release);
  __glibcxx_assert(__m != memory_order_acq_rel);

  __sync_synchronize();
  __int_type __ret = _M_i;
  __sync_synchronize();
  return __ret;
}

__sync_synchronize() performs full memory barrier.
In short it insert MFENCE instruction, which is both store and load barrier.
So every load operation generates two MFENCE instruction, before and after the actual load.
MFENCE is the most expensive instructions compared to others(SFENCE and LFENCE)

Implementation in GCC 4.7

GCC 4.7 introduced new set of atomic builtin functions, with __atomic_ as prefix.
These builtins have properly implemented memory ordering.
GCC 4.7 implementation of atomic builtins are twice as fast as GCC < 4.7
This can be understood from the implementation of load() itself.

__int_type load(memory_order __m = memory_order_seq_cst) const noexcept
{
  __glibcxx_assert(__m != memory_order_release);
  __glibcxx_assert(__m != memory_order_acq_rel);

  return __atomic_load_n(&_M_i, __m);
}

Memory models for Atomic Operations and Optimizations


Atomic Memory Models

Memory model is an important topic for understanding synchronization techniques using atomic operations.
Each memory model specifies how memory accesses are to be ordered around an atomic operation.
In a typical multi-threaded environment, there will be concurrent read and write to several variables and the order in which it happens will differ among the threads.
Understanding the memory models will help to know how to constrain the memory ordering to attain low latency synchronizations.

Each of the memory models will have some effect on the compiler optimizations.
But none of them have any effect on the thread local optimizations.

Every atomic operation allows to specify an addition argument, which is the memory model for that operation.
There are mainly 3 models:

Sequentially Consistent

pre

This is the default mode when none is specified in an atomic operation and is the most restrictive.
It means the same programmatic order is visible by all the threads.
It can also be explicitly specified using std::memory_order_seq_cst.
Consider the following example:

int a;
atomic_int b, c;
--Thread 1--
a=10;
b.store(5);
c.store(4);

--Thread 2--
if(c.load() == 4)
{
 assert(a==10);
 assert(b.load()==5);
}

Both the above assertions cannot fail.
All the stores, including dependent and non-dependent, happens before (will be visible after) the current operation.
This is the most expensive of all the operations and the most safe from a synchronization perspective.
It is required to enforce a system wide consistency between all threads.
Compiler will not be able to move instructions in either direction(before a store or after a load) and hence will have impact on optimizations.

Relaxed

This doesn't have any happens-before restrictions. Any operations before relaxed can happen in any order across threads.
In the example above, either of the assertions can fail, with relaxed memory ordering.
Relaxed ordering can be specified using std::memory_order_relaxed
The only ordering imposed is that once a value for a variable from thread 1 is observed in thread 2, thread 2 cannot see an earlier value for that variable from thread1.
For eg:

atomic_int a=0, b, c;
--Thread1--
a.store(1, memory_order_relaxed)
a.store(2, memory_order_relaxed)
--Thread2--
b=a.load(memory_order_relaxed)
c=a.load(memory_order_relaxed)
assert(b<=c)

The above assert can never fail.
Once store of 1 is visible in Thread2, it can never be 0(initial value).
Also if store of 2 is visible in Thread2, it cannot be 1 or 0 in the same thread.

Relaxed is not commonly used for synchronization, but merely to make an operation atomic.
Relaxed mode has no synchronization side effects and compiler can perform optimizations.
But it cannot be treated like a normal shared memory access as compiler cannot reorder the atomic operation.

Acquire-Release

In acquire-release, we specify store operation to use acquire mode(std::memory_order_acquire) and load operation to use release mode(std::memory_order_release).
When a store with acquire is issued, all the stores of dependent atomic variables will be visible after the operation. All the non-atomic stores also will be visible.
When a load with release is issued, all the loads of dependent atomic variables will be visible after the operation. All the non-atomic loads also will be visible.

For eg:
int a;
std::atomic_int b, c, d;
--Thread 1--
a=20;
b.store(10, std::memory_order_acquire);
c.store(30, std::memory_order_acquire);
d.store(b.load(std::memory_order_release));

--Thread 2--
if(d.load(std::memory_order_release) == 10)
 assert(b.load(std::memory_order_release) == 10);//will succeed as d is dependent on b
if(d.load(std::memory_order_release) == 10)
 assert(a == 20);//will succeed as all non-atomic stores follows happens-before for acquire
if(d.load(std::memory_order_release) == 10)
 assert(c.load(std::memory_order_release) == 30);// will fail as c is not dependent on d.

In the example above, the first two asserts succeeds as b is dependent on d and a is a non-atomic variable.
Third assert fails as c is non-dependent on b.

Acquire-Release does not allow compiler to rearrange the code before a store and after a load.
So there will be an optimization impact but it will be better than sequential consistency.

Acquire-Consume

It is similar to acquire-release but there is no happens-before ordering on the non-dependent variables(both atomic and non-atomic).
Only the stores to the dependent variables are visible after the operation.
In the previous example, if load on d is invoked with std::memory_order_consume, then its not guaranteed to have set a to 20.

int a;
std::atomic_int b, c, d;
--Thread 1--
a=20;
b.store(10, std::memory_order_acquire);
c.store(30, std::memory_order_acquire);
d.store(b.load(std::memory_order_release));

--Thread 2--
if(d.load(std::memory_order_release) == 10)
 assert(b.load(std::memory_order_release) == 10);//will succeed as d is dependent on b
if(d.load(std::memory_order_consume) == 10)
 assert(a == 20);//can fail as non-dependent variables don't follow happens-before
if(d.load(std::memory_order_release) == 10)
 assert(c.load(std::memory_order_release) == 30);// can fail as c is not dependent on d.

Consume can be faster than release, at hardware level, as it doesn't need to flush the states of the non-dependent variables.
As far as compiler optimizes are concerned, it will be treated as an acquire.

All the above models are defined in an enum in the header <atomic> (<cstdatomic> for gcc < 4.7)

enum memory_order {
    memory_order_relaxed,
    memory_order_consume,
    memory_order_acquire,
    memory_order_release,
    memory_order_acq_rel,
    memory_order_seq_cst
};

memory_order_acq_rel is an order which performs acquire on load and release on store.