A Course By Curiosity: 2014

Thursday, June 12, 2014

Linux Kernel Time Calculation

Time

Linux provides various methods for calculating time. The most frequently used one is time() defined in time.h
But time() returns time_t which represents number of seconds since epoch.
What if we want microsecond resolution? Better use gettimeofday. It takes timeval structure which provides time in microsecond resolution.
What if we want nanosecond resolution? What if we want time interval rather than absolute time?
Linux provides methods to get time with nanosecond resolution, provided your hardware support it. Also Linux allows you to choose the clock source for time calculation.

Clock Source

Linux provides multiple sources based on which, time is calculated. The sources can be found in /sys/devices/system/clocksource/clocksource0/available_clocksource
The current clock source selected by kernel can found in /sys/devices/system/clocksource/clocksource0/current_clocksource

root@jijith-M17xR4:/home/jijith# cat /sys/devices/system/clocksource/clocksource0/available_clocksource 
tsc hpet acpi_pm

In my machine i have the following clocks available

TSC
HPET
ACPI_PM

Redhat Customer Portal has a good comparison of the performance of various clocks available. Here is a summary:

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
tsc
# time ./clock_timing

	real	0m0.601s
	user	0m0.592s
	sys	0m0.002s

# echo hpet > /sys/devices/system/clocksource/clocksource0/current_clocksource
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
# time ./clock_timing

	real	0m12.263s
	user	0m12.197s
	sys	0m0.001s

# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
acpi_pm
# time ./clock_timing

	real	0m24.461s
	user	0m0.504s
	sys	0m23.776s

From the above results the efficiency of generating timestamps, in descending order, is: TSC, HPET, ACPI_PM.

Time Calculation

Time in Nanoseconds

clock_gettime can be used for getting time in nanoseconds(since epoch). It gives result in struct timespec which has the following members

struct timespec {
               time_t   tv_sec;        /* seconds */
               long     tv_nsec;       /* nanoseconds */
           };

clock_gettime also let you specify clock source, identified by clockid_t.
Possible values are


CLOCK_REALTIME
System wide real time clock.
CLOCK_MONOTONIC
Clock that cannot be set and represents monotonic time since some unspecified starting point. More like a running counter.
CLOCK_MONOTONIC_RAW
Similar to CLOCK_MONOTONIC but provides access to a raw hardware-based time that is not subject to NTP adjustments.

To check if your hardware supports nanosecond resolution, use clock_getres and check the value in tv_nsec.If its greater than 10^9 then you can get time in nanoseconds.
Note: You have to link your application with -lrt to use the above methods.

Time interval

If we use CLOCK_REALTIME then we get the absolute time. It can also be used to calculate the time interval, by taking diff of two instances. Any issues?
CLOCK_REALTIME is the system wide real time clock and it can cause drifts in the time if you have ntpdeamon running(or periodical ntdpate). This could result in a negative value of the time interval if you take a diff of two instances.
CLOCK_MONOTONIC_RAW solves this problem. Its a monotonously increasing number which is not subjected to NTP adjustments.

Performance

Usually clock_gettime() is a kernel call and there are overheads in making kernel calls.
Based on kernel configuration, Linux kernel can implement these methods as VDSO(Virtually Dynamic Shared Object). So they can run on user space. If you have CONFIG_GENERIC_TIME_VSYSCALL set to y in the kernel config, then the clock_gettime() will be available as VDSO.
How will you be able to access system time in user space?
Linux kernel timer handler can calculate the time and store it on a global memory. The user space call for clock_gettime() will read the contents of this memory. But it has to be guarded with a spin lock.
So basically, clock_gettime() can run in user space(if its available as VDSO) and can read the time from a global memory, which is already updated by timer interrupt handler.

Wednesday, May 14, 2014

Linux Network Packet Drops

NIC Level Packet Drops

Packet drops hinders the performance of networks. We can use ethtool to check NIC level packet drops in Linux. Use the -S option to get the statistics.

Sample Usage:

ethtool -S em1  | egrep "(nodesc)|(bad)"
     tx_bad_bytes: 0
     tx_bad: 0
     rx_bad_bytes: 65824
     rx_bad: 0
     rx_bad_lt64: 0
     rx_bad_64_to_15xx: 0
     rx_bad_15xx_to_jumbo: 0
     rx_bad_gtjumbo: 0
     rx_nodesc_drop_cnt: 1039

The field rx_nodesc_drop_cnt increasing over time is an indication that packets are being dropped by the adapter.

You can also view the packet drops using ifconfig

ifconfig em1 | grep drop
          RX packets:4154237628 errors:4 dropped:1429 overruns:0 frame:4
          TX packets:4148105177 errors:0 dropped:0 overruns:0 carrier:0

Fixing NIC Level Packet Drops

One common issue with Packet drops is that the NIC ring buffer is not enough for buffering. Solution is to increase the buffer size.
To view the existing buffer size:

ethtool -g em1
     Ring parameters for em1:
     Pre-set maximums:
     RX:             1020
     RX Mini:        0
     RX Jumbo:       4080
     TX:             255
     Current hardware settings:
     RX:             255
     RX Mini:        0
     RX Jumbo:       0
     TX:             255

Pre-set maximums are the maximum values for the buffers. As in the above example, we can increase the receive buffer(RX) to 1020. To increase the buffer size, use the following command

ethtool -G em1 rx 1020

Socket Level Packet Drops

Packets can also be dropped with UDP due to datagrams arriving when the socket buffer is full i.e. traffic is arriving faster than the application can consume. Solution is to increase the receive buffer

Increasing socket receive buffer size

uint64_t size=10*1024*1024;//10 MB
setsockopt(m_socketFd, SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));

SO_RCVBUF sets or gets the maximum socket receive buffer in bytes. The kernel doubles this value to allow space for bookkeeping overhead when it is set using setsockopt.
The maximum allowed value for this option is set by /proc/sys/net/core/rmem_max. You might have to change it to allow higher values in the SO_RCVBUF.
Maximum value can also be set using sysctl -w net.core.rmem_max=262144

Solarflare multicast and kernel bypass

Kernel ByPass

Kernel ByPass is the latest technology in improving the performance of network applications. Whenever a network packet is received by NIC, it writes the packet to a ring buffer and raise a soft IRQ to the kernel. Kernel will then perform packet processing and finally write to the user space buffer. This kernel intervention can lead to application jitters.
Using Kernel ByPass, the network packet can get delivered directly to the user space application. This could be the raw network packet, in which case the application will have to take care of the network packet processing. Or some NIC vendors (like Mellanox or Solarflare) provides NIC which has the network packet processing inside it. Usually NIC will be an FPGA card which has the packet processing logic in it. NIC can then directly transfer the data to user space. Solareflare, using Zero Copy API, can also provide direct access to the data in the ring buffer rather than copying it to the application buffer.

Kernel ByPass in Solarflare

Solarflare provides an application onload for bypassing kernel. Its very easy to use. You don't need to change your existing applications if you are using BSD sockets. Just start your application using onload and you will get the benefit of Kernel ByPass.
Using solarflare onload typically will save you ~2 micro seconds and save most of the jitters.
Internally each onload will create an onload stack which will get the packet directly from the NIC. Packets are copied from the onload stack to the application space. To view all the onload stacks issue the command onload_stackdump It shows Id of each stack along with the PID.

Multicast and Kernel ByPass in Solarflare

Multicast provides new challenges to handling Kernel ByPass as more than one application in the same host can subscribe to the same multicast channel. onload performs Kernel ByPass if we subscribe for a multicast stream. But if we have two applications which subscribe to the same multicast stream using two onload stack, then it goes through kernel. Still it uses onload stack, but user space acceleration is lost. The reason is because solarflare NIC will not deliver a single packet multiple times to multiple stacks, its only delivered once. On handing over the packet to kernel-space, kernel TCP/IP stack will copy to each of the onload stacks.

How to kernel bypass multiple subscribers to same multicast stream from the same host

To achieve kernel bypass in this case, all the subscribers should share the same onload stack. The problem with solarflare not able to perform kernel bypass was that it cannot copy packets to multiple stacks. So if we share the same stack, then solarflare doesn't need to copy and it can bypass kernel.
To share the stack, all processes should use the same value for the environment variable EF_NAME

Fast double to string conversion

Double to String Conversion

Converting numbers to string is a frequent operation when we have to transmit data as a character string to any system. In ultra low latency systems this conversion can contribute to almost 20% of the system latency, when the precision requirement is higher.
Let us consider the popular ways of converting double to string in C++.

stringstream

std::stringstream provides a simple and easy way of converting double to string.
Here is a code snippet.

std::stringstream strval;
double dval = 1729.314;
strval<<dval;
const char* finalValue = strval.str().c_str();

to_string()

C++11 strings library provides a convenient method to_string() to convert numbers to string.

double dval = 1729.314;
const char* finalValue = std::to_string(dval).to_string()

sprintf()

This is the C way of converting double to string.

char finalValue[1024];
double dval = 1729.314;
sprintf(finalValue, "%f", dval);

Which one do you prefer in a latency sensitive project?
Let us analyze the performance of each of the above methods.
But buffer we proceed, let us think of a custom way of doing it on our own for better performance.

Custom Implementation

Integer to string conversion can be cached if we know the range of integers we are going to use. For example, we can cache a millon integers which could cost hardly 10 MB.
We can cut double precision numbers to two integers, numbers before and after the decimal separator. Then its a matter of converting these integers and concatenating both with decimal separator.
The only edge case we need to handle here is when the number after decimal separator has leading 0s, like 1.02.
One idea is to multiply the number after decimal separator with power of 10 (better to take 10^maximum precision). If the number of digits is less than the power of 10(maximum precision), then we have to prepend the number with 0s.

Lets now see the performance comparison of all the above approaches:

The above benchmark is done by executing the same example using all the 4 approaches discussed before.
Benchmark is done on converting random double precision numbers of 4 digits decimal precision, 1000 times.

From the above graph you can see that stringstream is the worst. Median latency for stringstream is ~5.8 micro seconds.
stod() is the next better, which is ~4.2 micro seconds.
sprintf() provides event better, which is ~3 micro seconds. So the powerful C function beats its C++ counter parts.
Finally our custom implementation outperforms all the standard functions by miles. Its median latency is ~500 nano seconds which is more than 6 times better than sprintf!!

Conclusion

For ultra low latency applications its better to have handcrafted functions for double conversion.
Standard library functions is not fast enough especially if the decimal precision required is more.
Custom implementation can gain lot of advantage depending on the use case. For eg: in our case we have used more memory for gaining on reducing latency. Its a better bet if you have enough RAM and has more stricter latency requirements.

Subscribe to: Posts ( Atom )