Wednesday, May 14, 2014

Linux Network Packet Drops


NIC Level Packet Drops

Packet drops hinders the performance of networks. We can use ethtool to check NIC level packet drops in Linux. Use the -S option to get the statistics.

Sample Usage:

ethtool -S em1  | egrep "(nodesc)|(bad)"
     tx_bad_bytes: 0
     tx_bad: 0
     rx_bad_bytes: 65824
     rx_bad: 0
     rx_bad_lt64: 0
     rx_bad_64_to_15xx: 0
     rx_bad_15xx_to_jumbo: 0
     rx_bad_gtjumbo: 0
     rx_nodesc_drop_cnt: 1039

The field rx_nodesc_drop_cnt increasing over time is an indication that packets are being dropped by the adapter.

You can also view the packet drops using ifconfig

ifconfig em1 | grep drop
          RX packets:4154237628 errors:4 dropped:1429 overruns:0 frame:4
          TX packets:4148105177 errors:0 dropped:0 overruns:0 carrier:0

Fixing NIC Level Packet Drops

One common issue with Packet drops is that the NIC ring buffer is not enough for buffering. Solution is to increase the buffer size.
To view the existing buffer size:

ethtool -g em1
     Ring parameters for em1:
     Pre-set maximums:
     RX:             1020
     RX Mini:        0
     RX Jumbo:       4080
     TX:             255
     Current hardware settings:
     RX:             255
     RX Mini:        0
     RX Jumbo:       0
     TX:             255

Pre-set maximums are the maximum values for the buffers. As in the above example, we can increase the receive buffer(RX) to 1020. To increase the buffer size, use the following command

ethtool -G em1 rx 1020

Socket Level Packet Drops

Packets can also be dropped with UDP due to datagrams arriving when the socket buffer is full i.e. traffic is arriving faster than the application can consume. Solution is to increase the receive buffer

Increasing socket receive buffer size

uint64_t size=10*1024*1024;//10 MB
setsockopt(m_socketFd, SOL_SOCKET, SO_RCVBUF, &size, sizeof(size));

SO_RCVBUF sets or gets the maximum socket receive buffer in bytes. The kernel doubles this value to allow space for bookkeeping overhead when it is set using setsockopt.
The maximum allowed value for this option is set by /proc/sys/net/core/rmem_max. You might have to change it to allow higher values in the SO_RCVBUF.
Maximum value can also be set using sysctl -w net.core.rmem_max=262144

Solarflare multicast and kernel bypass


Kernel ByPass

Kernel ByPass is the latest technology in improving the performance of network applications. Whenever a network packet is received by NIC, it writes the packet to a ring buffer and raise a soft IRQ to the kernel. Kernel will then perform packet processing and finally write to the user space buffer. This kernel intervention can lead to application jitters.
Using Kernel ByPass, the network packet can get delivered directly to the user space application. This could be the raw network packet, in which case the application will have to take care of the network packet processing. Or some NIC vendors (like Mellanox or Solarflare) provides NIC which has the network packet processing inside it. Usually NIC will be an FPGA card which has the packet processing logic in it. NIC can then directly transfer the data to user space. Solareflare, using Zero Copy API, can also provide direct access to the data in the ring buffer rather than copying it to the application buffer.

Kernel ByPass in Solarflare

Solarflare provides an application onload for bypassing kernel. Its very easy to use. You don't need to change your existing applications if you are using BSD sockets. Just start your application using onload and you will get the benefit of Kernel ByPass.
Using solarflare onload typically will save you ~2 micro seconds and save most of the jitters.
Internally each onload will create an onload stack which will get the packet directly from the NIC. Packets are copied from the onload stack to the application space. To view all the onload stacks issue the command onload_stackdump It shows Id of each stack along with the PID.

Multicast and Kernel ByPass in Solarflare

Multicast provides new challenges to handling Kernel ByPass as more than one application in the same host can subscribe to the same multicast channel. onload performs Kernel ByPass if we subscribe for a multicast stream. But if we have two applications which subscribe to the same multicast stream using two onload stack, then it goes through kernel. Still it uses onload stack, but user space acceleration is lost. The reason is because solarflare NIC will not deliver a single packet multiple times to multiple stacks, its only delivered once. On handing over the packet to kernel-space, kernel TCP/IP stack will copy to each of the onload stacks.

How to kernel bypass multiple subscribers to same multicast stream from the same host

To achieve kernel bypass in this case, all the subscribers should share the same onload stack. The problem with solarflare not able to perform kernel bypass was that it cannot copy packets to multiple stacks. So if we share the same stack, then solarflare doesn't need to copy and it can bypass kernel.
To share the stack, all processes should use the same value for the environment variable EF_NAME

Fast double to string conversion


Double to String Conversion

Converting numbers to string is a frequent operation when we have to transmit data as a character string to any system. In ultra low latency systems this conversion can contribute to almost 20% of the system latency, when the precision requirement is higher.
Let us consider the popular ways of converting double to string in C++.

stringstream

std::stringstream provides a simple and easy way of converting double to string.
Here is a code snippet.
std::stringstream strval;
double dval = 1729.314;
strval<<dval;
const char* finalValue = strval.str().c_str();

to_string()

C++11 strings library provides a convenient method to_string() to convert numbers to string.
double dval = 1729.314;
const char* finalValue = std::to_string(dval).to_string()

sprintf()

This is the C way of converting double to string.
char finalValue[1024];
double dval = 1729.314;
sprintf(finalValue, "%f", dval);

Which one do you prefer in a latency sensitive project?
Let us analyze the performance of each of the above methods.
But buffer we proceed, let us think of a custom way of doing it on our own for better performance.

Custom Implementation

Integer to string conversion can be cached if we know the range of integers we are going to use. For example, we can cache a millon integers which could cost hardly 10 MB.
We can cut double precision numbers to two integers, numbers before and after the decimal separator. Then its a matter of converting these integers and concatenating both with decimal separator.
The only edge case we need to handle here is when the number after decimal separator has leading 0s, like 1.02.
One idea is to multiply the number after decimal separator with power of 10 (better to take 10^maximum precision). If the number of digits is less than the power of 10(maximum precision), then we have to prepend the number with 0s.

Lets now see the performance comparison of all the above approaches:

The above benchmark is done by executing the same example using all the 4 approaches discussed before.
Benchmark is done on converting random double precision numbers of 4 digits decimal precision, 1000 times.

From the above graph you can see that stringstream is the worst. Median latency for stringstream is ~5.8 micro seconds.
stod() is the next better, which is ~4.2 micro seconds.
sprintf() provides event better, which is ~3 micro seconds. So the powerful C function beats its C++ counter parts.
Finally our custom implementation outperforms all the standard functions by miles. Its median latency is ~500 nano seconds which is more than 6 times better than sprintf!!

Conclusion

For ultra low latency applications its better to have handcrafted functions for double conversion.
Standard library functions is not fast enough especially if the decimal precision required is more.
Custom implementation can gain lot of advantage depending on the use case. For eg: in our case we have used more memory for gaining on reducing latency. Its a better bet if you have enough RAM and has more stricter latency requirements.