Demystifying Kafka’s high-performance core black technology: Zero-Copy zero copy

1. Foreword

Some time ago, I studied the high-throughput parallel storage of large-scale log streams, through an in-depth study of Kafka’s underlying storage mechanism. We found that Kafka’s Zero-Copy zero-copy technology uses the Java underlying FileTransferTo method. Later, we tried to test the TransferTo performance and its parallel performance. And later, the parallel TransferTo method was implemented on Kafka and applied to the Apache Kafka system.

2. Message storage mechanism

Kafka is a distributed message subscription-publishing system, whether it is publishing or subscribing, a Topic must be specified. Topic is just a logical concept. Each Topic contains one or more Partitions, and different Partitions can be located on different nodes. At the same time, Partition physically corresponds to a local folder, each Partition contains one or more Segments, and each Segment contains a data file and a corresponding index file. Logically, a Partition can be regarded as a very long array, and its data can be accessed through the index (offset) of this “array”.

3. The zero-copy technology used by Kafka

In the message storage mode in Kafka, data is stored in the underlying file system. When a Consumer subscribes to the corresponding Topic message, the data needs to be read from the disk and then written back to the socket (Socket). This action seems to require less CPU activity, but it is very inefficient: first the kernel reads the entire disk data, then pushes the data across the kernel user to the application, then the application pushes the data back again across the kernel user, writes out to the socket. The application actually acts as an inefficient intermediary here, transferring the data from the disk file to the socket.

Data is copied every time it traverses the user kernel, consuming CPU cycles and memory bandwidth. Fortunately, you can get rid of these copies with a trick called zero-copy – aptly so. Applications using zero-copy require the kernel to copy data directly from the disk file to the socket without going through the application. Zero copy not only greatly improves the performance of the application, but also reduces the context switching between the kernel and user mode.

Java class library via
transferTo() method in java.nio.channels.FileChannel to support zero-copy on Linux and UNIX systems. The transferTo() method can be used to transfer bytes directly from the channel it is called on to another writable byte channel without the data flowing through the application. This article first shows the overhead incurred by simple file transfers via traditional copy semantics, and then shows how using the transferTo() zero-copy trick can improve performance.

3.1 Four copies and four context switches in traditional mode

Consider the scenario of reading data from a file and transferring it to another program on the network

File.read(fileDesc, buf, len);
Socket. send(socket, buf, len);
Copy Code

The code logic is very simple, but in fact, the copy operation requires four context switches between user mode and kernel mode, and the data is copied four times before the operation is completed. The following diagram shows how data is moved internally from the file to the socket:

The steps involved here are:

  1. The read() call (see Figure 2) triggers a context switch from user mode to kernel mode. Internally, sys_read() (or equivalent) is issued to read data from the file. The first copy is performed by the direct memory access (DMA) engine (see Figure 1), which reads the file contents from disk and stores them in a kernel address space buffer.
  2. The required data is copied from the read buffer to the user buffer, and the read() call returns. The return of this call triggers a context switch from kernel mode to user mode (yet another context switch). Data is now stored in user address space buffers.
  3. The send() socket call initiates a context switch from user mode to kernel mode. The data is copied a third time and placed again in the kernel address space buffer. But this time a different buffer is placed, which is associated with the target socket.
  4. The send() system call returns, resulting in a fourth context switch. The DMA engine transfers the data from the kernel buffer to the protocol engine, and the fourth copy happens independently and asynchronously.

It might seem a bit inefficient to use an intermediate kernel buffer instead of directly transferring data to the user buffer. But the purpose of introducing the intermediate kernel buffer is to improve performance. Using an intermediate kernel buffer for reads allows the kernel buffer to act as a “readahead cache” when the application does not need all the data in the kernel buffer. This greatly improves performance when the amount of data required is smaller than the kernel buffer size. An intermediate buffer on the write side allows the write process to complete asynchronously.

Unfortunately, this approach can itself become a performance bottleneck if the amount of data required is much larger than the kernel buffer size. Data is copied multiple times on disk, in kernel buffers, and in user buffers before being finally passed to the application.

3.2 The zero-copy technology used by Kafka

Examining the legacy scene again, we notice that the second and third copies are simply redundant. The application does nothing more than cache the data and pass it back to the socket. Data can be transferred directly from the read buffer to the socket buffer. The transferTo() method lets you do just that.

transferTo() method call

public void transferTo(long position, long count, WritableByteChannel target);
Copy Code

The transferTo() method transfers data from the file channel to the given writable bytes channel. Internally, it relies on the underlying operating system’s support for zero-copy; on UNIX and various Linux systems, this call is passed into the sendfile() system call, as shown in the following code, which transfers data from a file descriptor transferred to another file descriptor:

sendFile() system call

#include <sys/socket.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
Copy Code

The steps of the transferTo() method shown in the figure above are:

The transferTo() method causes the DMA engine to copy the file contents to a read buffer. The data is then copied by the kernel into the kernel buffer associated with the output socket.

The third copy of data occurs when the DMA engine transfers data from the kernel socket buffer to the protocol engine.

Improvements: We reduced the number of context switches from four to two, and the number of data copies from four to three (only one of which involved the CPU). But this code does not yet meet our zero-copy requirements. If the underlying network interface card supports gather operations, then we can further reduce the data copying of the kernel. In Linux kernel 2.4 and later, the socket buffer descriptor has been adjusted accordingly to meet this requirement. Not only does this approach reduce multiple context switches, it also eliminates the need for repeated data copies involving the CPU. On the user side, the usage remains the same, but the internal operations have changed:

  1. The transferTo() method causes the DMA engine to copy the file contents to the kernel buffer.
  2. Data was not copied to the socket buffer. Instead, only descriptors containing information about the location and length of the data are appended to the socket buffer. The DMA engine transfers data directly from the kernel buffer to the protocol engine, eliminating the remaining last CPU copy.

4. FileTransferTo parallel performance test

Next, we use the multi-thread method to perform parallel transmission on the FileTransferTo method, hoping to improve the performance of reading the underlying file system through parallel IO technology.

The test conditions are as follows:

CentOS release 5.10

Intel? Xeon? CPU E7420 @ 2.13GHz

Number of logical CPUs 16

16GB RAM

Test file size: 1.2GB