I/O multiplexing epoll

1. Blocking I/O

while ((n = read(STDIN_FILENO, buf, BUFSIZ)) > 0) {
    if (write(STDOUT_FILENO, buf, n) != n) {
        perror("write error");
    }
}

The above code snippet uses blocking I/O, which can be seen in many scenarios. For example, copying files, reading one file and writing to another file. For example, some scripts simply process files, read from a file, replace the text, and then write to the file to complete the text processing. Another example is some interactive scenes, which read from the input of the terminal and then output. For the processing of network sockets, when read receives the message, write replies.

You need to understand why read and write here are called blocking I/O. In the default mode, the system call read will block the process if there is no data to read. If the file is read to the end (EOF), read will not block but return 0. But for terminals and network sockets, when there is currently no data to read, read will block the process until data is input (or an abort character is received, an abort signal is received, the connection is closed, etc.). Blocking I/O handles reading from one descriptor just fine, but what about two file descriptors at the same time?

2. Concurrency model

Consider a chat room program that reads from the user’s terminal and immediately writes to a network socket. Read from the network socket and immediately write to the terminal. At this time, a blocked read is used and two reads are called in a loop. When a read operation is blocked, the program cannot execute another read operation in time. This may lead to a phenomenon where you send a message and the other party sends three messages. You can only see one message sent by the other party. If you want to see the other two messages, you must send an additional message (because the program is blocking and waiting for you. Sending a message but unable to process the message sent by the other party).

One way to solve this problem is to create multiple processes or threads. Each process or thread handles one file descriptor, so that blocking I/O can be used to handle multiple file descriptors. However, this solution brings many problems, including memory problems (each process needs to allocate independent memory space), and CPU consumption (when there are many process threads, your CPU spends most of its time doing context switching). Concurrency control brings huge difficulties to development, testing, and maintenance. If you have ever written a program that supports multi-threading, you will understand. For programs where the number of file descriptors might scale to very large sizes (such as a chat room server), we have to find a better solution.

3. Non-blocking I/O

Non-blocking I/O is an I/O operation mode in which system calls do not block the current thread or process while waiting for some operation. In non-blocking I/O mode. If the data is ready, the call returns immediately and processes the data. If the data is not ready, the call immediately returns an error (usually EWOULDBLOCK or EAGAIN) indicating that there is no data to read.

There are two ways to specify non-blocking I/O for a given descriptor. Call open to obtain the descriptor and specify the O_NONBLOCK flag. For an already open file descriptor, call function fcntl to open the O_NONBLOCK file status flag.

3.1 Polling

You can use polling plus non-blocking I/O to process multiple descriptors at the same time.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
?
#defineBUF_SIZE 1024
?
void set_nonblock(int fd) {
    int flags = fcntl(fd, F_GETFL, 0);
    fcntl(fd, F_SETFL, flags | O_NONBLOCK);
}
?
void handle_io(int fd_read, int fd_write) {
    char buf[BUF_SIZE];
    ssize_t n;
    
    n = read(fd_read, buf, BUF_SIZE);
    if (n < 0) {
        if (errno != EAGAIN & amp; & amp; errno != EWOULDBLOCK) {
            perror("read error");
            exit(EXIT_FAILURE);
        }
        return; // No data read, try again later
    }
    
    if (n > 0) {
        // Actual handling code for read data, e.g., write to another fd
        if (write(fd_write, buf, n) != n) {
            perror("write error");
            exit(EXIT_FAILURE);
        }
    }
}
?
int main() {
    int fd1, fd2, fd1_out, fd2_out;
    // Assume fd1, fd2, fd1_out, fd2_out are valid file descriptors
    
    set_nonblock(fd1);
    set_nonblock(fd2);
?
    while (1) {
        // Continuously poll the file descriptors
        handle_io(fd1, fd1_out);
        handle_io(fd2, fd2_out);
        // Potentially add sleep or yield here to reduce CPU usage
        usleep(100); // Example: sleep for 100 microseconds
    }
?
    return 0;
}

The code above shows how to set a descriptor to be non-blocking and shows how to use polling to handle multiple descriptors. The advantage of polling is that it is relatively simple to implement and easy to understand. It has sufficient performance in simple scenarios and does not need to consider thread synchronization.

The polling scheme will lead to CPU inefficiency and higher latency. You can imagine such a scenario. The read operation cannot read the data and it takes 10ms to return, and the read operation takes 100ms to read the data. Only 5 of the 100 descriptors polled each time have readable data, the CPU efficiency is only 34.5%, and the average delay reaches 190ms. This data will perform worse when I/O is more inactive. (100 * 5 / (5 * 100 + 95 * 10) = 0.345, (100 – 5) / 5 * 10 = 190)

3.2 I/O multiplexing

Polling unprepared descriptors will bring low CPU efficiency and high latency. I/O multiplexing technology can solve this problem. It can be imagined that if we use a signal-like mechanism, when a descriptor is ready, a signal is sent to the process to wake up the process, and the process then reads the file descriptor. This way we don’t need to poll for unready descriptors! But we know that the number of signals is limited (usually 64), and there is no way for the signal to tell us which descriptors are ready when the number of descriptors is greater than 64. The Linux kernel provides a kernel event notification mechanism to efficiently manage and monitor events such as I/O operations. When a descriptor is ready, the kernel event notification mechanism can notify us which descriptors are ready. The functions select, poll, and epoll provided by the C library all use this mechanism.

3.2.1 select

select() is a classic I/O multiplexing function, used to monitor the status changes of multiple file descriptors. It allows a program to determine whether a file descriptor in a set of file descriptors is ready for a read operation, a write operation, or an exception status check.

#include <sys/select.h>
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds, struct timeval *timeout);

Parameters:

  • nfds: Add 1 to the maximum number of file descriptors to monitor.

  • readfds: The set of file descriptors to be checked for readability.

  • writefds: The set of file descriptors to be checked for writability.

  • exceptfds: A collection of file descriptors to check for exception status.

  • timeout: Specifies the maximum time to wait. If NULL, select() waits until a file descriptor is ready.

Return value:

  • Greater than 0: Indicates the number of prepared file descriptors.

  • Equal to 0: indicates timeout.

  • Less than 0: indicates an error.

fd_set is a file descriptor set, use the following macros to operate it:

  • FD_ZERO(fd_set *set): Clear the set.

  • FD_SET(int fd, fd_set *set): Add file descriptor fd to the set.

  • FD_CLR(int fd, fd_set *set): Remove the file descriptor fd from the set.

  • FD_ISSET(int fd, fd_set *set): Check whether the file descriptor fd is in the set.

Usage examples:

Example of creating multiple descriptors and then selecting, and finding which descriptor is ready

3.2.2 poll

poll() is similar to select(), but the interface is different.

#include <poll.h>
int poll(struct pollfd *fds, nfds_t nfds, int timeout);

Parameters:

  • fds: An array containing multiple file descriptors, each file descriptor is described by a pollfd structure.

  • nfds: The number of file descriptors in the fds array.

  • timeout: The maximum number of milliseconds to wait, -1 means waiting infinitely, 0 means returning immediately.

struct pollfd {
    int fd; /* file descriptor */
    short events; /* Requested events */
    short revents; /*Returned events */
};
  • fd: The file descriptor to be detected or monitored.

  • events: The events you are interested in, such as POLLIN (data can be read), POLLOUT (data can be written).

  • revents: The actual events that occurred, set by the poll function when it returns.

Return value:

  • -1: Error, errno will be set

  • 0: Timeout, no file descriptor ready

  • >0: Number of ready file descriptors

Simple example:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <poll.h>

int main(void) {
    struct pollfd fds[1];
    fds[0].fd = STDIN_FILENO;
    fds[0].events = POLLIN;
    int timeout = 5000;

    printf("Waiting for input...\\
");
    int ret = poll(fds, 1, timeout);

    if (ret == -1) {
        perror("poll");
        exit(EXIT_FAILURE);
    } else if (ret == 0) {
        printf("No data within five seconds.\\
");
    } else {
        if (fds[0].revents & amp; POLLIN) {
            char buffer[100];
            ssize_t n = read(STDIN_FILENO, buffer, sizeof(buffer) - 1);
            if (n > 0) {
                buffer[n] = '\0';
                printf("Read: %s", buffer);
            }
        }
    }
    return 0;
}
3.2.3 epoll

epoll provides a more modern multiplexing technology for Linux that can monitor events on multiple descriptors more efficiently. To call select to get the fd_set returned by the kernel, we still need to traverse all file descriptors to find the ready part (the complexity is O(n), but compared to polling, there is already optimization and no need to call read). The kernel implementation of epoll uses a red-black tree to maintain descriptors and a linked list to maintain ready descriptors. Select returns fd_set and needs to copy all descriptor status. epoll_wait only returns the linked list without copying the entire descriptor table from kernel space to user space. epoll uses a red-black tree to maintain the descriptor table, and the complexity of addition, deletion and query is O(logn). The complexity of select using linear table addition, deletion and query is O(n).

int epoll_create1(int flags);

Creates an epoll instance and returns a file descriptor that uniquely identifies the instance. The flags parameter is usually set to 0. To exceed the descriptor upper limit setting.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

Controls the registration, modification, and deletion of file descriptors in epoll instances. epfd is the file descriptor of the epoll instance returned by epoll_create1. op specifies the operation type (such as EPOLL_CTL_ADD, EPOLL_CTL_MOD, EPOLL_CTL_DEL, etc.). fd is the target file descriptor to be operated on, and event is the event to be associated with fd. Add, delete, and modify descriptors to the kernel’s red-black tree.

int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout);

Wait for epoll events on instance epfd. events is used to return ready events, maxevents indicates the maximum number of events that the events array can accommodate, timeout is the number of milliseconds to wait, and -1 indicates infinite waiting.

struct epoll_event {
    uint32_t events; /* Epoll events */
    epoll_data_t data; /* User data variable */
};

events: Indicates the event type, such as EPOLLIN, EPOLLOUT, EPOLLERR, etc. data: User data, usually used to store information related to epoll events.