63 ALTERNATIVE I/O MODELS

1 Overview

Replenish

1.1 Level-Triggered and Edge-Triggered Notification

Before discussing the various alternative I/O mechanisms in detail, we need to distinguish two models of readiness notification for a file descriptor:

Level-triggered notification: A file descriptor is considered to be ready if it is possible to perform an I/O system call without blocking.
Edge-triggered notification: Notification is provided if there is I/O activity (e.g., new input) on a file descriptor since it was last monitored.

Table 63-1 summarizes the notification models employed by I/O multiplexing, signal-driven I/O, and epoll. The epoll API differs from the other two I/O models in that it can employ both level-triggered notification (the default) and edge-triggered
notification.

Details of the differences between these two notification models will become clearer during the course of the chapter. For now, we describe how the choice of notification model affects the way we design a program.

When we employ level-triggered notification, we can check the readiness of a file descriptor at any time. This means that when we determine that a file descriptor is ready (e.g., it has input available), we can perform some I/O on the descriptor, and then repeat the monitoring operation to check if the descriptor is still ready (e.g., it still has more input available), in which case we can perform more I/O, and so on. In other words, because the level -triggered model allows us to repeat the I/O monitoring operation at any time, it is not necessary to perform as much I/O as possible (e.g., read as many bytes as possible) on the file descriptor (or even perform any I /O at all) each time we are notified that a file descriptor is ready.

By contrast, when we employ edge-triggered notification, we receive notification only when an I/O event occurs. We don’t receive any further notification until another I/O event occurs. Furthermore, when an I/O event is notified for a file descriptor, we usually don’t know how much I/O is possible (e.g., how many bytes
are available for reading). Therefore, programs that employ edge-triggered notification are usually designed according to the following rules:

After notification of an I/O event, the program should-at some point-perform as much I/O as possible (e.g., read as many bytes as possible) on the corresponding file descriptor. If the program fails to do this, then it might miss the opportunity to perform some I/O, because it would not be aware of the need to operate on the file descriptor until another I/O event occurred. This could lead to spurious data loss or blocksages in a program. We said “at some point,” because sometimes it may not be desirable to perform all of the I/O immediately after we determine that the file descriptor is ready. The problem is that we may starve other file descriptors of attention if we perform a large amount of I/O on one file descriptor. We consider this point in more detail when we describe the edge-triggered notification model for epoll in Section 63.4.6.
If the program employs a loop to perform as much I/O as possible on the file
descriptor, and the descriptor is marked as blocking, then eventually an I/O sys-
tem call will block when no more I/O is possible. For this reason, each monitored
file descriptor is normally placed in nonblocking mode, and after notification
of an I/O event, I/O operations are performed repeatedly until the relevant
system call (e.g., read() or write()) fails with the error EAGAIN or EWOULDBLOCK.

1.2 Employing Nonblocking I/O with Alternative I/O Models

Nonblocking I/O (the O_NONBLOCK flag) is often used in conjunction with the I/O models described in this chapter. Some examples of why this can be useful are the following:

As explained in the previous section, nonblocking I/O is usually employed in conjunction with I/O models that provide edge-triggered notification of I/O events.
If multiple processes (or threads) are performing I/O on the same open file descriptions, then, from a particular process’s point of view, a descriptor’s readiness may change between the time the descriptor was notified as being ready and the time of the subsequent I/O call. Consequently, a blocking I/O call could block, thus preventing the process from monitoring other file descriptors. (This can occur for all of the I/O models that we describe in this chapter, regardless of whether they employ level-triggered or edge-triggered notification.)
Even after a level-triggered API such as select() or poll() informs us that a file descriptor for a stream socket is ready for writing, if we write a large enough block of data in a single write() or send(), then the call will nevertheless block.
In rare cases, level-triggered APIs such as select() and poll() can return spurious
readiness notifications-they can falsely inform us that a file descriptor is ready.
This could be caused by a kernel bug or be expected behavior in an uncom-
mon scenario.

Section 16.6 of [Stevens et al., 2004] describes one example of spurious readi-
ness notifications on BSD systems for a listening socket. If a client connects to
a server’s listening socket and then resets the connection, a select() performed
By the server between these two events will indicate the listening socket as
being readable, but a subsequent accept() that is performed after the client’s
reset will block.

2 I/O Multiplexing

I/O multiplexing allows us to simultaneously monitor multiple file descriptors to
see if I/O is possible on any of them. We can perform I/O multiplexing using
either of two system calls with essentially the same functionality. The first of these,
select(), appeared along with the sockets API in BSD. This was historically the more
widespread of the two system calls. The other system call, poll(), appeared in System V.
Both select() and poll() are nowadays required by SUSv3.

We can use select() and poll() to monitor file descriptors for regular files, termi-
nals, pseudoterminals, pipes, FIFOs, sockets, and some types of character devices.
Both system calls allow a process either to block indefinitely waiting for file describe-
tors to become ready or to specify a timeout on the call.

2.1 The select() System Call

The select() system call blocks until one or more of a set of file descriptors becomes ready.

#include <sys/time.h>
#include <sys/select.h>
/* For portability */
int select(int nfds, fd_set *readfds, fd_set *writefds, fd_set *exceptfds,
struct timeval *timeout);
//Returns number of ready file descriptors, 0 on timeout, or –1 on error

The nfds, readfds, writefds, and exceptfds arguments specify the file descriptors that
select() is to monitor. The timeout argument can be used to set an upper limit on the
time for which select() will block. We describe each of these arguments in detail below.

In the prototype for select() shown above, we include because that
was the header specified in SUSv2, and some UNIX implementations require
this header. (The header is present on Linux, and including it
does no harm.)

File descriptor sets

Replenish

The timeout argument

Replenish

Return value from select()

Replenish

Example program

Replenish

2.2 The poll() System Call

The poll() system call performs a similar task to select(). The major difference
between the two system calls lies in how we specify the file descriptors to be moni-
tored. With select(), we provide three sets, each marked to indicate the file descriptors
of interest. With poll(), we provide a list of file descriptors, each marked with the set of
events of interest.

Replenish

2.3 When Is a File Descriptor Ready?

Correctly using select() and poll() requires an understanding of the conditions under which a file descriptor indicates as being ready. SUSv3 says that a file descriptor (with O_NONBLOCK clear) is considered to be ready if a call to an I/O function would not block, regardless of whether the function would actually transfer data. The key point is italicized: select() and poll() tell us whether an I/O operation would not block, rather than whether it would successfully transfer data. In this light, let us consider how these system calls operate for different types of file descriptors. We show this information in tables containing two columns:

The select() column indicates whether a file descriptor is marked as readable ( r ), writable ( w ), or having an exceptional condition ( x ).
The poll() column indicates the bit(s) returned in the revents field. In these tables, we omit mention of POLLRDNORM, POLLWRNORM, POLLRDBAND, and POLLWRBAND. Although some of these flags may be returned in revents in various circumstances (if they are specified in events), they convey no useful information
beyond that provided by POLLIN, POLLOUT, POLLHUP, and POLLERR.

Regular files

File descriptors that refer to regular files are always marked as readable and writable by select(), and returned with POLLIN and POLLOUT set in revents for poll(), for the following reasons:

Replenish

2.5 Problems with select() and poll()

The select() and poll() system calls are the portable, long-standing, and widely used
methods of monitoring multiple file descriptors for readiness. However, these
APIs suffer some problems when monitoring a large number of file descriptors:

3 Signal-Driven I/O

With I/O multiplexing, a process makes a system call (select() or poll()) in order to check whether I/O is possible on a file descriptor. With signal-driven I/O, a process requests that the kernel send it a signal when I/O is possible on a file descriptor. The process can then perform any other activity until I/O is possible, at which time the signal is delivered to the process. To use signal-driven I/O, a program performs the following steps:

Establish a handler for the signal delivered by the signal-driven I/O mechanism. By default, this notification signal is SIGIO.
Set the owner of the file descriptor-that is, the process or process group that is to receive signals when I/O is possible on the file descriptor. Typically, we make the calling process the owner. The owner is set using an fcntl() F_SETOWN operation of the following form:
fcntl(fd, F_SETOWN, pid);
Enable nonblocking I/O by setting the O_NONBLOCK open file status flag.

4 The epoll API

Like the I/O multiplexing system calls and signal-driven I/O, the Linux epoll (event poll) API is used to monitor multiple file descriptors to see if they are ready for I/O.
The primary advantages of the epoll API are the following:

The performance of epoll scales much better than select() and poll() when moni-toring large numbers of file descriptors.
The epoll API permits either level-triggered or edge-triggered notification. By contrast, select() and poll() provide only level-triggered notification, and signal-driven I/O provides only edge-triggered notification.

The performance of epoll and signal-driven I/O is similar. However, epoll has some advantages over signal-driven I/O:

We avoid the complexities of signal handling (e.g., signal-queue overflow).
We have greater flexibility in specifying what kind of monitoring we want to perform (e.g., checking to see if a file descriptor for a socket is ready for reading, writing, or both).

Replenish