pselect() Pitfalls

pselectWhen dealing with multiple network connections or timeouts, the select() Unix system call is still the workhorse for many applications. Its well-known and frequently used interface beats the learning curve on the more scalable poll(), epoll(), or /dev/poll interfaces, especially if only a few file descriptors have to be monitored. select()’s younger sibling, pselect(), adds improved signal handling while retaining interface simplicity. However, when not being extra careful, applications changing to pselect() can ignore network messages for many minutes, as we had to learn the hard way on a medium-to-well loaded large-scale mail server.

select() revisited

int select(int nfds, fd_set *readfds, fd_set *write_fds,
           fd_set *exceptfds, struct timeval *timeout);

The purpose of select() is to

  1. wait for activity on three sets of file descriptors (Reading possible? Writing possible? Special events waiting, e.g. urgent messages?), return value > 0;
  2. optionally time out after a number of seconds and microseconds (when timeout != NULL), return value == 0; or
  3. terminate on a signal (as most system calls which can sleep), return value < 0 && errno == EINTR.

In many server processes, signals such as SIGHUP to reload configuration, SIGTERM to shut down cleanly, or SIGCHLD to respawn worker processes require interaction with the main loop. In this case, it is often easier to perform the operation not in the signal handler itself but in the main loop, as this avoids problems with the asynchronous nature of signals.

For example, the signal handlers in the Cyrus IMAP mail server consist of a single assignment to a “have seen a signal” variable, which is then checked in the main event loop. However, if a signal is delivered outside of the select() system call, select() is obviously not woken up and the signal is only handled when the next event on one of the file descriptors arrives or the system call times out.

Problem: Signal handling can be delayed arbitrarily, especially on lightly or unloaded servers.

The solution would be to have signals delivered only during the select() syscall. Blocking or masking all signals before entering the main loop and then only quickly unblocking around the select() does not work, as the signals will be delivered between the unmasking and the call to select(), so select() will remain clueless about the signal just delivered. Additional probing can be done, but race conditions will remain.

Introducing pselect(): Toward a message-like handling of signals

This is what pselect() was designed to remedy: In a main loop with pselect(), signals are masked, while they are atomically unblocked when entering/exiting pselect(). So in effect, pselect() allows the programmer to treat signals as an additional kind of message.

int pselect(int nfds, fd_set *readfds, fd_set *writefds,
            fd_set *exceptfds, const struct timespec *timeout,
            const sigset_t *sigmask);

The main differences to select() are:

  1. Atomic un-/re-masking of signals according to sigmask akin to using pthread_sigmask().
  2. timeout is now in nanoseconds (ns), no longer in microseconds (μs).
  3. timeout is not updated to reflect (an estimate of) the unused waiting time.

However, this signals-as-a-message treatment is not as well integrated as it seems at first. On servers with a high rate of signals, and with a main event loop taking enough time to always have a signal pending when pselect() is called next, the presence of active file descriptors is never recognized or handled.

On our 32-core Cyrus IMAP mail server, on certain load patterns, the thousands of IMAP daemons kept the Cyrus master scheduling process busy enough with respawning of dying processes (notified through SIGCHLD) that it was unable to recognize when living processes became busy (notified through a socket). Thus, the IMAP server stopped accepting new connections even though the system was not heavily loaded. This happened especially on Monday and Thursday mornings, when the rise in IMAP connections was steepest. (Note that using brute-force bulk load tests does not easily reproduce the problem.)

Problem: File descriptor handling can be delayed arbitrarily, especially on heavily loaded servers.

The solution would be to have signals delivered with the same priority as file descriptor events. However, with the current overloading of return values, this is not possible.

Introducing pselect7()

In the tradition of naming extended versions of system calls with the number of arguments it now has (see wait3() and wait4() for prominent representatives), I propose an extension of pselect(), which treats signals and file descriptors that have accumulated before the call to pselect() with the same priority. Signals or FDs becoming ready during the wait will be returned on a first-come-first-serve basis.

To achieve this, pselect7() will retry pselect() with a zero timeout, whenever pselect() had returned EINTR. The zero timeout will ensure that only file descriptors ready immediately will be returned. Whether signals have occurred will be reported to the caller in the int pointed to by the signals_received argument, which may be passed as NULL, if this information is not desired.

int pselect7(int nfds, fd_set *readfds, fd_set *writefds,
             fd_set *exceptfds, const struct timespec *timeout,
             const sigset_t *sigmask, int *signals_received)
{
  static const struct timespec zeroto = {0, 0};
  if (signals_received != NULL) *signals_received = 0;
  while (1) {
    int retval = pselect(nfds, readfds, writefds, exceptfds,
                         timeout, sigmask);
    if (retval != -1 || errno != EINTR) return retval;
    /* If a signal has been received,
       retry with zero timeout to obtain the pending fdsets */
    if (signals_received != NULL) *signals_received = 1;
    timeout = &zeroto;
  }
}

The easiest is to copy/paste the above into your existing code and instead of pselect(…) now call pselect7(…, NULL), and suddenly your program will continue to work as expected under high signal load. Of course, you are free to use the signals_received argument to learn whether signals have occurred during pselect7().

The following is a simple usage example:

int retval, signals_received;
retval = pselect7(nfds, readfds, writefds, exeptfds, timeout, sigmask, &signals_received);
/* Both retval and signals_received can be > 0 at the same time */
if (retval > 0) handle_fds(readfs, writefds, exceptfds);
if (signals_received > 0) handle_signal_results();
/* The following two cannot occur if at least one fired above */
if (retval == 0 && signals_received == 0) handle_timeout();
if (retval < 0) perror("pselect7");

The code is also available on GitHub for ease of use and modification.

Alternatives

Linux provides the signalfd() system call, which allows a process to read its pending signals from a file descriptor, which can also be passed to select(), poll(), epoll(), or /dev/poll APIs. This allows for a clean integration of the two message types, but is not an option for applications that should run on other (POSIX-compatible) operating systems as well.

Acknowledgments

This article is based on information gained in part through intense interactions, technical discussions, and experiments with and by Jacob Becker, Jens Erat, and Christian Mack over the past few months.

Schreibe einen Kommentar