KoblentsBlog Photography
Contact About
Anatomy of a bug in select
In the previous article we discussed problems with multiplexing Linux facilities informally.
Here we are going to discuss
select
using more precise language and with more technical details.
Preliminary Notes:
It is assumed that readers can run commands
man
and
info
and read
C
header files on a Linux machine, or can read POSIX docs, manpages, infopages, and header files on Internet. It lets us refer to the documentation without extensive quoting.
What file descriptors are assigned to sockets
When files, sockets and other file-like objects are opened or created, Linux usually assigns them the smallest
fd
not occupied by another object. After the objects close, their
fd
are recycled. If there was no «holes» in the range of active
fd
, or after all the «holes» were reused,
fd
are assigned sequentially in a sense that a freshly assigned
fd
is a number directly following the biggest active
fd
.
It follows that a socket with
fd
equal to, for example, 1234, was created when all
fd
in a range 0 – 1233 were in use, but
fd
slot #1234 was free to grab. At least 1233 files, sockets, or similar objects where open at the moment. More if slot #1234 was inside a «hole» between active
fd
.
Let us emphasize that the thousands active
fd
may belong not only to sockets, but also to «usual» files and objects of other types, like pipes, devices, or events.
Let us notice also that one or multiple smaller
fd
may be closed and released after a socket with a high
fd
was created. The latter still will retain its big
fd
.
Anatomy of
select

To represent what
fd
to watch,
select
uses data type
fd_set
. The type is opaque, but documentation provides a general guidance on its internal structure.
Opaque means that internal details of the structure are not defined nor described in POSIX. System programmers who develop operation systems and libraries have a leeway in implementation. Application developers should not expect the structure to be the same on different operation systems. It even may change in future versions of the same operation system. It follows that the application programmers should not access and manipulate the content of
fd_set
directly. They should use functions, or rather macros, that POSIX provides for the purpose:
1234
  void FD_ZERO (fd_set *set) /* erases all elements from a set of fd */
  void FD_SET (int fd, fd_set *set) /* adds a specific fd to a set */
  void FD_CLR (int fd, fd_set *set) /* removes a specific fd from a set */
  int FD_ISSET (int fd, const fd_set *set) /* queries, if a specific fd is an element of a set */

However, the freedom of system developers in designing the internal structure is limited by POSIX. According to the standard,
fd_set
must be «file descriptor masks». POSIX tells also that
FD_SET
and
FD_CLR
have undefined behavior when their argument
fd
is equal or higher than
FD_SETSIZE
- even if the
fd
is a valid file descriptor.
Linux implementation follows the guidance. According to Linux infopage,
fd_set
«is actually a bit array». Linux manpage says that
fd_set
is «a fixed size buffer».
To get more detailed info on Linux implementation, i.e. to see the internal structure of Linux
fd_set
and how the macros are defined, we can peek into the two header files:
12
    /usr/include/sys/select.h
    /usr/include/bits/select.h
Strictly speaking, the headers are for a specific version of Linux, but the data structure should be reasonably stable, because changing it would break a binary compatibility. People would not happy if a compiled code stop working after an upgrade to a next version of operation system.
As it was expected,
fd_set
is a bit arrays of a fixed size. With 8 bits per byte, the size of the array is
FD_SETSIZE
/ 8 bytes. On a typical Linux system it is 128 bytes. Each
fd
in the set is represented by a binary 1.
fd
not included in a set are represented by binary 0.
We can see also that in a current Linux implementation
FD_SET
and
FD_CLR
macros do not check if argument
fd
is out of the array bounds. It follows that, if called with an argument
fd >= FD_SETSIZE
, the macros would modify bytes located out of the array bounds.
Function
1234567
   int select (int stop, /* can be used for optimization, see below */
               /* pointers to three monitored sets of fd: */
               fd_set *rd, /* Ready to read */
               fd_set *wrt, /* Ready to write */
               fd_set *excpt, /* Exceptional conditions */

               struct timeval *timeout) /* timeout structure, irrelevant for our discussion */

has five arguments. Generally, it monitors the three sets of
fd
:
  1. Ready to read:
    • Data sent from a remote side appeared on a regular socket.
    • A request to connect was received by a listening socket.
  2. Ready to write: A socket connection is established.
  3. Exceptional conditions: Out of Band message («urgent message») is on the socket.

Let us consider, for simplicity, a case when only «Ready to read» condition is monitored. We do not need anything more elaborate to show how problems occur with
select
. Also, it is a very typical usage case, particularly for TCP/IP sockets.
The function returns the total number of
fd
on which monitored conditions occurred. The value is zero when the function returns because of a timeout. It is equal to −1 when the function returns because of errors in data. In a typical case, when an event occurs on a single socket, it returns 1. However, it may return bigger values - when data arrives simultaneously on several sockets.
The function also returns, in calling arguments passed by pointers, information on what
fd
a monitored condition occurred. For this,
select
overwrites the content of
fd_set
arguments.
select
supports two optimization:
  1. A calling code can supply pointers to empty
    fd_set
    when not interested in monitoring some of conditions, but it is less than optimal. The empty
    fd_set
    will eat memory, they must be initialized, and
    select
    must scan their content. Alternatively, NULL pointers may be passed to
    select
    , meaning that the respective events should not be monitored. In the most usual case, when
    fd
    are monitored only for read,
    select
    is called as following:
    1
              select (FD_SETSIZE, &rd_set, NULL, NULL, &the_timeout);
    

  2. It happens often that all
    fd
    to watch have small values. For example, our codes watch for requests to connect on a single listener socket that was created during startup. It is very likely that operation system will assign the socket a small
    fd
    , something like 5 or 10 or 20. It would be inefficient for
    select
    to scan all
    FD_SETSIZE
    , i.e. 1024 bits in fd_set, if we know already that it should stop much earlier. The first argument of
    select
    tells the function where to stop the scan:
    1
              select (fd + 1, &read_set, NULL, NULL, &the_timeout);
    
    Yes, the first argument is not
    fd
    , but
    fd+1
    .
Another example: we know that data are going to be sent to our server by dozens of slow clients. We can multiplex the input to process all the data in a single thread. For this we will watch all respective sockets. An optimization is still possible:
1234
    /* maximal fd over a watched set of sockets */
    int max_fd = ...
          ...
    select (max_fd + 1, &read_set, NULL, NULL, &the_timeout);
By the way, when the first argument is equal to zero,
select
does not watch for conditions on any
fd
, but still returns on a timeout. The behavior is often used to implement timers.
What is wrong with
select

In short,
select
does not scale well with a number of
fd
, even when the
fd
are not monitored and/or are not sockets:
  1. Initially,
    select
    becomes inefficient.
  2. After the number of
    fd
    exceeds
    FD_SETSIZE
    , the behavior of
    select
    becomes undefined. In a practice, the behavior is catastrophic:
    select
    corrupts stack.

Inefficiency
The optimization using a first argument of
select
stops to work as an optimization when a maximum value of a
fd
set is big. The function has to scan a long sub‑array of
fd_set
looking for binary 1. It would be OK when most of
fd
are sockets and you really need to monitor most of them. However, more typically most of the scanned
fd
never will be watched. Many of them belong to «usual» files, not sockets. Or the
fd
may be closed already. In more general terms, the
fd_set
bit array is sparse - and, as usual, it is inefficient to represent a sparse array the same way as a dense one.
Another inefficiency is caused by returning results in the same structures in which input data are passed to function
select
: the data structures must be filled again and again before every call to the function.
Catastrophic behavior
Let us consider, step by step, what happens with a typical program code involving
select
when we attempt to watch for a
socket_fd
bigger than
FD_SETSIZE - 1
.
123456
fd_set read_set; /* fixed size data structure in a stack */
FD_ZERO (read_set); /* clears all fd smaller than FD_SETSIZE */
FD_SET (socket_fd, read_set); /* modifies bit socket_fd % 8 of a byte at
                                 address &read_set + (socket_fd / 8). 
                                 The address is higher than the upper bound of read_set. */
int ret_value = select (socket_fd + 1, &read_set, NULL, NULL, &the_timeout);

Even before
select
is called,
FD_SET
may already corrupt one bit located out of the bounds of a bit array
read_set
. The bit may be a part of local data. Or, if a stack grows down, as it is standard for x86, x86-64, and many other architectures, the bit may be located in a stack frame. For example, it can be a part of a return address used by a calling function.
A set of
fd
passed to the function to monitor is represented by binary 1 stored between 0-th bit of a byte at address
&read_set
and
socket_fd
inclusive. All bits corresponding to
fd
smaller than
FD_SETSIZE
were cleared by
FD_ZERO
, i.e. they are equal zero. However, bits starting from
FD_SETSIZE
and till
socket_fd - 1
were neither cleaned nor set by macros. They may store arbitrary values. A programmer likely would expect that the bits also were cleared by
FD_ZERO
.
Function
select
does the following:
  1. Should and may verify the validity of its first argument,
    socket_fd+1
    . According to POSIX, it is not an options: when the argument is bigger than
    FD_SETSIZE-1
    , function
    select
    should return immediately with
    ret_value
    equal −1 and
    errno
    equal
    EINVAL
    . However, a Linux implementation of
    select
    does not conform to POSIX in the respect.

  2. Verifies that a set of
    fd
    passed in a second argument are valid, i.e. all "1" bits correspond to open files, sockets, and similar objects. If even one of these bits does not correspond to an open file, function
    select
    returns with
    ret_value
    equal −1 and
    errno
    equal
    EBADF
    .

    Please, notice that the set of
    fd
    spills out the upper border of
    read_set
    up to a bit with index
    socket_fd
    .

  3. Monitors the above set of
    fd
    for «Ready to read» conditions. As we mentioned above, in addition to
    socket_fd
    that we intended to monitor, the function may also monitor bits between
    FD_SETSIZE
    inclusive and
    socket_fd
    , that accidentally happen to have value 1.

  4. Returns with
    ret_value
    equal to a number of
    fd
    that have a «Ready to read» condition occured. Before returning, the function builds a set of
    fd
    on which the condition was encountered. For this it sets the corresponding bits to 1 and clears all other bits to 0.

    As we can see,
    select
    never modifies bits that had value zero when passed to the function in a second argument, because it does not monitor the respective
    fd
    . Bits 1 on which read condition was encountered retain their values. All other bits 1 are cleared to zero.

    The bits 1 are out of
    read_set
    bounds. It follows, that on the step
    select
    changes, i.e. corrupts, at least one bit of local data or a stack frame. It may corrupt multiple bits.
Summary
When used with
fd
exceeding
FD_SETSIZE-1
,
select
and its helper macro
FD_SET
tend to corrupt local data or stack frames. Given the function normally runs in a loop, it has multiple opportunities to corrupt a stack.
Yuriy Koblents-Mishke
December 24, 2013
 
Older »
© Copyright Koblents.com, 2012-2024