Anatomy of a bug in select
Here we are going to discuss
using more precise language and with more technical details.
Preliminary Notes:
It is assumed that readers can run commands
and
and read
header files on a Linux machine, or can read POSIX docs, manpages, infopages, and header files on Internet. It lets us refer to the documentation without extensive quoting.
What file descriptors are assigned to sockets
When files, sockets and other file-like objects are opened or created,
Linux usually assigns them the smallest
not occupied by another object. After the objects close, their
are recycled. If there was no «holes» in the range of active
, or after all the «holes» were reused,
are assigned sequentially in a sense that a freshly assigned
is a number directly following the biggest active
.
It follows that a socket with
equal to, for example, 1234, was created when all
in a range 0 – 1233
were in use, but
slot #1234 was free to grab.
At least 1233 files, sockets, or similar objects where open at the moment. More if slot #1234 was inside a «hole» between active
.
Let us emphasize that the thousands active
may belong not only to sockets, but also to «usual» files and objects of other types, like pipes, devices, or events.
Let us notice also that one or multiple smaller
may be closed and released after a socket with a high
was created. The latter still will retain its big
.
Anatomy of
To represent what
to watch,
uses data type
. The type is opaque, but documentation provides a general guidance on its internal structure.
Opaque means that internal details of the structure are not defined nor described in POSIX. System programmers who develop operation systems and libraries have a leeway in implementation. Application developers should not expect the structure to be the same on different operation systems. It even may change in future versions of the same operation system. It follows that the application programmers should not access and manipulate the content of
directly. They should use functions, or rather macros, that POSIX provides for the purpose:
1234 | void FD_ZERO (fd_set *set)
void FD_SET (int fd, fd_set *set)
void FD_CLR (int fd, fd_set *set)
int FD_ISSET (int fd, const fd_set *set)
|
However, the freedom of system developers in designing the internal structure is limited by POSIX. According to the standard,
must be «file descriptor masks». POSIX tells also that
and
have undefined behavior when their argument
is equal or higher than
- even if the
is a valid file descriptor.
Linux implementation follows the guidance. According to Linux infopage,
«is actually a bit array». Linux manpage says that
is «a fixed size buffer».
To get more detailed info on Linux implementation, i.e. to see the internal structure of Linux
and how the macros are defined, we can peek into the two header files:
12 | /usr/include/sys/select.h
/usr/include/bits/select.h
|
Strictly speaking, the headers are for a specific version of Linux, but the data structure should be reasonably stable, because changing it would break a binary compatibility. People would not happy if a compiled code stop working after an upgrade to a next version of operation system.
As it was expected,
is a bit arrays of a fixed size.
With 8 bits per byte, the size of the array is
/ 8 bytes. On a typical Linux system it is 128 bytes. Each
in the set is represented by a binary 1.
not included in a set are represented by binary 0.
We can see also that in a current Linux implementation
and
macros do not check if argument
is out of the array bounds. It follows that, if called with an argument
, the macros would modify bytes located out of the array bounds.
Function
1234567 | int select (int stop,
fd_set *rd,
fd_set *wrt,
fd_set *excpt,
struct timeval *timeout)
|
has five arguments. Generally, it monitors the three sets of
:
- Ready to read:
- Data sent from a remote side appeared on a regular socket.
- A request to connect was received by a listening socket.
- Ready to write: A socket connection is established.
- Exceptional conditions: Out of Band message («urgent message») is on the socket.
Let us consider, for simplicity, a case when only «Ready to read» condition is monitored. We do not need anything more elaborate to show how problems occur with
. Also, it is a very typical usage case, particularly for TCP/IP sockets.
The function returns the total number of
on which monitored conditions occurred. The value is zero when the function returns because of a timeout. It is equal to −1 when the function returns because of errors in data. In a typical case, when an event occurs on a single socket, it returns 1. However, it may return bigger values - when data arrives simultaneously on several sockets.
The function also returns, in calling arguments passed by pointers, information on what
a monitored condition occurred. For this,
overwrites the content of
arguments.
supports two optimization:
-
A calling code can supply pointers to empty when not interested in monitoring some of conditions, but it is less than optimal.
The empty will eat memory, they must be initialized, and must scan their content. Alternatively, NULL pointers may be passed to , meaning that the respective events should not be monitored. In the most usual case, when are monitored only for read, is called as following:
1 | select (FD_SETSIZE, &rd_set, NULL, NULL, &the_timeout);
|
-
It happens often that all to watch have small values.
For example, our codes watch for requests to connect
on a single listener socket that was created during startup.
It is very likely that operation system will assign the socket
a small , something like 5 or 10 or 20.
It would be inefficient for to scan
all , i.e. 1024 bits in fd_set,
if we know already that it should stop much earlier.
The first argument of tells the function
where to stop the scan:
1 | select (fd + 1, &read_set, NULL, NULL, &the_timeout);
|
Yes, the first argument is not , but .
Another example: we know that data are going to be sent to our server by dozens of slow clients. We can multiplex the input to process all the data in a single thread. For this we will watch all respective sockets. An optimization is still possible:
1234 |
int max_fd = ...
...
select (max_fd + 1, &read_set, NULL, NULL, &the_timeout);
|
By the way, when the first argument is equal to zero,
does not watch for conditions on any
, but still returns on a timeout. The behavior is often used to implement timers.
What is wrong with
In short,
does not scale well with a number of
, even when the
are not monitored
and/or are not sockets:
- Initially, becomes inefficient.
- After the number of exceeds , the behavior of becomes undefined. In a practice, the behavior is catastrophic: corrupts stack.
Inefficiency
The optimization using a first argument of
stops to work as an optimization when a maximum value of a
set is big. The function has to scan a long sub‑array of
looking for binary 1. It would be OK when most of
are sockets and you really need to monitor most of them. However, more typically most of the scanned
never will be watched. Many of them belong to «usual» files, not sockets. Or the
may be closed already. In more general terms, the
bit array is sparse - and, as usual, it is inefficient to represent a sparse array the same way as a dense one.
Another inefficiency is caused by returning results in the same structures in which input data are passed to function
: the data structures must be filled again and again before every call to the function.
Catastrophic behavior
Let us consider, step by step, what happens with a typical program code
involving
when we attempt to watch for a
bigger than
.
123456 | fd_set read_set;
FD_ZERO (read_set);
FD_SET (socket_fd, read_set);
int ret_value = select (socket_fd + 1, &read_set, NULL, NULL, &the_timeout);
|
Even before
is called,
may already corrupt one bit located out of the bounds of a bit array
. The bit may be a part of local data. Or, if a stack grows down, as it is standard for x86, x86-64, and many other architectures, the bit may be located in a stack frame. For example, it can be a part of a return address used by a calling function.
A set of
passed to the function to monitor is represented by binary 1 stored between 0-th bit of a byte at address
and
inclusive. All bits corresponding to
smaller than
were cleared by
, i.e. they are equal zero. However, bits starting from
and till
were neither cleaned nor set by macros. They may store arbitrary values. A programmer likely would expect that the bits also were cleared by
.
Function
does the following:
-
Should and may verify the validity of its first argument, . According to POSIX, it is not an options: when the argument is bigger than , function should return immediately with equal −1 and equal . However, a Linux implementation of does not conform to POSIX in the respect.
-
Verifies that a set of passed in a second argument are valid, i.e. all "1" bits correspond to open files, sockets, and similar objects. If even one of these bits does not correspond to an open file, function returns with equal −1 and equal .
Please, notice that the set of spills out the upper border of up to a bit with index .
-
Monitors the above set of for «Ready to read» conditions. As we mentioned above, in addition to that we intended to monitor, the function may also monitor bits between inclusive and , that accidentally happen to have value 1.
-
Returns with equal to a number of that have a «Ready to read» condition occured. Before returning, the function builds a set of on which the condition was encountered. For this it sets the corresponding bits to 1 and clears all other bits to 0.
As we can see, never modifies bits that had value zero when passed to the function in a second argument, because it does not monitor the respective . Bits 1 on which read condition was encountered retain their values. All other bits 1 are cleared to zero.
The bits 1 are out of bounds. It follows, that on the step changes, i.e. corrupts, at least one bit of local data or a stack frame. It may corrupt multiple bits.
Summary
When used with
exceeding
,
and its helper macro
tend to corrupt local data or stack frames. Given the function normally runs in a loop, it has multiple opportunities to corrupt a stack.
Yuriy Koblents-Mishke
December 24, 2013