8 February 2022

Async IO fundamentals

Async programming in Rust is built on top of the operating system's async IO facilities. While it is possible to just use async/await for control flow, mostly people use async for async IO. The runtime and libraries abstract away many of the details of IO (which is good, because working directly with the OS is fiddly and error-prone), but it is still useful to understand how async IO works at a high level so that you understand the performance characteristics and can choose the best systems and libraries for your application.

Note that I'm not an expert on this stuff and this blog post is introductory.

What is async anyway

When talking about IO (and async programming in general, most of the time), async means that we don't block the thread waiting for IO to complete. So if the user requests some IO from the OS and the OS does the IO without returning, then returns you the result when its done, that is synchronous (aka, blocking) IO. With async IO, the user requests the IO, then can get on with other things, and when the IO is finished, gets the result.

In pseudo code (which looks nothing like real life):

// Synchronous
let result = do_some_io(); // This call could take a while.

// Async
start_some_io();
// Do some other work ...
// When the IO is done:
let result = get_result_of_io(); // This call is quick because the IO was already finished.

Note that we are only talking about a single user thread. We can always make things asynchronous by using multiple threads, but that is not what most people mean when they are talking about async IO or async programming.

Other terminology is to call this kind of IO non-blocking IO, and reserve async for the async IO Linux syscalls. This seems less common in the Rust community, at least.

Readiness and completion models

There are two high-level models for async IO: readiness and completion (I'm not clear how widely used these terms are to define the model of asynchronicity, but they are common terms). Epoll and Kqueue are examples of the readiness model; IOCP and io_uring are examples of the completion model. The basic difference is that in the readiness model, the OS notifies the user when the resource is ready to read or write; in the completion model, the OS model notifies the user when reading or writing to/from the resource is complete.

More pseudo code:

// Readiness
start_some_io();
when io_is_ready {
    let mut buf = ...;
    read(&mut buf);
    // Do something with the data we read into buf.
}

// Completion
let mut buf = ...;
start_some_io(&mut buf);
when io_is_complete {
    // Do something with the data we read into buf.
}

Note that we've still hand-waved quite a bit about how the OS notifies the user that IO is ready or complete. There are lots of ways to do that: at the highest level, either the user has to check with the OS (polling), or the OS has to interrupt the user. The interrupt approach doesn't seem to be widely used. The Linux Async IO syscalls (AIO, not to be confused with async IO in general) are the best known example, I think.

An essential observation is that it is extremely inefficient to individually poll the OS for each IO in progress. Instead, the user should poll the OS about all or many IOs in progress and find out which are ready/complete (called multiplexing).

I'll describe some of the common async IO mechanisms and how notification works for each.

select, poll, and epoll

The select and poll syscalls, and the epoll family of syscalls, do basically the same thing: they give the OS a set of resources to watch, and wait (block with a timeout) until at least one of them is ready to read/write. The difference between them is how that set of resources is specified.

With select or poll, the set of resources is passed to the OS each time the syscall is made. In other words, it is maintained by the user. They are both POSIX calls, so are portable and available on all Unix systems.

With epoll, the set of resources is maintained inside the OS and the user modifies the set of resources with a separate syscall (epoll_ctl, c.f., epoll_wait which waits for the resources to be ready). That is convenient, but more importantly, epoll is wildly more performant than poll or select when dealing with large numbers of resources. It is, however, Linux-specific.

With these approaches, when the IO resource is ready, the user must then call a read or write syscall to transfer data.

Edge and level triggering

One detail is what to do if a resource is ready to read/write but has not yet been read/written? Either the OS can keep reporting the resource as ready (called level-triggered IO), or the OS can stop reporting the resource as ready, i.e., the OS only reports that the resource is ready once (called edge-triggered IO). You can think of these alternatives as reporting based on the current state vs reporting based on changes.

Select and poll are always level-triggered. Epoll can be configured to be either level- or edge-triggered, level-triggered is the default. Epoll also supports a one-shot mode which is like a more extreme version of edge-triggering where the user is only notified that a resource is ready, even if multiple events occur.

IOCP

IOCP is completion-based and Windows only (note that async IO on Windows is often called overlapped IO). A completion port (the CP in IOCP) is similar to an epoll object in that it is a set of IO resources managed inside the OS which the user access indirectly. The major difference is that after epoll informs the user that the resource is ready, the user must then read or write the resource. When IOCP informs the user that the IO is complete, the data has already been read into the user's memory, or written from it (i.e., the IO is complete).

For this to work, the user must allocate a buffer to read into or write out of when starting the IO and keep it alive (and not overwrite it) until the IO completes.

io_uring

io_uring is similar to IOCP in that it is completion based and requires the user to maintain a buffer while the IO takes place asynchronously. The difference is in the details of how the user sets things up and checks for notifications (and in the implementation of course). In particular, the use of a ring buffer to minimise the number of syscalls (I need to understand this part better!).

Which is better?

The major advantage for readiness-based IO is that buffers don't need to be allocated ahead of time. That means there is no memory that must be kept alive from initiating the IO until the data can be copied to/from the OS. That reduces memory usage, makes code simpler (because it simplifies buffer management), and permits multiple IOs on the same thread to share a buffer.

Completion-based IO does require eager allocation of buffers, but it permits a zero-copy approach where data can be written directly to/from user memory without being copied by the OS.