I read a paper: 'In search of an understandable consensus algorithm' (the Raft paper)

Raft is an algorithm for distributed consensus. It was introduced in this paper, 'In search of an understandable consensus algorithm' by Ongaro and Ousterhout, in 2014. It is a very readable paper and I recommend you read it. Raft has been implemented in Rust and is used in TiKV.

I thought it would be a good idea to re-read the paper, so here are my notes. They are kind-of stream of consciousness.

For more info, Tang Liu explained Raft and the Rust implementation at Rustconf 2018. There is also a visualisation.

1. Introduction

Makes the case for Raft existing: that although Paxos is fine, it is notoriously difficult to understand. That makes it hard to implement, verify, optimise, and extend. They point out Raft's novel features, but they're covered in more detail later.

2. Replicated state machines

This is background, but important. The high level idea is that the consensus problem is getting separate servers to agree on stuff, specifically a log. This is framed in terms of state machines where the log is a log of commands, but I don't understand why - afaict, these states are different to the Raft states described later, so the log is just data and the state machines are just machines?

They don't go into detail of why you need such replication, but I'm pretty sure that it is the usual distributed systems motivation of wanting robustness (and performance/scalability/geographical coverage) in the face of hardware and network failure.

"agree on stuff" is described semi-formally at the end of this section. Note that 'availability' here is not quite the same as in the CAP theorem, so there is no free lunch in that respect. Also important is "under non-Byzantine" conditions, i.e., that the servers and networks are secure. IOW, they're solving a robustness problem, but there might be security issues which need to be addressed (which is the same for other consensus algorithms, aiui).

3. What's wrong with Paxos?

Pre-Raft, Paxos was the de facto standard consensus algorithm. It works fine, but it is really, really complicated. The authors make a good case, although it is not particularly rigorous. OTOH, I don't think anyone would ever disagree.

4. Designing for understandability

OK, we get it, Raft is easier to understand than Paxos.

5. The Raft consensus algorithm

Right, this is where it gets interesting!

There's a big colourful summary in figure 2, but I'll come back to that once I've read the prose.

The high level summary is that rather than getting every server to agree on every value, the servers agree on a leader and then the leader is always right. If the leader disappears, then the cluster elects a new one.

There's a detailed description of what safety means for Raft in figure 3. It's a pretty round-about way of describing safety; what isn't guaranteed is probably more important that what is. In particular, the guarantee is not that the logs on any two servers will match. The weaker guarantee given is that if two logs agree at some point, then they agree at every previous point; and that if an entry in the log has been "applied", then no server will have a different entry in their log at the same point (but note that they may have no entry at all).

The paper then gets into the details of the algorithm. Mostly when and how a leader is elected.

Elections

The authors introduce the concept of terms. A term starts with an election and lasts until the elected leader disappears (which might be observable to different servers at different times). Terms are numbered and monotonically increasing. They are consecutively numbered, but elections can fail and so a term might not have any non-election time. Also, from the point of view of any one server, there might be gaps in the sequence of term numbers, but those terms will exist somewhere. Term numbers are a logical clock (they give a useful approximation of time). There can be at most one leader per term.

Every message includes the term number. If a server receives a message with a lower number than its current term, it ignores it. If a message's term number is higher, then the server's current term number is changed to the higher one. This behaviour causes a server to be able to recover from failure without causing errors whether it is a leader or follower node.

A leader sends a heartbeat message. If a server does not see a heartbeat or other message within a given time, then it believes there is no leader and it should call an election. It marks itself as leader, increments its term and sends out a message to every other server requesting a vote. If it gets back a majority of votes then it confirms itself as leader. If it gets a message confirming another server as leader in the same or a later term (i.e., it lost the election), it reverts to being a follower, if it times out waiting for votes, then it will retry. Timeouts are randomised so that elections should not repeatedly be split.

A server votes at most once in each term, and votes for the first candidate it receives a request from (in a given term). Note how terms are really important for ensuring the safety of elections!

Normal operation

Clients send data to the leader - only the leader can run the process of writing data to the log. When it receives the data, the leader writes it to its log and sends it to all the other servers. At this stage it is uncommitted. Once a majority of servers have accepted the data, the leader considers it committed and applies the data to its state machine (which again, I'm not really sure what this means, but I'm guessing is an abstraction over any durable action in response to the data). Which log entries are committed is communicated in the heartbeat message. Log entries can only be committed in order. Once data is committed it can't be removed or changed. Once a follower observes that data has been committed, then it can apply it to its state machine too.

This is like two phase commit. IIUC, the difference is that in Raft there is no explicit 'no' vote, either a follower accepts the data to commit, or it times-out (which happens because the leader cannot reach the follower and so that follower calls an election).

The leader must keep trying to replicate uncommitted data until all servers have the data or it is deposed by the election of a new leader. That last caveat means that data can be lost if it is not confirmed to be committed.

The paper then covers a consistency check for ensuring that follower's logs match the leader's log, even after changes in leadership.

Safety

This section introduces a restriction on which servers can be elected leader in order to ensure that a leader always has all the data from previous terms. When a candidate requests a vote, the voting server will only vote for the candidate if that candidate's log is at least as up to date as the voter's. This works because data will only have been committed if it is in the logs of a majority of servers, therefore is a candidate is out of date, it will be impossible for it to get a majority of votes.

Hmm, seems a little more subtle than that. In figure 8, the authors present an example which would be unsafe. By timing multiple crashes and leader elections a committed value can be overwritten. To prevent this they introduce another restriction: data can only be committed by counting replicas in the current term. Data from a previous term can only be committed 'implicitly', by committing more recent data.

In section 5.4.3, the authors make a more formal argument for the safety of Raft. The proof is detailed, but there's nothing surprising given the previous sections.

6 Cluster membership changes

This section describes how the configuration of a cluster can be changed via joint consensus. A configuration change means adding or removing servers, which might include the leader. You can't simply swap from the old configuration to the new one because it's a distributed system and so nothing is atomic. Instead there is a joint consensus phase where the cluster consists of both old and new servers and the leader can come from either set. However, rather than requiring a single majority of the combined cluster, elections require majorities in both clusters.

The authors then spell out the mechanics of the changeovers and a few edge cases. One thing to note is that the log is used here for cluster metadata, not just data from the client, which feels like a significant change. It also adds a new state for servers - non-voting member of a cluster, in which state they are catching up their logs, but not voting (I believe that in the Rust implementation, these are called learners, but that that is not a term from the paper).

7 Log compaction

Again, I was a bit lost here because I'm not sure what exactly is kept in the log. This section seems to make the assumption that the log itself is not important, only the current state of the cluster (and since I don't really understand what the states would be, I can't 100% make sense of that assumption). However, just treating the state machine as its own abstraction and not thinking about what that would mean in an application, this section is fairly straightforward.

The idea is that you take a snapshot of the state of a single server, and replace the log up to that point with the snapshot. The key observation, I think, is that log compaction is treated as an implementation detail of a single node, and is not visible to the rest of the cluster.

Snapshots are also sent from the leader to followers when they need to catch up, for example when adding a server to a cluster.

8 Client interaction

Clients only interact with the leader. Finding the leader is an interesting detail.

This section also describes further details required to make Raft implement linearisable semantics. Each write from a client must have a sequence number (ensuring a write happens exactly once). Before responding to a read-only request from a client, the leader must verify that it is still really the leader (to prevent stale reads).

9 Implementation and evaluation

The authors did a user study for understandability. I can't really judge this. It seems kind of obvious and has certainly stood the test of time. It feels like the user study is box-ticking.

The authors have formalised Raft using TLA+ and proved it correct. There is also a hand-written proof. Neither are included in the paper, so there's not much to say about it. Nice to know they exist, I guess.

The authors hint at some performance optimisations, but not in detail. They prove experimentally that randomisation of the timeouts prevents multiple split votes. There is a performance trade-off in setting the election timeout - too short and you get many erroneous elections, too long, and leader recovery takes longer. They recommend 150-300ms timeouts with 5-50ms of randomness. I imagine the optimal numbers depend on the parameters around heartbeat timing for any particular cluster (e.g., load, network latency, etc.).

10 Related work

Paxos exists, there are implementations (Chubby, ZooKeeper, Spanner). Viewstamped Replication is another consensus algorithm, more similar to Raft.

11 Conclusion

In which they conclude.

My conclusion is that this is a great paper - it's a very readable explanation of a very interesting algorithm. There are no big "aha" moments, each step of the algorithm makes sense and is pretty simple, but put it together and you get this magic result!

I wish they gave a more practical example. As a relative outsider to distributed systems it would have helped my understanding. But that is a small niggle.

My capsule description of Raft: a leader/follower system where only the leader can handle reads and writes to the cluster. The leader commits writes using two-phase commit (-ish). If a leader disappears, a new leader is elected by a majority of followers.