Skip to main content

Raft Stage2

··4 mins

Last time, we finished the first stage of Raft, sucessfully electing a leader without logs. In stage 2, which is the most difficult stage I think, we need to implement the log replication. Leader will receive the command from the client, append it to its log, and then send AppendEntries RPC to the followers. Followers will append the entries to their logs and reply with success or failure.

Vote with logs #

In stage 2, we need to consider whether to vote for a candidate based on the logs. A server will vote for a candidate if the candidate’s log is at least as up-to-date as its own log, no matter what the term is. Since we have assumed only leader can receive the command from the client, this is a valid strategy. We need to prevent an isolated server with stale logs keep increasing its term, when it finally gets connected to the cluster, it will become the leader because it has the highest term, but its logs are stale, which leads to wrong results.

Log replication #

Whenever the leader receives a command from the client, it will append the command to its log and then send AppendEntries RPC to the followers.
The AppendEntries RPC contains

  • term: leader’s term
  • leaderId: so followers can redirect clients
  • prevLogIndex and prevLogTerm: to make sure followers have the same log as the leader
  • entries: the log entries to store (empty for heartbeat; may send more than one for efficiency)
  • leaderCommit: leader’s commitIndex, used for apply logs to state machine

When a follower receives an AppendEntries RPC, it will first check the term as we did in stage 1, if the term is smaller than the current term, it will reply with false and current term, which will tell the stale leader to step down. If prevLogIndex and prevLogTerm don’t match the follower’s log, which means there’s mismatch between the leader and the follower, the leader will decrease the nextIndex for that follower and retry until it finds the match.

Leader #

For leader, it needs to maintain two arrays, nextIndex and matchIndex. nextIndex[i] is the index of the next log entry to send to follower i, which is initialized to leader’s last log index + 1. matchIndex[i] is the index of the highest log entry known to be replicated on follower i, which is initialized to 0. Whenever the leader receives a successful reply from a follower, it will update nextIndex and matchIndex for that follower.

Once over half of the followers have replicated a log entry, which can be calculated by matchIndex, the leader can commit that log entry and apply it to the state machine. All followers will receive the commitIndex in the AppendEntries RPC, and they will apply the log entries up to commitIndex to their state machines.

Follower #

What follower should do is quite simple, it just needs to append the entries to its log and reply with success or failure according to the rules we mentioned above.

Improvements beyond paper #

A common issue is when a new follower has no log entries and the leader already has many log entries, the leader decreases the nextIndex for that follower one by one until it finds the match, which is very inefficient. To solve this problem, we have different strategies, one is to use binary search to find the match, which can reduce the time complexity from O(n) to O(log n). Another strategy is to use a hint from the follower’s reply, which can tell the leader where the mismatch is, using conflictIndex and conflictTerm in the reply, we will check conflictTerm in the leader’s log first, if exists, we will set nextIndex to the last index of that term + 1, if not exists, we will take a look at conflictIndex, which is the index of the first log entry with a term larger than conflictTerm, and set nextIndex to conflictIndex. This can significantly reduce the time to find the match.