github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/raft.md (about) 1 # Raft implementation 2 3 SwarmKit uses the Raft consensus protocol to synchronize state between manager 4 nodes and support high availability. The lowest level portions of this are 5 provided by the `github.com/coreos/etcd/raft` package. SwarmKit's 6 `github.com/docker/swarmkit/manager/state/raft` package builds a complete 7 solution on top of this, adding things like saving and loading state on disk, 8 an RPC layer so nodes can pass Raft messages over a network, and dynamic cluster 9 membership. 10 11 ## A quick review of Raft 12 13 The details of the Raft protocol are outside the scope of this document, but 14 it's well worth reviewing the [raft paper](https://raft.github.io/raft.pdf). 15 16 Essentially, Raft gives us two things. It provides the mechanism to elect a 17 leader, which serves as the arbiter or all consensus decisions. It also provides 18 a distributed log that we can append entries to, subject to the leader's 19 approval. The distributed log is the basic building block for agreeing on and 20 distributing state. Once an entry in the log becomes *committed*, it becomes an 21 immutable part of the log that will survive any future leader elections and 22 changes to the cluster. We can think of a committed log entry as piece of state 23 that the cluster has reached agreement on. 24 25 ## Role of the leader 26 27 The leader has special responsibilities in the Raft protocol, but we also assign 28 it special functions in SwarmKit outside the context of Raft. For example, the 29 scheduler, orchestrators, dispatcher, and CA run on the leader node. This is not 30 a design requirement, but simplifies things somewhat. If these components ran in 31 a distributed fashion, we would need some mechanism to resolve conflicts between 32 writes made by different nodes. Limiting decision-making to the leader avoids 33 the need for this, since we can be certain that there is at most one leader at 34 any time. The leader is also guaranteed to have the most up-to-date data in its 35 store, so it is best positioned to make decisions. 36 37 The basic rule is that anything which writes to the Raft-backed data store needs 38 to run on the leader. If a follower node tries to write to the data store, the 39 write will fail. Writes will also fail on a node that starts out as the leader 40 but loses its leadership position before the write finishes. 41 42 ## Raft IDs vs. node IDs 43 44 Nodes in SwarmKit are identified by alphanumeric strings, but `etcd/raft` uses 45 integers to identify Raft nodes. Thus, managers have two distinct IDs. The Raft 46 IDs are assigned dynamically when a node joins the Raft consensus group. A node 47 could potentially leave the Raft consensus group (through demotion), then later 48 get promoted and rejoin under a different Raft ID. In this case, the node ID 49 would stay the same, because it's a cryptographically-verifiable property of the 50 node's certificate, but the Raft ID is assigned arbitrarily and would change. 51 52 It's important to note that a Raft ID can't be reused after a node that was 53 using the ID leaves the consensus group. These Raft IDs of nodes that are no 54 longer part of the cluster are saved (persisted on disk) in a list (a blacklist, 55 if you will) to make sure they aren't reused. If a node with a Raft ID on this list 56 tries to use Raft RPCs, other nodes won't honor these requests. etcd/raft doesn't allow 57 reuse of raft Id, which is likely done to avoid ambiguity. 58 59 The blacklist of demoted/removed nodes is used to restrict these nodes from 60 communicating and affecting cluster state. A membership list is also persisted, 61 however this does not restrict communication between nodes. 62 This is done to favor stability (and availability, by enabling faster return to 63 non-degraded state) over consistency, by allowing newly added nodes (which may not 64 have propagated to all the raft group members) to join and communicate with the group 65 even though the membership list may not consistent at the point in time (but eventually 66 will be). In case of node demotion/removal from the group, the affected node may be able 67 to communicate with the other members until the change is fully propagated. 68 69 ## Logs and snapshots 70 71 There are two sets of files on disk that provide persistent state for Raft. 72 There is a set of WAL (write-ahead log files). These store a series of log 73 entries and Raft metadata, such as the current term, index, and committed index. 74 WAL files are automatically rotated when they reach a certain size. 75 76 To avoid having to retain every entry in the history of the log, snapshots 77 serialize a view of the state at a particular point in time. After a snapshot 78 gets taken, logs that predate the snapshot are no longer necessary, because the 79 snapshot captures all the information that's needed from the log up to that 80 point. The number of old snapshots and WALs to retain is configurable. 81 82 In SwarmKit's usage, WALs mostly contain protobuf-serialized data store 83 modifications. A log entry can contain a batch of creations, updates, and 84 deletions of objects from the data store. Some log entries contain other kinds 85 of metadata, like node additions or removals. Snapshots contain a complete dump 86 of the store, as well as any metadata from the log entries that needs to be 87 preserved. The saved metadata includes the Raft term and index, a list of nodes 88 in the cluster, and a list of nodes that have been removed from the cluster. 89 90 WALs and snapshots are both stored encrypted, even if the autolock feature is 91 disabled. With autolock turned off, the data encryption key is stored on disk in 92 plaintext, in a header inside the TLS key. When autolock is turned on, the data 93 encryption key is encrypted with a key encryption key. 94 95 ## Initializing a Raft cluster 96 97 The first manager of a cluster (`swarm init`) assigns itself a random Raft ID. 98 It creates a new WAL with its own Raft identity stored in the metadata field. 99 The metadata field is the only part of the WAL that differs between nodes. By 100 storing information such as the local Raft ID, it's easy to restore this 101 node-specific information after a restart. In principle it could be stored in a 102 separate file, but embedding it inside the WAL is most convenient. 103 104 The node then starts the Raft state machine. From this point, it's a fully 105 functional single-node Raft instance. Writes to the data store actually go 106 through Raft, though this is a trivial case because reaching consensus doesn't 107 involve communicating with any other nodes. The `Run` loop sees these writes and 108 serializes them to disk as requested by the `etcd/raft` package. 109 110 ## Adding and removing nodes 111 112 New nodes can join an existing Raft consensus group by invoking the `Join` RPC 113 on the leader node. This corresponds to joining a swarm with a manager-level 114 token, or promoting a worker node to a manager. If successful, `Join` returns a 115 Raft ID for the new node and a list of other members of the consensus group. 116 117 On the leader side, `Join` tries to append a configuration change entry to the 118 Raft log, and waits until that entry becomes committed. 119 120 A new node creates an empty Raft log with its own node information in the 121 metadata field. Then it starts the state machine. By running the Raft consensus 122 protocol, the leader will discover that the new node doesn't have any entries in 123 its log, and will synchronize these entries to the new node through some 124 combination of sending snapshots and log entries. It can take a little while for 125 a new node to become a functional member of the consensus group, because it 126 needs to receive this data first. 127 128 On the node receiving the log, code watching changes to the store will see log 129 entries replayed as if the changes to the store were happening at that moment. 130 This doesn't just apply when nodes receive logs for the first time - in 131 general, when followers receive log entries with changes to the store, those 132 are replayed in the follower's data store. 133 134 Removing a node through demotion is a bit different. This requires two 135 coordinated changes: the node must renew its certificate to get a worker 136 certificate, and it should also be cleanly removed from the Raft consensus 137 group. To avoid inconsistent states, particularly in cases like demoting the 138 leader, there is a reconciliation loop that handles this in 139 `manager/role_manager.go`. To initiate demotion, the user changes a node's 140 `DesiredRole` to `Worker`. The role manager detects any nodes that have been 141 demoted but are still acting as managers, and first removes them from the 142 consensus group by calling `RemoveMember`. Only once this has happened is it 143 safe to change the `Role` field to get a new certificate issued, because issuing 144 a worker certificate to a node participating in the Raft group could cause loss 145 of quorum. 146 147 `RemoveMember` works similarly to `Join`. It appends an entry to the Raft log 148 removing the member from the consensus group, and waits until this entry becomes 149 committed. Once a member is removed, its Raft ID can never be reused. 150 151 There is a special case when the leader is being demoted. It cannot reliably 152 remove itself, because this involves informing the other nodes that the removal 153 log entry has been committed, and if any of those messages are lost in transit, 154 the leader won't have an opportunity to retry sending them, since demotion 155 causes the Raft state machine to shut down. To solve this problem, the leader 156 demotes itself simply by transferring leadership to a different manager node. 157 When another node becomes the leader, the role manager will start up on that 158 node, and it will be able to demote the former leader without this complication. 159 160 ## The main Raft loop 161 162 The `Run` method acts as a main loop. It receives ticks from a ticker, and 163 forwards these to the `etcd/raft` state machine, which relies on external code 164 for timekeeping. It also receives `Ready` structures from the `etcd/raft` state 165 machine on a channel. 166 167 A `Ready` message conveys the current state of the system, provides a set of 168 messages to send to peers, and includes any items that need to be acted on or 169 written to disk. It is basically `etcd/raft`'s mechanism for communicating with 170 the outside world and expressing its state to higher-level code. 171 172 There are five basic functions the `Run` function performs when it receives a 173 `Ready` message: 174 175 1. Write new entries or a new snapshot to disk. 176 2. Forward any messages for other peers to the right destinations over gRPC. 177 3. Update the data store based on new snapshots or newly-committed log entries. 178 4. Evaluate the current leadership status, and signal to other code if it 179 changes (for example, so that components like the orchestrator can be started 180 or stopped). 181 5. If enough entries have accumulated between snapshots, create a new snapshot 182 to compact the WALs. The snapshot is written asynchronously and notifies the 183 `Run` method on completion. 184 185 ## Communication between nodes 186 187 The `etcd/raft` package does not implement communication over a network. It 188 references nodes by IDs, and it is up to higher-level code to convey messages to 189 the correct places. 190 191 SwarmKit uses gRPC to transfer these messages. The interface for this is very 192 simple. Messages are only conveyed through a single RPC named 193 `ProcessRaftMessage`. 194 195 There is an additional RPC called `ResolveAddress` that deals with a corner case 196 that can happen when nodes are added to a cluster dynamically. If a node was 197 down while the current cluster leader was added, or didn't mark the log entry 198 that added the leader as committed (which is done lazily), this node won't have 199 the leader's address. It would receive RPCs from the leader, but not be able to 200 invoke RPCs on the leader, so the communication would only happen in one 201 direction. It would normally be impossible for the node to catch up. With 202 `ResolveAddress`, it can query other cluster members for the leader's address, 203 and restore two-way communication. See 204 https://github.com/docker/swarmkit/issues/436 more details on this situation. 205 206 SwarmKit's `raft/transport` package abstracts the mechanism for keeping track of 207 peers, and sending messages to them over gRPC in a specific message order. 208 209 ## Integration between Raft and the data store 210 211 The Raft `Node` object implements the `Proposer` interface which the data store 212 uses to propagate changes across the cluster. The key method is `ProposeValue`, 213 which appends information to the distributed log. 214 215 The guts of `ProposeValue` are inside `processInternalRaftRequest`. This method 216 appends the message to the log, and then waits for it to become committed. There 217 is only one way `ProposeValue` can fail, which is the node where it's running 218 losing its position as the leader. If the node remains the leader, there is no 219 way a proposal can fail, since the leader controls which new entries are added 220 to the log, and can't retract an entry once it has been appended. It can, 221 however, take an indefinitely long time for a quorum of members to acknowledge 222 the new entry. There is no timeout on `ProposeValue` because a timeout wouldn't 223 retract the log entry, so having a timeout could put us in a state where a 224 write timed out, but ends up going through later on. This would make the data 225 store inconsistent with what's actually in the Raft log, which would be very 226 bad. 227 228 When the log entry successfully becomes committed, `processEntry` triggers the 229 wait associated with this entry, which allows `processInternalRaftRequest` to 230 return. On a leadership change, all outstanding waits get cancelled. 231 232 ## The Raft RPC proxy 233 234 As mentioned above, writes to the data store are only allowed on the leader 235 node. But any manager node can receive gRPC requests, and workers don't even 236 attempt to route those requests to the leaders. Somehow, requests that involve 237 writing to the data store or seeing a consistent view of it need to be 238 redirected to the leader. 239 240 We generate wrappers around RPC handlers using the code in 241 `protobuf/plugin/raftproxy`. These wrappers check if the current node is the 242 leader, and serve the RPC locally in that case. In the case where some other 243 node is the leader, the wrapper invokes the same RPC on the leader instead, 244 acting as a proxy. The proxy inserts identity information for the client node in 245 the gRPC headers of the request, so that clients can't achieve privilege 246 escalation by going through the proxy. 247 248 If one of these wrappers is registered with gRPC instead of the generated server 249 code itself, the server in question will automatically proxy its requests to the 250 leader. We use this for most APIs such as the dispatcher, control API, and CA. 251 However, there are some cases where RPCs need to be invoked directly instead of 252 being proxied to the leader, and in these cases, we don't use the wrappers. Raft 253 itself is a good example of this - if `ProcessRaftMessage` was always forwarded 254 to the leader, it would be impossible for the leader to communicate with other 255 nodes. Incidentally, this is why the Raft RPCs are split between a `Raft` 256 service and a `RaftMembership` service. The membership RPCs `Join` and `Leave` 257 need to run on the leader, but RPCs such as `ProcessRaftMessage` must not be 258 forwarded to the leader.