github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/raft.md

github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/raft.md (about)

1 # Raft implementation
2
3 SwarmKit uses the Raft consensus protocol to synchronize state between manager
4 nodes and support high availability. The lowest level portions of this are
5 provided by the `github.com/coreos/etcd/raft` package. SwarmKit's
6 `github.com/docker/swarmkit/manager/state/raft` package builds a complete
7 solution on top of this, adding things like saving and loading state on disk,
8 an RPC layer so nodes can pass Raft messages over a network, and dynamic cluster
9 membership.
10
11 ## A quick review of Raft
12
13 The details of the Raft protocol are outside the scope of this document, but
14 it's well worth reviewing the [raft paper](https://raft.github.io/raft.pdf).
15
16 Essentially, Raft gives us two things. It provides the mechanism to elect a
17 leader, which serves as the arbiter or all consensus decisions. It also provides
18 a distributed log that we can append entries to, subject to the leader's
19 approval. The distributed log is the basic building block for agreeing on and
20 distributing state. Once an entry in the log becomes *committed*, it becomes an
21 immutable part of the log that will survive any future leader elections and
22 changes to the cluster. We can think of a committed log entry as piece of state
23 that the cluster has reached agreement on.
24
25 ## Role of the leader
26
27 The leader has special responsibilities in the Raft protocol, but we also assign
28 it special functions in SwarmKit outside the context of Raft. For example, the
29 scheduler, orchestrators, dispatcher, and CA run on the leader node. This is not
30 a design requirement, but simplifies things somewhat. If these components ran in
31 a distributed fashion, we would need some mechanism to resolve conflicts between
32 writes made by different nodes. Limiting decision-making to the leader avoids
33 the need for this, since we can be certain that there is at most one leader at
34 any time. The leader is also guaranteed to have the most up-to-date data in its
35 store, so it is best positioned to make decisions.
36
37 The basic rule is that anything which writes to the Raft-backed data store needs
38 to run on the leader. If a follower node tries to write to the data store, the
39 write will fail. Writes will also fail on a node that starts out as the leader
40 but loses its leadership position before the write finishes.
41
42 ## Raft IDs vs. node IDs
43
44 Nodes in SwarmKit are identified by alphanumeric strings, but `etcd/raft` uses
45 integers to identify Raft nodes. Thus, managers have two distinct IDs. The Raft
46 IDs are assigned dynamically when a node joins the Raft consensus group. A node
47 could potentially leave the Raft consensus group (through demotion), then later
48 get promoted and rejoin under a different Raft ID. In this case, the node ID
49 would stay the same, because it's a cryptographically-verifiable property of the
50 node's certificate, but the Raft ID is assigned arbitrarily and would change.
51
52 It's important to note that a Raft ID can't be reused after a node that was
53 using the ID leaves the consensus group. These Raft IDs of nodes that are no
54 longer part of the cluster are saved (persisted on disk) in a list (a blacklist,
55 if you will) to make sure they aren't reused. If a node with a Raft ID on this list
56 tries to use Raft RPCs, other nodes won't honor these requests. etcd/raft doesn't allow
57 reuse of raft Id, which is likely done to avoid ambiguity.
58
59 The blacklist of demoted/removed nodes is used to restrict these nodes from
60 communicating and affecting cluster state. A membership list is also persisted,
61 however this does not restrict communication between nodes.
62 This is done to favor stability (and availability, by enabling faster return to
63 non-degraded state) over consistency, by allowing newly added nodes (which may not
64 have propagated to all the raft group members) to join and communicate with the group
65 even though the membership list may not consistent at the point in time (but eventually
66 will be). In case of node demotion/removal from the group, the affected node may be able
67 to communicate with the other members until the change is fully propagated.
68
69 ## Logs and snapshots
70
71 There are two sets of files on disk that provide persistent state for Raft.
72 There is a set of WAL (write-ahead log files). These store a series of log
73 entries and Raft metadata, such as the current term, index, and committed index.
74 WAL files are automatically rotated when they reach a certain size.
75
76 To avoid having to retain every entry in the history of the log, snapshots
77 serialize a view of the state at a particular point in time. After a snapshot
78 gets taken, logs that predate the snapshot are no longer necessary, because the
79 snapshot captures all the information that's needed from the log up to that
80 point. The number of old snapshots and WALs to retain is configurable.
81
82 In SwarmKit's usage, WALs mostly contain protobuf-serialized data store
83 modifications. A log entry can contain a batch of creations, updates, and
84 deletions of objects from the data store. Some log entries contain other kinds
85 of metadata, like node additions or removals. Snapshots contain a complete dump
86 of the store, as well as any metadata from the log entries that needs to be
87 preserved. The saved metadata includes the Raft term and index, a list of nodes
88 in the cluster, and a list of nodes that have been removed from the cluster.
89
90 WALs and snapshots are both stored encrypted, even if the autolock feature is
91 disabled. With autolock turned off, the data encryption key is stored on disk in
92 plaintext, in a header inside the TLS key. When autolock is turned on, the data
93 encryption key is encrypted with a key encryption key.
94
95 ## Initializing a Raft cluster
96
97 The first manager of a cluster (`swarm init`) assigns itself a random Raft ID.
98 It creates a new WAL with its own Raft identity stored in the metadata field.
99 The metadata field is the only part of the WAL that differs between nodes. By
100 storing information such as the local Raft ID, it's easy to restore this
101 node-specific information after a restart. In principle it could be stored in a
102 separate file, but embedding it inside the WAL is most convenient.
103
104 The node then starts the Raft state machine. From this point, it's a fully
105 functional single-node Raft instance. Writes to the data store actually go
106 through Raft, though this is a trivial case because reaching consensus doesn't
107 involve communicating with any other nodes. The `Run` loop sees these writes and
108 serializes them to disk as requested by the `etcd/raft` package.
109
110 ## Adding and removing nodes
111
112 New nodes can join an existing Raft consensus group by invoking the `Join` RPC
113 on the leader node. This corresponds to joining a swarm with a manager-level
114 token, or promoting a worker node to a manager. If successful, `Join` returns a
115 Raft ID for the new node and a list of other members of the consensus group.
116
117 On the leader side, `Join` tries to append a configuration change entry to the
118 Raft log, and waits until that entry becomes committed.
119
120 A new node creates an empty Raft log with its own node information in the
121 metadata field. Then it starts the state machine. By running the Raft consensus
122 protocol, the leader will discover that the new node doesn't have any entries in
123 its log, and will synchronize these entries to the new node through some
124 combination of sending snapshots and log entries. It can take a little while for
125 a new node to become a functional member of the consensus group, because it
126 needs to receive this data first.
127
128 On the node receiving the log, code watching changes to the store will see log
129 entries replayed as if the changes to the store were happening at that moment.
130 This doesn't just apply when nodes receive logs for the first time - in
131 general, when followers receive log entries with changes to the store, those
132 are replayed in the follower's data store.
133
134 Removing a node through demotion is a bit different. This requires two
135 coordinated changes: the node must renew its certificate to get a worker
136 certificate, and it should also be cleanly removed from the Raft consensus
137 group. To avoid inconsistent states, particularly in cases like demoting the
138 leader, there is a reconciliation loop that handles this in
139 `manager/role_manager.go`. To initiate demotion, the user changes a node's
140 `DesiredRole` to `Worker`. The role manager detects any nodes that have been
141 demoted but are still acting as managers, and first removes them from the
142 consensus group by calling `RemoveMember`. Only once this has happened is it
143 safe to change the `Role` field to get a new certificate issued, because issuing
144 a worker certificate to a node participating in the Raft group could cause loss
145 of quorum.
146
147 `RemoveMember` works similarly to `Join`. It appends an entry to the Raft log
148 removing the member from the consensus group, and waits until this entry becomes
149 committed. Once a member is removed, its Raft ID can never be reused.
150
151 There is a special case when the leader is being demoted. It cannot reliably
152 remove itself, because this involves informing the other nodes that the removal
153 log entry has been committed, and if any of those messages are lost in transit,
154 the leader won't have an opportunity to retry sending them, since demotion
155 causes the Raft state machine to shut down. To solve this problem, the leader
156 demotes itself simply by transferring leadership to a different manager node.
157 When another node becomes the leader, the role manager will start up on that
158 node, and it will be able to demote the former leader without this complication.
159
160 ## The main Raft loop
161
162 The `Run` method acts as a main loop. It receives ticks from a ticker, and
163 forwards these to the `etcd/raft` state machine, which relies on external code
164 for timekeeping. It also receives `Ready` structures from the `etcd/raft` state
165 machine on a channel.
166
167 A `Ready` message conveys the current state of the system, provides a set of
168 messages to send to peers, and includes any items that need to be acted on or
169 written to disk. It is basically `etcd/raft`'s mechanism for communicating with
170 the outside world and expressing its state to higher-level code.
171
172 There are five basic functions the `Run` function performs when it receives a
173 `Ready` message:
174
175 1. Write new entries or a new snapshot to disk.
176 2. Forward any messages for other peers to the right destinations over gRPC.
177 3. Update the data store based on new snapshots or newly-committed log entries.
178 4. Evaluate the current leadership status, and signal to other code if it
179 changes (for example, so that components like the orchestrator can be started
180 or stopped).
181 5. If enough entries have accumulated between snapshots, create a new snapshot
182 to compact the WALs. The snapshot is written asynchronously and notifies the
183 `Run` method on completion.
184
185 ## Communication between nodes
186
187 The `etcd/raft` package does not implement communication over a network. It
188 references nodes by IDs, and it is up to higher-level code to convey messages to
189 the correct places.
190
191 SwarmKit uses gRPC to transfer these messages. The interface for this is very
192 simple. Messages are only conveyed through a single RPC named
193 `ProcessRaftMessage`.
194
195 There is an additional RPC called `ResolveAddress` that deals with a corner case
196 that can happen when nodes are added to a cluster dynamically. If a node was
197 down while the current cluster leader was added, or didn't mark the log entry
198 that added the leader as committed (which is done lazily), this node won't have
199 the leader's address. It would receive RPCs from the leader, but not be able to
200 invoke RPCs on the leader, so the communication would only happen in one
201 direction. It would normally be impossible for the node to catch up. With
202 `ResolveAddress`, it can query other cluster members for the leader's address,
203 and restore two-way communication. See
204 https://github.com/docker/swarmkit/issues/436 more details on this situation.
205
206 SwarmKit's `raft/transport` package abstracts the mechanism for keeping track of
207 peers, and sending messages to them over gRPC in a specific message order.
208
209 ## Integration between Raft and the data store
210
211 The Raft `Node` object implements the `Proposer` interface which the data store
212 uses to propagate changes across the cluster. The key method is `ProposeValue`,
213 which appends information to the distributed log.
214
215 The guts of `ProposeValue` are inside `processInternalRaftRequest`. This method
216 appends the message to the log, and then waits for it to become committed. There
217 is only one way `ProposeValue` can fail, which is the node where it's running
218 losing its position as the leader. If the node remains the leader, there is no
219 way a proposal can fail, since the leader controls which new entries are added
220 to the log, and can't retract an entry once it has been appended. It can,
221 however, take an indefinitely long time for a quorum of members to acknowledge
222 the new entry. There is no timeout on `ProposeValue` because a timeout wouldn't
223 retract the log entry, so having a timeout could put us in a state where a
224 write timed out, but ends up going through later on. This would make the data
225 store inconsistent with what's actually in the Raft log, which would be very
226 bad.
227
228 When the log entry successfully becomes committed, `processEntry` triggers the
229 wait associated with this entry, which allows `processInternalRaftRequest` to
230 return. On a leadership change, all outstanding waits get cancelled.
231
232 ## The Raft RPC proxy
233
234 As mentioned above, writes to the data store are only allowed on the leader
235 node. But any manager node can receive gRPC requests, and workers don't even
236 attempt to route those requests to the leaders. Somehow, requests that involve
237 writing to the data store or seeing a consistent view of it need to be
238 redirected to the leader.
239
240 We generate wrappers around RPC handlers using the code in
241 `protobuf/plugin/raftproxy`. These wrappers check if the current node is the
242 leader, and serve the RPC locally in that case. In the case where some other
243 node is the leader, the wrapper invokes the same RPC on the leader instead,
244 acting as a proxy. The proxy inserts identity information for the client node in
245 the gRPC headers of the request, so that clients can't achieve privilege
246 escalation by going through the proxy.
247
248 If one of these wrappers is registered with gRPC instead of the generated server
249 code itself, the server in question will automatically proxy its requests to the
250 leader. We use this for most APIs such as the dispatcher, control API, and CA.
251 However, there are some cases where RPCs need to be invoked directly instead of
252 being proxied to the leader, and in these cases, we don't use the wrappers. Raft
253 itself is a good example of this - if `ProcessRaftMessage` was always forwarded
254 to the leader, it would be impossible for the leader to communicate with other
255 nodes. Incidentally, this is why the Raft RPCs are split between a `Raft`
256 service and a `RaftMembership` service. The membership RPCs `Join` and `Leave`
257 need to run on the leader, but RPCs such as `ProcessRaftMessage` must not be
258 forwarded to the leader.