github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/raft.md (about)

     1  # Raft implementation
     2  
     3  SwarmKit uses the Raft consensus protocol to synchronize state between manager
     4  nodes and support high availability. The lowest level portions of this are
     5  provided by the `github.com/coreos/etcd/raft` package. SwarmKit's
     6  `github.com/docker/swarmkit/manager/state/raft` package builds a complete
     7  solution on top of this, adding things like saving and loading state on disk,
     8  an RPC layer so nodes can pass Raft messages over a network, and dynamic cluster
     9  membership.
    10  
    11  ## A quick review of Raft
    12  
    13  The details of the Raft protocol are outside the scope of this document, but
    14  it's well worth reviewing the [raft paper](https://raft.github.io/raft.pdf).
    15  
    16  Essentially, Raft gives us two things. It provides the mechanism to elect a
    17  leader, which serves as the arbiter or all consensus decisions. It also provides
    18  a distributed log that we can append entries to, subject to the leader's
    19  approval. The distributed log is the basic building block for agreeing on and
    20  distributing state. Once an entry in the log becomes *committed*, it becomes an
    21  immutable part of the log that will survive any future leader elections and
    22  changes to the cluster. We can think of a committed log entry as piece of state
    23  that the cluster has reached agreement on.
    24  
    25  ## Role of the leader
    26  
    27  The leader has special responsibilities in the Raft protocol, but we also assign
    28  it special functions in SwarmKit outside the context of Raft. For example, the
    29  scheduler, orchestrators, dispatcher, and CA run on the leader node. This is not
    30  a design requirement, but simplifies things somewhat. If these components ran in
    31  a distributed fashion, we would need some mechanism to resolve conflicts between
    32  writes made by different nodes. Limiting decision-making to the leader avoids
    33  the need for this, since we can be certain that there is at most one leader at
    34  any time. The leader is also guaranteed to have the most up-to-date data in its
    35  store, so it is best positioned to make decisions.
    36  
    37  The basic rule is that anything which writes to the Raft-backed data store needs
    38  to run on the leader. If a follower node tries to write to the data store, the
    39  write will fail. Writes will also fail on a node that starts out as the leader
    40  but loses its leadership position before the write finishes.
    41  
    42  ## Raft IDs vs. node IDs
    43  
    44  Nodes in SwarmKit are identified by alphanumeric strings, but `etcd/raft` uses
    45  integers to identify Raft nodes. Thus, managers have two distinct IDs. The Raft
    46  IDs are assigned dynamically when a node joins the Raft consensus group. A node
    47  could potentially leave the Raft consensus group (through demotion), then later
    48  get promoted and rejoin under a different Raft ID. In this case, the node ID
    49  would stay the same, because it's a cryptographically-verifiable property of the
    50  node's certificate, but the Raft ID is assigned arbitrarily and would change.
    51  
    52  It's important to note that a Raft ID can't be reused after a node that was
    53  using the ID leaves the consensus group. These Raft IDs of nodes that are no
    54  longer part of the cluster are saved (persisted on disk) in a list (a blacklist,
    55  if you will) to make sure they aren't reused. If a node with a Raft ID on this list
    56  tries to use Raft RPCs, other nodes won't honor these requests. etcd/raft doesn't allow
    57  reuse of raft Id, which is likely done to avoid ambiguity.
    58  
    59  The blacklist of demoted/removed nodes is used to restrict these nodes from
    60  communicating and affecting cluster state. A membership list is also persisted,
    61  however this does not restrict communication between nodes.
    62  This is done to favor stability (and availability, by enabling faster return to
    63  non-degraded state) over consistency, by allowing newly added nodes (which may not
    64  have propagated to all the raft group members) to join and communicate with the group
    65  even though the membership list may not consistent at the point in time (but eventually
    66  will be). In case of node demotion/removal from the group, the affected node may be able
    67  to communicate with the other members until the change is fully propagated.
    68  
    69  ## Logs and snapshots
    70  
    71  There are two sets of files on disk that provide persistent state for Raft.
    72  There is a set of WAL (write-ahead log files). These store a series of log
    73  entries and Raft metadata, such as the current term, index, and committed index.
    74  WAL files are automatically rotated when they reach a certain size.
    75  
    76  To avoid having to retain every entry in the history of the log, snapshots
    77  serialize a view of the state at a particular point in time. After a snapshot
    78  gets taken, logs that predate the snapshot are no longer necessary, because the
    79  snapshot captures all the information that's needed from the log up to that
    80  point. The number of old snapshots and WALs to retain is configurable.
    81  
    82  In SwarmKit's usage, WALs mostly contain protobuf-serialized data store
    83  modifications. A log entry can contain a batch of creations, updates, and
    84  deletions of objects from the data store. Some log entries contain other kinds
    85  of metadata, like node additions or removals. Snapshots contain a complete dump
    86  of the store, as well as any metadata from the log entries that needs to be
    87  preserved. The saved metadata includes the Raft term and index, a list of nodes
    88  in the cluster, and a list of nodes that have been removed from the cluster.
    89  
    90  WALs and snapshots are both stored encrypted, even if the autolock feature is
    91  disabled. With autolock turned off, the data encryption key is stored on disk in
    92  plaintext, in a header inside the TLS key. When autolock is turned on, the data
    93  encryption key is encrypted with a key encryption key.
    94  
    95  ## Initializing a Raft cluster
    96  
    97  The first manager of a cluster (`swarm init`) assigns itself a random Raft ID.
    98  It creates a new WAL with its own Raft identity stored in the metadata field.
    99  The metadata field is the only part of the WAL that differs between nodes. By
   100  storing information such as the local Raft ID, it's easy to restore this
   101  node-specific information after a restart. In principle it could be stored in a
   102  separate file, but embedding it inside the WAL is most convenient.
   103  
   104  The node then starts the Raft state machine. From this point, it's a fully
   105  functional single-node Raft instance. Writes to the data store actually go
   106  through Raft, though this is a trivial case because reaching consensus doesn't
   107  involve communicating with any other nodes. The `Run` loop sees these writes and
   108  serializes them to disk as requested by the `etcd/raft` package.
   109  
   110  ## Adding and removing nodes
   111  
   112  New nodes can join an existing Raft consensus group by invoking the `Join` RPC
   113  on the leader node. This corresponds to joining a swarm with a manager-level
   114  token, or promoting a worker node to a manager. If successful, `Join` returns a
   115  Raft ID for the new node and a list of other members of the consensus group.
   116  
   117  On the leader side, `Join` tries to append a configuration change entry to the
   118  Raft log, and waits until that entry becomes committed.
   119  
   120  A new node creates an empty Raft log with its own node information in the
   121  metadata field. Then it starts the state machine. By running the Raft consensus
   122  protocol, the leader will discover that the new node doesn't have any entries in
   123  its log, and will synchronize these entries to the new node through some
   124  combination of sending snapshots and log entries. It can take a little while for
   125  a new node to become a functional member of the consensus group, because it
   126  needs to receive this data first.
   127  
   128  On the node receiving the log, code watching changes to the store will see log
   129  entries replayed as if the changes to the store were happening at that moment.
   130  This doesn't just apply when nodes receive logs for the first time - in
   131  general, when followers receive log entries with changes to the store, those
   132  are replayed in the follower's data store.
   133  
   134  Removing a node through demotion is a bit different. This requires two
   135  coordinated changes: the node must renew its certificate to get a worker
   136  certificate, and it should also be cleanly removed from the Raft consensus
   137  group. To avoid inconsistent states, particularly in cases like demoting the
   138  leader, there is a reconciliation loop that handles this in
   139  `manager/role_manager.go`. To initiate demotion, the user changes a node's
   140  `DesiredRole` to `Worker`. The role manager detects any nodes that have been
   141  demoted but are still acting as managers, and first removes them from the
   142  consensus group by calling `RemoveMember`. Only once this has happened is it
   143  safe to change the `Role` field to get a new certificate issued, because issuing
   144  a worker certificate to a node participating in the Raft group could cause loss
   145  of quorum.
   146  
   147  `RemoveMember` works similarly to `Join`. It appends an entry to the Raft log
   148  removing the member from the consensus group, and waits until this entry becomes
   149  committed. Once a member is removed, its Raft ID can never be reused.
   150  
   151  There is a special case when the leader is being demoted. It cannot reliably
   152  remove itself, because this involves informing the other nodes that the removal
   153  log entry has been committed, and if any of those messages are lost in transit,
   154  the leader won't have an opportunity to retry sending them, since demotion
   155  causes the Raft state machine to shut down. To solve this problem, the leader
   156  demotes itself simply by transferring leadership to a different manager node.
   157  When another node becomes the leader, the role manager will start up on that
   158  node, and it will be able to demote the former leader without this complication.
   159  
   160  ## The main Raft loop
   161  
   162  The `Run` method acts as a main loop. It receives ticks from a ticker, and
   163  forwards these to the `etcd/raft` state machine, which relies on external code
   164  for timekeeping. It also receives `Ready` structures from the `etcd/raft` state
   165  machine on a channel.
   166  
   167  A `Ready` message conveys the current state of the system, provides a set of
   168  messages to send to peers, and includes any items that need to be acted on or
   169  written to disk. It is basically `etcd/raft`'s mechanism for communicating with
   170  the outside world and expressing its state to higher-level code.
   171  
   172  There are five basic functions the `Run` function performs when it receives a
   173  `Ready` message:
   174  
   175  1. Write new entries or a new snapshot to disk.
   176  2. Forward any messages for other peers to the right destinations over gRPC.
   177  3. Update the data store based on new snapshots or newly-committed log entries.
   178  4. Evaluate the current leadership status, and signal to other code if it
   179     changes (for example, so that components like the orchestrator can be started
   180     or stopped).
   181  5. If enough entries have accumulated between snapshots, create a new snapshot
   182     to compact the WALs. The snapshot is written asynchronously and notifies the
   183     `Run` method on completion.
   184  
   185  ## Communication between nodes
   186  
   187  The `etcd/raft` package does not implement communication over a network. It
   188  references nodes by IDs, and it is up to higher-level code to convey messages to
   189  the correct places.
   190  
   191  SwarmKit uses gRPC to transfer these messages. The interface for this is very
   192  simple. Messages are only conveyed through a single RPC named
   193  `ProcessRaftMessage`.
   194  
   195  There is an additional RPC called `ResolveAddress` that deals with a corner case
   196  that can happen when nodes are added to a cluster dynamically. If a node was
   197  down while the current cluster leader was added, or didn't mark the log entry
   198  that added the leader as committed (which is done lazily), this node won't have
   199  the leader's address. It would receive RPCs from the leader, but not be able to
   200  invoke RPCs on the leader, so the communication would only happen in one
   201  direction. It would normally be impossible for the node to catch up. With
   202  `ResolveAddress`, it can query other cluster members for the leader's address,
   203  and restore two-way communication. See
   204  https://github.com/docker/swarmkit/issues/436 more details on this situation.
   205  
   206  SwarmKit's `raft/transport` package abstracts the mechanism for keeping track of
   207  peers, and sending messages to them over gRPC in a specific message order.
   208  
   209  ## Integration between Raft and the data store
   210  
   211  The Raft `Node` object implements the `Proposer` interface which the data store
   212  uses to propagate changes across the cluster. The key method is `ProposeValue`,
   213  which appends information to the distributed log.
   214  
   215  The guts of `ProposeValue` are inside `processInternalRaftRequest`. This method
   216  appends the message to the log, and then waits for it to become committed. There
   217  is only one way `ProposeValue` can fail, which is the node where it's running
   218  losing its position as the leader. If the node remains the leader, there is no
   219  way a proposal can fail, since the leader controls which new entries are added
   220  to the log, and can't retract an entry once it has been appended. It can,
   221  however, take an indefinitely long time for a quorum of members to acknowledge
   222  the new entry. There is no timeout on `ProposeValue` because a timeout wouldn't
   223  retract the log entry, so having a timeout could put us in a state where a
   224  write timed out, but ends up going through later on. This would make the data
   225  store inconsistent with what's actually in the Raft log, which would be very
   226  bad.
   227  
   228  When the log entry successfully becomes committed, `processEntry` triggers the
   229  wait associated with this entry, which allows `processInternalRaftRequest` to
   230  return. On a leadership change, all outstanding waits get cancelled.
   231  
   232  ## The Raft RPC proxy
   233  
   234  As mentioned above, writes to the data store are only allowed on the leader
   235  node. But any manager node can receive gRPC requests, and workers don't even
   236  attempt to route those requests to the leaders. Somehow, requests that involve
   237  writing to the data store or seeing a consistent view of it need to be
   238  redirected to the leader.
   239  
   240  We generate wrappers around RPC handlers using the code in
   241  `protobuf/plugin/raftproxy`. These wrappers check if the current node is the
   242  leader, and serve the RPC locally in that case. In the case where some other
   243  node is the leader, the wrapper invokes the same RPC on the leader instead,
   244  acting as a proxy. The proxy inserts identity information for the client node in
   245  the gRPC headers of the request, so that clients can't achieve privilege
   246  escalation by going through the proxy.
   247  
   248  If one of these wrappers is registered with gRPC instead of the generated server
   249  code itself, the server in question will automatically proxy its requests to the
   250  leader. We use this for most APIs such as the dispatcher, control API, and CA.
   251  However, there are some cases where RPCs need to be invoked directly instead of
   252  being proxied to the leader, and in these cases, we don't use the wrappers. Raft
   253  itself is a good example of this - if `ProcessRaftMessage` was always forwarded
   254  to the leader, it would be impossible for the leader to communicate with other
   255  nodes. Incidentally, this is why the Raft RPCs are split between a `Raft`
   256  service and a `RaftMembership` service. The membership RPCs `Join` and `Leave`
   257  need to run on the leader, but RPCs such as `ProcessRaftMessage` must not be
   258  forwarded to the leader.