github.com/outbrain/consul@v1.4.5/website/source/docs/internals/consensus.html.md

github.com/outbrain/consul@v1.4.5/website/source/docs/internals/consensus.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Consensus Protocol"
     4  sidebar_current: "docs-internals-consensus"
     5  description: |-
     6    Consul uses a consensus protocol to provide Consistency as defined by CAP. The consensus protocol is based on Raft: In search of an Understandable Consensus Algorithm. For a visual explanation of Raft, see The Secret Lives of Data.
     7  ---
     8  
     9  # Consensus Protocol
    10  
    11  Consul uses a [consensus protocol](https://en.wikipedia.org/wiki/Consensus_(computer_science))
    12  to provide [Consistency (as defined by CAP)](https://en.wikipedia.org/wiki/CAP_theorem).
    13  The consensus protocol is based on
    14  ["Raft: In search of an Understandable Consensus Algorithm"](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf).
    15  For a visual explanation of Raft, see [The Secret Lives of Data](http://thesecretlivesofdata.com/raft).
    16  
    17  ~> **Advanced Topic!** This page covers technical details of
    18  the internals of Consul. You don't need to know these details to effectively
    19  operate and use Consul. These details are documented here for those who wish
    20  to learn about them without having to go spelunking through the source code.
    21  
    22  ## Raft Protocol Overview
    23  
    24  Raft is a consensus algorithm that is based on
    25  [Paxos](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29). Compared
    26  to Paxos, Raft is designed to have fewer states and a simpler, more
    27  understandable algorithm.
    28  
    29  There are a few key terms to know when discussing Raft:
    30  
    31  * Log - The primary unit of work in a Raft system is a log entry. The problem
    32  of consistency can be decomposed into a *replicated log*. A log is an ordered
    33  sequence of entries. We consider the log consistent if all members agree on
    34  the entries and their order.
    35  
    36  * FSM - [Finite State Machine](https://en.wikipedia.org/wiki/Finite-state_machine).
    37  An FSM is a collection of finite states with transitions between them. As new logs
    38  are applied, the FSM is allowed to transition between states. Application of the
    39  same sequence of logs must result in the same state, meaning behavior must be deterministic.
    40  
    41  * Peer set - The peer set is the set of all members participating in log replication.
    42  For Consul's purposes, all server nodes are in the peer set of the local datacenter.
    43  
    44  * Quorum - A quorum is a majority of members from a peer set: for a set of size `n`,
    45  quorum requires at least `(n/2)+1` members.
    46  For example, if there are 5 members in the peer set, we would need 3 nodes
    47  to form a quorum. If a quorum of nodes is unavailable for any reason, the
    48  cluster becomes *unavailable* and no new logs can be committed.
    49  
    50  * Committed Entry - An entry is considered *committed* when it is durably stored
    51  on a quorum of nodes. Once an entry is committed it can be applied.
    52  
    53  * Leader - At any given time, the peer set elects a single node to be the leader.
    54  The leader is responsible for ingesting new log entries, replicating to followers,
    55  and managing when an entry is considered committed.
    56  
    57  Raft is a complex protocol and will not be covered here in detail (for those who
    58  desire a more comprehensive treatment, the full specification is available in this
    59  [paper](https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf)).
    60  We will, however, attempt to provide a high level description which may be useful
    61  for building a mental model.
    62  
    63  Raft nodes are always in one of three states: follower, candidate, or leader. All
    64  nodes initially start out as a follower. In this state, nodes can accept log entries
    65  from a leader and cast votes. If no entries are received for some time, nodes
    66  self-promote to the candidate state. In the candidate state, nodes request votes from
    67  their peers. If a candidate receives a quorum of votes, then it is promoted to a leader.
    68  The leader must accept new log entries and replicate to all the other followers.
    69  In addition, if stale reads are not acceptable, all queries must also be performed on
    70  the leader.
    71  
    72  Once a cluster has a leader, it is able to accept new log entries. A client can
    73  request that a leader append a new log entry (from Raft's perspective, a log entry
    74  is an opaque binary blob). The leader then writes the entry to durable storage and
    75  attempts to replicate to a quorum of followers. Once the log entry is considered
    76  *committed*, it can be *applied* to a finite state machine. The finite state machine
    77  is application specific; in Consul's case, we use
    78  [MemDB](https://github.com/hashicorp/go-memdb) to maintain cluster state. Consul's writes
    79  block until it is both _committed_ and _applied_. This achieves read after write semantics
    80  when used with the [consistent](/api/index.html#consistent) mode for queries.
    81  
    82  Obviously, it would be undesirable to allow a replicated log to grow in an unbounded
    83  fashion. Raft provides a mechanism by which the current state is snapshotted and the
    84  log is compacted. Because of the FSM abstraction, restoring the state of the FSM must
    85  result in the same state as a replay of old logs. This allows Raft to capture the FSM
    86  state at a point in time and then remove all the logs that were used to reach that
    87  state. This is performed automatically without user intervention and prevents unbounded
    88  disk usage while also minimizing time spent replaying logs. One of the advantages of
    89  using MemDB is that it allows Consul to continue accepting new transactions even while
    90  old state is being snapshotted, preventing any availability issues.
    91  
    92  Consensus is fault-tolerant up to the point where quorum is available.
    93  If a quorum of nodes is unavailable, it is impossible to process log entries or reason
    94  about peer membership. For example, suppose there are only 2 peers: A and B. The quorum
    95  size is also 2, meaning both nodes must agree to commit a log entry. If either A or B
    96  fails, it is now impossible to reach quorum. This means the cluster is unable to add
    97  or remove a node or to commit any additional log entries. This results in
    98  *unavailability*. At this point, manual intervention would be required to remove
    99  either A or B and to restart the remaining node in bootstrap mode.
   100  
   101  A Raft cluster of 3 nodes can tolerate a single node failure while a cluster
   102  of 5 can tolerate 2 node failures. The recommended configuration is to either
   103  run 3 or 5 Consul servers per datacenter. This maximizes availability without
   104  greatly sacrificing performance. The [deployment table](#deployment_table) below
   105  summarizes the potential cluster size options and the fault tolerance of each.
   106  
   107  In terms of performance, Raft is comparable to Paxos. Assuming stable leadership,
   108  committing a log entry requires a single round trip to half of the cluster.
   109  Thus, performance is bound by disk I/O and network latency. Although Consul is
   110  not designed to be a high-throughput write system, it should handle on the order
   111  of hundreds to thousands of transactions per second depending on network and
   112  hardware configuration.
   113  
   114  ## Raft in Consul
   115  
   116  Only Consul server nodes participate in Raft and are part of the peer set. All
   117  client nodes forward requests to servers. Part of the reason for this design is
   118  that, as more members are added to the peer set, the size of the quorum also increases.
   119  This introduces performance problems as you may be waiting for hundreds of machines
   120  to agree on an entry instead of a handful.
   121  
   122  When getting started, a single Consul server is put into "bootstrap" mode. This mode
   123  allows it to self-elect as a leader. Once a leader is elected, other servers can be
   124  added to the peer set in a way that preserves consistency and safety. Eventually,
   125  once the first few servers are added, bootstrap mode can be disabled. See [this
   126  guide](/docs/guides/bootstrapping.html) for more details.
   127  
   128  Since all servers participate as part of the peer set, they all know the current
   129  leader. When an RPC request arrives at a non-leader server, the request is
   130  forwarded to the leader. If the RPC is a *query* type, meaning it is read-only,
   131  the leader generates the result based on the current state of the FSM. If
   132  the RPC is a *transaction* type, meaning it modifies state, the leader
   133  generates a new log entry and applies it using Raft. Once the log entry is committed
   134  and applied to the FSM, the transaction is complete.
   135  
   136  Because of the nature of Raft's replication, performance is sensitive to network
   137  latency. For this reason, each datacenter elects an independent leader and maintains
   138  a disjoint peer set. Data is partitioned by datacenter, so each leader is responsible
   139  only for data in their datacenter. When a request is received for a remote datacenter,
   140  the request is forwarded to the correct leader. This design allows for lower latency
   141  transactions and higher availability without sacrificing consistency.
   142  
   143  ## Consistency Modes
   144  
   145  Although all writes to the replicated log go through Raft, reads are more
   146  flexible. To support various trade-offs that developers may want, Consul
   147  supports 3 different consistency modes for reads.
   148  
   149  The three read modes are:
   150  
   151  * `default` - Raft makes use of leader leasing, providing a time window
   152    in which the leader assumes its role is stable. However, if a leader
   153    is partitioned from the remaining peers, a new leader may be elected
   154    while the old leader is holding the lease. This means there are 2 leader
   155    nodes. There is no risk of a split-brain since the old leader will be
   156    unable to commit new logs. However, if the old leader services any reads,
   157    the values are potentially stale. The default consistency mode relies only
   158    on leader leasing, exposing clients to potentially stale values. We make
   159    this trade-off because reads are fast, usually strongly consistent, and
   160    only stale in a hard-to-trigger situation. The time window of stale reads
   161    is also bounded since the leader will step down due to the partition.
   162  
   163  * `consistent` - This mode is strongly consistent without caveats. It requires
   164    that a leader verify with a quorum of peers that it is still leader. This
   165    introduces an additional round-trip to all server nodes. The trade-off is
   166    always consistent reads but increased latency due to the extra round trip.
   167  
   168  * `stale` - This mode allows any server to service the read regardless of whether
   169    it is the leader. This means reads can be arbitrarily stale but are generally
   170    within 50 milliseconds of the leader. The trade-off is very fast and scalable
   171    reads but with stale values. This mode allows reads without a leader meaning
   172    a cluster that is unavailable will still be able to respond.
   173  
   174  For more documentation about using these various modes, see the
   175  [HTTP API](/api/index.html).
   176  
   177  ## <a name="deployment_table"></a>Deployment Table
   178  
   179  Below is a table that shows quorum size and failure tolerance for various
   180  cluster sizes. The recommended deployment is either 3 or 5 servers. A single
   181  server deployment is _**highly**_ discouraged as data loss is inevitable in a
   182  failure scenario.
   183  
   184  <table class="table table-bordered table-striped">
   185    <tr>
   186      <th>Servers</th>
   187      <th>Quorum Size</th>
   188      <th>Failure Tolerance</th>
   189    </tr>
   190    <tr>
   191      <td>1</td>
   192      <td>1</td>
   193      <td>0</td>
   194    </tr>
   195    <tr>
   196      <td>2</td>
   197      <td>2</td>
   198      <td>0</td>
   199    </tr>
   200    <tr class="warning">
   201      <td>3</td>
   202      <td>2</td>
   203      <td>1</td>
   204    </tr>
   205    <tr>
   206      <td>4</td>
   207      <td>3</td>
   208      <td>1</td>
   209    </tr>
   210    <tr class="warning">
   211      <td>5</td>
   212      <td>3</td>
   213      <td>2</td>
   214    </tr>
   215    <tr>
   216      <td>6</td>
   217      <td>4</td>
   218      <td>2</td>
   219    </tr>
   220    <tr>
   221      <td>7</td>
   222      <td>4</td>
   223      <td>3</td>
   224    </tr>
   225  </table>