github.com/iqoqo/nomad@v0.11.3-0.20200911112621-d7021c74d101/website/pages/docs/internals/consensus.mdx (about) 1 --- 2 layout: docs 3 page_title: Consensus Protocol 4 sidebar_title: Consensus Protocol 5 description: |- 6 Nomad uses a consensus protocol to provide Consistency as defined by CAP. 7 The consensus protocol is based on Raft: In search of an Understandable 8 Consensus Algorithm. For a visual explanation of Raft, see The Secret Lives of 9 Data. 10 --- 11 12 # Consensus Protocol 13 14 Nomad uses a [consensus protocol](<https://en.wikipedia.org/wiki/Consensus_(computer_science)>) 15 to provide [Consistency (as defined by CAP)](https://en.wikipedia.org/wiki/CAP_theorem). 16 The consensus protocol is based on 17 ["Raft: In search of an Understandable Consensus Algorithm"](https://raft.github.io/raft.pdf). 18 For a visual explanation of Raft, see [The Secret Lives of Data](http://thesecretlivesofdata.com/raft). 19 20 ~> **Advanced Topic!** This page covers technical details of 21 the internals of Nomad. You do not need to know these details to effectively 22 operate and use Nomad. These details are documented here for those who wish 23 to learn about them without having to go spelunking through the source code. 24 25 ## Raft Protocol Overview 26 27 Raft is a consensus algorithm that is based on 28 [Paxos](https://en.wikipedia.org/wiki/Paxos_%28computer_science%29). Compared 29 to Paxos, Raft is designed to have fewer states and a simpler, more 30 understandable algorithm. 31 32 There are a few key terms to know when discussing Raft: 33 34 - **Log** - The primary unit of work in a Raft system is a log entry. The problem 35 of consistency can be decomposed into a _replicated log_. A log is an ordered 36 sequence of entries. We consider the log consistent if all members agree on 37 the entries and their order. 38 39 - **FSM** - [Finite State Machine](https://en.wikipedia.org/wiki/Finite-state_machine). 40 An FSM is a collection of finite states with transitions between them. As new logs 41 are applied, the FSM is allowed to transition between states. Application of the 42 same sequence of logs must result in the same state, meaning behavior must be deterministic. 43 44 - **Peer set** - The peer set is the set of all members participating in log replication. 45 For Nomad's purposes, all server nodes are in the peer set of the local region. 46 47 - **Quorum** - A quorum is a majority of members from a peer set: for a set of size `n`, 48 quorum requires at least `⌊(n/2)+1⌋` members. 49 For example, if there are 5 members in the peer set, we would need 3 nodes 50 to form a quorum. If a quorum of nodes is unavailable for any reason, the 51 cluster becomes _unavailable_ and no new logs can be committed. 52 53 - **Committed Entry** - An entry is considered _committed_ when it is durably stored 54 on a quorum of nodes. Once an entry is committed it can be applied. 55 56 - **Leader** - At any given time, the peer set elects a single node to be the leader. 57 The leader is responsible for ingesting new log entries, replicating to followers, 58 and managing when an entry is considered committed. 59 60 Raft is a complex protocol and will not be covered here in detail (for those who 61 desire a more comprehensive treatment, the full specification is available in this 62 [paper](https://raft.github.io/raft.pdf)). 63 We will, however, attempt to provide a high level description which may be useful 64 for building a mental model. 65 66 Raft nodes are always in one of three states: follower, candidate, or leader. All 67 nodes initially start out as a follower. In this state, nodes can accept log entries 68 from a leader and cast votes. If no entries are received for some time, nodes 69 self-promote to the candidate state. In the candidate state, nodes request votes from 70 their peers. If a candidate receives a quorum of votes, then it is promoted to a leader. 71 The leader must accept new log entries and replicate to all the other followers. 72 In addition, if stale reads are not acceptable, all queries must also be performed on 73 the leader. 74 75 Once a cluster has a leader, it is able to accept new log entries. A client can 76 request that a leader append a new log entry (from Raft's perspective, a log entry 77 is an opaque binary blob). The leader then writes the entry to durable storage and 78 attempts to replicate to a quorum of followers. Once the log entry is considered 79 _committed_, it can be _applied_ to a finite state machine. The finite state machine 80 is application specific; in Nomad's case, we use 81 [MemDB](https://github.com/hashicorp/go-memdb) to maintain cluster state. 82 83 Obviously, it would be undesirable to allow a replicated log to grow in an unbounded 84 fashion. Raft provides a mechanism by which the current state is snapshotted and the 85 log is compacted. Because of the FSM abstraction, restoring the state of the FSM must 86 result in the same state as a replay of old logs. This allows Raft to capture the FSM 87 state at a point in time and then remove all the logs that were used to reach that 88 state. This is performed automatically without user intervention and prevents unbounded 89 disk usage while also minimizing time spent replaying logs. One of the advantages of 90 using MemDB is that it allows Nomad to continue accepting new transactions even while 91 old state is being snapshotted, preventing any availability issues. 92 93 Consensus is fault-tolerant up to the point where quorum is available. 94 If a quorum of nodes is unavailable, it is impossible to process log entries or reason 95 about peer membership. For example, suppose there are only 2 peers: A and B. The quorum 96 size is also 2, meaning both nodes must agree to commit a log entry. If either A or B 97 fails, it is now impossible to reach quorum. This means the cluster is unable to add 98 or remove a node or to commit any additional log entries. This results in 99 _unavailability_. At this point, manual intervention would be required to remove 100 either A or B and to restart the remaining node in bootstrap mode. 101 102 A Raft cluster of 3 nodes can tolerate a single node failure while a cluster 103 of 5 can tolerate 2 node failures. The recommended configuration is to either 104 run 3 or 5 Nomad servers per region. This maximizes availability without 105 greatly sacrificing performance. The [deployment table](#deployment_table) below 106 summarizes the potential cluster size options and the fault tolerance of each. 107 108 In terms of performance, Raft is comparable to Paxos. Assuming stable leadership, 109 committing a log entry requires a single round trip to half of the cluster. 110 Thus, performance is bound by disk I/O and network latency. 111 112 ## Raft in Nomad 113 114 Only Nomad server nodes participate in Raft and are part of the peer set. All 115 client nodes forward requests to servers. The clients in Nomad only need to know 116 about their allocations and query that information from the servers, while the 117 servers need to maintain the global state of the cluster. 118 119 Since all servers participate as part of the peer set, they all know the current 120 leader. When an RPC request arrives at a non-leader server, the request is 121 forwarded to the leader. If the RPC is a _query_ type, meaning it is read-only, 122 the leader generates the result based on the current state of the FSM. If 123 the RPC is a _transaction_ type, meaning it modifies state, the leader 124 generates a new log entry and applies it using Raft. Once the log entry is committed 125 and applied to the FSM, the transaction is complete. 126 127 Because of the nature of Raft's replication, performance is sensitive to network 128 latency. For this reason, each region elects an independent leader and maintains 129 a disjoint peer set. Data is partitioned by region, so each leader is responsible 130 only for data in their region. When a request is received for a remote region, 131 the request is forwarded to the correct leader. This design allows for lower latency 132 transactions and higher availability without sacrificing consistency. 133 134 ## Consistency Modes 135 136 Although all writes to the replicated log go through Raft, reads are more 137 flexible. To support various trade-offs that developers may want, Nomad 138 supports 2 different consistency modes for reads. 139 140 The two read modes are: 141 142 - `default` - Raft makes use of leader leasing, providing a time window 143 in which the leader assumes its role is stable. However, if a leader 144 is partitioned from the remaining peers, a new leader may be elected 145 while the old leader is holding the lease. This means there are 2 leader 146 nodes. There is no risk of a split-brain since the old leader will be 147 unable to commit new logs. However, if the old leader services any reads, 148 the values are potentially stale. The default consistency mode relies only 149 on leader leasing, exposing clients to potentially stale values. We make 150 this trade-off because reads are fast, usually strongly consistent, and 151 only stale in a hard-to-trigger situation. The time window of stale reads 152 is also bounded since the leader will step down due to the partition. 153 154 - `stale` - This mode allows any server to service the read regardless of if 155 it is the leader. This means reads can be arbitrarily stale but are generally 156 within 50 milliseconds of the leader. The trade-off is very fast and scalable 157 reads but with stale values. This mode allows reads without a leader meaning 158 a cluster that is unavailable will still be able to respond. 159 160 ## Deployment Table ((#deployment_table)) 161 162 Below is a table that shows quorum size and failure tolerance for various 163 cluster sizes. The recommended deployment is either 3 or 5 servers. A single 164 server deployment is _**highly**_ discouraged as data loss is inevitable in a 165 failure scenario. 166 167 <table> 168 <thead> 169 <tr> 170 <th>Servers</th> 171 <th>Quorum Size</th> 172 <th>Failure Tolerance</th> 173 </tr> 174 </thead> 175 <tbody> 176 <tr> 177 <td>1</td> 178 <td>1</td> 179 <td>0</td> 180 </tr> 181 <tr> 182 <td>2</td> 183 <td>2</td> 184 <td>0</td> 185 </tr> 186 <tr class="warning"> 187 <td>3</td> 188 <td>2</td> 189 <td>1</td> 190 </tr> 191 <tr> 192 <td>4</td> 193 <td>3</td> 194 <td>1</td> 195 </tr> 196 <tr class="warning"> 197 <td>5</td> 198 <td>3</td> 199 <td>2</td> 200 </tr> 201 <tr> 202 <td>6</td> 203 <td>4</td> 204 <td>2</td> 205 </tr> 206 <tr> 207 <td>7</td> 208 <td>4</td> 209 <td>3</td> 210 </tr> 211 </tbody> 212 </table>