github.com/outbrain/consul@v1.4.5/website/source/docs/guides/outage.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Outage Recovery" 4 sidebar_current: "docs-guides-outage" 5 description: |- 6 Don't panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but recovery is straightforward. 7 --- 8 9 # Outage Recovery 10 11 Don't panic! This is a critical first step. 12 13 Depending on your 14 [deployment configuration](/docs/internals/consensus.html#deployment_table), it 15 may take only a single server failure for cluster unavailability. Recovery 16 requires an operator to intervene, but the process is straightforward. 17 18 This guide is for recovery from a Consul outage due to a majority 19 of server nodes in a datacenter being lost. There are several types 20 of outages, depending on the number of server nodes and number of failed 21 server nodes. We will outline how to recover from: 22 23 * Failure of a Single Server Cluster. This is when you have a single Consul 24 server and it fails. 25 * Failure of a Server in a Multi-Server Cluster. This is when one server fails, 26 the Consul cluster has 3 or more servers. 27 * Failure of Multiple Servers in a Multi-Server Cluster. This when more than one 28 Consul server fails in a cluster of 3 or more servers. This scenario is potentially 29 the most serious, because it can result in data loss. 30 31 32 ## Failure of a Single Server Cluster 33 34 If you had only a single server and it has failed, simply restart it. A 35 single server configuration requires the 36 [`-bootstrap`](/docs/agent/options.html#_bootstrap) or 37 [`-bootstrap-expect=1`](/docs/agent/options.html#_bootstrap_expect) 38 flag. 39 40 ```sh 41 consul agent -bootstrap-expect=1 42 ``` 43 44 If the server cannot be recovered, you need to bring up a new 45 server using the [deployment guide](https://www.consul.io/docs/guides/deployment-guide.html). 46 47 In the case of an unrecoverable server failure in a single server cluster and 48 no backup procedure, data loss is inevitable since data was not replicated 49 to any other servers. This is why a single server deploy is **never** recommended. 50 51 Any services registered with agents will be re-populated when the new server 52 comes online as agents perform [anti-entropy](/docs/internals/anti-entropy.html). 53 54 ## Failure of a Server in a Multi-Server Cluster 55 56 If you think the failed server is recoverable, the easiest option is to bring 57 it back online and have it rejoin the cluster with the same IP address, returning 58 the cluster to a fully healthy state. Similarly, even if you need to rebuild a 59 new Consul server to replace the failed node, you may wish to do that immediately. 60 Keep in mind that the rebuilt server needs to have the same IP address as the failed 61 server. Again, once this server is online and has rejoined, the cluster will return 62 to a fully healthy state. 63 64 ```sh 65 consul agent -bootstrap-expect=3 -bind=192.172.2.4 -auto-rejoin=192.172.2.3 66 ``` 67 68 Both of these strategies involve a potentially lengthy time to reboot or rebuild 69 a failed server. If this is impractical or if building a new server with the same 70 IP isn't an option, you need to remove the failed server. Usually, you can issue 71 a [`consul force-leave`](/docs/commands/force-leave.html) command to remove the failed 72 server if it's still a member of the cluster. 73 74 ```sh 75 consul force-leave <node.name.consul> 76 ``` 77 78 If [`consul force-leave`](/docs/commands/force-leave.html) isn't able to remove the 79 server, you have two methods available to remove it, depending on your version of Consul: 80 81 * In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-remove-peer) command to remove the stale peer server on the fly with no downtime if the cluster has a leader. 82 83 * In versions of Consul prior to 0.7, you can manually remove the stale peer 84 server using the `raft/peers.json` recovery file on all remaining servers. See 85 the [section below](#peers.json) for details on this procedure. This process 86 requires a Consul downtime to complete. 87 88 In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-list-peers) 89 command to inspect the Raft configuration: 90 91 ``` 92 $ consul operator raft list-peers 93 Node ID Address State Voter RaftProtocol 94 alice 10.0.1.8:8300 10.0.1.8:8300 follower true 3 95 bob 10.0.1.6:8300 10.0.1.6:8300 leader true 3 96 carol 10.0.1.7:8300 10.0.1.7:8300 follower true 3 97 ``` 98 99 ## Failure of Multiple Servers in a Multi-Server Cluster 100 101 In the event that multiple servers are lost, causing a loss of quorum and a 102 complete outage, partial recovery is possible using data on the remaining 103 servers in the cluster. There may be data loss in this situation because multiple 104 servers were lost, so information about what's committed could be incomplete. 105 The recovery process implicitly commits all outstanding Raft log entries, so 106 it's also possible to commit data that was uncommitted before the failure. 107 108 See the section below on manual recovery using peers.json for details of the recovery procedure. You 109 simply include just the remaining servers in the `raft/peers.json` recovery file. 110 The cluster should be able to elect a leader once the remaining servers are all 111 restarted with an identical `raft/peers.json` configuration. 112 113 Any new servers you introduce later can be fresh with totally clean data directories 114 and joined using Consul's `join` command. 115 116 ```sh 117 consul agent -join=192.172.2.3 118 ``` 119 120 In extreme cases, it should be possible to recover with just a single remaining 121 server by starting that single server with itself as the only peer in the 122 `raft/peers.json` recovery file. 123 124 Prior to Consul 0.7 it wasn't always possible to recover from certain 125 types of outages with `raft/peers.json` because this was ingested before any Raft 126 log entries were played back. In Consul 0.7 and later, the `raft/peers.json` 127 recovery file is final, and a snapshot is taken after it is ingested, so you are 128 guaranteed to start with your recovered configuration. This does implicitly commit 129 all Raft log entries, so should only be used to recover from an outage, but it 130 should allow recovery from any situation where there's some cluster data available. 131 132 <a name="peers.json"></a> 133 ### Manual Recovery Using peers.json 134 135 To begin, stop all remaining servers. You can attempt a graceful leave, 136 but it will not work in most cases. Do not worry if the leave exits with an 137 error. The cluster is in an unhealthy state, so this is expected. 138 139 In Consul 0.7 and later, the `peers.json` file is no longer present 140 by default and is only used when performing recovery. This file will be deleted 141 after Consul starts and ingests this file. Consul 0.7 also uses a new, automatically- 142 created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the 143 first start after upgrading. Be sure to leave `raft/peers.info` in place for proper 144 operation. 145 146 Using `raft/peers.json` for recovery can cause uncommitted Raft log entries to be 147 implicitly committed, so this should only be used after an outage where no 148 other option is available to recover a lost server. Make sure you don't have 149 any automated processes that will put the peers file in place on a 150 periodic basis. 151 152 The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir) 153 of each Consul server. Inside that directory, there will be a `raft/` 154 sub-directory. We need to create a `raft/peers.json` file. The format of this file 155 depends on what the server has configured for its 156 [Raft protocol](/docs/agent/options.html#_raft_protocol) version. 157 158 For Raft protocol version 2 and earlier, this should be formatted as a JSON 159 array containing the address and port of each Consul server in the cluster, like 160 this: 161 162 ```json 163 [ 164 "10.1.0.1:8300", 165 "10.1.0.2:8300", 166 "10.1.0.3:8300" 167 ] 168 ``` 169 170 For Raft protocol version 3 and later, this should be formatted as a JSON 171 array containing the node ID, address:port, and suffrage information of each 172 Consul server in the cluster, like this: 173 174 ``` 175 [ 176 { 177 "id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e", 178 "address": "10.1.0.1:8300", 179 "non_voter": false 180 }, 181 { 182 "id": "8b6dda82-3103-11e7-93ae-92361f002671", 183 "address": "10.1.0.2:8300", 184 "non_voter": false 185 }, 186 { 187 "id": "97e17742-3103-11e7-93ae-92361f002671", 188 "address": "10.1.0.3:8300", 189 "non_voter": false 190 } 191 ] 192 ``` 193 194 - `id` `(string: <required>)` - Specifies the [node ID](/docs/agent/options.html#_node_id) 195 of the server. This can be found in the logs when the server starts up if it was auto-generated, 196 and it can also be found inside the `node-id` file in the server's data directory. 197 198 - `address` `(string: <required>)` - Specifies the IP and port of the server. The port is the 199 server's RPC port used for cluster communications. 200 201 - `non_voter` `(bool: <false>)` - This controls whether the server is a non-voter, which is used 202 in some advanced [Autopilot](/docs/guides/autopilot.html) configurations. If omitted, it will 203 default to false, which is typical for most clusters. 204 205 Simply create entries for all servers. You must confirm that servers you do not include here have 206 indeed failed and will not later rejoin the cluster. Ensure that this file is the same across all 207 remaining server nodes. 208 209 At this point, you can restart all the remaining servers. In Consul 0.7 and 210 later you will see them ingest recovery file: 211 212 ```text 213 ... 214 2016/08/16 14:39:20 [INFO] consul: found peers.json file, recovering Raft configuration... 215 2016/08/16 14:39:20 [INFO] consul.fsm: snapshot created in 12.484µs 216 2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp 217 2016/08/16 14:39:20 [INFO] consul: deleted peers.json file after successful recovery 218 2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779 219 2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:8300 Address:10.212.15.121:8300}] 220 ... 221 ``` 222 223 If any servers managed to perform a graceful leave, you may need to have them 224 rejoin the cluster using the [`join`](/docs/commands/join.html) command: 225 226 ```text 227 $ consul join <Node Address> 228 Successfully joined cluster by contacting 1 nodes. 229 ``` 230 231 It should be noted that any existing member can be used to rejoin the cluster 232 as the gossip protocol will take care of discovering the server nodes. 233 234 At this point, the cluster should be in an operable state again. One of the 235 nodes should claim leadership and emit a log like: 236 237 ```text 238 [INFO] consul: cluster leadership acquired 239 ``` 240 241 In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-list-peers) 242 command to inspect the Raft configuration: 243 244 ``` 245 $ consul operator raft list-peers 246 Node ID Address State Voter RaftProtocol 247 alice 10.0.1.8:8300 10.0.1.8:8300 follower true 3 248 bob 10.0.1.6:8300 10.0.1.6:8300 leader true 3 249 carol 10.0.1.7:8300 10.0.1.7:8300 follower true 3 250 ``` 251 252 ## Summary 253 254 In this guided we reviewed how to recover from a Consul server outage. Depending on the 255 quorum size and number of failed servers, the recovery process will vary. In the event of 256 complete failure it is beneficial to have a 257 [backup process](https://www.consul.io/docs/guides/deployment-guide.html#backups).