github.com/outbrain/consul@v1.4.5/website/source/docs/guides/outage.html.md

github.com/outbrain/consul@v1.4.5/website/source/docs/guides/outage.html.md (about)

1 ---
2 layout: "docs"
3 page_title: "Outage Recovery"
4 sidebar_current: "docs-guides-outage"
5 description: |-
6 Don't panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but recovery is straightforward.
7 ---
8
9 # Outage Recovery
10
11 Don't panic! This is a critical first step.
12
13 Depending on your
14 [deployment configuration](/docs/internals/consensus.html#deployment_table), it
15 may take only a single server failure for cluster unavailability. Recovery
16 requires an operator to intervene, but the process is straightforward.
17
18 This guide is for recovery from a Consul outage due to a majority
19 of server nodes in a datacenter being lost. There are several types
20 of outages, depending on the number of server nodes and number of failed
21 server nodes. We will outline how to recover from:
22
23 * Failure of a Single Server Cluster. This is when you have a single Consul
24 server and it fails.
25 * Failure of a Server in a Multi-Server Cluster. This is when one server fails,
26 the Consul cluster has 3 or more servers.
27 * Failure of Multiple Servers in a Multi-Server Cluster. This when more than one
28 Consul server fails in a cluster of 3 or more servers. This scenario is potentially
29 the most serious, because it can result in data loss.
30
31
32 ## Failure of a Single Server Cluster
33
34 If you had only a single server and it has failed, simply restart it. A
35 single server configuration requires the
36 [`-bootstrap`](/docs/agent/options.html#_bootstrap) or
37 [`-bootstrap-expect=1`](/docs/agent/options.html#_bootstrap_expect)
38 flag.
39
40 ```sh
41 consul agent -bootstrap-expect=1
42 ```
43
44 If the server cannot be recovered, you need to bring up a new
45 server using the [deployment guide](https://www.consul.io/docs/guides/deployment-guide.html).
46
47 In the case of an unrecoverable server failure in a single server cluster and
48 no backup procedure, data loss is inevitable since data was not replicated
49 to any other servers. This is why a single server deploy is **never** recommended.
50
51 Any services registered with agents will be re-populated when the new server
52 comes online as agents perform [anti-entropy](/docs/internals/anti-entropy.html).
53
54 ## Failure of a Server in a Multi-Server Cluster
55
56 If you think the failed server is recoverable, the easiest option is to bring
57 it back online and have it rejoin the cluster with the same IP address, returning
58 the cluster to a fully healthy state. Similarly, even if you need to rebuild a
59 new Consul server to replace the failed node, you may wish to do that immediately.
60 Keep in mind that the rebuilt server needs to have the same IP address as the failed
61 server. Again, once this server is online and has rejoined, the cluster will return
62 to a fully healthy state.
63
64 ```sh
65 consul agent -bootstrap-expect=3 -bind=192.172.2.4 -auto-rejoin=192.172.2.3
66 ```
67
68 Both of these strategies involve a potentially lengthy time to reboot or rebuild
69 a failed server. If this is impractical or if building a new server with the same
70 IP isn't an option, you need to remove the failed server. Usually, you can issue
71 a [`consul force-leave`](/docs/commands/force-leave.html) command to remove the failed
72 server if it's still a member of the cluster.
73
74 ```sh
75 consul force-leave <node.name.consul>
76 ```
77
78 If [`consul force-leave`](/docs/commands/force-leave.html) isn't able to remove the
79 server, you have two methods available to remove it, depending on your version of Consul:
80
81 * In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-remove-peer) command to remove the stale peer server on the fly with no downtime if the cluster has a leader.
82
83 * In versions of Consul prior to 0.7, you can manually remove the stale peer
84 server using the `raft/peers.json` recovery file on all remaining servers. See
85 the [section below](#peers.json) for details on this procedure. This process
86 requires a Consul downtime to complete.
87
88 In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-list-peers)
89 command to inspect the Raft configuration:
90
91 ```
92 $ consul operator raft list-peers
93 Node ID Address State Voter RaftProtocol
94 alice 10.0.1.8:8300 10.0.1.8:8300 follower true 3
95 bob 10.0.1.6:8300 10.0.1.6:8300 leader true 3
96 carol 10.0.1.7:8300 10.0.1.7:8300 follower true 3
97 ```
98
99 ## Failure of Multiple Servers in a Multi-Server Cluster
100
101 In the event that multiple servers are lost, causing a loss of quorum and a
102 complete outage, partial recovery is possible using data on the remaining
103 servers in the cluster. There may be data loss in this situation because multiple
104 servers were lost, so information about what's committed could be incomplete.
105 The recovery process implicitly commits all outstanding Raft log entries, so
106 it's also possible to commit data that was uncommitted before the failure.
107
108 See the section below on manual recovery using peers.json for details of the recovery procedure. You
109 simply include just the remaining servers in the `raft/peers.json` recovery file.
110 The cluster should be able to elect a leader once the remaining servers are all
111 restarted with an identical `raft/peers.json` configuration.
112
113 Any new servers you introduce later can be fresh with totally clean data directories
114 and joined using Consul's `join` command.
115
116 ```sh
117 consul agent -join=192.172.2.3
118 ```
119
120 In extreme cases, it should be possible to recover with just a single remaining
121 server by starting that single server with itself as the only peer in the
122 `raft/peers.json` recovery file.
123
124 Prior to Consul 0.7 it wasn't always possible to recover from certain
125 types of outages with `raft/peers.json` because this was ingested before any Raft
126 log entries were played back. In Consul 0.7 and later, the `raft/peers.json`
127 recovery file is final, and a snapshot is taken after it is ingested, so you are
128 guaranteed to start with your recovered configuration. This does implicitly commit
129 all Raft log entries, so should only be used to recover from an outage, but it
130 should allow recovery from any situation where there's some cluster data available.
131
132 <a name="peers.json"></a>
133 ### Manual Recovery Using peers.json
134
135 To begin, stop all remaining servers. You can attempt a graceful leave,
136 but it will not work in most cases. Do not worry if the leave exits with an
137 error. The cluster is in an unhealthy state, so this is expected.
138
139 In Consul 0.7 and later, the `peers.json` file is no longer present
140 by default and is only used when performing recovery. This file will be deleted
141 after Consul starts and ingests this file. Consul 0.7 also uses a new, automatically-
142 created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the
143 first start after upgrading. Be sure to leave `raft/peers.info` in place for proper
144 operation.
145
146 Using `raft/peers.json` for recovery can cause uncommitted Raft log entries to be
147 implicitly committed, so this should only be used after an outage where no
148 other option is available to recover a lost server. Make sure you don't have
149 any automated processes that will put the peers file in place on a
150 periodic basis.
151
152 The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
153 of each Consul server. Inside that directory, there will be a `raft/`
154 sub-directory. We need to create a `raft/peers.json` file. The format of this file
155 depends on what the server has configured for its
156 [Raft protocol](/docs/agent/options.html#_raft_protocol) version.
157
158 For Raft protocol version 2 and earlier, this should be formatted as a JSON
159 array containing the address and port of each Consul server in the cluster, like
160 this:
161
162 ```json
163 [
164 "10.1.0.1:8300",
165 "10.1.0.2:8300",
166 "10.1.0.3:8300"
167 ]
168 ```
169
170 For Raft protocol version 3 and later, this should be formatted as a JSON
171 array containing the node ID, address:port, and suffrage information of each
172 Consul server in the cluster, like this:
173
174 ```
175 [
176 {
177 "id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
178 "address": "10.1.0.1:8300",
179 "non_voter": false
180 },
181 {
182 "id": "8b6dda82-3103-11e7-93ae-92361f002671",
183 "address": "10.1.0.2:8300",
184 "non_voter": false
185 },
186 {
187 "id": "97e17742-3103-11e7-93ae-92361f002671",
188 "address": "10.1.0.3:8300",
189 "non_voter": false
190 }
191 ]
192 ```
193
194 - `id` `(string: <required>)` - Specifies the [node ID](/docs/agent/options.html#_node_id)
195 of the server. This can be found in the logs when the server starts up if it was auto-generated,
196 and it can also be found inside the `node-id` file in the server's data directory.
197
198 - `address` `(string: <required>)` - Specifies the IP and port of the server. The port is the
199 server's RPC port used for cluster communications.
200
201 - `non_voter` `(bool: <false>)` - This controls whether the server is a non-voter, which is used
202 in some advanced [Autopilot](/docs/guides/autopilot.html) configurations. If omitted, it will
203 default to false, which is typical for most clusters.
204
205 Simply create entries for all servers. You must confirm that servers you do not include here have
206 indeed failed and will not later rejoin the cluster. Ensure that this file is the same across all
207 remaining server nodes.
208
209 At this point, you can restart all the remaining servers. In Consul 0.7 and
210 later you will see them ingest recovery file:
211
212 ```text
213 ...
214 2016/08/16 14:39:20 [INFO] consul: found peers.json file, recovering Raft configuration...
215 2016/08/16 14:39:20 [INFO] consul.fsm: snapshot created in 12.484µs
216 2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
217 2016/08/16 14:39:20 [INFO] consul: deleted peers.json file after successful recovery
218 2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
219 2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:8300 Address:10.212.15.121:8300}]
220 ...
221 ```
222
223 If any servers managed to perform a graceful leave, you may need to have them
224 rejoin the cluster using the [`join`](/docs/commands/join.html) command:
225
226 ```text
227 $ consul join <Node Address>
228 Successfully joined cluster by contacting 1 nodes.
229 ```
230
231 It should be noted that any existing member can be used to rejoin the cluster
232 as the gossip protocol will take care of discovering the server nodes.
233
234 At this point, the cluster should be in an operable state again. One of the
235 nodes should claim leadership and emit a log like:
236
237 ```text
238 [INFO] consul: cluster leadership acquired
239 ```
240
241 In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-list-peers)
242 command to inspect the Raft configuration:
243
244 ```
245 $ consul operator raft list-peers
246 Node ID Address State Voter RaftProtocol
247 alice 10.0.1.8:8300 10.0.1.8:8300 follower true 3
248 bob 10.0.1.6:8300 10.0.1.6:8300 leader true 3
249 carol 10.0.1.7:8300 10.0.1.7:8300 follower true 3
250 ```
251
252 ## Summary
253
254 In this guided we reviewed how to recover from a Consul server outage. Depending on the
255 quorum size and number of failed servers, the recovery process will vary. In the event of
256 complete failure it is beneficial to have a
257 [backup process](https://www.consul.io/docs/guides/deployment-guide.html#backups).