github.com/smintz/nomad@v0.8.3/website/source/guides/outage.html.markdown

github.com/smintz/nomad@v0.8.3/website/source/guides/outage.html.markdown (about)

1 ---
2 layout: "guides"
3 page_title: "Outage Recovery"
4 sidebar_current: "guides-outage-recovery"
5 description: |-
6 Don't panic! This is a critical first step. Depending on your deployment
7 configuration, it may take only a single server failure for cluster
8 unavailability. Recovery requires an operator to intervene, but recovery is
9 straightforward.
10 ---
11
12 # Outage Recovery
13
14 Don't panic! This is a critical first step.
15
16 Depending on your
17 [deployment configuration](/docs/internals/consensus.html#deployment_table), it
18 may take only a single server failure for cluster unavailability. Recovery
19 requires an operator to intervene, but the process is straightforward.
20
21 ~> This guide is for recovery from a Nomad outage due to a majority of server
22 nodes in a datacenter being lost. If you are looking to add or remove servers,
23 see the [bootstrapping guide](/guides/cluster/bootstrapping.html).
24
25 ## Failure of a Single Server Cluster
26
27 If you had only a single server and it has failed, simply restart it. A
28 single server configuration requires the
29 [`-bootstrap-expect=1`](/docs/agent/configuration/server.html#bootstrap_expect)
30 flag. If the server cannot be recovered, you need to bring up a new
31 server. See the [bootstrapping guide](/guides/cluster/bootstrapping.html)
32 for more detail.
33
34 In the case of an unrecoverable server failure in a single server cluster, data
35 loss is inevitable since data was not replicated to any other servers. This is
36 why a single server deploy is **never** recommended.
37
38 ## Failure of a Server in a Multi-Server Cluster
39
40 If you think the failed server is recoverable, the easiest option is to bring
41 it back online and have it rejoin the cluster with the same IP address, returning
42 the cluster to a fully healthy state. Similarly, even if you need to rebuild a
43 new Nomad server to replace the failed node, you may wish to do that immediately.
44 Keep in mind that the rebuilt server needs to have the same IP address as the failed
45 server. Again, once this server is online and has rejoined, the cluster will return
46 to a fully healthy state.
47
48 Both of these strategies involve a potentially lengthy time to reboot or rebuild
49 a failed server. If this is impractical or if building a new server with the same
50 IP isn't an option, you need to remove the failed server. Usually, you can issue
51 a [`nomad server force-leave`](/docs/commands/server/force-leave.html) command
52 to remove the failed server if it's still a member of the cluster.
53
54 If [`nomad server force-leave`](/docs/commands/server/force-leave.html) isn't
55 able to remove the server, you have two methods available to remove it,
56 depending on your version of Nomad:
57
58 * In Nomad 0.5.5 and later, you can use the [`nomad operator raft
59 remove-peer`](/docs/commands/operator/raft-remove-peer.html) command to remove
60 the stale peer server on the fly with no downtime.
61
62 * In versions of Nomad prior to 0.5.5, you can manually remove the stale peer
63 server using the `raft/peers.json` recovery file on all remaining servers. See
64 the [section below](#manual-recovery-using-peers-json) for details on this
65 procedure. This process requires Nomad downtime to complete.
66
67 In Nomad 0.5.5 and later, you can use the [`nomad operator raft
68 list-peers`](/docs/commands/operator/raft-list-peers.html) command to inspect
69 the Raft configuration:
70
71 ```
72 $ nomad operator raft list-peers
73 Node ID Address State Voter
74 nomad-server01.global 10.10.11.5:4647 10.10.11.5:4647 follower true
75 nomad-server02.global 10.10.11.6:4647 10.10.11.6:4647 leader true
76 nomad-server03.global 10.10.11.7:4647 10.10.11.7:4647 follower true
77 ```
78
79 ## Failure of Multiple Servers in a Multi-Server Cluster
80
81 In the event that multiple servers are lost, causing a loss of quorum and a
82 complete outage, partial recovery is possible using data on the remaining
83 servers in the cluster. There may be data loss in this situation because multiple
84 servers were lost, so information about what's committed could be incomplete.
85 The recovery process implicitly commits all outstanding Raft log entries, so
86 it's also possible to commit data that was uncommitted before the failure.
87
88 See the [section below](#manual-recovery-using-peers-json) for details of the
89 recovery procedure. You simply include just the remaining servers in the
90 `raft/peers.json` recovery file. The cluster should be able to elect a leader
91 once the remaining servers are all restarted with an identical `raft/peers.json`
92 configuration.
93
94 Any new servers you introduce later can be fresh with totally clean data directories
95 and joined using Nomad's `server join` command.
96
97 In extreme cases, it should be possible to recover with just a single remaining
98 server by starting that single server with itself as the only peer in the
99 `raft/peers.json` recovery file.
100
101 Prior to Nomad 0.5.5 it wasn't always possible to recover from certain
102 types of outages with `raft/peers.json` because this was ingested before any Raft
103 log entries were played back. In Nomad 0.5.5 and later, the `raft/peers.json`
104 recovery file is final, and a snapshot is taken after it is ingested, so you are
105 guaranteed to start with your recovered configuration. This does implicitly commit
106 all Raft log entries, so should only be used to recover from an outage, but it
107 should allow recovery from any situation where there's some cluster data available.
108
109 ## Manual Recovery Using peers.json
110
111 To begin, stop all remaining servers. You can attempt a graceful leave,
112 but it will not work in most cases. Do not worry if the leave exits with an
113 error. The cluster is in an unhealthy state, so this is expected.
114
115 In Nomad 0.5.5 and later, the `peers.json` file is no longer present
116 by default and is only used when performing recovery. This file will be deleted
117 after Nomad starts and ingests this file. Nomad 0.5.5 also uses a new, automatically-
118 created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the
119 first start after upgrading. Be sure to leave `raft/peers.info` in place for proper
120 operation.
121
122 Using `raft/peers.json` for recovery can cause uncommitted Raft log entries to be
123 implicitly committed, so this should only be used after an outage where no
124 other option is available to recover a lost server. Make sure you don't have
125 any automated processes that will put the peers file in place on a
126 periodic basis.
127
128 The next step is to go to the
129 [`-data-dir`](/docs/agent/configuration/index.html#data_dir) of each Nomad
130 server. Inside that directory, there will be a `raft/` sub-directory. We need to
131 create a `raft/peers.json` file. It should look something like:
132
133 ```javascript
134 [
135 "10.0.1.8:4647",
136 "10.0.1.6:4647",
137 "10.0.1.7:4647"
138 ]
139 ```
140
141 Simply create entries for all remaining servers. You must confirm
142 that servers you do not include here have indeed failed and will not later
143 rejoin the cluster. Ensure that this file is the same across all remaining
144 server nodes.
145
146 At this point, you can restart all the remaining servers. In Nomad 0.5.5 and
147 later you will see them ingest recovery file:
148
149 ```text
150 ...
151 2016/08/16 14:39:20 [INFO] nomad: found peers.json file, recovering Raft configuration...
152 2016/08/16 14:39:20 [INFO] nomad.fsm: snapshot created in 12.484µs
153 2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
154 2016/08/16 14:39:20 [INFO] nomad: deleted peers.json file after successful recovery
155 2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
156 2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:4647 Address:10.212.15.121:4647}]
157 ...
158 ```
159
160 If any servers managed to perform a graceful leave, you may need to have them
161 rejoin the cluster using the [`server join`](/docs/commands/server/join.html) command:
162
163 ```text
164 $ nomad server join <Node Address>
165 Successfully joined cluster by contacting 1 nodes.
166 ```
167
168 It should be noted that any existing member can be used to rejoin the cluster
169 as the gossip protocol will take care of discovering the server nodes.
170
171 At this point, the cluster should be in an operable state again. One of the
172 nodes should claim leadership and emit a log like:
173
174 ```text
175 [INFO] nomad: cluster leadership acquired
176 ```
177
178 In Nomad 0.5.5 and later, you can use the [`nomad operator raft
179 list-peers`](/docs/commands/operator/raft-list-peers.html) command to inspect
180 the Raft configuration:
181
182 ```
183 $ nomad operator raft list-peers
184 Node ID Address State Voter
185 nomad-server01.global 10.10.11.5:4647 10.10.11.5:4647 follower true
186 nomad-server02.global 10.10.11.6:4647 10.10.11.6:4647 leader true
187 nomad-server03.global 10.10.11.7:4647 10.10.11.7:4647 follower true
188 ```