github.com/smintz/nomad@v0.8.3/website/source/guides/outage.html.markdown (about) 1 --- 2 layout: "guides" 3 page_title: "Outage Recovery" 4 sidebar_current: "guides-outage-recovery" 5 description: |- 6 Don't panic! This is a critical first step. Depending on your deployment 7 configuration, it may take only a single server failure for cluster 8 unavailability. Recovery requires an operator to intervene, but recovery is 9 straightforward. 10 --- 11 12 # Outage Recovery 13 14 Don't panic! This is a critical first step. 15 16 Depending on your 17 [deployment configuration](/docs/internals/consensus.html#deployment_table), it 18 may take only a single server failure for cluster unavailability. Recovery 19 requires an operator to intervene, but the process is straightforward. 20 21 ~> This guide is for recovery from a Nomad outage due to a majority of server 22 nodes in a datacenter being lost. If you are looking to add or remove servers, 23 see the [bootstrapping guide](/guides/cluster/bootstrapping.html). 24 25 ## Failure of a Single Server Cluster 26 27 If you had only a single server and it has failed, simply restart it. A 28 single server configuration requires the 29 [`-bootstrap-expect=1`](/docs/agent/configuration/server.html#bootstrap_expect) 30 flag. If the server cannot be recovered, you need to bring up a new 31 server. See the [bootstrapping guide](/guides/cluster/bootstrapping.html) 32 for more detail. 33 34 In the case of an unrecoverable server failure in a single server cluster, data 35 loss is inevitable since data was not replicated to any other servers. This is 36 why a single server deploy is **never** recommended. 37 38 ## Failure of a Server in a Multi-Server Cluster 39 40 If you think the failed server is recoverable, the easiest option is to bring 41 it back online and have it rejoin the cluster with the same IP address, returning 42 the cluster to a fully healthy state. Similarly, even if you need to rebuild a 43 new Nomad server to replace the failed node, you may wish to do that immediately. 44 Keep in mind that the rebuilt server needs to have the same IP address as the failed 45 server. Again, once this server is online and has rejoined, the cluster will return 46 to a fully healthy state. 47 48 Both of these strategies involve a potentially lengthy time to reboot or rebuild 49 a failed server. If this is impractical or if building a new server with the same 50 IP isn't an option, you need to remove the failed server. Usually, you can issue 51 a [`nomad server force-leave`](/docs/commands/server/force-leave.html) command 52 to remove the failed server if it's still a member of the cluster. 53 54 If [`nomad server force-leave`](/docs/commands/server/force-leave.html) isn't 55 able to remove the server, you have two methods available to remove it, 56 depending on your version of Nomad: 57 58 * In Nomad 0.5.5 and later, you can use the [`nomad operator raft 59 remove-peer`](/docs/commands/operator/raft-remove-peer.html) command to remove 60 the stale peer server on the fly with no downtime. 61 62 * In versions of Nomad prior to 0.5.5, you can manually remove the stale peer 63 server using the `raft/peers.json` recovery file on all remaining servers. See 64 the [section below](#manual-recovery-using-peers-json) for details on this 65 procedure. This process requires Nomad downtime to complete. 66 67 In Nomad 0.5.5 and later, you can use the [`nomad operator raft 68 list-peers`](/docs/commands/operator/raft-list-peers.html) command to inspect 69 the Raft configuration: 70 71 ``` 72 $ nomad operator raft list-peers 73 Node ID Address State Voter 74 nomad-server01.global 10.10.11.5:4647 10.10.11.5:4647 follower true 75 nomad-server02.global 10.10.11.6:4647 10.10.11.6:4647 leader true 76 nomad-server03.global 10.10.11.7:4647 10.10.11.7:4647 follower true 77 ``` 78 79 ## Failure of Multiple Servers in a Multi-Server Cluster 80 81 In the event that multiple servers are lost, causing a loss of quorum and a 82 complete outage, partial recovery is possible using data on the remaining 83 servers in the cluster. There may be data loss in this situation because multiple 84 servers were lost, so information about what's committed could be incomplete. 85 The recovery process implicitly commits all outstanding Raft log entries, so 86 it's also possible to commit data that was uncommitted before the failure. 87 88 See the [section below](#manual-recovery-using-peers-json) for details of the 89 recovery procedure. You simply include just the remaining servers in the 90 `raft/peers.json` recovery file. The cluster should be able to elect a leader 91 once the remaining servers are all restarted with an identical `raft/peers.json` 92 configuration. 93 94 Any new servers you introduce later can be fresh with totally clean data directories 95 and joined using Nomad's `server join` command. 96 97 In extreme cases, it should be possible to recover with just a single remaining 98 server by starting that single server with itself as the only peer in the 99 `raft/peers.json` recovery file. 100 101 Prior to Nomad 0.5.5 it wasn't always possible to recover from certain 102 types of outages with `raft/peers.json` because this was ingested before any Raft 103 log entries were played back. In Nomad 0.5.5 and later, the `raft/peers.json` 104 recovery file is final, and a snapshot is taken after it is ingested, so you are 105 guaranteed to start with your recovered configuration. This does implicitly commit 106 all Raft log entries, so should only be used to recover from an outage, but it 107 should allow recovery from any situation where there's some cluster data available. 108 109 ## Manual Recovery Using peers.json 110 111 To begin, stop all remaining servers. You can attempt a graceful leave, 112 but it will not work in most cases. Do not worry if the leave exits with an 113 error. The cluster is in an unhealthy state, so this is expected. 114 115 In Nomad 0.5.5 and later, the `peers.json` file is no longer present 116 by default and is only used when performing recovery. This file will be deleted 117 after Nomad starts and ingests this file. Nomad 0.5.5 also uses a new, automatically- 118 created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the 119 first start after upgrading. Be sure to leave `raft/peers.info` in place for proper 120 operation. 121 122 Using `raft/peers.json` for recovery can cause uncommitted Raft log entries to be 123 implicitly committed, so this should only be used after an outage where no 124 other option is available to recover a lost server. Make sure you don't have 125 any automated processes that will put the peers file in place on a 126 periodic basis. 127 128 The next step is to go to the 129 [`-data-dir`](/docs/agent/configuration/index.html#data_dir) of each Nomad 130 server. Inside that directory, there will be a `raft/` sub-directory. We need to 131 create a `raft/peers.json` file. It should look something like: 132 133 ```javascript 134 [ 135 "10.0.1.8:4647", 136 "10.0.1.6:4647", 137 "10.0.1.7:4647" 138 ] 139 ``` 140 141 Simply create entries for all remaining servers. You must confirm 142 that servers you do not include here have indeed failed and will not later 143 rejoin the cluster. Ensure that this file is the same across all remaining 144 server nodes. 145 146 At this point, you can restart all the remaining servers. In Nomad 0.5.5 and 147 later you will see them ingest recovery file: 148 149 ```text 150 ... 151 2016/08/16 14:39:20 [INFO] nomad: found peers.json file, recovering Raft configuration... 152 2016/08/16 14:39:20 [INFO] nomad.fsm: snapshot created in 12.484µs 153 2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp 154 2016/08/16 14:39:20 [INFO] nomad: deleted peers.json file after successful recovery 155 2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779 156 2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:4647 Address:10.212.15.121:4647}] 157 ... 158 ``` 159 160 If any servers managed to perform a graceful leave, you may need to have them 161 rejoin the cluster using the [`server join`](/docs/commands/server/join.html) command: 162 163 ```text 164 $ nomad server join <Node Address> 165 Successfully joined cluster by contacting 1 nodes. 166 ``` 167 168 It should be noted that any existing member can be used to rejoin the cluster 169 as the gossip protocol will take care of discovering the server nodes. 170 171 At this point, the cluster should be in an operable state again. One of the 172 nodes should claim leadership and emit a log like: 173 174 ```text 175 [INFO] nomad: cluster leadership acquired 176 ``` 177 178 In Nomad 0.5.5 and later, you can use the [`nomad operator raft 179 list-peers`](/docs/commands/operator/raft-list-peers.html) command to inspect 180 the Raft configuration: 181 182 ``` 183 $ nomad operator raft list-peers 184 Node ID Address State Voter 185 nomad-server01.global 10.10.11.5:4647 10.10.11.5:4647 follower true 186 nomad-server02.global 10.10.11.6:4647 10.10.11.6:4647 leader true 187 nomad-server03.global 10.10.11.7:4647 10.10.11.7:4647 follower true 188 ```