github.com/smintz/nomad@v0.8.3/website/source/guides/outage.html.markdown (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Outage Recovery"
     4  sidebar_current: "guides-outage-recovery"
     5  description: |-
     6    Don't panic! This is a critical first step. Depending on your deployment
     7    configuration, it may take only a single server failure for cluster
     8    unavailability. Recovery requires an operator to intervene, but recovery is
     9    straightforward.
    10  ---
    11  
    12  # Outage Recovery
    13  
    14  Don't panic! This is a critical first step.
    15  
    16  Depending on your
    17  [deployment configuration](/docs/internals/consensus.html#deployment_table), it
    18  may take only a single server failure for cluster unavailability. Recovery
    19  requires an operator to intervene, but the process is straightforward.
    20  
    21  ~> This guide is for recovery from a Nomad outage due to a majority of server
    22  nodes in a datacenter being lost. If you are looking to add or remove servers,
    23  see the [bootstrapping guide](/guides/cluster/bootstrapping.html).
    24  
    25  ## Failure of a Single Server Cluster
    26  
    27  If you had only a single server and it has failed, simply restart it. A
    28  single server configuration requires the
    29  [`-bootstrap-expect=1`](/docs/agent/configuration/server.html#bootstrap_expect)
    30  flag. If the server cannot be recovered, you need to bring up a new
    31  server. See the [bootstrapping guide](/guides/cluster/bootstrapping.html)
    32  for more detail.
    33  
    34  In the case of an unrecoverable server failure in a single server cluster, data
    35  loss is inevitable since data was not replicated to any other servers. This is
    36  why a single server deploy is **never** recommended.
    37  
    38  ## Failure of a Server in a Multi-Server Cluster
    39  
    40  If you think the failed server is recoverable, the easiest option is to bring
    41  it back online and have it rejoin the cluster with the same IP address, returning
    42  the cluster to a fully healthy state. Similarly, even if you need to rebuild a
    43  new Nomad server to replace the failed node, you may wish to do that immediately.
    44  Keep in mind that the rebuilt server needs to have the same IP address as the failed
    45  server. Again, once this server is online and has rejoined, the cluster will return
    46  to a fully healthy state.
    47  
    48  Both of these strategies involve a potentially lengthy time to reboot or rebuild
    49  a failed server. If this is impractical or if building a new server with the same
    50  IP isn't an option, you need to remove the failed server. Usually, you can issue
    51  a [`nomad server force-leave`](/docs/commands/server/force-leave.html) command
    52  to remove the failed server if it's still a member of the cluster.
    53  
    54  If [`nomad server force-leave`](/docs/commands/server/force-leave.html) isn't
    55  able to remove the server, you have two methods available to remove it,
    56  depending on your version of Nomad:
    57  
    58  * In Nomad 0.5.5 and later, you can use the [`nomad operator raft
    59    remove-peer`](/docs/commands/operator/raft-remove-peer.html) command to remove
    60    the stale peer server on the fly with no downtime.
    61  
    62  * In versions of Nomad prior to 0.5.5, you can manually remove the stale peer
    63    server using the `raft/peers.json` recovery file on all remaining servers. See
    64    the [section below](#manual-recovery-using-peers-json) for details on this
    65    procedure. This process requires Nomad downtime to complete.
    66  
    67  In Nomad 0.5.5 and later, you can use the [`nomad operator raft
    68  list-peers`](/docs/commands/operator/raft-list-peers.html) command to inspect
    69  the Raft configuration:
    70  
    71  ```
    72  $ nomad operator raft list-peers
    73  Node                   ID               Address          State     Voter
    74  nomad-server01.global  10.10.11.5:4647  10.10.11.5:4647  follower  true
    75  nomad-server02.global  10.10.11.6:4647  10.10.11.6:4647  leader    true
    76  nomad-server03.global  10.10.11.7:4647  10.10.11.7:4647  follower  true
    77  ```
    78  
    79  ## Failure of Multiple Servers in a Multi-Server Cluster
    80  
    81  In the event that multiple servers are lost, causing a loss of quorum and a
    82  complete outage, partial recovery is possible using data on the remaining
    83  servers in the cluster. There may be data loss in this situation because multiple
    84  servers were lost, so information about what's committed could be incomplete.
    85  The recovery process implicitly commits all outstanding Raft log entries, so
    86  it's also possible to commit data that was uncommitted before the failure.
    87  
    88  See the [section below](#manual-recovery-using-peers-json) for details of the
    89  recovery procedure. You simply include just the remaining servers in the
    90  `raft/peers.json` recovery file.  The cluster should be able to elect a leader
    91  once the remaining servers are all restarted with an identical `raft/peers.json`
    92  configuration.
    93  
    94  Any new servers you introduce later can be fresh with totally clean data directories
    95  and joined using Nomad's `server join` command.
    96  
    97  In extreme cases, it should be possible to recover with just a single remaining
    98  server by starting that single server with itself as the only peer in the
    99  `raft/peers.json` recovery file.
   100  
   101  Prior to Nomad 0.5.5 it wasn't always possible to recover from certain
   102  types of outages with `raft/peers.json` because this was ingested before any Raft
   103  log entries were played back. In Nomad 0.5.5 and later, the `raft/peers.json`
   104  recovery file is final, and a snapshot is taken after it is ingested, so you are
   105  guaranteed to start with your recovered configuration. This does implicitly commit
   106  all Raft log entries, so should only be used to recover from an outage, but it
   107  should allow recovery from any situation where there's some cluster data available.
   108  
   109  ## Manual Recovery Using peers.json
   110  
   111  To begin, stop all remaining servers. You can attempt a graceful leave,
   112  but it will not work in most cases. Do not worry if the leave exits with an
   113  error. The cluster is in an unhealthy state, so this is expected.
   114  
   115  In Nomad 0.5.5 and later, the `peers.json` file is no longer present
   116  by default and is only used when performing recovery. This file will be deleted
   117  after Nomad starts and ingests this file. Nomad 0.5.5 also uses a new, automatically-
   118  created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the
   119  first start after upgrading. Be sure to leave `raft/peers.info` in place for proper
   120  operation.
   121  
   122  Using `raft/peers.json` for recovery can cause uncommitted Raft log entries to be
   123  implicitly committed, so this should only be used after an outage where no
   124  other option is available to recover a lost server. Make sure you don't have
   125  any automated processes that will put the peers file in place on a
   126  periodic basis.
   127  
   128  The next step is to go to the
   129  [`-data-dir`](/docs/agent/configuration/index.html#data_dir) of each Nomad
   130  server. Inside that directory, there will be a `raft/` sub-directory. We need to
   131  create a `raft/peers.json` file. It should look something like:
   132  
   133  ```javascript
   134  [
   135    "10.0.1.8:4647",
   136    "10.0.1.6:4647",
   137    "10.0.1.7:4647"
   138  ]
   139  ```
   140  
   141  Simply create entries for all remaining servers. You must confirm
   142  that servers you do not include here have indeed failed and will not later
   143  rejoin the cluster. Ensure that this file is the same across all remaining
   144  server nodes.
   145  
   146  At this point, you can restart all the remaining servers. In Nomad 0.5.5 and
   147  later you will see them ingest recovery file:
   148  
   149  ```text
   150  ...
   151  2016/08/16 14:39:20 [INFO] nomad: found peers.json file, recovering Raft configuration...
   152  2016/08/16 14:39:20 [INFO] nomad.fsm: snapshot created in 12.484µs
   153  2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
   154  2016/08/16 14:39:20 [INFO] nomad: deleted peers.json file after successful recovery
   155  2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
   156  2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:4647 Address:10.212.15.121:4647}]
   157  ...
   158  ```
   159  
   160  If any servers managed to perform a graceful leave, you may need to have them
   161  rejoin the cluster using the [`server join`](/docs/commands/server/join.html) command:
   162  
   163  ```text
   164  $ nomad server join <Node Address>
   165  Successfully joined cluster by contacting 1 nodes.
   166  ```
   167  
   168  It should be noted that any existing member can be used to rejoin the cluster
   169  as the gossip protocol will take care of discovering the server nodes.
   170  
   171  At this point, the cluster should be in an operable state again. One of the
   172  nodes should claim leadership and emit a log like:
   173  
   174  ```text
   175  [INFO] nomad: cluster leadership acquired
   176  ```
   177  
   178  In Nomad 0.5.5 and later, you can use the [`nomad operator raft
   179  list-peers`](/docs/commands/operator/raft-list-peers.html) command to inspect
   180  the Raft configuration:
   181  
   182  ```
   183  $ nomad operator raft list-peers
   184  Node                   ID               Address          State     Voter
   185  nomad-server01.global  10.10.11.5:4647  10.10.11.5:4647  follower  true
   186  nomad-server02.global  10.10.11.6:4647  10.10.11.6:4647  leader    true
   187  nomad-server03.global  10.10.11.7:4647  10.10.11.7:4647  follower  true
   188  ```