github.com/outbrain/consul@v1.4.5/website/source/docs/guides/outage.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Outage Recovery"
     4  sidebar_current: "docs-guides-outage"
     5  description: |-
     6    Don't panic! This is a critical first step. Depending on your deployment configuration, it may take only a single server failure for cluster unavailability. Recovery requires an operator to intervene, but recovery is straightforward.
     7  ---
     8  
     9  # Outage Recovery
    10  
    11  Don't panic! This is a critical first step.
    12  
    13  Depending on your
    14  [deployment configuration](/docs/internals/consensus.html#deployment_table), it
    15  may take only a single server failure for cluster unavailability. Recovery
    16  requires an operator to intervene, but the process is straightforward.
    17  
    18  This guide is for recovery from a Consul outage due to a majority
    19  of server nodes in a datacenter being lost. There are several types
    20  of outages, depending on the number of server nodes and number of failed
    21  server nodes. We will outline how to recover from:
    22  
    23  * Failure of a Single Server Cluster. This is when you have a single Consul
    24  server and it fails.
    25  * Failure of a Server in a Multi-Server Cluster. This is when one server fails,
    26  the Consul cluster has 3 or more servers.
    27  * Failure of Multiple Servers in a Multi-Server Cluster. This when more than one
    28  Consul server fails in a cluster of 3 or more servers. This scenario is potentially
    29  the most serious, because it can result in data loss.
    30  
    31  
    32  ## Failure of a Single Server Cluster
    33  
    34  If you had only a single server and it has failed, simply restart it. A
    35  single server configuration requires the
    36  [`-bootstrap`](/docs/agent/options.html#_bootstrap) or
    37  [`-bootstrap-expect=1`](/docs/agent/options.html#_bootstrap_expect)
    38  flag.
    39  
    40  ```sh
    41  consul agent -bootstrap-expect=1
    42  ```
    43  
    44  If the server cannot be recovered, you need to bring up a new
    45  server using the [deployment guide](https://www.consul.io/docs/guides/deployment-guide.html). 
    46  
    47  In the case of an unrecoverable server failure in a single server cluster and
    48  no backup procedure, data loss is inevitable since data was not replicated
    49  to any other servers. This is why a single server deploy is **never** recommended.
    50  
    51  Any services registered with agents will be re-populated when the new server
    52  comes online as agents perform [anti-entropy](/docs/internals/anti-entropy.html).
    53  
    54  ## Failure of a Server in a Multi-Server Cluster
    55  
    56  If you think the failed server is recoverable, the easiest option is to bring
    57  it back online and have it rejoin the cluster with the same IP address, returning
    58  the cluster to a fully healthy state. Similarly, even if you need to rebuild a
    59  new Consul server to replace the failed node, you may wish to do that immediately.
    60  Keep in mind that the rebuilt server needs to have the same IP address as the failed
    61  server. Again, once this server is online and has rejoined, the cluster will return
    62  to a fully healthy state.
    63  
    64  ```sh
    65  consul agent -bootstrap-expect=3 -bind=192.172.2.4 -auto-rejoin=192.172.2.3
    66  ```
    67  
    68  Both of these strategies involve a potentially lengthy time to reboot or rebuild
    69  a failed server. If this is impractical or if building a new server with the same
    70  IP isn't an option, you need to remove the failed server. Usually, you can issue
    71  a [`consul force-leave`](/docs/commands/force-leave.html) command to remove the failed
    72  server if it's still a member of the cluster.
    73  
    74  ```sh
    75  consul force-leave <node.name.consul>
    76  ```
    77  
    78  If [`consul force-leave`](/docs/commands/force-leave.html) isn't able to remove the
    79  server, you have two methods available to remove it, depending on your version of Consul:
    80  
    81  * In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-remove-peer) command to remove the stale peer server on the fly with no downtime if the cluster has a leader.
    82  
    83  * In versions of Consul prior to 0.7, you can manually remove the stale peer
    84  server using the `raft/peers.json` recovery file on all remaining servers. See
    85  the [section below](#peers.json) for details on this procedure. This process
    86  requires a Consul downtime to complete.
    87  
    88  In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-list-peers)
    89  command to inspect the Raft configuration:
    90  
    91  ```
    92  $ consul operator raft list-peers
    93  Node     ID              Address         State     Voter RaftProtocol
    94  alice    10.0.1.8:8300   10.0.1.8:8300   follower  true  3
    95  bob      10.0.1.6:8300   10.0.1.6:8300   leader    true  3
    96  carol    10.0.1.7:8300   10.0.1.7:8300   follower  true  3
    97  ```
    98  
    99  ## Failure of Multiple Servers in a Multi-Server Cluster
   100  
   101  In the event that multiple servers are lost, causing a loss of quorum and a
   102  complete outage, partial recovery is possible using data on the remaining
   103  servers in the cluster. There may be data loss in this situation because multiple
   104  servers were lost, so information about what's committed could be incomplete.
   105  The recovery process implicitly commits all outstanding Raft log entries, so
   106  it's also possible to commit data that was uncommitted before the failure.
   107  
   108  See the section below on manual recovery using peers.json for details of the recovery procedure. You
   109  simply include just the remaining servers in the `raft/peers.json` recovery file.
   110  The cluster should be able to elect a leader once the remaining servers are all
   111  restarted with an identical `raft/peers.json` configuration.
   112  
   113  Any new servers you introduce later can be fresh with totally clean data directories
   114  and joined using Consul's `join` command.
   115  
   116  ```sh
   117  consul agent -join=192.172.2.3
   118  ```
   119  
   120  In extreme cases, it should be possible to recover with just a single remaining
   121  server by starting that single server with itself as the only peer in the
   122  `raft/peers.json` recovery file.
   123  
   124  Prior to Consul 0.7 it wasn't always possible to recover from certain
   125  types of outages with `raft/peers.json` because this was ingested before any Raft
   126  log entries were played back. In Consul 0.7 and later, the `raft/peers.json`
   127  recovery file is final, and a snapshot is taken after it is ingested, so you are
   128  guaranteed to start with your recovered configuration. This does implicitly commit
   129  all Raft log entries, so should only be used to recover from an outage, but it
   130  should allow recovery from any situation where there's some cluster data available.
   131  
   132  <a name="peers.json"></a>
   133  ### Manual Recovery Using peers.json
   134  
   135  To begin, stop all remaining servers. You can attempt a graceful leave,
   136  but it will not work in most cases. Do not worry if the leave exits with an
   137  error. The cluster is in an unhealthy state, so this is expected.
   138  
   139  In Consul 0.7 and later, the `peers.json` file is no longer present
   140  by default and is only used when performing recovery. This file will be deleted
   141  after Consul starts and ingests this file. Consul 0.7 also uses a new, automatically-
   142  created `raft/peers.info` file to avoid ingesting the `raft/peers.json` file on the
   143  first start after upgrading. Be sure to leave `raft/peers.info` in place for proper
   144  operation.
   145  
   146  Using `raft/peers.json` for recovery can cause uncommitted Raft log entries to be
   147  implicitly committed, so this should only be used after an outage where no
   148  other option is available to recover a lost server. Make sure you don't have
   149  any automated processes that will put the peers file in place on a
   150  periodic basis.
   151  
   152  The next step is to go to the [`-data-dir`](/docs/agent/options.html#_data_dir)
   153  of each Consul server. Inside that directory, there will be a `raft/`
   154  sub-directory. We need to create a `raft/peers.json` file. The format of this file
   155  depends on what the server has configured for its
   156  [Raft protocol](/docs/agent/options.html#_raft_protocol) version.
   157  
   158  For Raft protocol version 2 and earlier, this should be formatted as a JSON
   159  array containing the address and port of each Consul server in the cluster, like
   160  this:
   161  
   162  ```json
   163  [
   164    "10.1.0.1:8300",
   165    "10.1.0.2:8300",
   166    "10.1.0.3:8300"
   167  ]
   168  ```
   169  
   170  For Raft protocol version 3 and later, this should be formatted as a JSON
   171  array containing the node ID, address:port, and suffrage information of each
   172  Consul server in the cluster, like this:
   173  
   174  ```
   175  [
   176    {
   177      "id": "adf4238a-882b-9ddc-4a9d-5b6758e4159e",
   178      "address": "10.1.0.1:8300",
   179      "non_voter": false
   180    },
   181    {
   182      "id": "8b6dda82-3103-11e7-93ae-92361f002671",
   183      "address": "10.1.0.2:8300",
   184      "non_voter": false
   185    },
   186    {
   187      "id": "97e17742-3103-11e7-93ae-92361f002671",
   188      "address": "10.1.0.3:8300",
   189      "non_voter": false
   190    }
   191  ]
   192  ```
   193  
   194  - `id` `(string: <required>)` - Specifies the [node ID](/docs/agent/options.html#_node_id)
   195    of the server. This can be found in the logs when the server starts up if it was auto-generated,
   196    and it can also be found inside the `node-id` file in the server's data directory.
   197  
   198  - `address` `(string: <required>)` - Specifies the IP and port of the server. The port is the
   199    server's RPC port used for cluster communications.
   200  
   201  - `non_voter` `(bool: <false>)` - This controls whether the server is a non-voter, which is used
   202    in some advanced [Autopilot](/docs/guides/autopilot.html) configurations. If omitted, it will
   203    default to false, which is typical for most clusters.
   204  
   205  Simply create entries for all servers. You must confirm that servers you do not include here have
   206  indeed failed and will not later rejoin the cluster. Ensure that this file is the same across all
   207  remaining server nodes.
   208  
   209  At this point, you can restart all the remaining servers. In Consul 0.7 and
   210  later you will see them ingest recovery file:
   211  
   212  ```text
   213  ...
   214  2016/08/16 14:39:20 [INFO] consul: found peers.json file, recovering Raft configuration...
   215  2016/08/16 14:39:20 [INFO] consul.fsm: snapshot created in 12.484µs
   216  2016/08/16 14:39:20 [INFO] snapshot: Creating new snapshot at /tmp/peers/raft/snapshots/2-5-1471383560779.tmp
   217  2016/08/16 14:39:20 [INFO] consul: deleted peers.json file after successful recovery
   218  2016/08/16 14:39:20 [INFO] raft: Restored from snapshot 2-5-1471383560779
   219  2016/08/16 14:39:20 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.212.15.121:8300 Address:10.212.15.121:8300}]
   220  ...
   221  ```
   222  
   223  If any servers managed to perform a graceful leave, you may need to have them
   224  rejoin the cluster using the [`join`](/docs/commands/join.html) command:
   225  
   226  ```text
   227  $ consul join <Node Address>
   228  Successfully joined cluster by contacting 1 nodes.
   229  ```
   230  
   231  It should be noted that any existing member can be used to rejoin the cluster
   232  as the gossip protocol will take care of discovering the server nodes.
   233  
   234  At this point, the cluster should be in an operable state again. One of the
   235  nodes should claim leadership and emit a log like:
   236  
   237  ```text
   238  [INFO] consul: cluster leadership acquired
   239  ```
   240  
   241  In Consul 0.7 and later, you can use the [`consul operator`](/docs/commands/operator.html#raft-list-peers)
   242  command to inspect the Raft configuration:
   243  
   244  ```
   245  $ consul operator raft list-peers
   246  Node     ID              Address         State     Voter  RaftProtocol
   247  alice    10.0.1.8:8300   10.0.1.8:8300   follower  true   3
   248  bob      10.0.1.6:8300   10.0.1.6:8300   leader    true   3
   249  carol    10.0.1.7:8300   10.0.1.7:8300   follower  true   3
   250  ```
   251  
   252  ## Summary
   253  
   254  In this guided we reviewed how to recover from a Consul server outage. Depending on the
   255  quorum size and number of failed servers, the recovery process will vary. In the event of
   256  complete failure it is beneficial to have a
   257  [backup process](https://www.consul.io/docs/guides/deployment-guide.html#backups).