github.com/outbrain/consul@v1.4.5/website/source/docs/guides/servers.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Adding & Removing Servers"
     4  sidebar_current: "docs-guides-servers"
     5  description: |-
     6    Consul is designed to require minimal operator involvement, however any changes to the set of Consul servers must be handled carefully. To better understand why, reading about the consensus protocol will be useful. In short, the Consul servers perform leader election and replication. For changes to be processed, a minimum quorum of servers (N/2)+1 must be available. That means if there are 3 server nodes, at least 2 must be available.
     7  ---
     8  
     9  # Adding & Removing Servers
    10  
    11  Consul is designed to require minimal operator involvement, however any changes
    12  to the set of Consul servers must be handled carefully. To better understand
    13  why, reading about the [consensus protocol](/docs/internals/consensus.html) will
    14  be useful. In short, the Consul servers perform leader election and replication.
    15  For changes to be processed, a minimum quorum of servers (N/2)+1 must be available.
    16  That means if there are 3 server nodes, at least 2 must be available.
    17  
    18  In general, if you are ever adding and removing nodes simultaneously, it is better
    19  to first add the new nodes and then remove the old nodes.
    20  
    21  In this guide, we will cover the different methods for adding and removing servers.
    22  
    23  ## Manually Add a New Server
    24  
    25  Manually adding new servers is generally straightforward, start the new
    26  agent with the `-server` flag. At this point the server will not be a member of
    27  any cluster, and should emit something like:
    28  
    29  ```sh
    30  consul agent -server
    31  [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election.
    32  ```
    33  
    34  This means that it does not know about any peers and is not configured to elect itself.
    35  This is expected, and we can now add this node to the existing cluster using `join`.
    36  From the new server, we can join any member of the existing cluster:
    37  
    38  ```sh
    39  $ consul join <Existing Node Address>
    40  Successfully joined cluster by contacting 1 nodes.
    41  ```
    42  
    43  It is important to note that any node, including a non-server may be specified for
    44  join. Generally, this method is good for testing purposes but not recommended for production
    45  deployments. For production clusters, you will likely want to use the agent configuration
    46  option to add additional servers.
    47  
    48  ## Add a Server with Agent Configuration
    49  
    50  In production environments, you should use the [agent configuration](https://www.consul.io/docs/agent/options.html) option, `retry_join`. `retry_join` can be used as a command line flag or in the agent configuration file. 
    51  
    52  With the Consul CLI:
    53  
    54  ```sh
    55  $ consul agent -retry-join=["52.10.110.11", "52.10.110.12", "52.10.100.13"]
    56  ```
    57  
    58  In the agent configuration file:
    59  
    60  ```sh
    61  {
    62    "bootstrap": false,
    63    "bootstrap_expect": 3,
    64    "server": true,
    65    "retry_join": ["52.10.110.11", "52.10.110.12", "52.10.100.13"]
    66  }
    67  ```
    68  
    69  [`retry_join`](https://www.consul.io/docs/agent/options.html#retry-join)
    70  will ensure that if any server loses connection
    71  with the cluster for any reason, including the node restarting, it can
    72  rejoin when it comes back. In additon to working with static IPs, it 
    73  can also be  useful for other discovery mechanisms, such as auto joining 
    74  based on cloud metadata and discovery. Both servers and clients can use this method.
    75  
    76  ### Server Coordination
    77  
    78  To ensure Consul servers are joining the cluster properly, you should monitor
    79  the server coordination. The gossip protocol is used to properly discover all
    80  the nodes in the cluster. Once the node has joined, the existing cluster
    81  leader should log something like:
    82  
    83  ```text
    84  [INFO] raft: Added peer 127.0.0.2:8300, starting replication
    85  ```
    86  
    87  This means that raft, the underlying consensus protocol, has added the peer and begun
    88  replicating state. Since the existing cluster may be very far ahead, it can take some
    89  time for the new node to catch up. To check on this, run `info` on the leader:
    90  
    91  ```text
    92  $ consul info
    93  ...
    94  raft:
    95  	applied_index = 47244
    96  	commit_index = 47244
    97  	fsm_pending = 0
    98  	last_log_index = 47244
    99  	last_log_term = 21
   100  	last_snapshot_index = 40966
   101  	last_snapshot_term = 20
   102  	num_peers = 4
   103  	state = Leader
   104  	term = 21
   105  ...
   106  ```
   107  
   108  This will provide various information about the state of Raft. In particular
   109  the `last_log_index` shows the last log that is on disk. The same `info` command
   110  can be run on the new server to see how far behind it is. Eventually the server
   111  will be caught up, and the values should match.
   112  
   113  It is best to add servers one at a time, allowing them to catch up. This avoids
   114  the possibility of data loss in case the existing servers fail while bringing
   115  the new servers up-to-date.
   116  
   117  ## Manually Remove a Server
   118  
   119  Removing servers must be done carefully to avoid causing an availability outage.
   120  For a cluster of N servers, at least (N/2)+1 must be available for the cluster
   121  to function. See this [deployment table](/docs/internals/consensus.html#toc_4).
   122  If you have 3 servers and 1 of them is currently failing, removing any other servers
   123  will cause the cluster to become unavailable.
   124  
   125  To avoid this, it may be necessary to first add new servers to the cluster,
   126  increasing the failure tolerance of the cluster, and then to remove old servers.
   127  Even if all 3 nodes are functioning, removing one leaves the cluster in a state
   128  that cannot tolerate the failure of any node.
   129  
   130  Once you have verified the existing servers are healthy, and that the cluster
   131  can handle a node leaving, the actual process is simple. You simply issue a
   132  `leave` command to the server.
   133  
   134  ```sh
   135  consul leave
   136  ```
   137  
   138  The server leaving should contain logs like:
   139  
   140  ```text
   141  ...
   142  [INFO] consul: server starting leave
   143  ...
   144  [INFO] raft: Removed ourself, transitioning to follower
   145  ...
   146  ```
   147  
   148  The leader should also emit various logs including:
   149  
   150  ```text
   151  ...
   152  [INFO] consul: member 'node-10-0-1-8' left, deregistering
   153  [INFO] raft: Removed peer 10.0.1.8:8300, stopping replication
   154  ...
   155  ```
   156  
   157  At this point the node has been gracefully removed from the cluster, and
   158  will shut down.
   159  
   160  ~> Running `consul leave` on a server explicitly will reduce the quorum size. Even if the cluster used `bootstrap_expect` to set a quorum size initially, issuing `consul leave` on a server will reconfigure the cluster to have fewer servers. This means you could end up with just one server that is still able to commit writes because quorum is only 1, but those writes might be lost if that server fails before more are added.
   161  
   162  To remove all agents that accidentally joined the wrong set of servers, clear out the contents of the data directory (`-data-dir`) on both client and server nodes.
   163  
   164  These graceful methods to remove servres assumse you have a healthly cluster. 
   165  If the cluster has no leader due to loss of quorum or data corruption, you should 
   166  plan for [outage recovery](/docs/guides/outage.html#manual-recovery-using-peers-json).
   167  
   168  !> **WARNING** Removing data on server nodes will destroy all state in the cluster
   169  
   170  ## Manual Forced Removal
   171  
   172  In some cases, it may not be possible to gracefully remove a server. For example,
   173  if the server simply fails, then there is no ability to issue a leave. Instead,
   174  the cluster will detect the failure and replication will continuously retry.
   175  
   176  If the server can be recovered, it is best to bring it back online and then gracefully
   177  leave the cluster. However, if this is not a possibility, then the `force-leave` command
   178  can be used to force removal of a server.
   179  
   180  ```sh
   181  consul force-leave <node>
   182  ```
   183  
   184  This is done by invoking that command with the name of the failed node. At this point,
   185  the cluster leader will mark the node as having left the cluster and it will stop attempting to replicate.
   186  
   187  ## Summary
   188  
   189  In this guide we learned the straightforward process of adding and removing servers including;
   190  manually adding servers, adding servers through the agent configuration, gracefully removing
   191  servers, and forcing removal of servers. Finally, we should restate that manually adding servers
   192   is good for testing purposes, however, for production it is recommended to add servers with
   193  the agent configuration.