github.com/outbrain/consul@v1.4.5/website/source/docs/guides/servers.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Adding & Removing Servers" 4 sidebar_current: "docs-guides-servers" 5 description: |- 6 Consul is designed to require minimal operator involvement, however any changes to the set of Consul servers must be handled carefully. To better understand why, reading about the consensus protocol will be useful. In short, the Consul servers perform leader election and replication. For changes to be processed, a minimum quorum of servers (N/2)+1 must be available. That means if there are 3 server nodes, at least 2 must be available. 7 --- 8 9 # Adding & Removing Servers 10 11 Consul is designed to require minimal operator involvement, however any changes 12 to the set of Consul servers must be handled carefully. To better understand 13 why, reading about the [consensus protocol](/docs/internals/consensus.html) will 14 be useful. In short, the Consul servers perform leader election and replication. 15 For changes to be processed, a minimum quorum of servers (N/2)+1 must be available. 16 That means if there are 3 server nodes, at least 2 must be available. 17 18 In general, if you are ever adding and removing nodes simultaneously, it is better 19 to first add the new nodes and then remove the old nodes. 20 21 In this guide, we will cover the different methods for adding and removing servers. 22 23 ## Manually Add a New Server 24 25 Manually adding new servers is generally straightforward, start the new 26 agent with the `-server` flag. At this point the server will not be a member of 27 any cluster, and should emit something like: 28 29 ```sh 30 consul agent -server 31 [WARN] raft: EnableSingleNode disabled, and no known peers. Aborting election. 32 ``` 33 34 This means that it does not know about any peers and is not configured to elect itself. 35 This is expected, and we can now add this node to the existing cluster using `join`. 36 From the new server, we can join any member of the existing cluster: 37 38 ```sh 39 $ consul join <Existing Node Address> 40 Successfully joined cluster by contacting 1 nodes. 41 ``` 42 43 It is important to note that any node, including a non-server may be specified for 44 join. Generally, this method is good for testing purposes but not recommended for production 45 deployments. For production clusters, you will likely want to use the agent configuration 46 option to add additional servers. 47 48 ## Add a Server with Agent Configuration 49 50 In production environments, you should use the [agent configuration](https://www.consul.io/docs/agent/options.html) option, `retry_join`. `retry_join` can be used as a command line flag or in the agent configuration file. 51 52 With the Consul CLI: 53 54 ```sh 55 $ consul agent -retry-join=["52.10.110.11", "52.10.110.12", "52.10.100.13"] 56 ``` 57 58 In the agent configuration file: 59 60 ```sh 61 { 62 "bootstrap": false, 63 "bootstrap_expect": 3, 64 "server": true, 65 "retry_join": ["52.10.110.11", "52.10.110.12", "52.10.100.13"] 66 } 67 ``` 68 69 [`retry_join`](https://www.consul.io/docs/agent/options.html#retry-join) 70 will ensure that if any server loses connection 71 with the cluster for any reason, including the node restarting, it can 72 rejoin when it comes back. In additon to working with static IPs, it 73 can also be useful for other discovery mechanisms, such as auto joining 74 based on cloud metadata and discovery. Both servers and clients can use this method. 75 76 ### Server Coordination 77 78 To ensure Consul servers are joining the cluster properly, you should monitor 79 the server coordination. The gossip protocol is used to properly discover all 80 the nodes in the cluster. Once the node has joined, the existing cluster 81 leader should log something like: 82 83 ```text 84 [INFO] raft: Added peer 127.0.0.2:8300, starting replication 85 ``` 86 87 This means that raft, the underlying consensus protocol, has added the peer and begun 88 replicating state. Since the existing cluster may be very far ahead, it can take some 89 time for the new node to catch up. To check on this, run `info` on the leader: 90 91 ```text 92 $ consul info 93 ... 94 raft: 95 applied_index = 47244 96 commit_index = 47244 97 fsm_pending = 0 98 last_log_index = 47244 99 last_log_term = 21 100 last_snapshot_index = 40966 101 last_snapshot_term = 20 102 num_peers = 4 103 state = Leader 104 term = 21 105 ... 106 ``` 107 108 This will provide various information about the state of Raft. In particular 109 the `last_log_index` shows the last log that is on disk. The same `info` command 110 can be run on the new server to see how far behind it is. Eventually the server 111 will be caught up, and the values should match. 112 113 It is best to add servers one at a time, allowing them to catch up. This avoids 114 the possibility of data loss in case the existing servers fail while bringing 115 the new servers up-to-date. 116 117 ## Manually Remove a Server 118 119 Removing servers must be done carefully to avoid causing an availability outage. 120 For a cluster of N servers, at least (N/2)+1 must be available for the cluster 121 to function. See this [deployment table](/docs/internals/consensus.html#toc_4). 122 If you have 3 servers and 1 of them is currently failing, removing any other servers 123 will cause the cluster to become unavailable. 124 125 To avoid this, it may be necessary to first add new servers to the cluster, 126 increasing the failure tolerance of the cluster, and then to remove old servers. 127 Even if all 3 nodes are functioning, removing one leaves the cluster in a state 128 that cannot tolerate the failure of any node. 129 130 Once you have verified the existing servers are healthy, and that the cluster 131 can handle a node leaving, the actual process is simple. You simply issue a 132 `leave` command to the server. 133 134 ```sh 135 consul leave 136 ``` 137 138 The server leaving should contain logs like: 139 140 ```text 141 ... 142 [INFO] consul: server starting leave 143 ... 144 [INFO] raft: Removed ourself, transitioning to follower 145 ... 146 ``` 147 148 The leader should also emit various logs including: 149 150 ```text 151 ... 152 [INFO] consul: member 'node-10-0-1-8' left, deregistering 153 [INFO] raft: Removed peer 10.0.1.8:8300, stopping replication 154 ... 155 ``` 156 157 At this point the node has been gracefully removed from the cluster, and 158 will shut down. 159 160 ~> Running `consul leave` on a server explicitly will reduce the quorum size. Even if the cluster used `bootstrap_expect` to set a quorum size initially, issuing `consul leave` on a server will reconfigure the cluster to have fewer servers. This means you could end up with just one server that is still able to commit writes because quorum is only 1, but those writes might be lost if that server fails before more are added. 161 162 To remove all agents that accidentally joined the wrong set of servers, clear out the contents of the data directory (`-data-dir`) on both client and server nodes. 163 164 These graceful methods to remove servres assumse you have a healthly cluster. 165 If the cluster has no leader due to loss of quorum or data corruption, you should 166 plan for [outage recovery](/docs/guides/outage.html#manual-recovery-using-peers-json). 167 168 !> **WARNING** Removing data on server nodes will destroy all state in the cluster 169 170 ## Manual Forced Removal 171 172 In some cases, it may not be possible to gracefully remove a server. For example, 173 if the server simply fails, then there is no ability to issue a leave. Instead, 174 the cluster will detect the failure and replication will continuously retry. 175 176 If the server can be recovered, it is best to bring it back online and then gracefully 177 leave the cluster. However, if this is not a possibility, then the `force-leave` command 178 can be used to force removal of a server. 179 180 ```sh 181 consul force-leave <node> 182 ``` 183 184 This is done by invoking that command with the name of the failed node. At this point, 185 the cluster leader will mark the node as having left the cluster and it will stop attempting to replicate. 186 187 ## Summary 188 189 In this guide we learned the straightforward process of adding and removing servers including; 190 manually adding servers, adding servers through the agent configuration, gracefully removing 191 servers, and forcing removal of servers. Finally, we should restate that manually adding servers 192 is good for testing purposes, however, for production it is recommended to add servers with 193 the agent configuration.