github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170608_decommission.md (about) 1 - Feature Name: Decommissioning a node 2 - Status: completed 3 - Start Date: 2017-06-05 4 - Authors: Neeral Dodhia 5 - RFC PR: [#16447] 6 - Cockroach Issue: [#6198] 7 8 # Summary 9 10 When a node will be removed from a cluster, mark the node as decommissioned. At 11 all cost, no data can be lost. To this end: 12 13 - Drain data from node if data availability (replica count) is reduced. 14 - Prevent node from participating in cluster. 15 16 # Motivation 17 18 Cluster operations are common in production deployments, such as: taking a node 19 down for maintenance; later bringing it back up; adding a new node; and, 20 (permanently) removing a node. All of these should be simple and **safe** to 21 perform. 22 23 Currently, the way to remove a node from a cluster is to shut it down. There is 24 a period before the cluster rebalances its replicas. If another node were to go 25 down, for example due to power loss, then some replica sets would become 26 unavailable. Those replica sets which had a replica on the shut-down node and 27 the failed node now only have a single replica. This demonstrates why safe 28 removal—without risk of data loss—is required. 29 30 A typical operation in the field is to replace a set of nodes. This involves 31 removing the old nodes and adding their replacements. Decommissioning a node 32 at-a-time is inefficient. Decommissioning the first node causes data to be moved 33 onto nodes that are about to be decommissioned too. Then, decommissioning these 34 nodes leads to more data movement. Instead, it is more efficient to mark all 35 these nodes as being decommissioned at once. Then, the draining mechanism can 36 move data to nodes that will remain. The steps to replace multiple nodes are to 37 add the new nodes and then decommission the old ones. 38 39 # Detailed design 40 The following scenarios are considered: 41 1. A node will (permanently) be removed from the cluster. The node is currently 42 1. alive, or 43 1. dead. 44 1. A node will temporarily be down (e.g. for maintenance). 45 46 ## Permanent removal 47 48 1. On any node, the user executes 49 ```shell 50 cockroach node decommission <nodeID>... 51 ``` 52 The node that receives the CLI command is referred to as node A. The nodes 53 specified in the CLI command are referred to as target nodes. This command is 54 asynchronous and returns with list of node IDs, their status, replica count, 55 and their `Draining` and `Decommissioning` flags. 56 1. Node A sets the `Decommissioning` flag to `true` for all target nodes in the 57 node liveness table. 58 1. After approximately 10 seconds, each target node discovers that the 59 `Decommissioning` flag has been set. The mechanism for discovery is the 60 heartbeat process, which periodically updates its own entry in the node 61 liveness table. 62 1. Each target node: 63 1. sets the `Draining` flag in the node liveness table to `true`, and 64 1. waits until draining has completed. 65 1. At this point, every target node 66 1. is not holding any leases, 67 1. is not accepting any new SQL connections, and 68 1. is not accepting new replicas. 69 1. Leaseholders, necessarily non-target nodes, for ranges where a target node 70 is a member of the replica set will have their replicate queue treat 71 decommissioning nodes like dead replicas. This will do the right thing: 72 not down-replicate if that puts us in more danger than we already are, and 73 not up-replicate to more dangerous states. 74 1. up-replicate to a node not in decommissioning state 75 - If there are not any nodes available for up-replication, the process 76 stalls. This prevents data loss and ensures availability of all ranges. 77 1. down-replicate from target nodes. 78 1. Wait for the replica count on each target node to reach 0. To be able to do 79 this for dead nodes, use meta ranges. 80 1. The user idempotently executes the command from step 1. To measure 81 progress, track the replica count. 82 1. The user executes 83 ```shell 84 cockroach quit --decommission --host=<hostname> --port=<port> 85 ``` 86 for each target node. This explicitly shuts down the node. Otherwise, the 87 node remains up but is unable to participate in usual operations such as 88 having replicas. The `--decommission` flag causes it to wait until the 89 replica count on the specified node reaches 0 before initiating shutdown. 90 - This command also initiates decommission. Setting decommission is 91 idempotent, which is nice here. The difference with this command is that it 92 blocks until the node shuts down. Users who do not want asynchronous 93 external polling can choose this command, if they want to decommission a 94 single node. 95 96 ## Handling restarts 97 When a node is restarted, it may reset its `Draining` flag but its 98 `Decommissioned` flag remains. The above process resumes from the third step. 99 Hence, if a node is restarted at any point after the `Decommissioned` flag is 100 set, the decommissioning process will resume. 101 102 A decommissioned node can be restarted and would rejoin the cluster. However, 103 it would not participate in any activity. This is safe and, hence, there is no 104 need to prevent a decommissioned node from restarting. 105 106 When a node restarts, there is a small period when it can accept new replicas 107 before reading the node liveness table. This would be rare and short-lived: nodes 108 will not send new replicas to a decommissioned node, and so this requires both 109 nodes to be unaware of the decommissioning state. A decommissioned node could have 110 reached a replica count of 0, then be restarted and accept new replicas. In this 111 case, availability could be compromised if the node were shutdown immediately. 112 However, the `--decommission` flag to the `quit` command ensures that shutdown 113 is only initiated if the replica count is 0. 114 115 ## Undo 116 ```shell 117 cockroach node recommission <nodeID>... 118 ``` 119 sets the `Decommissioning` flag to `false` for target nodes. Then, the user must 120 restart each target node. When a node restarts, it resets its `Draining` flag. 121 This will allow the node to participate like normal. Node restart is required 122 because we cannot determine whether a node is in `Draining` because if was 123 previously in `Decommissioning` or for another reason. 124 125 ## Dead nodes 126 If a target node is dead (i.e. unreachable), it can be in one of the following 127 states: 128 129 1. It holds unexpired leases. 130 1. It holds no leases; the replicas of ranges it has on-disk have not been 131 rebalanced to other nodes. 132 1. It holds no leases; the replicas of ranges it has on-disk have been down- 133 replicated from it and up-replicated to other nodes. 134 135 Regardless of which state the dead node has, it cannot set its `Draining` flag 136 because it is dead. Instead, its leases will expire and will be taken by other 137 nodes. After `server.time_until_store_dead` elapses, the replicas of its 138 ranges will actively be rebalanced. Wait for the replica count to reach 0 as for 139 live nodes. 140 141 It is possible that a dead node becomes available and rejoins the cluster. It 142 would discover the `Decommissioned` flag has been set and follow the 143 decommissioning process. 144 145 ## Temporary removal 146 147 No changes required. The existing CockroachDB process, described below, is 148 sufficient. 149 150 During a temporary removal, `server.time_until_store_dead` could be updated 151 to to the length of the downtime to avoid unnecessary movement of ranges. 152 However, it is difficult to predict the length of downtime. This is an 153 optimization which could be implemented later. 154 155 After a node is detected as unavailable for more than 156 `server.time_until_store_dead` (an env variable with default: 5 minutes), the 157 node is marked as incommunicado and the cluster decides that the node may not be 158 coming back and moves the data elsewhere. Its leases would have been transferred 159 to other nodes as soon as they expired without being renewed. Ranges are 160 up-replicated to other nodes and down-replicated (removed) from target node. 161 Ranges are not down-replicated when there does not exist a target for 162 up-replication because this causes more work if that node rejoins the cluster; 163 however, when permanently removing a node, we still want to do that. Although a 164 node can have multiple stores, there can be at most one replica for each replica 165 set (e.g. range) per node. 166 167 ## CLI 168 169 Two new commands and one option will be added. These have been described above. 170 The first two are asynchronous commands and `quit` is synchronous. 171 172 - `node decommission <nodeID>...` prompts the user for confirmation. Passing 173 `--yes` as a command-line flag will skip this prompt. This returns a list of 174 all nodes, their statuses, replica count, decommissioning flag and draining 175 flag. 176 - *Nice to have*: satisfying safety constraints as a pre-requisite. As an 177 example: `number_of_nodes < max(zone.number_of_replicas_desired)`. It would 178 be nice to check that the number of nodes remaining after decommissioning is 179 large enough for a quorum to be reached. However, this is not as easy as it 180 sounds to achieve due to e.g. ZoneConfig. 181 - `node recommission <nodeID>...` has similar semantics to `decommission`. It 182 prints a message to the user asking them to restart the node for the change to 183 take effect. 184 - `quit --decommission`. This is synchronous to guarantee that availability of 185 a replica set of a range is not compromised. 186 187 It is possible to decommission several nodes by passing multiple nodeIDs on the 188 command-line. 189 190 ## UI 191 192 - The UI hides nodes which are marked as dead and are also in decommissioning 193 state (as per the liveness table, or rather its gossiped information). That 194 should have the desired effect and is straightforward while keeping stats, etc. 195 - If a node is dead, it cannot remove itself from the node table in the admin UI. 196 Other nodes are responsible for hiding decommissioned nodes from the admin UI. 197 - The action of decommissioning a node creates an event that is displayed in the 198 UI. 199 200 # Drawbacks 201 202 There is no atomic way to check that decommissioning a set of nodes will leave 203 enough nodes for the cluster to be available. A race can occur: in a five-node 204 cluster, two users can simultaneously request to decommission two different 205 nodes each. The resulting state leaves two nodes in decommissioning state and 206 only one live node. Decommissioning nodes can still participate in quorums; 207 the replication changes cannot make progress because there are not a sufficient 208 number of non-decommissioned nodes. The user can discover that too many nodes 209 are in decommissioning state and choose which nodes to recommission. It would 210 be nice to proactively detect this but the effort required is disproportionate 211 compared to the gain. 212 213 Recently dead nodes may cause decommissioning to hang until 214 `server.time_until_store_dead` elapses. 215 216 Recommissioning, i.e. undoing a decommission, requires node restart. We could 217 avoid this: if the node is live, the coordinating process can directly tell it 218 to stop draining. Punting on this for now; this could be future work. 219 220 # Alternatives 221 222 If the operation were blocking and it took a long time to complete, the 223 connection might timeout if no response was sent to the client. 224 225 # Unresolved questions 226 227 None. 228 229 [#16447]: https://github.com/cockroachdb/cockroach/pull/16447 230 [#6198]: https://github.com/cockroachdb/cockroach/issues/6198