github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170608_decommission.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170608_decommission.md (about)

1 - Feature Name: Decommissioning a node
2 - Status: completed
3 - Start Date: 2017-06-05
4 - Authors: Neeral Dodhia
5 - RFC PR: [#16447]
6 - Cockroach Issue: [#6198]
7
8 # Summary
9
10 When a node will be removed from a cluster, mark the node as decommissioned. At
11 all cost, no data can be lost. To this end:
12
13 - Drain data from node if data availability (replica count) is reduced.
14 - Prevent node from participating in cluster.
15
16 # Motivation
17
18 Cluster operations are common in production deployments, such as: taking a node
19 down for maintenance; later bringing it back up; adding a new node; and,
20 (permanently) removing a node. All of these should be simple and **safe** to
21 perform.
22
23 Currently, the way to remove a node from a cluster is to shut it down. There is
24 a period before the cluster rebalances its replicas. If another node were to go
25 down, for example due to power loss, then some replica sets would become
26 unavailable. Those replica sets which had a replica on the shut-down node and
27 the failed node now only have a single replica. This demonstrates why safe
28 removal—without risk of data loss—is required.
29
30 A typical operation in the field is to replace a set of nodes. This involves
31 removing the old nodes and adding their replacements. Decommissioning a node
32 at-a-time is inefficient. Decommissioning the first node causes data to be moved
33 onto nodes that are about to be decommissioned too. Then, decommissioning these
34 nodes leads to more data movement. Instead, it is more efficient to mark all
35 these nodes as being decommissioned at once. Then, the draining mechanism can
36 move data to nodes that will remain. The steps to replace multiple nodes are to
37 add the new nodes and then decommission the old ones.
38
39 # Detailed design
40 The following scenarios are considered:
41 1. A node will (permanently) be removed from the cluster. The node is currently
42 1. alive, or
43 1. dead.
44 1. A node will temporarily be down (e.g. for maintenance).
45
46 ## Permanent removal
47
48 1. On any node, the user executes
49 ```shell
50 cockroach node decommission <nodeID>...
51 ```
52 The node that receives the CLI command is referred to as node A. The nodes
53 specified in the CLI command are referred to as target nodes. This command is
54 asynchronous and returns with list of node IDs, their status, replica count,
55 and their `Draining` and `Decommissioning` flags.
56 1. Node A sets the `Decommissioning` flag to `true` for all target nodes in the
57 node liveness table.
58 1. After approximately 10 seconds, each target node discovers that the
59 `Decommissioning` flag has been set. The mechanism for discovery is the
60 heartbeat process, which periodically updates its own entry in the node
61 liveness table.
62 1. Each target node:
63 1. sets the `Draining` flag in the node liveness table to `true`, and
64 1. waits until draining has completed.
65 1. At this point, every target node
66 1. is not holding any leases,
67 1. is not accepting any new SQL connections, and
68 1. is not accepting new replicas.
69 1. Leaseholders, necessarily non-target nodes, for ranges where a target node
70 is a member of the replica set will have their replicate queue treat
71 decommissioning nodes like dead replicas. This will do the right thing:
72 not down-replicate if that puts us in more danger than we already are, and
73 not up-replicate to more dangerous states.
74 1. up-replicate to a node not in decommissioning state
75 - If there are not any nodes available for up-replication, the process
76 stalls. This prevents data loss and ensures availability of all ranges.
77 1. down-replicate from target nodes.
78 1. Wait for the replica count on each target node to reach 0. To be able to do
79 this for dead nodes, use meta ranges.
80 1. The user idempotently executes the command from step 1. To measure
81 progress, track the replica count.
82 1. The user executes
83 ```shell
84 cockroach quit --decommission --host=<hostname> --port=<port>
85 ```
86 for each target node. This explicitly shuts down the node. Otherwise, the
87 node remains up but is unable to participate in usual operations such as
88 having replicas. The `--decommission` flag causes it to wait until the
89 replica count on the specified node reaches 0 before initiating shutdown.
90 - This command also initiates decommission. Setting decommission is
91 idempotent, which is nice here. The difference with this command is that it
92 blocks until the node shuts down. Users who do not want asynchronous
93 external polling can choose this command, if they want to decommission a
94 single node.
95
96 ## Handling restarts
97 When a node is restarted, it may reset its `Draining` flag but its
98 `Decommissioned` flag remains. The above process resumes from the third step.
99 Hence, if a node is restarted at any point after the `Decommissioned` flag is
100 set, the decommissioning process will resume.
101
102 A decommissioned node can be restarted and would rejoin the cluster. However,
103 it would not participate in any activity. This is safe and, hence, there is no
104 need to prevent a decommissioned node from restarting.
105
106 When a node restarts, there is a small period when it can accept new replicas
107 before reading the node liveness table. This would be rare and short-lived: nodes
108 will not send new replicas to a decommissioned node, and so this requires both
109 nodes to be unaware of the decommissioning state. A decommissioned node could have
110 reached a replica count of 0, then be restarted and accept new replicas. In this
111 case, availability could be compromised if the node were shutdown immediately.
112 However, the `--decommission` flag to the `quit` command ensures that shutdown
113 is only initiated if the replica count is 0.
114
115 ## Undo
116 ```shell
117 cockroach node recommission <nodeID>...
118 ```
119 sets the `Decommissioning` flag to `false` for target nodes. Then, the user must
120 restart each target node. When a node restarts, it resets its `Draining` flag.
121 This will allow the node to participate like normal. Node restart is required
122 because we cannot determine whether a node is in `Draining` because if was
123 previously in `Decommissioning` or for another reason.
124
125 ## Dead nodes
126 If a target node is dead (i.e. unreachable), it can be in one of the following
127 states:
128
129 1. It holds unexpired leases.
130 1. It holds no leases; the replicas of ranges it has on-disk have not been
131 rebalanced to other nodes.
132 1. It holds no leases; the replicas of ranges it has on-disk have been down-
133 replicated from it and up-replicated to other nodes.
134
135 Regardless of which state the dead node has, it cannot set its `Draining` flag
136 because it is dead. Instead, its leases will expire and will be taken by other
137 nodes. After `server.time_until_store_dead` elapses, the replicas of its
138 ranges will actively be rebalanced. Wait for the replica count to reach 0 as for
139 live nodes.
140
141 It is possible that a dead node becomes available and rejoins the cluster. It
142 would discover the `Decommissioned` flag has been set and follow the
143 decommissioning process.
144
145 ## Temporary removal
146
147 No changes required. The existing CockroachDB process, described below, is
148 sufficient.
149
150 During a temporary removal, `server.time_until_store_dead` could be updated
151 to to the length of the downtime to avoid unnecessary movement of ranges.
152 However, it is difficult to predict the length of downtime. This is an
153 optimization which could be implemented later.
154
155 After a node is detected as unavailable for more than
156 `server.time_until_store_dead` (an env variable with default: 5 minutes), the
157 node is marked as incommunicado and the cluster decides that the node may not be
158 coming back and moves the data elsewhere. Its leases would have been transferred
159 to other nodes as soon as they expired without being renewed. Ranges are
160 up-replicated to other nodes and down-replicated (removed) from target node.
161 Ranges are not down-replicated when there does not exist a target for
162 up-replication because this causes more work if that node rejoins the cluster;
163 however, when permanently removing a node, we still want to do that. Although a
164 node can have multiple stores, there can be at most one replica for each replica
165 set (e.g. range) per node.
166
167 ## CLI
168
169 Two new commands and one option will be added. These have been described above.
170 The first two are asynchronous commands and `quit` is synchronous.
171
172 - `node decommission <nodeID>...` prompts the user for confirmation. Passing
173 `--yes` as a command-line flag will skip this prompt. This returns a list of
174 all nodes, their statuses, replica count, decommissioning flag and draining
175 flag.
176 - *Nice to have*: satisfying safety constraints as a pre-requisite. As an
177 example: `number_of_nodes < max(zone.number_of_replicas_desired)`. It would
178 be nice to check that the number of nodes remaining after decommissioning is
179 large enough for a quorum to be reached. However, this is not as easy as it
180 sounds to achieve due to e.g. ZoneConfig.
181 - `node recommission <nodeID>...` has similar semantics to `decommission`. It
182 prints a message to the user asking them to restart the node for the change to
183 take effect.
184 - `quit --decommission`. This is synchronous to guarantee that availability of
185 a replica set of a range is not compromised.
186
187 It is possible to decommission several nodes by passing multiple nodeIDs on the
188 command-line.
189
190 ## UI
191
192 - The UI hides nodes which are marked as dead and are also in decommissioning
193 state (as per the liveness table, or rather its gossiped information). That
194 should have the desired effect and is straightforward while keeping stats, etc.
195 - If a node is dead, it cannot remove itself from the node table in the admin UI.
196 Other nodes are responsible for hiding decommissioned nodes from the admin UI.
197 - The action of decommissioning a node creates an event that is displayed in the
198 UI.
199
200 # Drawbacks
201
202 There is no atomic way to check that decommissioning a set of nodes will leave
203 enough nodes for the cluster to be available. A race can occur: in a five-node
204 cluster, two users can simultaneously request to decommission two different
205 nodes each. The resulting state leaves two nodes in decommissioning state and
206 only one live node. Decommissioning nodes can still participate in quorums;
207 the replication changes cannot make progress because there are not a sufficient
208 number of non-decommissioned nodes. The user can discover that too many nodes
209 are in decommissioning state and choose which nodes to recommission. It would
210 be nice to proactively detect this but the effort required is disproportionate
211 compared to the gain.
212
213 Recently dead nodes may cause decommissioning to hang until
214 `server.time_until_store_dead` elapses.
215
216 Recommissioning, i.e. undoing a decommission, requires node restart. We could
217 avoid this: if the node is live, the coordinating process can directly tell it
218 to stop draining. Punting on this for now; this could be future work.
219
220 # Alternatives
221
222 If the operation were blocking and it took a long time to complete, the
223 connection might timeout if no response was sent to the client.
224
225 # Unresolved questions
226
227 None.
228
229 [#16447]: https://github.com/cockroachdb/cockroach/pull/16447
230 [#6198]: https://github.com/cockroachdb/cockroach/issues/6198