github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170608_decommission.md (about)

     1  - Feature Name: Decommissioning a node
     2  - Status: completed
     3  - Start Date: 2017-06-05
     4  - Authors: Neeral Dodhia
     5  - RFC PR: [#16447]
     6  - Cockroach Issue: [#6198]
     7  
     8  # Summary
     9  
    10  When a node will be removed from a cluster, mark the node as decommissioned. At
    11  all cost, no data can be lost. To this end:
    12  
    13  - Drain data from node if data availability (replica count) is reduced.
    14  - Prevent node from participating in cluster.
    15  
    16  # Motivation
    17  
    18  Cluster operations are common in production deployments, such as: taking a node
    19  down for maintenance; later bringing it back up; adding a new node; and,
    20  (permanently) removing a node. All of these should be simple and **safe** to
    21  perform.
    22  
    23  Currently, the way to remove a node from a cluster is to shut it down. There is
    24  a period before the cluster rebalances its replicas. If another node were to go
    25  down, for example due to power loss, then some replica sets would become
    26  unavailable. Those replica sets which had a replica on the shut-down node and
    27  the failed node now only have a single replica. This demonstrates why safe
    28  removal—without risk of data loss—is required.
    29  
    30  A typical operation in the field is to replace a set of nodes. This involves
    31  removing the old nodes and adding their replacements. Decommissioning a node
    32  at-a-time is inefficient. Decommissioning the first node causes data to be moved
    33  onto nodes that are about to be decommissioned too. Then, decommissioning these
    34  nodes leads to more data movement. Instead, it is more efficient to mark all
    35  these nodes as being decommissioned at once. Then, the draining mechanism can
    36  move data to nodes that will remain. The steps to replace multiple nodes are to
    37  add the new nodes and then decommission the old ones.
    38  
    39  # Detailed design
    40  The following scenarios are considered:
    41  1. A node will (permanently) be removed from the cluster. The node is currently
    42     1. alive, or
    43     1. dead.
    44  1. A node will temporarily be down (e.g. for maintenance).
    45  
    46  ## Permanent removal
    47  
    48  1. On any node, the user executes
    49     ```shell
    50     cockroach node decommission <nodeID>...
    51     ```
    52     The node that receives the CLI command is referred to as node A. The nodes
    53     specified in the CLI command are referred to as target nodes. This command is
    54     asynchronous and returns with list of node IDs, their status, replica count,
    55     and their `Draining` and `Decommissioning` flags.
    56  1. Node A sets the `Decommissioning` flag to `true` for all target nodes in the
    57     node liveness table.
    58  1. After approximately 10 seconds, each target node discovers that the
    59     `Decommissioning` flag has been set. The mechanism for discovery is the
    60     heartbeat process, which periodically updates its own entry in the node
    61     liveness table.
    62  1. Each target node:
    63     1. sets the `Draining` flag in the node liveness table to `true`, and
    64     1. waits until draining has completed.
    65  1. At this point, every target node
    66     1. is not holding any leases,
    67     1. is not accepting any new SQL connections, and
    68     1. is not accepting new replicas.
    69  1. Leaseholders, necessarily non-target nodes, for ranges where a target node
    70     is a member of the replica set will have their replicate queue treat
    71     decommissioning nodes like dead replicas. This will do the right thing:
    72     not down-replicate if that puts us in more danger than we already are, and
    73     not up-replicate to more dangerous states.
    74     1. up-replicate to a node not in decommissioning state
    75        - If there are not any nodes available for up-replication, the process
    76          stalls. This prevents data loss and ensures availability of all ranges.
    77     1. down-replicate from target nodes.
    78  1. Wait for the replica count on each target node to reach 0. To be able to do
    79     this for dead nodes, use meta ranges.
    80  1. The user idempotently executes the command from step 1. To measure
    81     progress, track the replica count.
    82  1. The user executes
    83     ```shell
    84     cockroach quit --decommission --host=<hostname> --port=<port>
    85     ```
    86     for each target node. This explicitly shuts down the node. Otherwise, the
    87     node remains up but is unable to participate in usual operations such as
    88     having replicas. The `--decommission` flag causes it to wait until the
    89     replica count on the specified node reaches 0 before initiating shutdown.
    90     - This command also initiates decommission. Setting decommission is
    91       idempotent, which is nice here. The difference with this command is that it
    92       blocks until the node shuts down. Users who do not want asynchronous
    93       external polling can choose this command, if they want to decommission a
    94       single node.
    95  
    96  ## Handling restarts
    97  When a node is restarted, it may reset its `Draining` flag but its
    98  `Decommissioned` flag remains. The above process resumes from the third step.
    99  Hence, if a node is restarted at any point after the `Decommissioned` flag is
   100  set, the decommissioning process will resume.
   101  
   102  A decommissioned node can be restarted and would rejoin the cluster. However,
   103  it would not participate in any activity. This is safe and, hence, there is no
   104  need to prevent a decommissioned node from restarting.
   105  
   106  When a node restarts, there is a small period when it can accept new replicas
   107  before reading the node liveness table. This would be rare and short-lived: nodes
   108  will not send new replicas to a decommissioned node, and so this requires both
   109  nodes to be unaware of the decommissioning state. A decommissioned node could have
   110  reached a replica count of 0, then be restarted and accept new replicas. In this
   111  case, availability could be compromised if the node were shutdown immediately.
   112  However, the `--decommission` flag to the `quit` command ensures that shutdown
   113  is only initiated if the replica count is 0.
   114  
   115  ## Undo
   116  ```shell
   117  cockroach node recommission <nodeID>...
   118  ```
   119  sets the `Decommissioning` flag to `false` for target nodes. Then, the user must
   120  restart each target node. When a node restarts, it resets its `Draining` flag.
   121  This will allow the node to participate like normal. Node restart is required
   122  because we cannot determine whether a node is in `Draining` because if was
   123  previously in `Decommissioning` or for another reason.
   124  
   125  ## Dead nodes
   126  If a target node is dead (i.e. unreachable), it can be in one of the following
   127  states:
   128  
   129  1. It holds unexpired leases.
   130  1. It holds no leases; the replicas of ranges it has on-disk have not been
   131     rebalanced to other nodes.
   132  1. It holds no leases; the replicas of ranges it has on-disk have been down-
   133     replicated from it and up-replicated to other nodes.
   134  
   135  Regardless of which state the dead node has, it cannot set its `Draining` flag
   136  because it is dead. Instead, its leases will expire and will be taken by other
   137  nodes. After `server.time_until_store_dead` elapses, the replicas of its
   138  ranges will actively be rebalanced. Wait for the replica count to reach 0 as for
   139  live nodes.
   140  
   141  It is possible that a dead node becomes available and rejoins the cluster. It
   142  would discover the `Decommissioned` flag has been set and follow the
   143  decommissioning process.
   144  
   145  ## Temporary removal
   146  
   147  No changes required. The existing CockroachDB process, described below, is
   148  sufficient.
   149  
   150  During a temporary removal, `server.time_until_store_dead` could be updated
   151  to to the length of the downtime to avoid unnecessary movement of ranges.
   152  However, it is difficult to predict the length of downtime. This is an
   153  optimization which could be implemented later.
   154  
   155  After a node is detected as unavailable for more than
   156  `server.time_until_store_dead` (an env variable with default: 5 minutes), the
   157  node is marked as incommunicado and the cluster decides that the node may not be
   158  coming back and moves the data elsewhere. Its leases would have been transferred
   159  to other nodes as soon as they expired without being renewed. Ranges are
   160  up-replicated to other nodes and down-replicated (removed) from target node.
   161  Ranges are not down-replicated when there does not exist a target for
   162  up-replication because this causes more work if that node rejoins the cluster;
   163  however, when permanently removing a node, we still want to do that. Although a
   164  node can have multiple stores, there can be at most one replica for each replica
   165  set (e.g. range) per node.
   166  
   167  ## CLI
   168  
   169  Two new commands and one option will be added. These have been described above.
   170  The first two are asynchronous commands and `quit` is synchronous.
   171  
   172  - `node decommission <nodeID>...` prompts the user for confirmation. Passing
   173     `--yes` as a command-line flag will skip this prompt. This returns a list of
   174    all nodes, their statuses, replica count, decommissioning flag and draining
   175    flag.
   176    - *Nice to have*: satisfying safety constraints as a pre-requisite. As an
   177      example: `number_of_nodes < max(zone.number_of_replicas_desired)`. It would
   178      be nice to check that the number of nodes remaining after decommissioning is
   179      large enough for a quorum to be reached. However, this is not as easy as it
   180      sounds to achieve due to e.g. ZoneConfig.
   181  - `node recommission <nodeID>...` has similar semantics to `decommission`. It
   182    prints a message to the user asking them to restart the node for the change to
   183    take effect.
   184  - `quit --decommission`. This is synchronous to guarantee that availability of
   185    a replica set of a range is not compromised.
   186  
   187  It is possible to decommission several nodes by passing multiple nodeIDs on the
   188  command-line.
   189  
   190  ## UI
   191  
   192  - The UI hides nodes which are marked as dead and are also in decommissioning
   193    state (as per the liveness table, or rather its gossiped information). That
   194    should have the desired effect and is straightforward while keeping stats, etc.
   195  - If a node is dead, it cannot remove itself from the node table in the admin UI.
   196    Other nodes are responsible for hiding decommissioned nodes from the admin UI.
   197  - The action of decommissioning a node creates an event that is displayed in the
   198    UI.
   199  
   200  # Drawbacks
   201  
   202  There is no atomic way to check that decommissioning a set of nodes will leave
   203  enough nodes for the cluster to be available. A race can occur: in a five-node
   204  cluster, two users can simultaneously request to decommission two different
   205  nodes each. The resulting state leaves two nodes in decommissioning state and
   206  only one live node. Decommissioning nodes can still participate in quorums;
   207  the replication changes cannot make progress because there are not a sufficient
   208  number of non-decommissioned nodes. The user can discover that too many nodes
   209  are in decommissioning state and choose which nodes to recommission. It would
   210  be nice to proactively detect this but the effort required is disproportionate
   211  compared to the gain.
   212  
   213  Recently dead nodes may cause decommissioning to hang until
   214  `server.time_until_store_dead` elapses.
   215  
   216  Recommissioning, i.e. undoing a decommission, requires node restart. We could
   217  avoid this: if the node is live, the coordinating process can directly tell it
   218  to stop draining. Punting on this for now; this could be future work.
   219  
   220  # Alternatives
   221  
   222  If the operation were blocking and it took a long time to complete, the
   223  connection might timeout if no response was sent to the client.
   224  
   225  # Unresolved questions
   226  
   227  None.
   228  
   229  [#16447]: https://github.com/cockroachdb/cockroach/pull/16447
   230  [#6198]: https://github.com/cockroachdb/cockroach/issues/6198