github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160425_drain_modes.md (about)

     1  - Feature Name: drain_modes
     2  - Status: completed
     3  - Start Date: 2016-04-25
     4  - Authors: Tobias Schottdorf (tobias.schottdorf@gmail.com), Alfonso Subiotto Marqués
     5  - RFC PRs: [#6283](https://github.com/cockroachdb/cockroach/pull/6283), [#10765](https://github.com/cockroachdb/cockroach/pull/10765)
     6  - Cockroach Issue: [#9541](https://github.com/cockroachdb/cockroach/issues/9541), [#9493](https://github.com/cockroachdb/cockroach/issues/9493), [#6295](https://github.com/cockroachdb/cockroach/issues/6295)
     7  
     8  # Summary
     9  Propose a draining process for a CockroachDB node to perform a graceful
    10  shutdown.
    11  
    12  This draining process will be composed of two reduced modes of operation that
    13  will be run in parallel:
    14  
    15  * `drain-clients` mode: the server lets on-going SQL clients finish up to some
    16    deadline when context cancellation and cleanup is performed, and then politely
    17    refuses new work at the gateway in the manner which popular load-balancing SQL
    18    clients can handle best.
    19  * `drain-leases` mode: all range leases are transferred away from the node, new
    20    ones are not granted (in turn disabling most queues and active gossipping) and
    21    the draining node will not be a target for lease or range transfers. The
    22    draining node will decline any preemptive snapshots.
    23  
    24  These modes are not related to the existing `Stopper`-based functionality and
    25  can be run independently (i.e. a server can temporarily run in `drain-clients`
    26  mode).
    27  
    28  # Motivation
    29  In our Stopper usage, we've taken to fairly ruthlessly shutting down service of
    30  many components once `(*Stopper).Quiesce()` is called. In code, this manifests
    31  itself through copious use of the `(*Stopper).ShouldQuiesce()` channel even when
    32  not running inside of a task (or running long-running operations inside tasks).
    33  
    34  This was motivated mainly by endless amounts of test failures around leaked
    35  goroutines or deadlocking operations and has served us reasonably well, keeping
    36  our moving parts in check.
    37  
    38  However, now that we're looking at clusters and their maintenance, this simple
    39  "drop-all" approach isn't enough. See, for example #6198, #6197 or #5279. This
    40  RFC outlines proposed changes to accommodate use cases such as
    41  
    42  * clean shutdown: instead of dropping everything on the floor, politely block
    43    new clients and finish open work up to some deadline, transfer/release all
    44    leases to avoid a period of unavailability, and only then initiate shutdown
    45    via `stopper.Stop()`. This clean shutdown avoids:
    46      * Blocking schema changes on the order of minutes because table descriptor
    47      leases are not properly released [#9493](https://github.com/cockroachdb/cockroach/issues/9493)
    48      * Increased per-range activity caused by nodes trying to pick up expired
    49      epoch-based range leases.
    50      * Leaving around intents from ongoing transactions.
    51  * draining data off a node (i.e. have all or some subset of ranges migrate
    52    away, and wait until that has happened).
    53    Used for decommissioning, but could also be used to change hard-drives
    54    cleanly (i.e. drain data off a store, clean shutdown, switch hdd, start).
    55  * decommission (permanently remove), which is "most of the above" with the
    56    addition of announcing the downtime as forever to the cluster.
    57  * drain for update/migration (see the `{F,Unf}reeze` proposal in #6166):
    58    Essentially here we'll want to have the cluster in drain-clients mode and then
    59    propose a `Freeze` command on all Ranges.
    60  
    61  This work will also result in
    62  * Auditing and improving our cancellation mechanism for individual queries.
    63  * Improving compatibility with server health-checking mechanisms used by third
    64    party proxies.
    65  
    66  # Detailed design
    67  Implement the following:
    68  
    69  ```go
    70  // Put the SQL server into drain-clients mode within deadline, cancelling
    71  // connections as necessary. A zero deadline undoes the effects of any
    72  // prior drain operation.
    73  (*pgwire.Server).DrainClients(deadline time.Time) error
    74  
    75  // Put the Node into drain-lease mode if requested, or undo any previous
    76  // mode change.
    77  (*server.Node).DrainLeases(bool) error
    78  ```
    79  
    80  ## SQL/drain-clients mode:
    81  The `v3conn` struct will be extended to point to the server. Before and after
    82  reading messages from the client, a `v3conn` will check the draining status on
    83  the server. If both `draining` is set to `true` and `v3conn.session.TxnState.State`
    84  is `NoTxn`, the `v3conn` will exit the read loop, send an
    85  [appropriate error code](#note-on-closing-sql-connections), and close the
    86  connection, thereby denying the reading and execution of statements that aren't
    87  part of an ongoing transaction. The `v3conn` will also be made aware of
    88  cancellation of its session's context which will be handled in the same way.
    89  
    90  All active sessions will be found through the session registry
    91  ([#10317](https://github.com/cockroachdb/cockroach/pull/10317/)). Sessions will
    92  be extended with a `done` channel which will be used by `pgwire.Server` to
    93  listen for the completion of sessions. Sessions will be created with a context
    94  derived from a cancellable parent context, thus offering a single point from
    95  which to cancel all sessions.
    96  
    97  Additionally, `pgwire.Server` will not create new sessions once in `draining`
    98  mode.
    99  
   100  After `draining` behavior is enabled, we have three classes of clients
   101  characterized by the behavior of the server:
   102  * Blocked in the read loop with no active transaction. No execution of any
   103    statements is ongoing.
   104  * Blocked in the read loop with an active transaction. No execution of any
   105    statements is ongoing but more are expected.
   106  * Not blocked in the read loop. The execution of (possibly a batch) of
   107  statements is ongoing.
   108  
   109  A first pass through the registry will collect all channels that belong to
   110  sessions of the second and third class of clients. These clients will be given a
   111  deadline of `drainMaxWait` (default of 10s) to finish any work. Note that once
   112  work is complete, the `v3conn` will not block in the read loop and will instead
   113  exit. Once `drainMaxWait` has elapsed or there are no more sessions with active
   114  transactions, the parent context of all sessions will be canceled. Since a
   115  derived context is used to send RPCs, these RPCs will be canceled, resulting in
   116  the propagation of errors back to the Executor and the client. Additionally,
   117  plumbing will have to be added to important local work done on nodes so that
   118  this context cancellation leads to the interruption of this work.
   119  
   120  `pgwire.Server` will listen for the completion of the remaining sessions up to
   121  a timeout of 1s. Some sessions might keep going indefinitely despite a canceled
   122  context.
   123  
   124  The final stage is to delete table descriptor leases from the `system.lease`
   125  table to avoid future schema changes blocking. `LeaseManager` will be extended
   126  to delete all leases from `system.lease` with a `refcount` of 0. Since there
   127  might still be ongoing sessions, `LeaseManager` will enter a `draining` mode in
   128  which if any lease's `refcount` is decremented to 0, the lease will be deleted
   129  from `system.lease`. If sessions are still ongoing after this point, a log
   130  message would warn of this fact.
   131  
   132  ### Note on closing SQL connections
   133  The load balancing solutions for Postgres that seem to be the most popular are [PGPool](http://www.pgpool.net/docs/latest/en/html/) and
   134  [HAProxy](https://www.haproxy.com/). Both systems have health check mechanisms
   135  that try to establish a SQL
   136  connection[[1]](http://www.pgpool.net/docs/latest/en/html/#HEALTH_CHECK_USER)
   137  [[2]](https://www.haproxy.com/doc/aloha/7.0/haproxy/healthchecks.html#checking-a-pgsql-service).
   138  HAProxy also supports doing only generic TCP checks. Because of how these health
   139  checks are performed, sending back a SQL error during the establishment of a
   140  connection will result in the draining node being correctly marked as down.
   141  
   142  For failures during an existing or new session, PGPool has a
   143  `fail_over_on_backend_error` option which is triggered when
   144  
   145  > 57P01 ADMIN SHUTDOWN
   146  > admin_shutdown
   147  
   148  is received from the backend. PGPool will retain session information under some conditions[[3]](http://pgsqlpgpool.blogspot.com/2016/07/avoiding-session-disconnection-while.html)
   149  and can reconnect transparently to a different backend.
   150  
   151  HAProxy does not have similar functionality and both SQL/TCP errors are
   152  forwarded to the client.
   153  
   154  We should therefore return
   155  
   156  > 57P01 ADMIN SHUTDOWN
   157  > admin_shutdown
   158  
   159  to reject new connections and close existing ones for compatibility with
   160  PGPool's automatic failover and both PGPool's and HAProxy's health checks.
   161  However, clients should retry in the case that HAProxy is being used and want
   162  transparent failover when the establishment of a connection errors out.
   163  
   164  ## Node/drain-leases mode:
   165  `(*server.Node).DrainLeases(bool)` iterates over its store list and delegates
   166  to all contained stores. `(*Replica).redirectOnOrAcquireLease` checks
   167  with its store before requesting a new or extending an existing lease.
   168  
   169  The `Liveness` proto will be extended with a `draining` field which will be
   170  taken into account in `Allocator.TransferLeaseTarget` and
   171  `Allocator.AllocateTarget`, resulting in no leases or ranges being transferred
   172  to a node that is known to be draining. Updating the node liveness record will
   173  trigger a gossip of the node's draining mode.
   174  
   175  Transfers to a draining node could still happen before the gossiped `Liveness`
   176  has reached the source of the transfer. This will only be handled in the case
   177  of range transfers.
   178  
   179  A draining node will decline a snapshot in `HandleSnapshot` if its store is
   180  draining. In the case of lease transfers, since they are proposed and applied as
   181  raft commands, there is no way for the recipient to reject a lease.
   182  
   183  [NB: previously this RFC proposed modifying the lease transfer mechanism so that
   184  a draining node could immediately send back any leases transferred to it while
   185  draining. Since the draining process is a best effort to avoid unavailability
   186  and it's not sure this addition would be necessary or produce significant
   187  benefits, the modification of the lease transfer mechanism has been left out
   188  but could be implemented in the future if necessary]
   189  
   190  To decrease the probability of unavailability, an optional 10s timeout will be
   191  introduced to wait for the gossip of the draining node's `Liveness` to
   192  propagate. This option will be off by default and operators will have the
   193  option of turning it on to prioritize availability over shutdown speed.
   194  
   195  After this step, leases that a node's replicas currently hold will be
   196  transferred away using `AdminTransferLease(target)` where `target` will be found
   197  using `Allocator.TransferLeaseTarget`.
   198  
   199  To allow commands that were sent to the node's replicas to complete, the
   200  draining node will wait for its replicas' command queues to drain up to a
   201  timeout of 1s. This timeout is necessary in the case that new leases are
   202  transferred to the node and it receives new commands.
   203  
   204  ## Server/adminServer:
   205  
   206  ```go
   207  
   208  type DrainMode int
   209  const (
   210    DrainClient DrainMode = 1 << iota
   211    DrainLeases
   212  )
   213  
   214  // For example, `s.Drain(DrainClient | DrainLeases)`
   215  (*server.Server).Drain(mode DrainMode) error
   216  ```
   217  
   218  and hook it up from `(*adminServer).handleQuit` (which currently has no way of
   219  accessing `*Server`, only `*Node`, so a shortcut may be taken for the time
   220  being if that seems opportune.
   221  
   222  # Drawbacks
   223  
   224  # Alternatives
   225  * Don't offer a `drainMaxWait` timeout to ongoing transactions. The idea of the
   226    timeout is to allow clients a grace period in which to complete work. The
   227    issue is that this timeout is arbitrary and it might make more sense to
   228    forcibly close these connections. It might also make sense to offer operators
   229    to set this timeout via an environment variable.
   230  * Reacquire table leases when the node goes back up. This was suggested in
   231    [#9493](https://github.com/cockroachdb/cockroach/issues/9493) but does not fix
   232    the issue of having to wait for the lease to expire if the node does not come
   233    back up.
   234  
   235  # Future work
   236  * Move ranges to another node if we're draining the node for decommissioning or
   237    reject a shut down if not doing so would cause Raft groups to drop below
   238    quorum.
   239  * Change the lease transfer mechanism so a transferrer can transfer its
   240    timestamp cache's high water mark which would act as the low water mark of the
   241    recipient's timestamp cache. This is conditioned on not [inserting reads in
   242    the command queue](https://forum.cockroachlabs.com/t/why-do-we-keep-read-commands-in-the-command-queue/360).
   243  
   244  # Unresolved questions