github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160425_drain_modes.md (about) 1 - Feature Name: drain_modes 2 - Status: completed 3 - Start Date: 2016-04-25 4 - Authors: Tobias Schottdorf (tobias.schottdorf@gmail.com), Alfonso Subiotto Marqués 5 - RFC PRs: [#6283](https://github.com/cockroachdb/cockroach/pull/6283), [#10765](https://github.com/cockroachdb/cockroach/pull/10765) 6 - Cockroach Issue: [#9541](https://github.com/cockroachdb/cockroach/issues/9541), [#9493](https://github.com/cockroachdb/cockroach/issues/9493), [#6295](https://github.com/cockroachdb/cockroach/issues/6295) 7 8 # Summary 9 Propose a draining process for a CockroachDB node to perform a graceful 10 shutdown. 11 12 This draining process will be composed of two reduced modes of operation that 13 will be run in parallel: 14 15 * `drain-clients` mode: the server lets on-going SQL clients finish up to some 16 deadline when context cancellation and cleanup is performed, and then politely 17 refuses new work at the gateway in the manner which popular load-balancing SQL 18 clients can handle best. 19 * `drain-leases` mode: all range leases are transferred away from the node, new 20 ones are not granted (in turn disabling most queues and active gossipping) and 21 the draining node will not be a target for lease or range transfers. The 22 draining node will decline any preemptive snapshots. 23 24 These modes are not related to the existing `Stopper`-based functionality and 25 can be run independently (i.e. a server can temporarily run in `drain-clients` 26 mode). 27 28 # Motivation 29 In our Stopper usage, we've taken to fairly ruthlessly shutting down service of 30 many components once `(*Stopper).Quiesce()` is called. In code, this manifests 31 itself through copious use of the `(*Stopper).ShouldQuiesce()` channel even when 32 not running inside of a task (or running long-running operations inside tasks). 33 34 This was motivated mainly by endless amounts of test failures around leaked 35 goroutines or deadlocking operations and has served us reasonably well, keeping 36 our moving parts in check. 37 38 However, now that we're looking at clusters and their maintenance, this simple 39 "drop-all" approach isn't enough. See, for example #6198, #6197 or #5279. This 40 RFC outlines proposed changes to accommodate use cases such as 41 42 * clean shutdown: instead of dropping everything on the floor, politely block 43 new clients and finish open work up to some deadline, transfer/release all 44 leases to avoid a period of unavailability, and only then initiate shutdown 45 via `stopper.Stop()`. This clean shutdown avoids: 46 * Blocking schema changes on the order of minutes because table descriptor 47 leases are not properly released [#9493](https://github.com/cockroachdb/cockroach/issues/9493) 48 * Increased per-range activity caused by nodes trying to pick up expired 49 epoch-based range leases. 50 * Leaving around intents from ongoing transactions. 51 * draining data off a node (i.e. have all or some subset of ranges migrate 52 away, and wait until that has happened). 53 Used for decommissioning, but could also be used to change hard-drives 54 cleanly (i.e. drain data off a store, clean shutdown, switch hdd, start). 55 * decommission (permanently remove), which is "most of the above" with the 56 addition of announcing the downtime as forever to the cluster. 57 * drain for update/migration (see the `{F,Unf}reeze` proposal in #6166): 58 Essentially here we'll want to have the cluster in drain-clients mode and then 59 propose a `Freeze` command on all Ranges. 60 61 This work will also result in 62 * Auditing and improving our cancellation mechanism for individual queries. 63 * Improving compatibility with server health-checking mechanisms used by third 64 party proxies. 65 66 # Detailed design 67 Implement the following: 68 69 ```go 70 // Put the SQL server into drain-clients mode within deadline, cancelling 71 // connections as necessary. A zero deadline undoes the effects of any 72 // prior drain operation. 73 (*pgwire.Server).DrainClients(deadline time.Time) error 74 75 // Put the Node into drain-lease mode if requested, or undo any previous 76 // mode change. 77 (*server.Node).DrainLeases(bool) error 78 ``` 79 80 ## SQL/drain-clients mode: 81 The `v3conn` struct will be extended to point to the server. Before and after 82 reading messages from the client, a `v3conn` will check the draining status on 83 the server. If both `draining` is set to `true` and `v3conn.session.TxnState.State` 84 is `NoTxn`, the `v3conn` will exit the read loop, send an 85 [appropriate error code](#note-on-closing-sql-connections), and close the 86 connection, thereby denying the reading and execution of statements that aren't 87 part of an ongoing transaction. The `v3conn` will also be made aware of 88 cancellation of its session's context which will be handled in the same way. 89 90 All active sessions will be found through the session registry 91 ([#10317](https://github.com/cockroachdb/cockroach/pull/10317/)). Sessions will 92 be extended with a `done` channel which will be used by `pgwire.Server` to 93 listen for the completion of sessions. Sessions will be created with a context 94 derived from a cancellable parent context, thus offering a single point from 95 which to cancel all sessions. 96 97 Additionally, `pgwire.Server` will not create new sessions once in `draining` 98 mode. 99 100 After `draining` behavior is enabled, we have three classes of clients 101 characterized by the behavior of the server: 102 * Blocked in the read loop with no active transaction. No execution of any 103 statements is ongoing. 104 * Blocked in the read loop with an active transaction. No execution of any 105 statements is ongoing but more are expected. 106 * Not blocked in the read loop. The execution of (possibly a batch) of 107 statements is ongoing. 108 109 A first pass through the registry will collect all channels that belong to 110 sessions of the second and third class of clients. These clients will be given a 111 deadline of `drainMaxWait` (default of 10s) to finish any work. Note that once 112 work is complete, the `v3conn` will not block in the read loop and will instead 113 exit. Once `drainMaxWait` has elapsed or there are no more sessions with active 114 transactions, the parent context of all sessions will be canceled. Since a 115 derived context is used to send RPCs, these RPCs will be canceled, resulting in 116 the propagation of errors back to the Executor and the client. Additionally, 117 plumbing will have to be added to important local work done on nodes so that 118 this context cancellation leads to the interruption of this work. 119 120 `pgwire.Server` will listen for the completion of the remaining sessions up to 121 a timeout of 1s. Some sessions might keep going indefinitely despite a canceled 122 context. 123 124 The final stage is to delete table descriptor leases from the `system.lease` 125 table to avoid future schema changes blocking. `LeaseManager` will be extended 126 to delete all leases from `system.lease` with a `refcount` of 0. Since there 127 might still be ongoing sessions, `LeaseManager` will enter a `draining` mode in 128 which if any lease's `refcount` is decremented to 0, the lease will be deleted 129 from `system.lease`. If sessions are still ongoing after this point, a log 130 message would warn of this fact. 131 132 ### Note on closing SQL connections 133 The load balancing solutions for Postgres that seem to be the most popular are [PGPool](http://www.pgpool.net/docs/latest/en/html/) and 134 [HAProxy](https://www.haproxy.com/). Both systems have health check mechanisms 135 that try to establish a SQL 136 connection[[1]](http://www.pgpool.net/docs/latest/en/html/#HEALTH_CHECK_USER) 137 [[2]](https://www.haproxy.com/doc/aloha/7.0/haproxy/healthchecks.html#checking-a-pgsql-service). 138 HAProxy also supports doing only generic TCP checks. Because of how these health 139 checks are performed, sending back a SQL error during the establishment of a 140 connection will result in the draining node being correctly marked as down. 141 142 For failures during an existing or new session, PGPool has a 143 `fail_over_on_backend_error` option which is triggered when 144 145 > 57P01 ADMIN SHUTDOWN 146 > admin_shutdown 147 148 is received from the backend. PGPool will retain session information under some conditions[[3]](http://pgsqlpgpool.blogspot.com/2016/07/avoiding-session-disconnection-while.html) 149 and can reconnect transparently to a different backend. 150 151 HAProxy does not have similar functionality and both SQL/TCP errors are 152 forwarded to the client. 153 154 We should therefore return 155 156 > 57P01 ADMIN SHUTDOWN 157 > admin_shutdown 158 159 to reject new connections and close existing ones for compatibility with 160 PGPool's automatic failover and both PGPool's and HAProxy's health checks. 161 However, clients should retry in the case that HAProxy is being used and want 162 transparent failover when the establishment of a connection errors out. 163 164 ## Node/drain-leases mode: 165 `(*server.Node).DrainLeases(bool)` iterates over its store list and delegates 166 to all contained stores. `(*Replica).redirectOnOrAcquireLease` checks 167 with its store before requesting a new or extending an existing lease. 168 169 The `Liveness` proto will be extended with a `draining` field which will be 170 taken into account in `Allocator.TransferLeaseTarget` and 171 `Allocator.AllocateTarget`, resulting in no leases or ranges being transferred 172 to a node that is known to be draining. Updating the node liveness record will 173 trigger a gossip of the node's draining mode. 174 175 Transfers to a draining node could still happen before the gossiped `Liveness` 176 has reached the source of the transfer. This will only be handled in the case 177 of range transfers. 178 179 A draining node will decline a snapshot in `HandleSnapshot` if its store is 180 draining. In the case of lease transfers, since they are proposed and applied as 181 raft commands, there is no way for the recipient to reject a lease. 182 183 [NB: previously this RFC proposed modifying the lease transfer mechanism so that 184 a draining node could immediately send back any leases transferred to it while 185 draining. Since the draining process is a best effort to avoid unavailability 186 and it's not sure this addition would be necessary or produce significant 187 benefits, the modification of the lease transfer mechanism has been left out 188 but could be implemented in the future if necessary] 189 190 To decrease the probability of unavailability, an optional 10s timeout will be 191 introduced to wait for the gossip of the draining node's `Liveness` to 192 propagate. This option will be off by default and operators will have the 193 option of turning it on to prioritize availability over shutdown speed. 194 195 After this step, leases that a node's replicas currently hold will be 196 transferred away using `AdminTransferLease(target)` where `target` will be found 197 using `Allocator.TransferLeaseTarget`. 198 199 To allow commands that were sent to the node's replicas to complete, the 200 draining node will wait for its replicas' command queues to drain up to a 201 timeout of 1s. This timeout is necessary in the case that new leases are 202 transferred to the node and it receives new commands. 203 204 ## Server/adminServer: 205 206 ```go 207 208 type DrainMode int 209 const ( 210 DrainClient DrainMode = 1 << iota 211 DrainLeases 212 ) 213 214 // For example, `s.Drain(DrainClient | DrainLeases)` 215 (*server.Server).Drain(mode DrainMode) error 216 ``` 217 218 and hook it up from `(*adminServer).handleQuit` (which currently has no way of 219 accessing `*Server`, only `*Node`, so a shortcut may be taken for the time 220 being if that seems opportune. 221 222 # Drawbacks 223 224 # Alternatives 225 * Don't offer a `drainMaxWait` timeout to ongoing transactions. The idea of the 226 timeout is to allow clients a grace period in which to complete work. The 227 issue is that this timeout is arbitrary and it might make more sense to 228 forcibly close these connections. It might also make sense to offer operators 229 to set this timeout via an environment variable. 230 * Reacquire table leases when the node goes back up. This was suggested in 231 [#9493](https://github.com/cockroachdb/cockroach/issues/9493) but does not fix 232 the issue of having to wait for the lease to expire if the node does not come 233 back up. 234 235 # Future work 236 * Move ranges to another node if we're draining the node for decommissioning or 237 reject a shut down if not doing so would cause Raft groups to drop below 238 quorum. 239 * Change the lease transfer mechanism so a transferrer can transfer its 240 timestamp cache's high water mark which would act as the low water mark of the 241 recipient's timestamp cache. This is conditioned on not [inserting reads in 242 the command queue](https://forum.cockroachlabs.com/t/why-do-we-keep-read-commands-in-the-command-queue/360). 243 244 # Unresolved questions