github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160425_drain_modes.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160425_drain_modes.md (about)

1 - Feature Name: drain_modes
2 - Status: completed
3 - Start Date: 2016-04-25
4 - Authors: Tobias Schottdorf (tobias.schottdorf@gmail.com), Alfonso Subiotto Marqués
5 - RFC PRs: [#6283](https://github.com/cockroachdb/cockroach/pull/6283), [#10765](https://github.com/cockroachdb/cockroach/pull/10765)
6 - Cockroach Issue: [#9541](https://github.com/cockroachdb/cockroach/issues/9541), [#9493](https://github.com/cockroachdb/cockroach/issues/9493), [#6295](https://github.com/cockroachdb/cockroach/issues/6295)
7
8 # Summary
9 Propose a draining process for a CockroachDB node to perform a graceful
10 shutdown.
11
12 This draining process will be composed of two reduced modes of operation that
13 will be run in parallel:
14
15 * `drain-clients` mode: the server lets on-going SQL clients finish up to some
16 deadline when context cancellation and cleanup is performed, and then politely
17 refuses new work at the gateway in the manner which popular load-balancing SQL
18 clients can handle best.
19 * `drain-leases` mode: all range leases are transferred away from the node, new
20 ones are not granted (in turn disabling most queues and active gossipping) and
21 the draining node will not be a target for lease or range transfers. The
22 draining node will decline any preemptive snapshots.
23
24 These modes are not related to the existing `Stopper`-based functionality and
25 can be run independently (i.e. a server can temporarily run in `drain-clients`
26 mode).
27
28 # Motivation
29 In our Stopper usage, we've taken to fairly ruthlessly shutting down service of
30 many components once `(*Stopper).Quiesce()` is called. In code, this manifests
31 itself through copious use of the `(*Stopper).ShouldQuiesce()` channel even when
32 not running inside of a task (or running long-running operations inside tasks).
33
34 This was motivated mainly by endless amounts of test failures around leaked
35 goroutines or deadlocking operations and has served us reasonably well, keeping
36 our moving parts in check.
37
38 However, now that we're looking at clusters and their maintenance, this simple
39 "drop-all" approach isn't enough. See, for example #6198, #6197 or #5279. This
40 RFC outlines proposed changes to accommodate use cases such as
41
42 * clean shutdown: instead of dropping everything on the floor, politely block
43 new clients and finish open work up to some deadline, transfer/release all
44 leases to avoid a period of unavailability, and only then initiate shutdown
45 via `stopper.Stop()`. This clean shutdown avoids:
46 * Blocking schema changes on the order of minutes because table descriptor
47 leases are not properly released [#9493](https://github.com/cockroachdb/cockroach/issues/9493)
48 * Increased per-range activity caused by nodes trying to pick up expired
49 epoch-based range leases.
50 * Leaving around intents from ongoing transactions.
51 * draining data off a node (i.e. have all or some subset of ranges migrate
52 away, and wait until that has happened).
53 Used for decommissioning, but could also be used to change hard-drives
54 cleanly (i.e. drain data off a store, clean shutdown, switch hdd, start).
55 * decommission (permanently remove), which is "most of the above" with the
56 addition of announcing the downtime as forever to the cluster.
57 * drain for update/migration (see the `{F,Unf}reeze` proposal in #6166):
58 Essentially here we'll want to have the cluster in drain-clients mode and then
59 propose a `Freeze` command on all Ranges.
60
61 This work will also result in
62 * Auditing and improving our cancellation mechanism for individual queries.
63 * Improving compatibility with server health-checking mechanisms used by third
64 party proxies.
65
66 # Detailed design
67 Implement the following:
68
69 ```go
70 // Put the SQL server into drain-clients mode within deadline, cancelling
71 // connections as necessary. A zero deadline undoes the effects of any
72 // prior drain operation.
73 (*pgwire.Server).DrainClients(deadline time.Time) error
74
75 // Put the Node into drain-lease mode if requested, or undo any previous
76 // mode change.
77 (*server.Node).DrainLeases(bool) error
78 ```
79
80 ## SQL/drain-clients mode:
81 The `v3conn` struct will be extended to point to the server. Before and after
82 reading messages from the client, a `v3conn` will check the draining status on
83 the server. If both `draining` is set to `true` and `v3conn.session.TxnState.State`
84 is `NoTxn`, the `v3conn` will exit the read loop, send an
85 [appropriate error code](#note-on-closing-sql-connections), and close the
86 connection, thereby denying the reading and execution of statements that aren't
87 part of an ongoing transaction. The `v3conn` will also be made aware of
88 cancellation of its session's context which will be handled in the same way.
89
90 All active sessions will be found through the session registry
91 ([#10317](https://github.com/cockroachdb/cockroach/pull/10317/)). Sessions will
92 be extended with a `done` channel which will be used by `pgwire.Server` to
93 listen for the completion of sessions. Sessions will be created with a context
94 derived from a cancellable parent context, thus offering a single point from
95 which to cancel all sessions.
96
97 Additionally, `pgwire.Server` will not create new sessions once in `draining`
98 mode.
99
100 After `draining` behavior is enabled, we have three classes of clients
101 characterized by the behavior of the server:
102 * Blocked in the read loop with no active transaction. No execution of any
103 statements is ongoing.
104 * Blocked in the read loop with an active transaction. No execution of any
105 statements is ongoing but more are expected.
106 * Not blocked in the read loop. The execution of (possibly a batch) of
107 statements is ongoing.
108
109 A first pass through the registry will collect all channels that belong to
110 sessions of the second and third class of clients. These clients will be given a
111 deadline of `drainMaxWait` (default of 10s) to finish any work. Note that once
112 work is complete, the `v3conn` will not block in the read loop and will instead
113 exit. Once `drainMaxWait` has elapsed or there are no more sessions with active
114 transactions, the parent context of all sessions will be canceled. Since a
115 derived context is used to send RPCs, these RPCs will be canceled, resulting in
116 the propagation of errors back to the Executor and the client. Additionally,
117 plumbing will have to be added to important local work done on nodes so that
118 this context cancellation leads to the interruption of this work.
119
120 `pgwire.Server` will listen for the completion of the remaining sessions up to
121 a timeout of 1s. Some sessions might keep going indefinitely despite a canceled
122 context.
123
124 The final stage is to delete table descriptor leases from the `system.lease`
125 table to avoid future schema changes blocking. `LeaseManager` will be extended
126 to delete all leases from `system.lease` with a `refcount` of 0. Since there
127 might still be ongoing sessions, `LeaseManager` will enter a `draining` mode in
128 which if any lease's `refcount` is decremented to 0, the lease will be deleted
129 from `system.lease`. If sessions are still ongoing after this point, a log
130 message would warn of this fact.
131
132 ### Note on closing SQL connections
133 The load balancing solutions for Postgres that seem to be the most popular are [PGPool](http://www.pgpool.net/docs/latest/en/html/) and
134 [HAProxy](https://www.haproxy.com/). Both systems have health check mechanisms
135 that try to establish a SQL
136 connection[[1]](http://www.pgpool.net/docs/latest/en/html/#HEALTH_CHECK_USER)
137 [[2]](https://www.haproxy.com/doc/aloha/7.0/haproxy/healthchecks.html#checking-a-pgsql-service).
138 HAProxy also supports doing only generic TCP checks. Because of how these health
139 checks are performed, sending back a SQL error during the establishment of a
140 connection will result in the draining node being correctly marked as down.
141
142 For failures during an existing or new session, PGPool has a
143 `fail_over_on_backend_error` option which is triggered when
144
145 > 57P01 ADMIN SHUTDOWN
146 > admin_shutdown
147
148 is received from the backend. PGPool will retain session information under some conditions[[3]](http://pgsqlpgpool.blogspot.com/2016/07/avoiding-session-disconnection-while.html)
149 and can reconnect transparently to a different backend.
150
151 HAProxy does not have similar functionality and both SQL/TCP errors are
152 forwarded to the client.
153
154 We should therefore return
155
156 > 57P01 ADMIN SHUTDOWN
157 > admin_shutdown
158
159 to reject new connections and close existing ones for compatibility with
160 PGPool's automatic failover and both PGPool's and HAProxy's health checks.
161 However, clients should retry in the case that HAProxy is being used and want
162 transparent failover when the establishment of a connection errors out.
163
164 ## Node/drain-leases mode:
165 `(*server.Node).DrainLeases(bool)` iterates over its store list and delegates
166 to all contained stores. `(*Replica).redirectOnOrAcquireLease` checks
167 with its store before requesting a new or extending an existing lease.
168
169 The `Liveness` proto will be extended with a `draining` field which will be
170 taken into account in `Allocator.TransferLeaseTarget` and
171 `Allocator.AllocateTarget`, resulting in no leases or ranges being transferred
172 to a node that is known to be draining. Updating the node liveness record will
173 trigger a gossip of the node's draining mode.
174
175 Transfers to a draining node could still happen before the gossiped `Liveness`
176 has reached the source of the transfer. This will only be handled in the case
177 of range transfers.
178
179 A draining node will decline a snapshot in `HandleSnapshot` if its store is
180 draining. In the case of lease transfers, since they are proposed and applied as
181 raft commands, there is no way for the recipient to reject a lease.
182
183 [NB: previously this RFC proposed modifying the lease transfer mechanism so that
184 a draining node could immediately send back any leases transferred to it while
185 draining. Since the draining process is a best effort to avoid unavailability
186 and it's not sure this addition would be necessary or produce significant
187 benefits, the modification of the lease transfer mechanism has been left out
188 but could be implemented in the future if necessary]
189
190 To decrease the probability of unavailability, an optional 10s timeout will be
191 introduced to wait for the gossip of the draining node's `Liveness` to
192 propagate. This option will be off by default and operators will have the
193 option of turning it on to prioritize availability over shutdown speed.
194
195 After this step, leases that a node's replicas currently hold will be
196 transferred away using `AdminTransferLease(target)` where `target` will be found
197 using `Allocator.TransferLeaseTarget`.
198
199 To allow commands that were sent to the node's replicas to complete, the
200 draining node will wait for its replicas' command queues to drain up to a
201 timeout of 1s. This timeout is necessary in the case that new leases are
202 transferred to the node and it receives new commands.
203
204 ## Server/adminServer:
205
206 ```go
207
208 type DrainMode int
209 const (
210 DrainClient DrainMode = 1 << iota
211 DrainLeases
212 )
213
214 // For example, `s.Drain(DrainClient | DrainLeases)`
215 (*server.Server).Drain(mode DrainMode) error
216 ```
217
218 and hook it up from `(*adminServer).handleQuit` (which currently has no way of
219 accessing `*Server`, only `*Node`, so a shortcut may be taken for the time
220 being if that seems opportune.
221
222 # Drawbacks
223
224 # Alternatives
225 * Don't offer a `drainMaxWait` timeout to ongoing transactions. The idea of the
226 timeout is to allow clients a grace period in which to complete work. The
227 issue is that this timeout is arbitrary and it might make more sense to
228 forcibly close these connections. It might also make sense to offer operators
229 to set this timeout via an environment variable.
230 * Reacquire table leases when the node goes back up. This was suggested in
231 [#9493](https://github.com/cockroachdb/cockroach/issues/9493) but does not fix
232 the issue of having to wait for the lease to expire if the node does not come
233 back up.
234
235 # Future work
236 * Move ranges to another node if we're draining the node for decommissioning or
237 reject a shut down if not doing so would cause Raft groups to drop below
238 quorum.
239 * Change the lease transfer mechanism so a transferrer can transfer its
240 timestamp cache's high water mark which would act as the low water mark of the
241 recipient's timestamp cache. This is conditioned on not [inserting reads in
242 the command queue](https://forum.cockroachlabs.com/t/why-do-we-keep-read-commands-in-the-command-queue/360).
243
244 # Unresolved questions