github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170815_version_migration.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170815_version_migration.md (about)

1 - Feature Name: Version Migration
2 - Status: completed
3 - Start Date: 2017-07-10
4 - Authors: Spencer Kimball, Tobias Schottdorf
5 - RFC PR: [#16977](https://github.com/cockroachdb/cockroach/pull/16977), [#17216](https://github.com/cockroachdb/cockroach/pull/17216), [#17411](https://github.com/cockroachdb/cockroach/pull/17411), [#17694](https://github.com/cockroachdb/cockroach/pull/17694)
6 - Cockroach Issue(s): [#17389](https://github.com/cockroachdb/cockroach/issues/17389)
7
8
9 # Summary
10
11 This RFC proposes a mechanism which allows point release upgrades (i.e. 1.1 to
12 1.2) of CockroachDB clusters using a rolling restart followed by an
13 operator-issued cluster-wide version bump that represents a promise that the old
14 version is no longer running on any of the nodes in the cluster, and that
15 backwards-incompatible features can be enabled.
16
17 This is achieved through a version which can be queried at server runtime, and
18 which is populated from
19
20 1. a new cluster setting `version`, and
21 1. a persisted version on the Store's engines.
22
23 The main concern in designing the mechanism here is operator friendlyness. We
24 don't want to make the process more complicated; it should be scriptable; it
25 should be hard to get wrong (and if you do, it shouldn't matter).
26
27 # Motivation
28
29 We have committed to supporting rolling restarts for version upgrades, but we do
30 not have a mechanism in place to safely enable those when backwards-incompatible
31 features are introduced.
32
33 The core problem is that incompatible features must not run in a mixed-version
34 cluster, yet this is precisely what happens if a rolling restart is used to
35 upgrade the versions. Yet, we expect a handful of backwards-incompatible
36 changes in each release.
37
38 What's needed is thus a mechanism that allows a rolling restart into the new
39 binary to be performed while holding back on using the new, incompatible,
40 features, and this is what's introduced in this RFC.
41
42 # Guide-level explanation
43
44 We'll identify the "moving parts" in this machinery first and then walk through
45 an example migration from 1.1 to 1.2, which contains exactly one incompatible
46 upgrade (which really lands in `v1.1`, but let's forget that), namely:
47
48 On splits,
49
50 - v1.1 was creating a Raft `HardState` while *proposing* the split. By the time
51 it was applied, it was potentially stale, which could lead to anomalies.
52 - v1.2 does not write a `HardState` during the split, but it writes a "correct"
53 `HardState` immediately before the right-hand side's Raft group is booted up.
54
55 This is an incompatible change because in a mixed cluster in which v1.2 issues
56 the split and v1.1 applies it, the node at v1.1 would end up without a
57 `HardState`, and immediately crash. We need a proper migration -- `v1.2` must know
58 when it is safe to use the new behavior; it must use the previous (incorrect) one
59 until it knows that, and hopes that the anomaly doesn't strike in the meantime.
60
61 You can forget about the precise change now, but it's useful to see that this
62 one is incompatible but absolutely necessary - we can't fix it in a way that
63 works around the incompatibility, and we can't keep the status quo.
64
65 Now let's do the actual update. For that, the operator
66
67 1. verify the output of `SHOW CLUSTER SETTING version` on each node, to assert
68 they are using `v1.1` (it could be that the operator previously updated
69 the *binary* from `v1.0` to `v1.1` but forgot to actually bump the version
70 in use!)
71 1. performs a rolling restart into the `v1.2` binary,
72 1. checks that the nodes are *really* running a `v1.2` binary. This includes
73 making sure auto-scalers, failover, etc, would spawn nodes at `v1.2`
74 1. in the meantime the nodes are running `v1.2`, but with `v1.1`-compatible
75 features, until the operator
76 1. runs `SET CLUSTER SETTING version = '1.2'`.
77 1. The cluster now operates at `v1.2`.
78
79 We run in detail through what the last one does under the hood.
80
81 ## Recap: how cluster settings work
82
83 Each node has a number of "settings variables" which have hard-coded default
84 values. To change a variable from that default value, one runs `SET CLUSTER
85 SETTING settingname = 'settingval'`.
86
87 Under the hood, this roughly translates to `INSERT INTO system.settings
88 VALUES(settingname, settingval)`, and we have special triggers in place which
89 gossip the contents of this table whenever they change.
90
91 When such a gossip update is received by a node, it goes through the (hard-coded
92 list of) setting variables, populates it with the values from the table, and
93 resets all unmentioned variables to their default value.
94
95 To show the values of the setting variables, one can run `SHOW ALL CLUSTER
96 SETTINGS`. Note that the output of that can vary based on the node, i.e. if one
97 hasn't gotten the most recent updates yet. The command to read from the actual
98 table is `SELECT ... FROM system.settings`; this shows only those commands for
99 which an explicit `SET CLUSTER SETTING` has been issued.
100
101 Note that the `version` variable has additional logic to be detailed in the next
102 section.
103
104 ## Running `SET CLUSTER SETTING version = '1.2'`
105
106 What actually happens when the cluster version is bumped? There are a few things
107 that we want.
108
109 1. The operator should not be able to perform "illegal" version bumps. A legal
110 bump is from one version to the next: 1.0 to 1.1, 1.1 to 1.2, perhaps 1.6 to
111 2.0 (if 1.6 is the last release in the 1.x series), but *not* 1.0 to 1.3 and
112 definitely not `v1.0` to `v5.0`.
113
114 This immediately tells us that the `version` setting needs to take into
115 account its existing value before updating to the new one, and may need
116 to return an error when its validation logic fails.
117
118 This also suggests that it should use the existing value from the
119 `system.settings` table, and not from `SHOW` (which may not be
120 authoritative).
121 2. If we use the `system.settings` table naively, we may have a problem: settings
122 don't persist their default value, and in particular, a new cluster (or one
123 started during 1.0) does not have a `version` setting persisted. It is hard
124 to persist a setting during bootstrap since the `settings` table is only
125 created later. It's tricky to correctly populate that table once the cluster
126 is running because all you know about is your current node, but what if
127 you're accidentally running 3.0 while the real cluster is at 1.0?
128
129 We defer the problem of populating the settings table to the detailed design
130 section and now assume that there *is* a `version` entry in the settings table.
131
132 Then what happens on a version bump is clear: the existing version is read, it
133 is checked whether the new version is a valid successor version, and if so, the
134 entry is transactionally updated. On error, the operator running the `SET
135 CLUSTER SETTING` command will receive a descriptive message such as:
136
137 ```
138 cannot upgrade to 2.0: node running 1.1
139 cannot upgrade directly from 1.0 to 1.3
140 cannot downgrade from 1.0 to 0.9
141 ```
142
143 Assuming success, the settings machinery picks up the new version, and populates
144 the in-memory version setting, from which it can be inspected by the running
145 node.
146
147 To probe the running version, essentially, all moving parts in the node hold on
148 to a variable that implements the following interface (and in the background is
149 hooked up to the version cluster setting appropriately):
150
151 ```go
152 type Versioner interface {
153 IsActive(version roachpb.Version) bool
154 Version() cluster.ClusterVersion // ignored for now
155 }
156 ```
157
158 For instance, before having run `SET CLUSTER SETTING version = '1.1'`, we have
159
160 ```go
161 IsActive(roachpb.Version{Major: 1, Minor: 0}) == true
162 IsActive(roachpb.Version{Major: 1, Minor: 1}) == false
163 ```
164
165 even though the binary is capable of running `v1.1`.
166
167 After having bumped the cluster version, we instead get
168
169 ```go
170 IsActive(roachpb.Version{Major: 1, Minor: 0}) == true
171 IsActive(roachpb.Version{Major: 1, Minor: 1}) == true
172 ```
173
174 but (still)
175
176 ```go
177 IsActive(roachpb.Version{Major: 1, Minor: 2}) == false
178 ```
179
180 To return back to our incompatible feature example, we
181 would have code like this:
182
183 ```go
184 func proposeSplitCommand() {
185 p := makeSplitProposal()
186 if !v.IsActive(roachpb.Version{Major: 1, Minor: 1}) {
187 // Preserve old behavior at v1.0.
188 p.hardState = makePotentiallyDangerousHardState()
189 }
190 propose(p)
191 }
192
193 func applySplit(p splitProposal) {
194 raftGroup := apply(p)
195 if v.IsActive(roachpb.Version{Major: 1, Minor: 1}) {
196 // Enable new behavior only if at v1.1 or later.
197 hardState := makeHardState()
198 writeHardState(hardState)
199 }
200 raftGroup.GoLive()
201 }
202 ```
203
204 Some features may require an explicit "ping" when the version gets bumped. Such
205 a mechanism is easy to add once it's required; we won't talk about it any more
206 here.
207
208 Note that a server always accepts features that require the new version, even if
209 it hasn't received note that they are safe to use, when their use is prompted by
210 another node. For instance, if a new inter-node RPC is introduced, nodes should
211 always respond to it if they can (i.e. know about the RPC). A bump in the
212 cluster version propagates through the cluster asynchronously, so a node may
213 start using a new feature before others realize it is safe.
214
215 ## Development versions
216
217 During the development cycle, new backwards-incompatible migrations may need to
218 be introduced. For this, we use "unstable" versions, which are written
219 `<major>.<minor>-<unstable>`; while stable releases will always have `<unstable>
220 == 0`, each unstable change gets a unique, strictly incrementing unstable
221 version component. For instance, at the time of writing (`v1.1-alpha`), we have
222 the following:
223
224 ```go
225 var (
226 // VersionSplitHardStateBelowRaft is https://github.com/cockroachdb/cockroach/pull/17051.
227 VersionSplitHardStateBelowRaft = roachpb.Version{Major: 1, Minor: 0, Unstable: 2}
228
229 // VersionRaftLogTruncationBelowRaft is https://github.com/cockroachdb/cockroach/pull/16993.
230 VersionRaftLogTruncationBelowRaft = roachpb.Version{Major: 1, Minor: 0, Unstable: 1}
231
232 // VersionBase corresponds to any binary older than 1.0-1,
233 // though these binaries won't know anything about the mechanism in which
234 // this version is used.
235 VersionBase = roachpb.Version{Major: 1}
236 )
237 ```
238
239 Note that there is no `v1.1` yet. This version will only exist with the stable
240 `v1.1.0` release.
241
242 Tagging the unstable versions individually has the advantage that we can
243 properly migrate our test clusters simply through a rolling restart, and then a
244 version bump to `<major>.<minor>-<latest_unstable>` (it's allowed to enable
245 multiple unstable versions at once).
246
247 ## Upgrade process (close to documentation)
248
249 The upgrade process as we document it publicly will have to advise operators to
250 create appropriate backups. They should roughly follow this checklist:
251
252 ### Optional prelude: staging dry run
253 - Start a **staging** cluster with the new version (e.g. 1.1).
254 - Restore data from most recent backup(s) to staging cluster.
255 - Tee production traffic or run load generators to simulate live
256 traffic and verify cluster stability.
257 - Proceed to upgrade process described above.
258
259 ### Steps in production
260 - Disable auto-scaling or other systems that could add a node with a conflicting
261 version at an inopportune time.
262 - Ensure that all nodes are either running or guaranteed to not rejoin the
263 cluster after it has been updated.
264 - We intend to protect the cluster from mismatched nodes, but the exact
265 mechanism is TBD. Check back when writing the docs.
266 - Create a (full or incremental) backup of the cluster.
267 - Rolling upgrade of nodes to next version.
268 - At this point, all nodes will be running the new binary, albeit with
269 compatibility for the old one.
270 - Verify no node running the old version remains in the cluster (and no new one
271 will accidentally be added).
272 - Verify basic cluster stability. If problems occur, a rolling downgrade is
273 still an option.
274 - Depending on how much time has passed, another incremental backup could be
275 advisable.
276 - `SET CLUSTER SETTING version = '<newversion>'`
277 - In the event of a catastrophic failure or corruption due to usage of new
278 features requiring 1.1, the only option is to restore from backup. This is a
279 two step process: start a new cluster using the old binary, and then restore
280 from the backup(s).
281 - restore any orchestration settings (auto-scaling, etc) back to their normal
282 production values.
283
284 ![Version migrations with rolling upgrades](images/version_migration.png?raw=true "Version migrations with rolling upgrades")
285
286 # Reference-level explanation
287
288 This section runs through the moving parts involved in the implementation.
289
290 ## Detailed design
291
292 ### Structures and nomenclature
293
294 The fundamental straightforward structure is `roachpb.Version`:
295
296 ```proto
297 message Version {
298 optional int32 major = 1; // the "2" in `v2.1`
299 optional int32 minor = 2; // the "1" in `v2.1`
300 optional int32 patch = 3; // placeholder; always zero
301 optional int32 unstable = 4 // dev version; all stable versions have zero
302 }
303 ```
304
305 The situation gets complicated by the fact that our usage of the version as the `version` cluster setting is really a "cluster-wide minimum version", which informs the use of the name `MinimumVersion` in the following `ClusterVersion` proto:
306
307 ```
308 message ClusterVersion {
309 // The minimum_version required for any node to support. This
310 // value must monotonically increase.
311 roachpb.Version minimum_version = 1 [(gogoproto.nullable) = false];
312 }
313 ```
314
315 This should make sense so far. However, discussion in this RFC has mandated that
316 we include an ominous `UseVersion` as well. This emerged as a compromise after
317 giving up on allowing "rollbacks" of upgrades. Briefly put, `UseVersion` can be
318 smaller than `MinimumVersion` and advises the server to not use new features
319 that it has the discretion to not use. For example, assume `v1.1` contains a
320 performance optimization (that isn't supported in a cluster running nodes at
321 `v1.0`). After bumping the cluster version to `v1.1`, it turns out that the
322 optimization is a horrible pessimization for the cluster's workload, and things
323 start to break. The operator can then set `UseVersion` back to `v1.0` to advise
324 the cluster to not use that performance optimization (even if it could). On the
325 other hand, some other migrations it has performed (perhaps it rewrote some of
326 its on-disk state to a new format) may not support being "deactivated", so they
327 would continue to be in effect.
328
329 This feature has not been implemented in the initial version, though it has been
330 "plumbed". It will be ignored in this design from this point on, though no
331 effort will be made to scrub it from code samples.
332
333 ```
334 message ClusterVersion {
335 [...]
336 // The version of functionality in use in the cluster. Unlike
337 // minimum_version, use_version may be downgraded, which will
338 // disable functionality requiring a higher version. However,
339 // some functionality, once in use, can not be discontinued.
340 // Support for that functionality is guaranteed by the ratchet
341 // of minimum_version.
342 roachpb.Version use_version = 2 [(gogoproto.nullable) = false];
343 }
344 ```
345
346 The `system.settings` table entry for `version` is in fact a marshalled
347 `ClusterVersion` (for which `MinimumVersion == UseVersion`).
348
349 ### Server Configuration
350
351 The `ServerVersion` (type `roachpb.Version`, sometimes referred to as "binary
352 version") is baked into the binaries we release (in which case it equals
353 `cluster.BinaryServerVersion`, so our `v1.1` release will have `ServerVersion
354 ==roachpb.Version{Major: 1, Minor: 1}`). However, internally, ServerVersion` is
355 part of the configuration of a `*Server` and can be set freely (which we do in
356 tests).
357
358 Similarly a `Server` is configured with `MinimumSupportedVersion` (which in
359 release builds typically trails `cluster.BinaryServerVersion` by a minor
360 version, reflecting the fact that it can run in a compatible way with its
361 predecessor). If a server starts up with a store that has a persisted version
362 smaller than this or larger than its `ServerVersion`, it exits with an error.
363 We'll talk about store persistence in the corresponding subsection.
364
365 ### Gossip
366
367 The `NodeDescriptor` gossips the server's configured `ServerVersion`. This isn't
368 used at the time of writing; see the unresolved section for discussion.
369
370 ### Storage (persistence)
371
372 Once a node has received a cluster-wide minimum version from the settings table
373 via Gossip, it is used as the authoritative version the server is operating at
374 (unless the binary can't support it, in which case it commits suicide).
375
376 Typically, the server will go through the following transitions:
377
378 - `ServerVersion` is (say) `v1.1`, runs `v1.1` (`MinimumVersion == UseVersion == v1.1`),
379 stores have the above `MinimumVersion` and `UseVersion` persisted
380 - rolling restart
381 - `ServerVersion` is `v1.2`, runs `v1.1`-compatible (`MinimumVersion == UseVersion == v1.1`)
382 - operator issues `SET CLUSTER SETTING version = '1.2'`
383 - gossip received: stores updated to `MinimumVersion == UseVersion == v1.2`
384 - new `MinimumVersion` (and equal `UseVersion`) exposed to running process:
385 `ServerVersion` is (still) `v1.2`, but now `MinimumVersion == UseVersion == v1.2`.
386
387 We need to close the gap between starting the server and receiving the above
388 information, and we also want to prevent restarting into a too-recent version
389 in the first place (for example restarting straight into `v1.3` from `v1.1`).
390
391 To this end, whenever any `Node` receives a `version` from gossip, it writes it
392 to a store local key (`keys.StoreClusterVersionKey()`) on *all* of its stores
393 (as a `ClusterVersion`).
394
395 Additionally, when a cluster is bootstrapped, the store is populated with
396 the running server's version (this will not happen for `v1.0` binaries as
397 these don't know about this RFC).
398
399 When a node starts up with new stores to bootstrap, it takes precautions to
400 propagate the cluster version to these stores as well. See the unresolved
401 questions for some discussion of how an illegal version joining an existing
402 cluster can be prevented.
403
404 This seems simple enough, but the logic that reads from the stores has to deal
405 with the case in which the various stores either have no (as could happen as we
406 boot into them from 1.0) or conflicting information. Roughly speaking,
407 bootstrapping stores can afford to wait for an authoritative version from gossip
408 and use that, and whenever we ask a store about its persisted `MinimumVersion`
409 and it has none persisted, it counts as a store at
410 `MinimumVersion=UseVersion=v1.0`. We make sure to write the version atomically
411 with the bootstrap information (to make sure we don't bootstrap a store, crash,
412 and then misidentify as `v1.0`). We also write all versions again after
413 bootstrap to immunize against the case in which the cluster version was bumped
414 mid-bootstrap (this could cause trouble if we add one store and then remove the
415 original one between restarts).
416
417 The cluster version we synthesize at node start is then the one with the largest
418 `MinimumVersion` (and the smallest `UseVersion`).
419
420 Examples:
421
422 - one store at `<empty>`, another at `v1.1` results in `v1.1`.
423 - two stores, both `<empty>` results in `v1.0`.
424 - three stores at `v1.1`, `v1.2` and `v1.3` results in `v1.3` (but likely
425 catches an error anyway because it spans versions)
426
427 ### The implementer of `Versioner`: `ExposedClusterVersion`
428
429 `ExposedClusterVersion` is the central object that manages the information
430 bootstrapped from the stores at node start and the gossiped central version and
431 flips between the two at the appropriate moment. It also encapsulates the logic
432 that dictates which version upgrades are admissible, and for that reason
433 integrates fairly tightly with the `settings` subsystem. This is fairly complex
434 and so appropriate detail is supplied below.
435
436 We start out with the struct itself.
437
438 ```go
439 type ExposedClusterVersion struct {
440 MinSupportedVersion roachpb.Version // Server configuration, does not change
441 ServerVersion roachpb.Version // Server configuration, does not change
442 // baseVersion stores a *ClusterVersion. It's very initially zero (at which
443 // point any calls to check the version are fatal) and is initialized with
444 // the version from the stores early in the boot sequence. Later, it gets
445 // updated with gossiped updates, but only *after* each update has been
446 // written back to the disks (so that we don't expose anything to callers
447 // that we may not see again if the node restarted).
448 baseVersion atomic.Value
449 // The cluster setting administering the `version` setting. On change,
450 // invokes logic that calls `cb` and then bumps `baseVersion`.
451 version *settings.StateMachineSetting
452 // Callback into the node to persist a new gossiped MinimumVersion (invoked
453 // before `baseVersion` is updated)
454 cb func(ClusterVersion)
455 }
456
457 // Version returns the minimum cluster version the caller may assume is in
458 // effect. It must not be called until the setting has been initialized.
459 func (ecv *ExposedClusterVersion) Version() ClusterVersion
460
461 // BootstrapVersion returns the version a newly initialized cluster should have.
462 func (ecv *ExposedClusterVersion) BootstrapVersion() ClusterVersion
463
464 // IsActive returns true if the features of the supplied version are active at
465 // the running version.
466 func (ecv *ExposedClusterVersion) IsActive(v roachpb.Version) bool
467 ```
468
469 The remaining complexity lies in the transformer function for
470 `version *settings.StateMachineSetting`. It contains all of the update logic
471 (mod reading from the table: we've updated the settings framework to use the
472 table for all `StateMachineSettings`, of which this is the only instance at the
473 time of writing); `versionTransformer` takes
474
475 - the previous encoded value (i.e. a marshalled `ClusterVersion`)
476 - the desired transition, if any (for example "1.2").
477
478 and returns
479
480 - the new encoded value (i.e. the new marshalled `ClusterVersion`)
481 - an interface backed by a "user-friendly" representation of the new state
482 (i.e. something that can be printed)
483 - an error if the input was illegal.
484
485 The most complicated bits happen inside of this function for the following
486 special cases:
487
488 - when no previous encoded value is given, the transformer provides the "default
489 value". In this case, it's `baseVersion`. In particular, the default value
490 changes! This behaviour is required because when the initial gossip update
491 comes in, it needs to be validated, and we also validate what users do during
492 `SET CLUSTER SETTING version = ...`, which they could do through multiple
493 versions.
494 - validate the new state and fail if either the node itself has `ServerVersion`
495 below the new `MinimumVersion` or its `MinSupportedVersion` is newer than the
496 `MinimumVersion`.
497
498 ## Initially populating the settings table version entry
499
500 ### Populating the settings table
501
502 As outlined in the guide-level explanation, we'd like the settings table to hold
503 the "current" cluster version for new clusters, but we have no good way of
504 populating it at bootstrap time and don't have sufficient information to
505 populate it in a foolproof manner later. The solution is presented at the end of
506 this section. We first detail the "obvious" approaches and their shortcomings.
507
508 #### Approaches that don't work
509
510 We could use the "suspected" local version during cluster version when no
511 `version` is persisted in the table, but:
512
513 - this is incorrect when, say, adding a v3.0 node to a v1.0 cluster and
514 running `SET CLUSTER SETTING version = '3.1'` on that node:
515 - in the absence of any other information (and assume that we don't try any
516 "polling" of everyone's version which can quickly get out of hand), the node
517 assumes the cluster version is `v3.1` (or `v3.0`).
518 - it manages to write `version = v3.1` to the settings table.
519 - all other nodes in the cluster die when they learn about that.
520 - it would "work" (with `v1.1`) on all nodes running the "actual" version; the
521 `v3.0` node would die instead, learning that the cluster is now at `v1.1`.
522 - the classic case to worry about is that of an operator "skipping a version"
523 during the rolling restart. We would catch this as the new binary can't run
524 from the old storage directory (i.e. `v1.7` can't be started on a `v1.5`
525 store).
526 - it would equally be impossible to roll from `v1.5` into `v1.6` into `v1.7`
527 (the storage markers would remain at `v1.5`).
528 - in effect, to realize the above problem in practice, an operator would
529 have to add a brand new node running a too new version to a preexisting
530 cluster and run the version bump on that node.
531 - This might be an acceptable solution in practice, but it's hard to reason
532 about and explain.
533
534 An alternative (but apparently problematic) approach is adding a sql migration.
535 The problem with those is that it's not clear which value the migration should
536 insert into the table -- is it that of the running binary? That would do the
537 wrong thing if a `v1.0` cluster is restarted into `v1.1` (which now has the
538 migration); we need to insert `v1.0` in that case. On the other hand, after
539 bootstrapping a `v1.x` cluster for `x > 0`, we want to insert `v1.x`.
540
541 And of course there is the third approach, which is writing the settings table
542 during actual bootstrapping. This seems to much work to be realistic at this
543 point in the cycle, and it may come with its own migration concerns.
544
545 #### The combination that does work
546
547 All of the previous approaches combined suggest a more workable combination:
548
549 1. instead of populating the settings table at bootstrap, populate a new key
550 `BootstrapVersion` (similar to the `ClusterIdent`). In effect, for the
551 lifetime of this cluster, we can get an authoritative answer about the
552 version at which it was bootstrapped.
553 1. change the semantics of `SET CLUSTER SETTING version = x` so that when it
554 doesn't find an entry in the `system.settings` table, it fails.
555 1. add a sql migration that
556 - reads `BootstrapVersion`
557 - runs
558 ```sql
559 -- need this explicitly or the `SET` below would fail!
560 UPSERT INTO system.settings VALUES(
561 'version', marshal_appropriately('<bootstrap_version>')
562 );
563 -- Trigger proper gossip, etc, by doing "no-op upgrade".
564 SET CLUSTER SETTING version = '<bootstrap_version>';
565 ```
566 - the cluster version is now set and operators can use `SET CLUSTER
567 SETTING`.
568
569 This obviously works if the migration runs when the cluster is still running
570 the bootstrapped version, even if the operator set the version explicitly (`SET
571 CLUSTER SETTING version = x` is idempotent).
572
573
574 ### Bootstrapping new stores
575
576 When an existing node restarts, it has on-disk markers that should reflect a
577 reasonable version configuration to assume until gossip updates are in effect.
578
579 The situation is slightly different when a new node joins a cluster for the
580 first time. In this case, it'll bootstrap its stores using its binary's
581 `MinimumSupportedVersion`, (for that is all that it knows), which is usually
582 one minor version behind the cluster's active version.
583
584 This is not an issue since the node's binary can still participate in newer
585 features, and it will bump its version once it receives the first gossip update,
586 typically after seconds.
587
588 We could conceivably be smarter about finding out the active cluster version
589 proactively (we're connected to gossip when we bootstrap), but this is not
590 deemed worth the extra complexity.
591
592 ## Drawbacks
593
594 - relying on the operator to promise that no old version is around opens cluster
595 health up to user error.
596 - we can't roll back upgrades, which will make many users nervous
597 - in particular, it discounts the OSS version of CockroachDB
598
599 ## Rationale and Alternatives
600
601 The main concern in designing the mechanism here is operator friendlyness. We
602 don't want to make the process more complicated; it should be scriptable; it
603 should be hard to get wrong (and if you do, it shouldn't matter).
604
605 The main concessions we make in this design are
606
607 1. no support for downgrades once the version setting has been bumped. This was
608 discussed early in the life of this RFC but was discarded for its inherent
609 complexity and large search space that would have to be tested.
610
611 It will be difficult to retrofit this, though a change in the upgrade process
612 can itself be migrated through the upgrade process presented here (though it
613 would take one release to go through the transition).
614
615 We can at least downgrade before bumping the cluster version setting though,
616 which allows all backwards-compatible changes (ideally the bulk) to be
617 tested.
618
619 Additionally, we could implement the originally envisioned `UseVersion`
620 functionality so that the following conditions must be met until a backup is
621 really necessary if a problem is a) severe and b) can't be "deactivated" (via
622 `UseVersion`).
623 1. relying on the operator to guarantee that a cluster is not running mixed
624 versions when the explicit version bump is issued.
625
626 There are ways in which this could be approximately inferred, but again it
627 was deemed too complex given the time frame. Besides, operators may prefer to
628 have some level of control over the migration process, and it is difficult to
629 make an autonomous upgrade workflow foolproof.
630
631 If desired in the future, this can be retrofitted.
632
633 As a result, we get a design that's ergonomic but limited. The complexity
634 inherent with it as written indicates that it is a good choice to not add
635 additional complexity at this point. We are not locked into the process in the
636 long term.
637
638 ## Unresolved questions
639
640 ### Naming
641
642 `MinimumVersion` and `MinimumSupportedVersion` are similar but also different.
643 Perhaps the latter should be renamed, though no better name comes to mind.
644
645 ### What to gossip
646
647 We make no use of the gossiped `ServerVersion`. The node's git commit hash is
648 already available, so this is only mildly interesting. Its `MinimumVersion`
649 (plus its `UseVersion` if that should ever differ) are more relevant. Likely
650 these should be added, even if the information isn't used today.