github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170815_version_migration.md (about) 1 - Feature Name: Version Migration 2 - Status: completed 3 - Start Date: 2017-07-10 4 - Authors: Spencer Kimball, Tobias Schottdorf 5 - RFC PR: [#16977](https://github.com/cockroachdb/cockroach/pull/16977), [#17216](https://github.com/cockroachdb/cockroach/pull/17216), [#17411](https://github.com/cockroachdb/cockroach/pull/17411), [#17694](https://github.com/cockroachdb/cockroach/pull/17694) 6 - Cockroach Issue(s): [#17389](https://github.com/cockroachdb/cockroach/issues/17389) 7 8 9 # Summary 10 11 This RFC proposes a mechanism which allows point release upgrades (i.e. 1.1 to 12 1.2) of CockroachDB clusters using a rolling restart followed by an 13 operator-issued cluster-wide version bump that represents a promise that the old 14 version is no longer running on any of the nodes in the cluster, and that 15 backwards-incompatible features can be enabled. 16 17 This is achieved through a version which can be queried at server runtime, and 18 which is populated from 19 20 1. a new cluster setting `version`, and 21 1. a persisted version on the Store's engines. 22 23 The main concern in designing the mechanism here is operator friendlyness. We 24 don't want to make the process more complicated; it should be scriptable; it 25 should be hard to get wrong (and if you do, it shouldn't matter). 26 27 # Motivation 28 29 We have committed to supporting rolling restarts for version upgrades, but we do 30 not have a mechanism in place to safely enable those when backwards-incompatible 31 features are introduced. 32 33 The core problem is that incompatible features must not run in a mixed-version 34 cluster, yet this is precisely what happens if a rolling restart is used to 35 upgrade the versions. Yet, we expect a handful of backwards-incompatible 36 changes in each release. 37 38 What's needed is thus a mechanism that allows a rolling restart into the new 39 binary to be performed while holding back on using the new, incompatible, 40 features, and this is what's introduced in this RFC. 41 42 # Guide-level explanation 43 44 We'll identify the "moving parts" in this machinery first and then walk through 45 an example migration from 1.1 to 1.2, which contains exactly one incompatible 46 upgrade (which really lands in `v1.1`, but let's forget that), namely: 47 48 On splits, 49 50 - v1.1 was creating a Raft `HardState` while *proposing* the split. By the time 51 it was applied, it was potentially stale, which could lead to anomalies. 52 - v1.2 does not write a `HardState` during the split, but it writes a "correct" 53 `HardState` immediately before the right-hand side's Raft group is booted up. 54 55 This is an incompatible change because in a mixed cluster in which v1.2 issues 56 the split and v1.1 applies it, the node at v1.1 would end up without a 57 `HardState`, and immediately crash. We need a proper migration -- `v1.2` must know 58 when it is safe to use the new behavior; it must use the previous (incorrect) one 59 until it knows that, and hopes that the anomaly doesn't strike in the meantime. 60 61 You can forget about the precise change now, but it's useful to see that this 62 one is incompatible but absolutely necessary - we can't fix it in a way that 63 works around the incompatibility, and we can't keep the status quo. 64 65 Now let's do the actual update. For that, the operator 66 67 1. verify the output of `SHOW CLUSTER SETTING version` on each node, to assert 68 they are using `v1.1` (it could be that the operator previously updated 69 the *binary* from `v1.0` to `v1.1` but forgot to actually bump the version 70 in use!) 71 1. performs a rolling restart into the `v1.2` binary, 72 1. checks that the nodes are *really* running a `v1.2` binary. This includes 73 making sure auto-scalers, failover, etc, would spawn nodes at `v1.2` 74 1. in the meantime the nodes are running `v1.2`, but with `v1.1`-compatible 75 features, until the operator 76 1. runs `SET CLUSTER SETTING version = '1.2'`. 77 1. The cluster now operates at `v1.2`. 78 79 We run in detail through what the last one does under the hood. 80 81 ## Recap: how cluster settings work 82 83 Each node has a number of "settings variables" which have hard-coded default 84 values. To change a variable from that default value, one runs `SET CLUSTER 85 SETTING settingname = 'settingval'`. 86 87 Under the hood, this roughly translates to `INSERT INTO system.settings 88 VALUES(settingname, settingval)`, and we have special triggers in place which 89 gossip the contents of this table whenever they change. 90 91 When such a gossip update is received by a node, it goes through the (hard-coded 92 list of) setting variables, populates it with the values from the table, and 93 resets all unmentioned variables to their default value. 94 95 To show the values of the setting variables, one can run `SHOW ALL CLUSTER 96 SETTINGS`. Note that the output of that can vary based on the node, i.e. if one 97 hasn't gotten the most recent updates yet. The command to read from the actual 98 table is `SELECT ... FROM system.settings`; this shows only those commands for 99 which an explicit `SET CLUSTER SETTING` has been issued. 100 101 Note that the `version` variable has additional logic to be detailed in the next 102 section. 103 104 ## Running `SET CLUSTER SETTING version = '1.2'` 105 106 What actually happens when the cluster version is bumped? There are a few things 107 that we want. 108 109 1. The operator should not be able to perform "illegal" version bumps. A legal 110 bump is from one version to the next: 1.0 to 1.1, 1.1 to 1.2, perhaps 1.6 to 111 2.0 (if 1.6 is the last release in the 1.x series), but *not* 1.0 to 1.3 and 112 definitely not `v1.0` to `v5.0`. 113 114 This immediately tells us that the `version` setting needs to take into 115 account its existing value before updating to the new one, and may need 116 to return an error when its validation logic fails. 117 118 This also suggests that it should use the existing value from the 119 `system.settings` table, and not from `SHOW` (which may not be 120 authoritative). 121 2. If we use the `system.settings` table naively, we may have a problem: settings 122 don't persist their default value, and in particular, a new cluster (or one 123 started during 1.0) does not have a `version` setting persisted. It is hard 124 to persist a setting during bootstrap since the `settings` table is only 125 created later. It's tricky to correctly populate that table once the cluster 126 is running because all you know about is your current node, but what if 127 you're accidentally running 3.0 while the real cluster is at 1.0? 128 129 We defer the problem of populating the settings table to the detailed design 130 section and now assume that there *is* a `version` entry in the settings table. 131 132 Then what happens on a version bump is clear: the existing version is read, it 133 is checked whether the new version is a valid successor version, and if so, the 134 entry is transactionally updated. On error, the operator running the `SET 135 CLUSTER SETTING` command will receive a descriptive message such as: 136 137 ``` 138 cannot upgrade to 2.0: node running 1.1 139 cannot upgrade directly from 1.0 to 1.3 140 cannot downgrade from 1.0 to 0.9 141 ``` 142 143 Assuming success, the settings machinery picks up the new version, and populates 144 the in-memory version setting, from which it can be inspected by the running 145 node. 146 147 To probe the running version, essentially, all moving parts in the node hold on 148 to a variable that implements the following interface (and in the background is 149 hooked up to the version cluster setting appropriately): 150 151 ```go 152 type Versioner interface { 153 IsActive(version roachpb.Version) bool 154 Version() cluster.ClusterVersion // ignored for now 155 } 156 ``` 157 158 For instance, before having run `SET CLUSTER SETTING version = '1.1'`, we have 159 160 ```go 161 IsActive(roachpb.Version{Major: 1, Minor: 0}) == true 162 IsActive(roachpb.Version{Major: 1, Minor: 1}) == false 163 ``` 164 165 even though the binary is capable of running `v1.1`. 166 167 After having bumped the cluster version, we instead get 168 169 ```go 170 IsActive(roachpb.Version{Major: 1, Minor: 0}) == true 171 IsActive(roachpb.Version{Major: 1, Minor: 1}) == true 172 ``` 173 174 but (still) 175 176 ```go 177 IsActive(roachpb.Version{Major: 1, Minor: 2}) == false 178 ``` 179 180 To return back to our incompatible feature example, we 181 would have code like this: 182 183 ```go 184 func proposeSplitCommand() { 185 p := makeSplitProposal() 186 if !v.IsActive(roachpb.Version{Major: 1, Minor: 1}) { 187 // Preserve old behavior at v1.0. 188 p.hardState = makePotentiallyDangerousHardState() 189 } 190 propose(p) 191 } 192 193 func applySplit(p splitProposal) { 194 raftGroup := apply(p) 195 if v.IsActive(roachpb.Version{Major: 1, Minor: 1}) { 196 // Enable new behavior only if at v1.1 or later. 197 hardState := makeHardState() 198 writeHardState(hardState) 199 } 200 raftGroup.GoLive() 201 } 202 ``` 203 204 Some features may require an explicit "ping" when the version gets bumped. Such 205 a mechanism is easy to add once it's required; we won't talk about it any more 206 here. 207 208 Note that a server always accepts features that require the new version, even if 209 it hasn't received note that they are safe to use, when their use is prompted by 210 another node. For instance, if a new inter-node RPC is introduced, nodes should 211 always respond to it if they can (i.e. know about the RPC). A bump in the 212 cluster version propagates through the cluster asynchronously, so a node may 213 start using a new feature before others realize it is safe. 214 215 ## Development versions 216 217 During the development cycle, new backwards-incompatible migrations may need to 218 be introduced. For this, we use "unstable" versions, which are written 219 `<major>.<minor>-<unstable>`; while stable releases will always have `<unstable> 220 == 0`, each unstable change gets a unique, strictly incrementing unstable 221 version component. For instance, at the time of writing (`v1.1-alpha`), we have 222 the following: 223 224 ```go 225 var ( 226 // VersionSplitHardStateBelowRaft is https://github.com/cockroachdb/cockroach/pull/17051. 227 VersionSplitHardStateBelowRaft = roachpb.Version{Major: 1, Minor: 0, Unstable: 2} 228 229 // VersionRaftLogTruncationBelowRaft is https://github.com/cockroachdb/cockroach/pull/16993. 230 VersionRaftLogTruncationBelowRaft = roachpb.Version{Major: 1, Minor: 0, Unstable: 1} 231 232 // VersionBase corresponds to any binary older than 1.0-1, 233 // though these binaries won't know anything about the mechanism in which 234 // this version is used. 235 VersionBase = roachpb.Version{Major: 1} 236 ) 237 ``` 238 239 Note that there is no `v1.1` yet. This version will only exist with the stable 240 `v1.1.0` release. 241 242 Tagging the unstable versions individually has the advantage that we can 243 properly migrate our test clusters simply through a rolling restart, and then a 244 version bump to `<major>.<minor>-<latest_unstable>` (it's allowed to enable 245 multiple unstable versions at once). 246 247 ## Upgrade process (close to documentation) 248 249 The upgrade process as we document it publicly will have to advise operators to 250 create appropriate backups. They should roughly follow this checklist: 251 252 ### Optional prelude: staging dry run 253 - Start a **staging** cluster with the new version (e.g. 1.1). 254 - Restore data from most recent backup(s) to staging cluster. 255 - Tee production traffic or run load generators to simulate live 256 traffic and verify cluster stability. 257 - Proceed to upgrade process described above. 258 259 ### Steps in production 260 - Disable auto-scaling or other systems that could add a node with a conflicting 261 version at an inopportune time. 262 - Ensure that all nodes are either running or guaranteed to not rejoin the 263 cluster after it has been updated. 264 - We intend to protect the cluster from mismatched nodes, but the exact 265 mechanism is TBD. Check back when writing the docs. 266 - Create a (full or incremental) backup of the cluster. 267 - Rolling upgrade of nodes to next version. 268 - At this point, all nodes will be running the new binary, albeit with 269 compatibility for the old one. 270 - Verify no node running the old version remains in the cluster (and no new one 271 will accidentally be added). 272 - Verify basic cluster stability. If problems occur, a rolling downgrade is 273 still an option. 274 - Depending on how much time has passed, another incremental backup could be 275 advisable. 276 - `SET CLUSTER SETTING version = '<newversion>'` 277 - In the event of a catastrophic failure or corruption due to usage of new 278 features requiring 1.1, the only option is to restore from backup. This is a 279 two step process: start a new cluster using the old binary, and then restore 280 from the backup(s). 281 - restore any orchestration settings (auto-scaling, etc) back to their normal 282 production values. 283 284  285 286 # Reference-level explanation 287 288 This section runs through the moving parts involved in the implementation. 289 290 ## Detailed design 291 292 ### Structures and nomenclature 293 294 The fundamental straightforward structure is `roachpb.Version`: 295 296 ```proto 297 message Version { 298 optional int32 major = 1; // the "2" in `v2.1` 299 optional int32 minor = 2; // the "1" in `v2.1` 300 optional int32 patch = 3; // placeholder; always zero 301 optional int32 unstable = 4 // dev version; all stable versions have zero 302 } 303 ``` 304 305 The situation gets complicated by the fact that our usage of the version as the `version` cluster setting is really a "cluster-wide minimum version", which informs the use of the name `MinimumVersion` in the following `ClusterVersion` proto: 306 307 ``` 308 message ClusterVersion { 309 // The minimum_version required for any node to support. This 310 // value must monotonically increase. 311 roachpb.Version minimum_version = 1 [(gogoproto.nullable) = false]; 312 } 313 ``` 314 315 This should make sense so far. However, discussion in this RFC has mandated that 316 we include an ominous `UseVersion` as well. This emerged as a compromise after 317 giving up on allowing "rollbacks" of upgrades. Briefly put, `UseVersion` can be 318 smaller than `MinimumVersion` and advises the server to not use new features 319 that it has the discretion to not use. For example, assume `v1.1` contains a 320 performance optimization (that isn't supported in a cluster running nodes at 321 `v1.0`). After bumping the cluster version to `v1.1`, it turns out that the 322 optimization is a horrible pessimization for the cluster's workload, and things 323 start to break. The operator can then set `UseVersion` back to `v1.0` to advise 324 the cluster to not use that performance optimization (even if it could). On the 325 other hand, some other migrations it has performed (perhaps it rewrote some of 326 its on-disk state to a new format) may not support being "deactivated", so they 327 would continue to be in effect. 328 329 This feature has not been implemented in the initial version, though it has been 330 "plumbed". It will be ignored in this design from this point on, though no 331 effort will be made to scrub it from code samples. 332 333 ``` 334 message ClusterVersion { 335 [...] 336 // The version of functionality in use in the cluster. Unlike 337 // minimum_version, use_version may be downgraded, which will 338 // disable functionality requiring a higher version. However, 339 // some functionality, once in use, can not be discontinued. 340 // Support for that functionality is guaranteed by the ratchet 341 // of minimum_version. 342 roachpb.Version use_version = 2 [(gogoproto.nullable) = false]; 343 } 344 ``` 345 346 The `system.settings` table entry for `version` is in fact a marshalled 347 `ClusterVersion` (for which `MinimumVersion == UseVersion`). 348 349 ### Server Configuration 350 351 The `ServerVersion` (type `roachpb.Version`, sometimes referred to as "binary 352 version") is baked into the binaries we release (in which case it equals 353 `cluster.BinaryServerVersion`, so our `v1.1` release will have `ServerVersion 354 ==roachpb.Version{Major: 1, Minor: 1}`). However, internally, ServerVersion` is 355 part of the configuration of a `*Server` and can be set freely (which we do in 356 tests). 357 358 Similarly a `Server` is configured with `MinimumSupportedVersion` (which in 359 release builds typically trails `cluster.BinaryServerVersion` by a minor 360 version, reflecting the fact that it can run in a compatible way with its 361 predecessor). If a server starts up with a store that has a persisted version 362 smaller than this or larger than its `ServerVersion`, it exits with an error. 363 We'll talk about store persistence in the corresponding subsection. 364 365 ### Gossip 366 367 The `NodeDescriptor` gossips the server's configured `ServerVersion`. This isn't 368 used at the time of writing; see the unresolved section for discussion. 369 370 ### Storage (persistence) 371 372 Once a node has received a cluster-wide minimum version from the settings table 373 via Gossip, it is used as the authoritative version the server is operating at 374 (unless the binary can't support it, in which case it commits suicide). 375 376 Typically, the server will go through the following transitions: 377 378 - `ServerVersion` is (say) `v1.1`, runs `v1.1` (`MinimumVersion == UseVersion == v1.1`), 379 stores have the above `MinimumVersion` and `UseVersion` persisted 380 - rolling restart 381 - `ServerVersion` is `v1.2`, runs `v1.1`-compatible (`MinimumVersion == UseVersion == v1.1`) 382 - operator issues `SET CLUSTER SETTING version = '1.2'` 383 - gossip received: stores updated to `MinimumVersion == UseVersion == v1.2` 384 - new `MinimumVersion` (and equal `UseVersion`) exposed to running process: 385 `ServerVersion` is (still) `v1.2`, but now `MinimumVersion == UseVersion == v1.2`. 386 387 We need to close the gap between starting the server and receiving the above 388 information, and we also want to prevent restarting into a too-recent version 389 in the first place (for example restarting straight into `v1.3` from `v1.1`). 390 391 To this end, whenever any `Node` receives a `version` from gossip, it writes it 392 to a store local key (`keys.StoreClusterVersionKey()`) on *all* of its stores 393 (as a `ClusterVersion`). 394 395 Additionally, when a cluster is bootstrapped, the store is populated with 396 the running server's version (this will not happen for `v1.0` binaries as 397 these don't know about this RFC). 398 399 When a node starts up with new stores to bootstrap, it takes precautions to 400 propagate the cluster version to these stores as well. See the unresolved 401 questions for some discussion of how an illegal version joining an existing 402 cluster can be prevented. 403 404 This seems simple enough, but the logic that reads from the stores has to deal 405 with the case in which the various stores either have no (as could happen as we 406 boot into them from 1.0) or conflicting information. Roughly speaking, 407 bootstrapping stores can afford to wait for an authoritative version from gossip 408 and use that, and whenever we ask a store about its persisted `MinimumVersion` 409 and it has none persisted, it counts as a store at 410 `MinimumVersion=UseVersion=v1.0`. We make sure to write the version atomically 411 with the bootstrap information (to make sure we don't bootstrap a store, crash, 412 and then misidentify as `v1.0`). We also write all versions again after 413 bootstrap to immunize against the case in which the cluster version was bumped 414 mid-bootstrap (this could cause trouble if we add one store and then remove the 415 original one between restarts). 416 417 The cluster version we synthesize at node start is then the one with the largest 418 `MinimumVersion` (and the smallest `UseVersion`). 419 420 Examples: 421 422 - one store at `<empty>`, another at `v1.1` results in `v1.1`. 423 - two stores, both `<empty>` results in `v1.0`. 424 - three stores at `v1.1`, `v1.2` and `v1.3` results in `v1.3` (but likely 425 catches an error anyway because it spans versions) 426 427 ### The implementer of `Versioner`: `ExposedClusterVersion` 428 429 `ExposedClusterVersion` is the central object that manages the information 430 bootstrapped from the stores at node start and the gossiped central version and 431 flips between the two at the appropriate moment. It also encapsulates the logic 432 that dictates which version upgrades are admissible, and for that reason 433 integrates fairly tightly with the `settings` subsystem. This is fairly complex 434 and so appropriate detail is supplied below. 435 436 We start out with the struct itself. 437 438 ```go 439 type ExposedClusterVersion struct { 440 MinSupportedVersion roachpb.Version // Server configuration, does not change 441 ServerVersion roachpb.Version // Server configuration, does not change 442 // baseVersion stores a *ClusterVersion. It's very initially zero (at which 443 // point any calls to check the version are fatal) and is initialized with 444 // the version from the stores early in the boot sequence. Later, it gets 445 // updated with gossiped updates, but only *after* each update has been 446 // written back to the disks (so that we don't expose anything to callers 447 // that we may not see again if the node restarted). 448 baseVersion atomic.Value 449 // The cluster setting administering the `version` setting. On change, 450 // invokes logic that calls `cb` and then bumps `baseVersion`. 451 version *settings.StateMachineSetting 452 // Callback into the node to persist a new gossiped MinimumVersion (invoked 453 // before `baseVersion` is updated) 454 cb func(ClusterVersion) 455 } 456 457 // Version returns the minimum cluster version the caller may assume is in 458 // effect. It must not be called until the setting has been initialized. 459 func (ecv *ExposedClusterVersion) Version() ClusterVersion 460 461 // BootstrapVersion returns the version a newly initialized cluster should have. 462 func (ecv *ExposedClusterVersion) BootstrapVersion() ClusterVersion 463 464 // IsActive returns true if the features of the supplied version are active at 465 // the running version. 466 func (ecv *ExposedClusterVersion) IsActive(v roachpb.Version) bool 467 ``` 468 469 The remaining complexity lies in the transformer function for 470 `version *settings.StateMachineSetting`. It contains all of the update logic 471 (mod reading from the table: we've updated the settings framework to use the 472 table for all `StateMachineSettings`, of which this is the only instance at the 473 time of writing); `versionTransformer` takes 474 475 - the previous encoded value (i.e. a marshalled `ClusterVersion`) 476 - the desired transition, if any (for example "1.2"). 477 478 and returns 479 480 - the new encoded value (i.e. the new marshalled `ClusterVersion`) 481 - an interface backed by a "user-friendly" representation of the new state 482 (i.e. something that can be printed) 483 - an error if the input was illegal. 484 485 The most complicated bits happen inside of this function for the following 486 special cases: 487 488 - when no previous encoded value is given, the transformer provides the "default 489 value". In this case, it's `baseVersion`. In particular, the default value 490 changes! This behaviour is required because when the initial gossip update 491 comes in, it needs to be validated, and we also validate what users do during 492 `SET CLUSTER SETTING version = ...`, which they could do through multiple 493 versions. 494 - validate the new state and fail if either the node itself has `ServerVersion` 495 below the new `MinimumVersion` or its `MinSupportedVersion` is newer than the 496 `MinimumVersion`. 497 498 ## Initially populating the settings table version entry 499 500 ### Populating the settings table 501 502 As outlined in the guide-level explanation, we'd like the settings table to hold 503 the "current" cluster version for new clusters, but we have no good way of 504 populating it at bootstrap time and don't have sufficient information to 505 populate it in a foolproof manner later. The solution is presented at the end of 506 this section. We first detail the "obvious" approaches and their shortcomings. 507 508 #### Approaches that don't work 509 510 We could use the "suspected" local version during cluster version when no 511 `version` is persisted in the table, but: 512 513 - this is incorrect when, say, adding a v3.0 node to a v1.0 cluster and 514 running `SET CLUSTER SETTING version = '3.1'` on that node: 515 - in the absence of any other information (and assume that we don't try any 516 "polling" of everyone's version which can quickly get out of hand), the node 517 assumes the cluster version is `v3.1` (or `v3.0`). 518 - it manages to write `version = v3.1` to the settings table. 519 - all other nodes in the cluster die when they learn about that. 520 - it would "work" (with `v1.1`) on all nodes running the "actual" version; the 521 `v3.0` node would die instead, learning that the cluster is now at `v1.1`. 522 - the classic case to worry about is that of an operator "skipping a version" 523 during the rolling restart. We would catch this as the new binary can't run 524 from the old storage directory (i.e. `v1.7` can't be started on a `v1.5` 525 store). 526 - it would equally be impossible to roll from `v1.5` into `v1.6` into `v1.7` 527 (the storage markers would remain at `v1.5`). 528 - in effect, to realize the above problem in practice, an operator would 529 have to add a brand new node running a too new version to a preexisting 530 cluster and run the version bump on that node. 531 - This might be an acceptable solution in practice, but it's hard to reason 532 about and explain. 533 534 An alternative (but apparently problematic) approach is adding a sql migration. 535 The problem with those is that it's not clear which value the migration should 536 insert into the table -- is it that of the running binary? That would do the 537 wrong thing if a `v1.0` cluster is restarted into `v1.1` (which now has the 538 migration); we need to insert `v1.0` in that case. On the other hand, after 539 bootstrapping a `v1.x` cluster for `x > 0`, we want to insert `v1.x`. 540 541 And of course there is the third approach, which is writing the settings table 542 during actual bootstrapping. This seems to much work to be realistic at this 543 point in the cycle, and it may come with its own migration concerns. 544 545 #### The combination that does work 546 547 All of the previous approaches combined suggest a more workable combination: 548 549 1. instead of populating the settings table at bootstrap, populate a new key 550 `BootstrapVersion` (similar to the `ClusterIdent`). In effect, for the 551 lifetime of this cluster, we can get an authoritative answer about the 552 version at which it was bootstrapped. 553 1. change the semantics of `SET CLUSTER SETTING version = x` so that when it 554 doesn't find an entry in the `system.settings` table, it fails. 555 1. add a sql migration that 556 - reads `BootstrapVersion` 557 - runs 558 ```sql 559 -- need this explicitly or the `SET` below would fail! 560 UPSERT INTO system.settings VALUES( 561 'version', marshal_appropriately('<bootstrap_version>') 562 ); 563 -- Trigger proper gossip, etc, by doing "no-op upgrade". 564 SET CLUSTER SETTING version = '<bootstrap_version>'; 565 ``` 566 - the cluster version is now set and operators can use `SET CLUSTER 567 SETTING`. 568 569 This obviously works if the migration runs when the cluster is still running 570 the bootstrapped version, even if the operator set the version explicitly (`SET 571 CLUSTER SETTING version = x` is idempotent). 572 573 574 ### Bootstrapping new stores 575 576 When an existing node restarts, it has on-disk markers that should reflect a 577 reasonable version configuration to assume until gossip updates are in effect. 578 579 The situation is slightly different when a new node joins a cluster for the 580 first time. In this case, it'll bootstrap its stores using its binary's 581 `MinimumSupportedVersion`, (for that is all that it knows), which is usually 582 one minor version behind the cluster's active version. 583 584 This is not an issue since the node's binary can still participate in newer 585 features, and it will bump its version once it receives the first gossip update, 586 typically after seconds. 587 588 We could conceivably be smarter about finding out the active cluster version 589 proactively (we're connected to gossip when we bootstrap), but this is not 590 deemed worth the extra complexity. 591 592 ## Drawbacks 593 594 - relying on the operator to promise that no old version is around opens cluster 595 health up to user error. 596 - we can't roll back upgrades, which will make many users nervous 597 - in particular, it discounts the OSS version of CockroachDB 598 599 ## Rationale and Alternatives 600 601 The main concern in designing the mechanism here is operator friendlyness. We 602 don't want to make the process more complicated; it should be scriptable; it 603 should be hard to get wrong (and if you do, it shouldn't matter). 604 605 The main concessions we make in this design are 606 607 1. no support for downgrades once the version setting has been bumped. This was 608 discussed early in the life of this RFC but was discarded for its inherent 609 complexity and large search space that would have to be tested. 610 611 It will be difficult to retrofit this, though a change in the upgrade process 612 can itself be migrated through the upgrade process presented here (though it 613 would take one release to go through the transition). 614 615 We can at least downgrade before bumping the cluster version setting though, 616 which allows all backwards-compatible changes (ideally the bulk) to be 617 tested. 618 619 Additionally, we could implement the originally envisioned `UseVersion` 620 functionality so that the following conditions must be met until a backup is 621 really necessary if a problem is a) severe and b) can't be "deactivated" (via 622 `UseVersion`). 623 1. relying on the operator to guarantee that a cluster is not running mixed 624 versions when the explicit version bump is issued. 625 626 There are ways in which this could be approximately inferred, but again it 627 was deemed too complex given the time frame. Besides, operators may prefer to 628 have some level of control over the migration process, and it is difficult to 629 make an autonomous upgrade workflow foolproof. 630 631 If desired in the future, this can be retrofitted. 632 633 As a result, we get a design that's ergonomic but limited. The complexity 634 inherent with it as written indicates that it is a good choice to not add 635 additional complexity at this point. We are not locked into the process in the 636 long term. 637 638 ## Unresolved questions 639 640 ### Naming 641 642 `MinimumVersion` and `MinimumSupportedVersion` are similar but also different. 643 Perhaps the latter should be renamed, though no better name comes to mind. 644 645 ### What to gossip 646 647 We make no use of the gossiped `ServerVersion`. The node's git commit hash is 648 already available, so this is only mildly interesting. Its `MinimumVersion` 649 (plus its `UseVersion` if that should ever differ) are more relevant. Likely 650 these should be added, even if the information isn't used today.