github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170815_version_migration.md (about)

     1  - Feature Name: Version Migration
     2  - Status: completed
     3  - Start Date: 2017-07-10
     4  - Authors: Spencer Kimball, Tobias Schottdorf
     5  - RFC PR: [#16977](https://github.com/cockroachdb/cockroach/pull/16977), [#17216](https://github.com/cockroachdb/cockroach/pull/17216), [#17411](https://github.com/cockroachdb/cockroach/pull/17411), [#17694](https://github.com/cockroachdb/cockroach/pull/17694)
     6  - Cockroach Issue(s): [#17389](https://github.com/cockroachdb/cockroach/issues/17389)
     7  
     8  
     9  # Summary
    10  
    11  This RFC proposes a mechanism which allows point release upgrades (i.e. 1.1 to
    12  1.2) of CockroachDB clusters using a rolling restart followed by an
    13  operator-issued cluster-wide version bump that represents a promise that the old
    14  version is no longer running on any of the nodes in the cluster, and that
    15  backwards-incompatible features can be enabled.
    16  
    17  This is achieved through a version which can be queried at server runtime, and
    18  which is populated from
    19  
    20  1. a new cluster setting `version`, and
    21  1. a persisted version on the Store's engines.
    22  
    23  The main concern in designing the mechanism here is operator friendlyness. We
    24  don't want to make the process more complicated; it should be scriptable; it
    25  should be hard to get wrong (and if you do, it shouldn't matter).
    26  
    27  # Motivation
    28  
    29  We have committed to supporting rolling restarts for version upgrades, but we do
    30  not have a mechanism in place to safely enable those when backwards-incompatible
    31  features are introduced.
    32  
    33  The core problem is that incompatible features must not run in a mixed-version
    34  cluster, yet this is precisely what happens if a rolling restart is used to
    35  upgrade the versions. Yet, we expect a handful of backwards-incompatible
    36  changes in each release.
    37  
    38  What's needed is thus a mechanism that allows a rolling restart into the new
    39  binary to be performed while holding back on using the new, incompatible,
    40  features, and this is what's introduced in this RFC.
    41  
    42  # Guide-level explanation
    43  
    44  We'll identify the "moving parts" in this machinery first and then walk through
    45  an example migration from 1.1 to 1.2, which contains exactly one incompatible
    46  upgrade (which really lands in `v1.1`, but let's forget that), namely:
    47  
    48  On splits,
    49  
    50  - v1.1 was creating a Raft `HardState` while *proposing* the split. By the time
    51    it was applied, it was potentially stale, which could lead to anomalies.
    52  - v1.2 does not write a `HardState` during the split, but it writes a "correct"
    53    `HardState` immediately before the right-hand side's Raft group is booted up.
    54  
    55  This is an incompatible change because in a mixed cluster in which v1.2 issues
    56  the split and v1.1 applies it, the node at v1.1 would end up without a
    57  `HardState`, and immediately crash. We need a proper migration -- `v1.2` must know
    58  when it is safe to use the new behavior; it must use the previous (incorrect) one
    59  until it knows that, and hopes that the anomaly doesn't strike in the meantime.
    60  
    61  You can forget about the precise change now, but it's useful to see that this
    62  one is incompatible but absolutely necessary - we can't fix it in a way that
    63  works around the incompatibility, and we can't keep the status quo.
    64  
    65  Now let's do the actual update. For that, the operator
    66  
    67  1. verify the output of `SHOW CLUSTER SETTING version` on each node, to assert
    68     they are using `v1.1` (it could be that the operator previously updated
    69     the *binary* from `v1.0` to `v1.1` but forgot to actually bump the version
    70     in use!)
    71  1. performs a rolling restart into the `v1.2` binary,
    72  1. checks that the nodes are *really* running a `v1.2` binary. This includes
    73     making sure auto-scalers, failover, etc, would spawn nodes at `v1.2`
    74  1. in the meantime the nodes are running `v1.2`, but with `v1.1`-compatible
    75     features, until the operator
    76  1. runs `SET CLUSTER SETTING version = '1.2'`.
    77  1. The cluster now operates at `v1.2`.
    78  
    79  We run in detail through what the last one does under the hood.
    80  
    81  ## Recap: how cluster settings work
    82  
    83  Each node has a number of "settings variables" which have hard-coded default
    84  values. To change a variable from that default value, one runs `SET CLUSTER
    85  SETTING settingname = 'settingval'`.
    86  
    87  Under the hood, this roughly translates to `INSERT INTO system.settings
    88  VALUES(settingname, settingval)`, and we have special triggers in place which
    89  gossip the contents of this table whenever they change.
    90  
    91  When such a gossip update is received by a node, it goes through the (hard-coded
    92  list of) setting variables, populates it with the values from the table, and
    93  resets all unmentioned variables to their default value.
    94  
    95  To show the values of the setting variables, one can run `SHOW ALL CLUSTER
    96  SETTINGS`. Note that the output of that can vary based on the node, i.e. if one
    97  hasn't gotten the most recent updates yet. The command to read from the actual
    98  table is `SELECT ... FROM system.settings`; this shows only those commands for
    99  which an explicit `SET CLUSTER SETTING` has been issued.
   100  
   101  Note that the `version` variable has additional logic to be detailed in the next
   102  section.
   103  
   104  ## Running `SET CLUSTER SETTING version = '1.2'`
   105  
   106  What actually happens when the cluster version is bumped? There are a few things
   107  that we want.
   108  
   109  1. The operator should not be able to perform "illegal" version bumps. A legal
   110     bump is from one version to the next: 1.0 to 1.1, 1.1 to 1.2, perhaps 1.6 to
   111     2.0 (if 1.6 is the last release in the 1.x series), but *not* 1.0 to 1.3 and
   112     definitely not `v1.0` to `v5.0`.
   113  
   114     This immediately tells us that the `version` setting needs to take into
   115     account its existing value before updating to the new one, and may need
   116     to return an error when its validation logic fails.
   117  
   118     This also suggests that it should use the existing value from the
   119     `system.settings` table, and not from `SHOW` (which may not be
   120     authoritative).
   121  2. If we use the `system.settings` table naively, we may have a problem: settings
   122     don't persist their default value, and in particular, a new cluster (or one
   123     started during 1.0) does not have a `version` setting persisted. It is hard
   124     to persist a setting during bootstrap since the `settings` table is only
   125     created later. It's tricky to correctly populate that table once the cluster
   126     is running because all you know about is your current node, but what if
   127     you're accidentally running 3.0 while the real cluster is at 1.0?
   128  
   129  We defer the problem of populating the settings table to the detailed design
   130  section and now assume that there *is* a `version` entry in the settings table.
   131  
   132  Then what happens on a version bump is clear: the existing version is read, it
   133  is checked whether the new version is a valid successor version, and if so, the
   134  entry is transactionally updated. On error, the operator running the `SET
   135  CLUSTER SETTING` command will receive a descriptive message such as:
   136  
   137  ```
   138  cannot upgrade to 2.0: node running 1.1
   139  cannot upgrade directly from 1.0 to 1.3
   140  cannot downgrade from 1.0 to 0.9
   141  ```
   142  
   143  Assuming success, the settings machinery picks up the new version, and populates
   144  the in-memory version setting, from which it can be inspected by the running
   145  node.
   146  
   147  To probe the running version, essentially, all moving parts in the node hold on
   148  to a variable that implements the following interface (and in the background is
   149  hooked up to the version cluster setting appropriately):
   150  
   151  ```go
   152  type Versioner interface {
   153    IsActive(version roachpb.Version) bool
   154    Version() cluster.ClusterVersion // ignored for now
   155  }
   156  ```
   157  
   158  For instance, before having run `SET CLUSTER SETTING version = '1.1'`, we have
   159  
   160  ```go
   161  IsActive(roachpb.Version{Major: 1, Minor: 0}) == true
   162  IsActive(roachpb.Version{Major: 1, Minor: 1}) == false
   163  ```
   164  
   165  even though the binary is capable of running `v1.1`.
   166  
   167  After having bumped the cluster version, we instead get
   168  
   169  ```go
   170  IsActive(roachpb.Version{Major: 1, Minor: 0}) == true
   171  IsActive(roachpb.Version{Major: 1, Minor: 1}) == true
   172  ```
   173  
   174  but (still)
   175  
   176  ```go
   177  IsActive(roachpb.Version{Major: 1, Minor: 2}) == false
   178  ```
   179  
   180  To return back to our incompatible feature example, we
   181  would have code like this:
   182  
   183  ```go
   184  func proposeSplitCommand() {
   185    p := makeSplitProposal()
   186    if !v.IsActive(roachpb.Version{Major: 1, Minor: 1}) {
   187      // Preserve old behavior at v1.0.
   188      p.hardState = makePotentiallyDangerousHardState()
   189    }
   190    propose(p)
   191  }
   192  
   193  func applySplit(p splitProposal) {
   194    raftGroup := apply(p)
   195    if v.IsActive(roachpb.Version{Major: 1, Minor: 1}) {
   196      // Enable new behavior only if at v1.1 or later.
   197      hardState := makeHardState()
   198      writeHardState(hardState)
   199    }
   200    raftGroup.GoLive()
   201  }
   202  ```
   203  
   204  Some features may require an explicit "ping" when the version gets bumped. Such
   205  a mechanism is easy to add once it's required; we won't talk about it any more
   206  here.
   207  
   208  Note that a server always accepts features that require the new version, even if
   209  it hasn't received note that they are safe to use, when their use is prompted by
   210  another node. For instance, if a new inter-node RPC is introduced, nodes should
   211  always respond to it if they can (i.e. know about the RPC). A bump in the
   212  cluster version propagates through the cluster asynchronously, so a node may
   213  start using a new feature before others realize it is safe.
   214  
   215  ## Development versions
   216  
   217  During the development cycle, new backwards-incompatible migrations may need to
   218  be introduced. For this, we use "unstable" versions, which are written
   219  `<major>.<minor>-<unstable>`; while stable releases will always have `<unstable>
   220  == 0`, each unstable change gets a unique, strictly incrementing unstable
   221  version component. For instance, at the time of writing (`v1.1-alpha`), we have
   222  the following:
   223  
   224  ```go
   225  var (
   226  	// VersionSplitHardStateBelowRaft is https://github.com/cockroachdb/cockroach/pull/17051.
   227  	VersionSplitHardStateBelowRaft = roachpb.Version{Major: 1, Minor: 0, Unstable: 2}
   228  
   229  	// VersionRaftLogTruncationBelowRaft is https://github.com/cockroachdb/cockroach/pull/16993.
   230  	VersionRaftLogTruncationBelowRaft = roachpb.Version{Major: 1, Minor: 0, Unstable: 1}
   231  
   232  	// VersionBase corresponds to any binary older than 1.0-1,
   233  	// though these binaries won't know anything about the mechanism in which
   234  	// this version is used.
   235  	VersionBase = roachpb.Version{Major: 1}
   236  )
   237  ```
   238  
   239  Note that there is no `v1.1` yet. This version will only exist with the stable
   240  `v1.1.0` release.
   241  
   242  Tagging the unstable versions individually has the advantage that we can
   243  properly migrate our test clusters simply through a rolling restart, and then a
   244  version bump to `<major>.<minor>-<latest_unstable>` (it's allowed to enable
   245  multiple unstable versions at once).
   246  
   247  ## Upgrade process (close to documentation)
   248  
   249  The upgrade process as we document it publicly will have to advise operators to
   250  create appropriate backups. They should roughly follow this checklist:
   251  
   252  ### Optional prelude: staging dry run
   253  - Start a **staging** cluster with the new version (e.g. 1.1).
   254  - Restore data from most recent backup(s) to staging cluster.
   255  - Tee production traffic or run load generators to simulate live
   256    traffic and verify cluster stability.
   257  - Proceed to upgrade process described above.
   258  
   259  ### Steps in production
   260  - Disable auto-scaling or other systems that could add a node with a conflicting
   261    version at an inopportune time.
   262  - Ensure that all nodes are either running or guaranteed to not rejoin the
   263    cluster after it has been updated.
   264      - We intend to protect the cluster from mismatched nodes, but the exact
   265        mechanism is TBD. Check back when writing the docs.
   266  - Create a (full or incremental) backup of the cluster.
   267  - Rolling upgrade of nodes to next version.
   268  - At this point, all nodes will be running the new binary, albeit with
   269    compatibility for the old one.
   270  - Verify no node running the old version remains in the cluster (and no new one
   271    will accidentally be added).
   272  - Verify basic cluster stability. If problems occur, a rolling downgrade is
   273    still an option.
   274  - Depending on how much time has passed, another incremental backup could be
   275    advisable.
   276  - `SET CLUSTER SETTING version = '<newversion>'`
   277  - In the event of a catastrophic failure or corruption due to usage of new
   278    features requiring 1.1, the only option is to restore from backup. This is a
   279    two step process: start a new cluster using the old binary, and then restore
   280    from the backup(s).
   281  - restore any orchestration settings (auto-scaling, etc) back to their normal
   282    production values.
   283  
   284  ![Version migrations with rolling upgrades](images/version_migration.png?raw=true "Version migrations with rolling upgrades")
   285  
   286  # Reference-level explanation
   287  
   288  This section runs through the moving parts involved in the implementation.
   289  
   290  ## Detailed design
   291  
   292  ### Structures and nomenclature
   293  
   294  The fundamental straightforward structure is `roachpb.Version`:
   295  
   296  ```proto
   297  message Version {
   298    optional int32 major = 1;   // the "2" in `v2.1`
   299    optional int32 minor = 2;   // the "1" in `v2.1`
   300    optional int32 patch = 3;   // placeholder; always zero
   301    optional int32 unstable = 4 // dev version; all stable versions have zero
   302  }
   303  ```
   304  
   305  The situation gets complicated by the fact that our usage of the version as the `version` cluster setting is really a "cluster-wide minimum version", which informs the use of the name `MinimumVersion` in the following `ClusterVersion` proto:
   306  
   307  ```
   308  message ClusterVersion {
   309    // The minimum_version required for any node to support. This
   310    // value must monotonically increase.
   311    roachpb.Version minimum_version = 1 [(gogoproto.nullable) = false];
   312  }
   313  ```
   314  
   315  This should make sense so far. However, discussion in this RFC has mandated that
   316  we include an ominous `UseVersion` as well. This emerged as a compromise after
   317  giving up on allowing "rollbacks" of upgrades. Briefly put, `UseVersion` can be
   318  smaller than `MinimumVersion` and advises the server to not use new features
   319  that it has the discretion to not use. For example, assume `v1.1` contains a
   320  performance optimization (that isn't supported in a cluster running nodes at
   321  `v1.0`). After bumping the cluster version to `v1.1`, it turns out that the
   322  optimization is a horrible pessimization for the cluster's workload, and things
   323  start to break. The operator can then set `UseVersion` back to `v1.0` to advise
   324  the cluster to not use that performance optimization (even if it could). On the
   325  other hand, some other migrations it has performed (perhaps it rewrote some of
   326  its on-disk state to a new format) may not support being "deactivated", so they
   327  would continue to be in effect.
   328  
   329  This feature has not been implemented in the initial version, though it has been
   330  "plumbed". It will be ignored in this design from this point on, though no
   331  effort will be made to scrub it from code samples.
   332  
   333  ```
   334  message ClusterVersion {
   335    [...]
   336    // The version of functionality in use in the cluster. Unlike
   337    // minimum_version, use_version may be downgraded, which will
   338    // disable functionality requiring a higher version. However,
   339    // some functionality, once in use, can not be discontinued.
   340    // Support for that functionality is guaranteed by the ratchet
   341    // of minimum_version.
   342    roachpb.Version use_version = 2 [(gogoproto.nullable) = false];
   343  }
   344  ```
   345  
   346  The `system.settings` table entry for `version` is in fact a marshalled
   347  `ClusterVersion` (for which `MinimumVersion == UseVersion`).
   348  
   349  ### Server Configuration
   350  
   351  The `ServerVersion` (type `roachpb.Version`, sometimes referred to as "binary
   352  version") is baked into the binaries we release (in which case it equals
   353  `cluster.BinaryServerVersion`, so our `v1.1` release will have `ServerVersion
   354  ==roachpb.Version{Major: 1, Minor: 1}`). However, internally, ServerVersion` is
   355  part of the configuration of a `*Server` and can be set freely (which we do in
   356  tests).
   357  
   358  Similarly a `Server` is configured with `MinimumSupportedVersion` (which in
   359  release builds typically trails `cluster.BinaryServerVersion` by a minor
   360  version, reflecting the fact that it can run in a compatible way with its
   361  predecessor). If a server starts up with a store that has a persisted version
   362  smaller than this or larger than its `ServerVersion`, it exits with an error.
   363  We'll talk about store persistence in the corresponding subsection.
   364  
   365  ### Gossip
   366  
   367  The `NodeDescriptor` gossips the server's configured `ServerVersion`. This isn't
   368  used at the time of writing; see the unresolved section for discussion.
   369  
   370  ### Storage (persistence)
   371  
   372  Once a node has received a cluster-wide minimum version from the settings table
   373  via Gossip, it is used as the authoritative version the server is operating at
   374  (unless the binary can't support it, in which case it commits suicide).
   375  
   376  Typically, the server will go through the following transitions:
   377  
   378  - `ServerVersion` is (say) `v1.1`, runs `v1.1` (`MinimumVersion == UseVersion == v1.1`),
   379    stores have the above `MinimumVersion` and `UseVersion` persisted
   380  - rolling restart
   381  - `ServerVersion` is `v1.2`, runs `v1.1`-compatible (`MinimumVersion == UseVersion == v1.1`)
   382  - operator issues `SET CLUSTER SETTING version = '1.2'`
   383  - gossip received: stores updated to `MinimumVersion == UseVersion == v1.2`
   384  - new `MinimumVersion` (and equal `UseVersion`) exposed to running process:
   385    `ServerVersion` is (still) `v1.2`, but now `MinimumVersion == UseVersion == v1.2`.
   386  
   387  We need to close the gap between starting the server and receiving the above
   388  information, and we also want to prevent restarting into a too-recent version
   389  in the first place (for example restarting straight into `v1.3` from `v1.1`).
   390  
   391  To this end, whenever any `Node` receives a `version` from gossip, it writes it
   392  to a store local key (`keys.StoreClusterVersionKey()`) on *all* of its stores
   393  (as a `ClusterVersion`).
   394  
   395  Additionally, when a cluster is bootstrapped, the store is populated with
   396  the running server's version (this will not happen for `v1.0` binaries as
   397  these don't know about this RFC).
   398  
   399  When a node starts up with new stores to bootstrap, it takes precautions to
   400  propagate the cluster version to these stores as well. See the unresolved
   401  questions for some discussion of how an illegal version joining an existing
   402  cluster can be prevented.
   403  
   404  This seems simple enough, but the logic that reads from the stores has to deal
   405  with the case in which the various stores either have no (as could happen as we
   406  boot into them from 1.0) or conflicting information. Roughly speaking,
   407  bootstrapping stores can afford to wait for an authoritative version from gossip
   408  and use that, and whenever we ask a store about its persisted `MinimumVersion`
   409  and it has none persisted, it counts as a store at
   410  `MinimumVersion=UseVersion=v1.0`. We make sure to write the version atomically
   411  with the bootstrap information (to make sure we don't bootstrap a store, crash,
   412  and then misidentify as `v1.0`). We also write all versions again after
   413  bootstrap to immunize against the case in which the cluster version was bumped
   414  mid-bootstrap (this could cause trouble if we add one store and then remove the
   415  original one between restarts).
   416  
   417  The cluster version we synthesize at node start is then the one with the largest
   418  `MinimumVersion` (and the smallest `UseVersion`).
   419  
   420  Examples:
   421  
   422  - one store at `<empty>`, another at `v1.1` results in `v1.1`.
   423  - two stores, both `<empty>` results in `v1.0`.
   424  - three stores at `v1.1`, `v1.2` and `v1.3` results in `v1.3` (but likely
   425    catches an error anyway because it spans versions)
   426  
   427  ### The implementer of `Versioner`: `ExposedClusterVersion`
   428  
   429  `ExposedClusterVersion` is the central object that manages the information
   430  bootstrapped from the stores at node start and the gossiped central version and
   431  flips between the two at the appropriate moment. It also encapsulates the logic
   432  that dictates which version upgrades are admissible, and for that reason
   433  integrates fairly tightly with the `settings` subsystem. This is fairly complex
   434  and so appropriate detail is supplied below.
   435  
   436  We start out with the struct itself.
   437  
   438  ```go
   439  type ExposedClusterVersion struct {
   440  	MinSupportedVersion roachpb.Version // Server configuration, does not change
   441  	ServerVersion       roachpb.Version // Server configuration, does not change
   442    // baseVersion stores a *ClusterVersion. It's very initially zero (at which
   443    // point any calls to check the version are fatal) and is initialized with
   444    // the version from the stores early in the boot sequence. Later, it gets
   445    // updated with gossiped updates, but only *after* each update has been
   446    // written back to the disks (so that we don't expose anything to callers
   447    // that we may not see again if the node restarted).
   448  	baseVersion         atomic.Value
   449    // The cluster setting administering the `version` setting. On change,
   450    // invokes logic that calls `cb` and then bumps `baseVersion`.
   451  	version             *settings.StateMachineSetting
   452    // Callback into the node to persist a new gossiped MinimumVersion (invoked
   453    // before `baseVersion` is updated)
   454  	cb                  func(ClusterVersion)
   455  }
   456  
   457  // Version returns the minimum cluster version the caller may assume is in
   458  // effect. It must not be called until the setting has been initialized.
   459  func (ecv *ExposedClusterVersion) Version() ClusterVersion
   460  
   461  // BootstrapVersion returns the version a newly initialized cluster should have.
   462  func (ecv *ExposedClusterVersion) BootstrapVersion() ClusterVersion
   463  
   464  // IsActive returns true if the features of the supplied version are active at
   465  // the running version.
   466  func (ecv *ExposedClusterVersion) IsActive(v roachpb.Version) bool
   467  ```
   468  
   469  The remaining complexity lies in the transformer function for
   470  `version *settings.StateMachineSetting`. It contains all of the update logic
   471  (mod reading from the table: we've updated the settings framework to use the
   472  table for all `StateMachineSettings`, of which this is the only instance at the
   473  time of writing); `versionTransformer` takes
   474  
   475  - the previous encoded value (i.e. a marshalled `ClusterVersion`)
   476  - the desired transition, if any (for example "1.2").
   477  
   478  and returns
   479  
   480  - the new encoded value (i.e. the new marshalled `ClusterVersion`)
   481  - an interface backed by a "user-friendly" representation of the new state
   482    (i.e. something that can be printed)
   483  - an error if the input was illegal.
   484  
   485  The most complicated bits happen inside of this function for the following
   486  special cases:
   487  
   488  - when no previous encoded value is given, the transformer provides the "default
   489    value". In this case, it's `baseVersion`. In particular, the default value
   490    changes! This behaviour is required because when the initial gossip update
   491    comes in, it needs to be validated, and we also validate what users do during
   492    `SET CLUSTER SETTING version = ...`, which they could do through multiple
   493    versions.
   494  - validate the new state and fail if either the node itself has `ServerVersion`
   495    below the new `MinimumVersion` or its `MinSupportedVersion` is newer than the
   496    `MinimumVersion`.
   497  
   498  ## Initially populating the settings table version entry
   499  
   500  ### Populating the settings table
   501  
   502  As outlined in the guide-level explanation, we'd like the settings table to hold
   503  the "current" cluster version for new clusters, but we have no good way of
   504  populating it at bootstrap time and don't have sufficient information to
   505  populate it in a foolproof manner later. The solution is presented at the end of
   506  this section. We first detail the "obvious" approaches and their shortcomings.
   507  
   508  #### Approaches that don't work
   509  
   510  We could use the "suspected" local version during cluster version when no
   511  `version` is persisted in the table, but:
   512  
   513  - this is incorrect when, say, adding a v3.0 node to a v1.0 cluster and
   514    running `SET CLUSTER SETTING version = '3.1'` on that node:
   515    - in the absence of any other information (and assume that we don't try any
   516      "polling" of everyone's version which can quickly get out of hand), the node
   517      assumes the cluster version is `v3.1` (or `v3.0`).
   518    - it manages to write `version = v3.1` to the settings table.
   519    - all other nodes in the cluster die when they learn about that.
   520  - it would "work" (with `v1.1`) on all nodes running the "actual" version; the
   521    `v3.0` node would die instead, learning that the cluster is now at `v1.1`.
   522  - the classic case to worry about is that of an operator "skipping a version"
   523    during the rolling restart. We would catch this as the new binary can't run
   524    from the old storage directory (i.e. `v1.7` can't be started on a `v1.5`
   525    store).
   526  - it would equally be impossible to roll from `v1.5` into `v1.6` into `v1.7`
   527    (the storage markers would remain at `v1.5`).
   528  - in effect, to realize the above problem in practice, an operator would
   529    have to add a brand new node running a too new version to a preexisting
   530    cluster and run the version bump on that node.
   531  - This might be an acceptable solution in practice, but it's hard to reason
   532    about and explain.
   533  
   534  An alternative (but apparently problematic) approach is adding a sql migration.
   535  The problem with those is that it's not clear which value the migration should
   536  insert into the table -- is it that of the running binary? That would do the
   537  wrong thing if a `v1.0` cluster is restarted into `v1.1` (which now has the
   538  migration); we need to insert `v1.0` in that case. On the other hand, after
   539  bootstrapping a `v1.x` cluster for `x > 0`, we want to insert `v1.x`.
   540  
   541  And of course there is the third approach, which is writing the settings table
   542  during actual bootstrapping. This seems to much work to be realistic at this
   543  point in the cycle, and it may come with its own migration concerns.
   544  
   545  #### The combination that does work
   546  
   547  All of the previous approaches combined suggest a more workable combination:
   548  
   549  1. instead of populating the settings table at bootstrap, populate a new key
   550     `BootstrapVersion` (similar to the `ClusterIdent`). In effect, for the
   551     lifetime of this cluster, we can get an authoritative answer about the
   552     version at which it was bootstrapped.
   553  1. change the semantics of `SET CLUSTER SETTING version = x` so that when it
   554     doesn't find an entry in the `system.settings` table, it fails.
   555  1. add a sql migration that
   556      - reads `BootstrapVersion`
   557      - runs
   558          ```sql
   559          -- need this explicitly or the `SET` below would fail!
   560          UPSERT INTO system.settings VALUES(
   561            'version', marshal_appropriately('<bootstrap_version>')
   562          );
   563          -- Trigger proper gossip, etc, by doing "no-op upgrade".
   564          SET CLUSTER SETTING version = '<bootstrap_version>';
   565          ```
   566      - the cluster version is now set and operators can use `SET CLUSTER
   567        SETTING`.
   568  
   569  This obviously works if the migration runs when the cluster is still running
   570  the bootstrapped version, even if the operator set the version explicitly  (`SET
   571  CLUSTER SETTING version = x` is idempotent).
   572  
   573  
   574  ### Bootstrapping new stores
   575  
   576  When an existing node restarts, it has on-disk markers that should reflect a
   577  reasonable version configuration to assume until gossip updates are in effect.
   578  
   579  The situation is slightly different when a new node joins a cluster for the
   580  first time. In this case, it'll bootstrap its stores using its binary's
   581  `MinimumSupportedVersion`, (for that is all that it knows), which is usually
   582  one minor version behind the cluster's active version.
   583  
   584  This is not an issue since the node's binary can still participate in newer
   585  features, and it will bump its version once it receives the first gossip update,
   586  typically after seconds.
   587  
   588  We could conceivably be smarter about finding out the active cluster version
   589  proactively (we're connected to gossip when we bootstrap), but this is not
   590  deemed worth the extra complexity.
   591  
   592  ## Drawbacks
   593  
   594  - relying on the operator to promise that no old version is around opens cluster
   595    health up to user error.
   596  - we can't roll back upgrades, which will make many users nervous
   597      - in particular, it discounts the OSS version of CockroachDB
   598  
   599  ## Rationale and Alternatives
   600  
   601  The main concern in designing the mechanism here is operator friendlyness. We
   602  don't want to make the process more complicated; it should be scriptable; it
   603  should be hard to get wrong (and if you do, it shouldn't matter).
   604  
   605  The main concessions we make in this design are
   606  
   607  1. no support for downgrades once the version setting has been bumped. This was
   608     discussed early in the life of this RFC but was discarded for its inherent
   609     complexity and large search space that would have to be tested.
   610  
   611     It will be difficult to retrofit this, though a change in the upgrade process
   612     can itself be migrated through the upgrade process presented here (though it
   613     would take one release to go through the transition).
   614  
   615     We can at least downgrade before bumping the cluster version setting though,
   616     which allows all backwards-compatible changes (ideally the bulk) to be
   617     tested.
   618  
   619     Additionally, we could implement the originally envisioned `UseVersion`
   620     functionality so that the following conditions must be met until a backup is
   621     really necessary if a problem is a) severe and b) can't be "deactivated" (via
   622     `UseVersion`).
   623  1. relying on the operator to guarantee that a cluster is not running mixed
   624     versions when the explicit version bump is issued.
   625  
   626     There are ways in which this could be approximately inferred, but again it
   627     was deemed too complex given the time frame. Besides, operators may prefer to
   628     have some level of control over the migration process, and it is difficult to
   629     make an autonomous upgrade workflow foolproof.
   630  
   631     If desired in the future, this can be retrofitted.
   632  
   633  As a result, we get a design that's ergonomic but limited. The complexity
   634  inherent with it as written indicates that it is a good choice to not add
   635  additional complexity at this point. We are not locked into the process in the
   636  long term.
   637  
   638  ## Unresolved questions
   639  
   640  ### Naming
   641  
   642  `MinimumVersion` and `MinimumSupportedVersion` are similar but also different.
   643  Perhaps the latter should be renamed, though no better name comes to mind.
   644  
   645  ### What to gossip
   646  
   647  We make no use of the gossiped `ServerVersion`. The node's git commit hash is
   648  already available, so this is only mildly interesting. Its `MinimumVersion`
   649  (plus its `UseVersion` if that should ever differ) are more relevant. Likely
   650  these should be added, even if the information isn't used today.