github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160426_version_upgrades.md (about)

     1  - Feature Name: Version upgrades
     2  - Status: rejected
     3  - Start Date: 2016-04-10
     4  - Authors: Tobias Schottdorf
     5  - RFC PR: [#5985](https://github.com/cockroachdb/cockroach/pull/5985)
     6  - Cockroach Issue:
     7  
     8  # Rejection notes
     9  
    10  This RFC lead us to reconsider Raft proposals and lead us to consider (and
    11  decide for) leaseholder-evaluated Raft (#6166) instead, which makes Raft migrations
    12  much rarer. We will likely eventually need some of the ideas brought forth
    13  in this RFC, but in a less all-encompassing setting best considered then.
    14  
    15  # Summary
    16  
    17  **This is a draft but certainly not the solution. It serves mostly to inspire
    18  discussion and to hopefully iterate on. See the "Drawbacks" section and model
    19  cases below.**
    20  
    21  Come up with a basic framework for dealing with migrations. We require a
    22  migration path from the earliest beta version (or, at least from an early beta
    23  version) into 1.0 and beyond. There are two components: the big picture and a
    24  potentially less powerful version which we can quickly implement to keep
    25  development (relatively) seamless.
    26  
    27  # Motivation
    28  
    29  Almost every small change in Raft requires a proper migration story. Even
    30  seemingly harmless changes (making a previously nullable field non-nullable) do
    31  since they can let Replicas which operate at different versions diverge.
    32  It's a no-brainer that we need a proper migration progress, with the holy grail
    33  being rolling updates (though stop-the-world is acceptable at least during
    34  beta).
    35  
    36  Of course, versioning doesn't stop at Raft and, as we'll see, changes in Raft
    37  quickly spider out of control. This is a first stab at and collection of
    38  possible issues.
    39  
    40  # What (some) others do
    41  
    42  ## VoltDB
    43  
    44  https://docs.voltdb.com/AdminGuide/MaintainUpgradeVoltdb.php
    45  
    46  The primary mode is shutdown everything, replace, restart. Otherwise, setup
    47  two clusters and replicate between them; upgrade the actual cluster only after
    48  having promoted the other cluster. This seems really expensive and involves
    49  copying all of the production data. My guess is that folks just use the first
    50  option in practice.
    51  
    52  ## RethinkDB
    53  
    54  As usual, they seem to be on the right track (but they might also be in a less
    55  complicated spot than we are):
    56  
    57  http://www.rethinkdb.com/docs/migration/
    58  
    59  > 1.16 or higher: Migration is handled automatically. (This is also true for
    60  > upgrading from 1.14 onward to versions earlier than 2.2.) After migration,
    61  > follow the “Rebuild indexes” directions.
    62  1.13–1.15: Upgrade to RethinkDB 2.0.5 first, rebuild the secondary indexes by
    63  following the “Rebuild indexes” directions, then upgrade to 2.1 or higher.
    64  (Migration from 2.0.5 to 2.1+ will be handled automatically.)
    65  1.7–1.12: Follow the “Migrating old data” directions.
    66  1.6 or earlier: Read the “Deprecated versions” section.
    67  
    68  This looks like for recent versions, you just stop the process, replace the
    69  binary and run the new thing. I did not see anything indicating online
    70  migrations, so this is similar to VoltDB's first option.
    71  
    72  ## Percona XtraDB Cluster
    73  
    74  Seems relatively complicated, but it is an online process.
    75  
    76  https://www.percona.com/doc/percona-xtradb-cluster/5.6/upgrading_guide_55_56.html
    77  
    78  ## Cassandra
    79  
    80  Rolling restarts, relatively straightforward it seems. Essentially, draining
    81  the node, stop it, update the version, start the new version. It appears that
    82  downgrading is more involved (there's no clear path except through a data dump
    83  and starting from scratch with the old version). They also have it easier than
    84  us.
    85  
    86  http://docs.datastax.com/en/archived/cassandra/1.2/cassandra/upgrade/upgradeChangesC_c.html
    87  
    88  # High level design
    89  
    90  Bundle all execution methods (everything multiplexed to by `executeCmd`) in a
    91  collection `ReplicaMethods` which is to be specified (but sementically
    92  speaking, it is a map of command to implementation of the command).
    93  It is this entity which is being versioned. Add `var MaxReplicaVersion int64`
    94  to the `storage` package. The version stored there is the version of the
    95  current code, and marks that and all previous versions as supported (at some
    96  point, we'll likely want a lower bound as well).
    97  
    98  In a nutshell, each `*Replica` keeps track of its active version (persisted to
    99  a key and cached in-memory) and uses an appropriately synthesized
   100  `ReplicaMethods` to execute (Raft and read) commands. A migration occurs as a
   101  new `ChangeVersion` Raft command is applied (it is tbd how `ChangeVersion`
   102  itself is versioned) and replaces the synthesized version of `ReplicaMethods`
   103  with one corresponding to the new version.
   104  
   105  ## User-facing
   106  
   107  ### Upgrades
   108  
   109  The most obvious transition is a version upgrade. In the happy case, all
   110  replicas have a version that is higher than that requested by `ChangeVersion`;
   111  they can seamlessly process all following commands with the new semantics.
   112  It is this case which is easily achieved: stop all the nodes, update all the
   113  binaries, restart the cluster (it operates at the old version), and then
   114  trigger an online update. Or, for a rolling upgrade, stop and update the nodes
   115  in turn, and then trigger the upgrade - two methods, same outcome.
   116  
   117  The unhappy upgrade case corresponds to not having upgraded all binaries before
   118  sending `ChangeVersion`. The only correct option is to die - nothing can be
   119  processed any more since the semantics are unknown. This case should be avoided
   120  by user-friendly tooling - an "upgrade trigger" could periodically check
   121  until all of the Replicas can be upgraded, with appropriate warnings logged.
   122  It should be possible we can automatically update after a rolling cluster
   123  restart in that way, but we may prefer to keep it explicit (`./cockroach update
   124  latest`?).
   125  
   126  ### Downgrades
   127  
   128  Users will want to be able to undo an upgrade in case they encountered a newly
   129  introduced bug or degradation. There are two types of downgrades: Either, it's
   130  only the `ReplicaVersion` which must decrease (i.e. keep the new binary, but
   131  run old `ReplicaMethods`), or it is the binary itself which needs to be
   132  reverted to an older build (i.e. likely a bug unrelated to this versioning).
   133  
   134  The first case is trivial: Send an appropriate `ChangeVersion`. In the second
   135  case, we do as in the first case, and then we can do a (full or rolling)
   136  cluster restart with the desired (older) binary.
   137  
   138  ## Developer-facing
   139  
   140  The developer-facing side of this system should be easy to understand and
   141  maintain. We need to keep old versions of functionality, but they should be
   142  clearly recognizable as such and easy to "ignore". Likewise, we need lints
   143  to make sure that a relevant code change results in the proper migration, and
   144  need to be able to test migrations. For this, we
   145  
   146  * augment test helpers and configurations so that servers/replicas can be spun
   147    up at any version (defaulting to `ReplicaMaxVersion`) to make it easy to test
   148    specific versions (or even to run tests against ranges of versions).
   149  * keep the "latest" version of the commands in a distinguished position (close
   150    to where they are now, with mild refactoring to establish some independence
   151    from `(*Replica)`.
   152  * when changing the semantics of a command, copy the "old" code to its
   153    designated new location keyed by the (now old) version ID (for example,
   154    `./storage/replica_command.125.RequestLease.go`):
   155    ```go
   156    versionedCommands[125][RequestLease] = func(...) (...) { ... }
   157    ```
   158  
   159  Let's look at some model cases for this naive approach.
   160  
   161  ### Model case: Simple behavioral change
   162  
   163  Assume we change `BeginTransaction` so that it fails unless the supplied
   164  transaction proto has `Heartbeat.Equal(OrigTimestamp)`. It's tempting to
   165  achieve that by wrapping around the previous version, but doing that a couple
   166  of times would render the code unreadable. Instead, copy the existing code
   167  out as described above for the previous version and update the new code
   168  in-place. This is the case in which the system actually works.
   169  
   170  ### Model case: Unintended behavioral change
   171  
   172  Assume someone decides that our heartbeat interval needs to be smaller
   173  (motivated by something in the coordinator) and changes
   174  `base.DefaultHeartbeatInterval` without thinking it through. That variable
   175  is used in `PushTxn` to decide whether the transaction record is abandoned, so
   176  during an update we might have transactions which are both aborted and not
   177  aborted. Many such examples exist and I'm afraid even Sysiphus could get tired
   178  of finding and building lints around them all.
   179  
   180  ### Model case: Simple proto change
   181  
   182  Assume we change `(*Transaction).LastHeartbeat` from a nullable to a
   183  non-nullable field (as in #5753). This is still very tricky because the logic
   184  hides in the protobuf generated code - encoding a zero timestamp is different
   185  from omitting the field completely (as would happen with a nullable one), but
   186  using the new generated code instantly switches on this changed behavior. Thus,
   187  we must make a copy of all commands which serialize the Transaction proto and
   188  replace the marshalling with one that (somehow) makes sure the timestamp is
   189  skipped when zero, in all previous versions. That's correct if we're sure that
   190  we never encountered (and never will) a non-nil zero LastHeartbeat.
   191  
   192  Proto changes are quite horrible. See this next example (#5845)
   193  
   194  ### Model case: adding a proto field
   195  
   196  Same as the nullability change. All old versions of the code must not use the
   197  new marshalling code. The old version of the struct and its generated code
   198  must be kept in a separate location. For example, when adding a field `MaxOffset`
   199  to `RequestLease`, the previous version of `RequestLease` and its code are kept
   200  at `./storage/replica_command.<vers>.go` and the old version converts
   201  `roachpb.RequestLease` to `OldRequestLease` (dropping the new field) and then
   202  marshals that to disk.
   203  Of course the lease is also accessed from other locations (loading the
   204  lease), so that requires versioning as well, leading to, essentially, a mess.
   205  
   206  See #5817 for a more involved change which requires additions to a proto and
   207  associated changes in `ConditionalPut`.
   208  
   209  ### Model case: adding a new Raft command
   210  
   211  We've talked repeatedly about changing the KV API to support more powerful
   212  primitives. This would be a sweeping change, but a basic building block is
   213  introducing a new Raft command. Theoretically, both up- and downgrading these
   214  should be possible in the same ways as discussed before, but there's likely
   215  additional complexity.
   216  
   217  # Drawbacks
   218  
   219  See above. A lot of code duplication and it's difficult to enforce that
   220  anything which needs a migration has that pointed out to it. Some of that
   221  might be remedied through refactoring (moving all replica commands into
   222  their own package and lint against access across that package boundary).
   223  
   224  The versioning scheme here will work well only if the changes are confined to
   225  within the Raft commands and don't affect serialization. Proto changes are
   226  going to be somewhat more involved, even in simple cases.
   227  
   228  # Alternatives
   229  
   230  One issue with the above is the code complexity involved in migrations. This
   231  could potentially be attenuated by
   232  
   233  ## Embracing adjacent versions more
   234  
   235  Supporting upgrades only between adjacent versions could enable us to automate
   236  a lot of the versioning effort (by automatically keeping two versions of the
   237  protobufs, Raft commands, etc) and requiring less hand-crafting. The obvious
   238  down-side is that users need to upgrade versions one at a time, which can be
   239  very annoying. This could be attenuated partially by supplying an external
   240  binary which runs the individual steps (`cockroach-update latest`) one after
   241  another.
   242  
   243  ## Embrace stop-the-world more
   244  The complexity in the current migration is due to a new binary having to act
   245  exactly like an old binary until the version trigger is pulled. That almost
   246  translates to having all previous versions of the code base embedded in it, and
   247  being able to switch atomically (actually even worse, atomically on each
   248  Replica). Allowing each binary to just be itself early on sounds much less
   249  involved.
   250  
   251  However, not having the version switch orchestrated through Raft presents other
   252  difficulties: We won't be able to do rolling updates, and even for offline
   253  updates we must make sure that all nodes are stopped with the same part of the
   254  logs applied, and no unapplied log entries present (as it's not clear what
   255  version they've been proposed with). That naturally leads again to a Raft
   256  command:
   257  
   258  The update sends a `ChangeVersion` (which is more of a `StopForUpgrade` command
   259  in this section) to all Replicas, with the following semantics:
   260  * It's the highest committed command when it applies (i.e. the Replica won't
   261    acknowledge receipt of additional log entries after the one containing
   262    `ChangeVersion`) and,
   263  * upon application, the Replica stalls (but not the node,
   264    to give all replicas a change to stall). Only then
   265  * the process exits (potentially after preparing the data for the desired
   266    version if it's a downgrade).
   267  * When the process restarts, it checks that it's running the desired version of the binary, performs any data
   268    upgrades it has to do, and then resumes operation.
   269  
   270  As long as the data changes are reversible, this should do the trick.
   271  There's a chance of leaving the cluster in an undefined state should the
   272  migration crash (we likely won't be able to do it atomically, only on each
   273  replica, but that could be remedied with tooling (to manually drive the process
   274  forward). In effect, this would version everything by binary version.
   275  
   276  For example,
   277  * when adding a (non-nullable) proto field to `Lease`, the
   278  forward and background migrations would simply delete all `Lease` entries
   279  (since that is the simplest option and works fine: the only enemy is setting
   280  range leases with a zero value where there shouldn't be one)
   281  * the `LastHeartbeat` nullability change would not have to migrate anything.
   282  * more complicated proto changes could still require taking some of the actual
   283    old marshalling code to transcode parts of the keyspace in preparation for
   284    running under a different version.
   285  
   286  ## Embrace inconsistency more
   287  Maybe some inconsistencies can be acceptable (encoding a zero timestamp vs. nil
   288  timestamp, etc) and trying to maintain consistency seamlessly over migrations
   289  isn't useful while running a mixed cluster. Instead, disable (panicking on
   290  failed) consistency checks while the cluster is mixed-version, and re-enable it
   291  after a migration queue has processed the replica set.
   292  
   293  # Unresolved questions
   294  Many, see above.