github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160426_version_upgrades.md (about) 1 - Feature Name: Version upgrades 2 - Status: rejected 3 - Start Date: 2016-04-10 4 - Authors: Tobias Schottdorf 5 - RFC PR: [#5985](https://github.com/cockroachdb/cockroach/pull/5985) 6 - Cockroach Issue: 7 8 # Rejection notes 9 10 This RFC lead us to reconsider Raft proposals and lead us to consider (and 11 decide for) leaseholder-evaluated Raft (#6166) instead, which makes Raft migrations 12 much rarer. We will likely eventually need some of the ideas brought forth 13 in this RFC, but in a less all-encompassing setting best considered then. 14 15 # Summary 16 17 **This is a draft but certainly not the solution. It serves mostly to inspire 18 discussion and to hopefully iterate on. See the "Drawbacks" section and model 19 cases below.** 20 21 Come up with a basic framework for dealing with migrations. We require a 22 migration path from the earliest beta version (or, at least from an early beta 23 version) into 1.0 and beyond. There are two components: the big picture and a 24 potentially less powerful version which we can quickly implement to keep 25 development (relatively) seamless. 26 27 # Motivation 28 29 Almost every small change in Raft requires a proper migration story. Even 30 seemingly harmless changes (making a previously nullable field non-nullable) do 31 since they can let Replicas which operate at different versions diverge. 32 It's a no-brainer that we need a proper migration progress, with the holy grail 33 being rolling updates (though stop-the-world is acceptable at least during 34 beta). 35 36 Of course, versioning doesn't stop at Raft and, as we'll see, changes in Raft 37 quickly spider out of control. This is a first stab at and collection of 38 possible issues. 39 40 # What (some) others do 41 42 ## VoltDB 43 44 https://docs.voltdb.com/AdminGuide/MaintainUpgradeVoltdb.php 45 46 The primary mode is shutdown everything, replace, restart. Otherwise, setup 47 two clusters and replicate between them; upgrade the actual cluster only after 48 having promoted the other cluster. This seems really expensive and involves 49 copying all of the production data. My guess is that folks just use the first 50 option in practice. 51 52 ## RethinkDB 53 54 As usual, they seem to be on the right track (but they might also be in a less 55 complicated spot than we are): 56 57 http://www.rethinkdb.com/docs/migration/ 58 59 > 1.16 or higher: Migration is handled automatically. (This is also true for 60 > upgrading from 1.14 onward to versions earlier than 2.2.) After migration, 61 > follow the “Rebuild indexes” directions. 62 1.13–1.15: Upgrade to RethinkDB 2.0.5 first, rebuild the secondary indexes by 63 following the “Rebuild indexes” directions, then upgrade to 2.1 or higher. 64 (Migration from 2.0.5 to 2.1+ will be handled automatically.) 65 1.7–1.12: Follow the “Migrating old data” directions. 66 1.6 or earlier: Read the “Deprecated versions” section. 67 68 This looks like for recent versions, you just stop the process, replace the 69 binary and run the new thing. I did not see anything indicating online 70 migrations, so this is similar to VoltDB's first option. 71 72 ## Percona XtraDB Cluster 73 74 Seems relatively complicated, but it is an online process. 75 76 https://www.percona.com/doc/percona-xtradb-cluster/5.6/upgrading_guide_55_56.html 77 78 ## Cassandra 79 80 Rolling restarts, relatively straightforward it seems. Essentially, draining 81 the node, stop it, update the version, start the new version. It appears that 82 downgrading is more involved (there's no clear path except through a data dump 83 and starting from scratch with the old version). They also have it easier than 84 us. 85 86 http://docs.datastax.com/en/archived/cassandra/1.2/cassandra/upgrade/upgradeChangesC_c.html 87 88 # High level design 89 90 Bundle all execution methods (everything multiplexed to by `executeCmd`) in a 91 collection `ReplicaMethods` which is to be specified (but sementically 92 speaking, it is a map of command to implementation of the command). 93 It is this entity which is being versioned. Add `var MaxReplicaVersion int64` 94 to the `storage` package. The version stored there is the version of the 95 current code, and marks that and all previous versions as supported (at some 96 point, we'll likely want a lower bound as well). 97 98 In a nutshell, each `*Replica` keeps track of its active version (persisted to 99 a key and cached in-memory) and uses an appropriately synthesized 100 `ReplicaMethods` to execute (Raft and read) commands. A migration occurs as a 101 new `ChangeVersion` Raft command is applied (it is tbd how `ChangeVersion` 102 itself is versioned) and replaces the synthesized version of `ReplicaMethods` 103 with one corresponding to the new version. 104 105 ## User-facing 106 107 ### Upgrades 108 109 The most obvious transition is a version upgrade. In the happy case, all 110 replicas have a version that is higher than that requested by `ChangeVersion`; 111 they can seamlessly process all following commands with the new semantics. 112 It is this case which is easily achieved: stop all the nodes, update all the 113 binaries, restart the cluster (it operates at the old version), and then 114 trigger an online update. Or, for a rolling upgrade, stop and update the nodes 115 in turn, and then trigger the upgrade - two methods, same outcome. 116 117 The unhappy upgrade case corresponds to not having upgraded all binaries before 118 sending `ChangeVersion`. The only correct option is to die - nothing can be 119 processed any more since the semantics are unknown. This case should be avoided 120 by user-friendly tooling - an "upgrade trigger" could periodically check 121 until all of the Replicas can be upgraded, with appropriate warnings logged. 122 It should be possible we can automatically update after a rolling cluster 123 restart in that way, but we may prefer to keep it explicit (`./cockroach update 124 latest`?). 125 126 ### Downgrades 127 128 Users will want to be able to undo an upgrade in case they encountered a newly 129 introduced bug or degradation. There are two types of downgrades: Either, it's 130 only the `ReplicaVersion` which must decrease (i.e. keep the new binary, but 131 run old `ReplicaMethods`), or it is the binary itself which needs to be 132 reverted to an older build (i.e. likely a bug unrelated to this versioning). 133 134 The first case is trivial: Send an appropriate `ChangeVersion`. In the second 135 case, we do as in the first case, and then we can do a (full or rolling) 136 cluster restart with the desired (older) binary. 137 138 ## Developer-facing 139 140 The developer-facing side of this system should be easy to understand and 141 maintain. We need to keep old versions of functionality, but they should be 142 clearly recognizable as such and easy to "ignore". Likewise, we need lints 143 to make sure that a relevant code change results in the proper migration, and 144 need to be able to test migrations. For this, we 145 146 * augment test helpers and configurations so that servers/replicas can be spun 147 up at any version (defaulting to `ReplicaMaxVersion`) to make it easy to test 148 specific versions (or even to run tests against ranges of versions). 149 * keep the "latest" version of the commands in a distinguished position (close 150 to where they are now, with mild refactoring to establish some independence 151 from `(*Replica)`. 152 * when changing the semantics of a command, copy the "old" code to its 153 designated new location keyed by the (now old) version ID (for example, 154 `./storage/replica_command.125.RequestLease.go`): 155 ```go 156 versionedCommands[125][RequestLease] = func(...) (...) { ... } 157 ``` 158 159 Let's look at some model cases for this naive approach. 160 161 ### Model case: Simple behavioral change 162 163 Assume we change `BeginTransaction` so that it fails unless the supplied 164 transaction proto has `Heartbeat.Equal(OrigTimestamp)`. It's tempting to 165 achieve that by wrapping around the previous version, but doing that a couple 166 of times would render the code unreadable. Instead, copy the existing code 167 out as described above for the previous version and update the new code 168 in-place. This is the case in which the system actually works. 169 170 ### Model case: Unintended behavioral change 171 172 Assume someone decides that our heartbeat interval needs to be smaller 173 (motivated by something in the coordinator) and changes 174 `base.DefaultHeartbeatInterval` without thinking it through. That variable 175 is used in `PushTxn` to decide whether the transaction record is abandoned, so 176 during an update we might have transactions which are both aborted and not 177 aborted. Many such examples exist and I'm afraid even Sysiphus could get tired 178 of finding and building lints around them all. 179 180 ### Model case: Simple proto change 181 182 Assume we change `(*Transaction).LastHeartbeat` from a nullable to a 183 non-nullable field (as in #5753). This is still very tricky because the logic 184 hides in the protobuf generated code - encoding a zero timestamp is different 185 from omitting the field completely (as would happen with a nullable one), but 186 using the new generated code instantly switches on this changed behavior. Thus, 187 we must make a copy of all commands which serialize the Transaction proto and 188 replace the marshalling with one that (somehow) makes sure the timestamp is 189 skipped when zero, in all previous versions. That's correct if we're sure that 190 we never encountered (and never will) a non-nil zero LastHeartbeat. 191 192 Proto changes are quite horrible. See this next example (#5845) 193 194 ### Model case: adding a proto field 195 196 Same as the nullability change. All old versions of the code must not use the 197 new marshalling code. The old version of the struct and its generated code 198 must be kept in a separate location. For example, when adding a field `MaxOffset` 199 to `RequestLease`, the previous version of `RequestLease` and its code are kept 200 at `./storage/replica_command.<vers>.go` and the old version converts 201 `roachpb.RequestLease` to `OldRequestLease` (dropping the new field) and then 202 marshals that to disk. 203 Of course the lease is also accessed from other locations (loading the 204 lease), so that requires versioning as well, leading to, essentially, a mess. 205 206 See #5817 for a more involved change which requires additions to a proto and 207 associated changes in `ConditionalPut`. 208 209 ### Model case: adding a new Raft command 210 211 We've talked repeatedly about changing the KV API to support more powerful 212 primitives. This would be a sweeping change, but a basic building block is 213 introducing a new Raft command. Theoretically, both up- and downgrading these 214 should be possible in the same ways as discussed before, but there's likely 215 additional complexity. 216 217 # Drawbacks 218 219 See above. A lot of code duplication and it's difficult to enforce that 220 anything which needs a migration has that pointed out to it. Some of that 221 might be remedied through refactoring (moving all replica commands into 222 their own package and lint against access across that package boundary). 223 224 The versioning scheme here will work well only if the changes are confined to 225 within the Raft commands and don't affect serialization. Proto changes are 226 going to be somewhat more involved, even in simple cases. 227 228 # Alternatives 229 230 One issue with the above is the code complexity involved in migrations. This 231 could potentially be attenuated by 232 233 ## Embracing adjacent versions more 234 235 Supporting upgrades only between adjacent versions could enable us to automate 236 a lot of the versioning effort (by automatically keeping two versions of the 237 protobufs, Raft commands, etc) and requiring less hand-crafting. The obvious 238 down-side is that users need to upgrade versions one at a time, which can be 239 very annoying. This could be attenuated partially by supplying an external 240 binary which runs the individual steps (`cockroach-update latest`) one after 241 another. 242 243 ## Embrace stop-the-world more 244 The complexity in the current migration is due to a new binary having to act 245 exactly like an old binary until the version trigger is pulled. That almost 246 translates to having all previous versions of the code base embedded in it, and 247 being able to switch atomically (actually even worse, atomically on each 248 Replica). Allowing each binary to just be itself early on sounds much less 249 involved. 250 251 However, not having the version switch orchestrated through Raft presents other 252 difficulties: We won't be able to do rolling updates, and even for offline 253 updates we must make sure that all nodes are stopped with the same part of the 254 logs applied, and no unapplied log entries present (as it's not clear what 255 version they've been proposed with). That naturally leads again to a Raft 256 command: 257 258 The update sends a `ChangeVersion` (which is more of a `StopForUpgrade` command 259 in this section) to all Replicas, with the following semantics: 260 * It's the highest committed command when it applies (i.e. the Replica won't 261 acknowledge receipt of additional log entries after the one containing 262 `ChangeVersion`) and, 263 * upon application, the Replica stalls (but not the node, 264 to give all replicas a change to stall). Only then 265 * the process exits (potentially after preparing the data for the desired 266 version if it's a downgrade). 267 * When the process restarts, it checks that it's running the desired version of the binary, performs any data 268 upgrades it has to do, and then resumes operation. 269 270 As long as the data changes are reversible, this should do the trick. 271 There's a chance of leaving the cluster in an undefined state should the 272 migration crash (we likely won't be able to do it atomically, only on each 273 replica, but that could be remedied with tooling (to manually drive the process 274 forward). In effect, this would version everything by binary version. 275 276 For example, 277 * when adding a (non-nullable) proto field to `Lease`, the 278 forward and background migrations would simply delete all `Lease` entries 279 (since that is the simplest option and works fine: the only enemy is setting 280 range leases with a zero value where there shouldn't be one) 281 * the `LastHeartbeat` nullability change would not have to migrate anything. 282 * more complicated proto changes could still require taking some of the actual 283 old marshalling code to transcode parts of the keyspace in preparation for 284 running under a different version. 285 286 ## Embrace inconsistency more 287 Maybe some inconsistencies can be acceptable (encoding a zero timestamp vs. nil 288 timestamp, etc) and trying to maintain consistency seamlessly over migrations 289 isn't useful while running a mixed cluster. Instead, disable (panicking on 290 failed) consistency checks while the cluster is mixed-version, and re-enable it 291 after a migration queue has processed the replica set. 292 293 # Unresolved questions 294 Many, see above.