github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160426_version_upgrades.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160426_version_upgrades.md (about)

1 - Feature Name: Version upgrades
2 - Status: rejected
3 - Start Date: 2016-04-10
4 - Authors: Tobias Schottdorf
5 - RFC PR: [#5985](https://github.com/cockroachdb/cockroach/pull/5985)
6 - Cockroach Issue:
7
8 # Rejection notes
9
10 This RFC lead us to reconsider Raft proposals and lead us to consider (and
11 decide for) leaseholder-evaluated Raft (#6166) instead, which makes Raft migrations
12 much rarer. We will likely eventually need some of the ideas brought forth
13 in this RFC, but in a less all-encompassing setting best considered then.
14
15 # Summary
16
17 **This is a draft but certainly not the solution. It serves mostly to inspire
18 discussion and to hopefully iterate on. See the "Drawbacks" section and model
19 cases below.**
20
21 Come up with a basic framework for dealing with migrations. We require a
22 migration path from the earliest beta version (or, at least from an early beta
23 version) into 1.0 and beyond. There are two components: the big picture and a
24 potentially less powerful version which we can quickly implement to keep
25 development (relatively) seamless.
26
27 # Motivation
28
29 Almost every small change in Raft requires a proper migration story. Even
30 seemingly harmless changes (making a previously nullable field non-nullable) do
31 since they can let Replicas which operate at different versions diverge.
32 It's a no-brainer that we need a proper migration progress, with the holy grail
33 being rolling updates (though stop-the-world is acceptable at least during
34 beta).
35
36 Of course, versioning doesn't stop at Raft and, as we'll see, changes in Raft
37 quickly spider out of control. This is a first stab at and collection of
38 possible issues.
39
40 # What (some) others do
41
42 ## VoltDB
43
44 https://docs.voltdb.com/AdminGuide/MaintainUpgradeVoltdb.php
45
46 The primary mode is shutdown everything, replace, restart. Otherwise, setup
47 two clusters and replicate between them; upgrade the actual cluster only after
48 having promoted the other cluster. This seems really expensive and involves
49 copying all of the production data. My guess is that folks just use the first
50 option in practice.
51
52 ## RethinkDB
53
54 As usual, they seem to be on the right track (but they might also be in a less
55 complicated spot than we are):
56
57 http://www.rethinkdb.com/docs/migration/
58
59 > 1.16 or higher: Migration is handled automatically. (This is also true for
60 > upgrading from 1.14 onward to versions earlier than 2.2.) After migration,
61 > follow the “Rebuild indexes” directions.
62 1.13–1.15: Upgrade to RethinkDB 2.0.5 first, rebuild the secondary indexes by
63 following the “Rebuild indexes” directions, then upgrade to 2.1 or higher.
64 (Migration from 2.0.5 to 2.1+ will be handled automatically.)
65 1.7–1.12: Follow the “Migrating old data” directions.
66 1.6 or earlier: Read the “Deprecated versions” section.
67
68 This looks like for recent versions, you just stop the process, replace the
69 binary and run the new thing. I did not see anything indicating online
70 migrations, so this is similar to VoltDB's first option.
71
72 ## Percona XtraDB Cluster
73
74 Seems relatively complicated, but it is an online process.
75
76 https://www.percona.com/doc/percona-xtradb-cluster/5.6/upgrading_guide_55_56.html
77
78 ## Cassandra
79
80 Rolling restarts, relatively straightforward it seems. Essentially, draining
81 the node, stop it, update the version, start the new version. It appears that
82 downgrading is more involved (there's no clear path except through a data dump
83 and starting from scratch with the old version). They also have it easier than
84 us.
85
86 http://docs.datastax.com/en/archived/cassandra/1.2/cassandra/upgrade/upgradeChangesC_c.html
87
88 # High level design
89
90 Bundle all execution methods (everything multiplexed to by `executeCmd`) in a
91 collection `ReplicaMethods` which is to be specified (but sementically
92 speaking, it is a map of command to implementation of the command).
93 It is this entity which is being versioned. Add `var MaxReplicaVersion int64`
94 to the `storage` package. The version stored there is the version of the
95 current code, and marks that and all previous versions as supported (at some
96 point, we'll likely want a lower bound as well).
97
98 In a nutshell, each `*Replica` keeps track of its active version (persisted to
99 a key and cached in-memory) and uses an appropriately synthesized
100 `ReplicaMethods` to execute (Raft and read) commands. A migration occurs as a
101 new `ChangeVersion` Raft command is applied (it is tbd how `ChangeVersion`
102 itself is versioned) and replaces the synthesized version of `ReplicaMethods`
103 with one corresponding to the new version.
104
105 ## User-facing
106
107 ### Upgrades
108
109 The most obvious transition is a version upgrade. In the happy case, all
110 replicas have a version that is higher than that requested by `ChangeVersion`;
111 they can seamlessly process all following commands with the new semantics.
112 It is this case which is easily achieved: stop all the nodes, update all the
113 binaries, restart the cluster (it operates at the old version), and then
114 trigger an online update. Or, for a rolling upgrade, stop and update the nodes
115 in turn, and then trigger the upgrade - two methods, same outcome.
116
117 The unhappy upgrade case corresponds to not having upgraded all binaries before
118 sending `ChangeVersion`. The only correct option is to die - nothing can be
119 processed any more since the semantics are unknown. This case should be avoided
120 by user-friendly tooling - an "upgrade trigger" could periodically check
121 until all of the Replicas can be upgraded, with appropriate warnings logged.
122 It should be possible we can automatically update after a rolling cluster
123 restart in that way, but we may prefer to keep it explicit (`./cockroach update
124 latest`?).
125
126 ### Downgrades
127
128 Users will want to be able to undo an upgrade in case they encountered a newly
129 introduced bug or degradation. There are two types of downgrades: Either, it's
130 only the `ReplicaVersion` which must decrease (i.e. keep the new binary, but
131 run old `ReplicaMethods`), or it is the binary itself which needs to be
132 reverted to an older build (i.e. likely a bug unrelated to this versioning).
133
134 The first case is trivial: Send an appropriate `ChangeVersion`. In the second
135 case, we do as in the first case, and then we can do a (full or rolling)
136 cluster restart with the desired (older) binary.
137
138 ## Developer-facing
139
140 The developer-facing side of this system should be easy to understand and
141 maintain. We need to keep old versions of functionality, but they should be
142 clearly recognizable as such and easy to "ignore". Likewise, we need lints
143 to make sure that a relevant code change results in the proper migration, and
144 need to be able to test migrations. For this, we
145
146 * augment test helpers and configurations so that servers/replicas can be spun
147 up at any version (defaulting to `ReplicaMaxVersion`) to make it easy to test
148 specific versions (or even to run tests against ranges of versions).
149 * keep the "latest" version of the commands in a distinguished position (close
150 to where they are now, with mild refactoring to establish some independence
151 from `(*Replica)`.
152 * when changing the semantics of a command, copy the "old" code to its
153 designated new location keyed by the (now old) version ID (for example,
154 `./storage/replica_command.125.RequestLease.go`):
155 ```go
156 versionedCommands[125][RequestLease] = func(...) (...) { ... }
157 ```
158
159 Let's look at some model cases for this naive approach.
160
161 ### Model case: Simple behavioral change
162
163 Assume we change `BeginTransaction` so that it fails unless the supplied
164 transaction proto has `Heartbeat.Equal(OrigTimestamp)`. It's tempting to
165 achieve that by wrapping around the previous version, but doing that a couple
166 of times would render the code unreadable. Instead, copy the existing code
167 out as described above for the previous version and update the new code
168 in-place. This is the case in which the system actually works.
169
170 ### Model case: Unintended behavioral change
171
172 Assume someone decides that our heartbeat interval needs to be smaller
173 (motivated by something in the coordinator) and changes
174 `base.DefaultHeartbeatInterval` without thinking it through. That variable
175 is used in `PushTxn` to decide whether the transaction record is abandoned, so
176 during an update we might have transactions which are both aborted and not
177 aborted. Many such examples exist and I'm afraid even Sysiphus could get tired
178 of finding and building lints around them all.
179
180 ### Model case: Simple proto change
181
182 Assume we change `(*Transaction).LastHeartbeat` from a nullable to a
183 non-nullable field (as in #5753). This is still very tricky because the logic
184 hides in the protobuf generated code - encoding a zero timestamp is different
185 from omitting the field completely (as would happen with a nullable one), but
186 using the new generated code instantly switches on this changed behavior. Thus,
187 we must make a copy of all commands which serialize the Transaction proto and
188 replace the marshalling with one that (somehow) makes sure the timestamp is
189 skipped when zero, in all previous versions. That's correct if we're sure that
190 we never encountered (and never will) a non-nil zero LastHeartbeat.
191
192 Proto changes are quite horrible. See this next example (#5845)
193
194 ### Model case: adding a proto field
195
196 Same as the nullability change. All old versions of the code must not use the
197 new marshalling code. The old version of the struct and its generated code
198 must be kept in a separate location. For example, when adding a field `MaxOffset`
199 to `RequestLease`, the previous version of `RequestLease` and its code are kept
200 at `./storage/replica_command.<vers>.go` and the old version converts
201 `roachpb.RequestLease` to `OldRequestLease` (dropping the new field) and then
202 marshals that to disk.
203 Of course the lease is also accessed from other locations (loading the
204 lease), so that requires versioning as well, leading to, essentially, a mess.
205
206 See #5817 for a more involved change which requires additions to a proto and
207 associated changes in `ConditionalPut`.
208
209 ### Model case: adding a new Raft command
210
211 We've talked repeatedly about changing the KV API to support more powerful
212 primitives. This would be a sweeping change, but a basic building block is
213 introducing a new Raft command. Theoretically, both up- and downgrading these
214 should be possible in the same ways as discussed before, but there's likely
215 additional complexity.
216
217 # Drawbacks
218
219 See above. A lot of code duplication and it's difficult to enforce that
220 anything which needs a migration has that pointed out to it. Some of that
221 might be remedied through refactoring (moving all replica commands into
222 their own package and lint against access across that package boundary).
223
224 The versioning scheme here will work well only if the changes are confined to
225 within the Raft commands and don't affect serialization. Proto changes are
226 going to be somewhat more involved, even in simple cases.
227
228 # Alternatives
229
230 One issue with the above is the code complexity involved in migrations. This
231 could potentially be attenuated by
232
233 ## Embracing adjacent versions more
234
235 Supporting upgrades only between adjacent versions could enable us to automate
236 a lot of the versioning effort (by automatically keeping two versions of the
237 protobufs, Raft commands, etc) and requiring less hand-crafting. The obvious
238 down-side is that users need to upgrade versions one at a time, which can be
239 very annoying. This could be attenuated partially by supplying an external
240 binary which runs the individual steps (`cockroach-update latest`) one after
241 another.
242
243 ## Embrace stop-the-world more
244 The complexity in the current migration is due to a new binary having to act
245 exactly like an old binary until the version trigger is pulled. That almost
246 translates to having all previous versions of the code base embedded in it, and
247 being able to switch atomically (actually even worse, atomically on each
248 Replica). Allowing each binary to just be itself early on sounds much less
249 involved.
250
251 However, not having the version switch orchestrated through Raft presents other
252 difficulties: We won't be able to do rolling updates, and even for offline
253 updates we must make sure that all nodes are stopped with the same part of the
254 logs applied, and no unapplied log entries present (as it's not clear what
255 version they've been proposed with). That naturally leads again to a Raft
256 command:
257
258 The update sends a `ChangeVersion` (which is more of a `StopForUpgrade` command
259 in this section) to all Replicas, with the following semantics:
260 * It's the highest committed command when it applies (i.e. the Replica won't
261 acknowledge receipt of additional log entries after the one containing
262 `ChangeVersion`) and,
263 * upon application, the Replica stalls (but not the node,
264 to give all replicas a change to stall). Only then
265 * the process exits (potentially after preparing the data for the desired
266 version if it's a downgrade).
267 * When the process restarts, it checks that it's running the desired version of the binary, performs any data
268 upgrades it has to do, and then resumes operation.
269
270 As long as the data changes are reversible, this should do the trick.
271 There's a chance of leaving the cluster in an undefined state should the
272 migration crash (we likely won't be able to do it atomically, only on each
273 replica, but that could be remedied with tooling (to manually drive the process
274 forward). In effect, this would version everything by binary version.
275
276 For example,
277 * when adding a (non-nullable) proto field to `Lease`, the
278 forward and background migrations would simply delete all `Lease` entries
279 (since that is the simplest option and works fine: the only enemy is setting
280 range leases with a zero value where there shouldn't be one)
281 * the `LastHeartbeat` nullability change would not have to migrate anything.
282 * more complicated proto changes could still require taking some of the actual
283 old marshalling code to transcode parts of the keyspace in preparation for
284 running under a different version.
285
286 ## Embrace inconsistency more
287 Maybe some inconsistencies can be acceptable (encoding a zero timestamp vs. nil
288 timestamp, etc) and trying to maintain consistency seamlessly over migrations
289 isn't useful while running a mixed cluster. Instead, disable (panicking on
290 failed) consistency checks while the cluster is mixed-version, and re-enable it
291 after a migration queue has processed the replica set.
292
293 # Unresolved questions
294 Many, see above.