github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160420_proposer_evaluated_kv.md (about) 1 - Feature Name: proposer_evaluated_kv 2 - Status: completed 3 - Start Date: 2016-04-19 4 - Authors: Tobias Schottdorf 5 - RFC PR: [#6166](https://github.com/cockroachdb/cockroach/pull/6166) 6 - Cockroach Issue: [#6290](https://github.com/cockroachdb/cockroach/pull/6290) 7 8 # Summary 9 10 Pursue an idea by @spencerkimball brought up in the context of 11 [#5985](https://github.com/cockroachdb/cockroach/pull/5985): move the bulk of 12 the logic involved in applying commands to a Replica upstream of Raft. 13 14 # Motivation 15 16 The [first exploration of 17 migrations](https://github.com/cockroachdb/cockroach/pull/5985) highlighted 18 the complexity concomitant with migrations on top of the current Raft execution 19 model. 20 21 As of writing, the leading replica is the only replica that proposes commands 22 to Raft. However, each replica applies commands individually, which leads to 23 large swaths of code living "downstream" of Raft. Since all of this code needs 24 to produce identical output even during migrations, a satisfactory migration 25 story is almost inachievable, a comment made by several reviewers (and the 26 author). 27 28 By having the proposing replica compute the effects on the key-value state 29 before proposing the command to Raft, much of this downstream code moves 30 upstream of Raft; naively what's left are simple writes to key-value pairs 31 (i.e. all commands move upstream except for a simplified MVCCPut operation). 32 While some complexity remains, this has promise for a much facilitated 33 migration story, as well as de-triplicating work previously done by each 34 individual replica. 35 36 # Detailed design 37 38 ## Execution path 39 40 ### Proposing commands 41 42 The main change in the execution path will be in `proposeRaftCommand`, which 43 currently takes a `BatchRequest` and creates a `pendingCmd`, which currently 44 has the form 45 46 ```go 47 type pendingCmd struct { 48 ctx context.Context 49 idKey storagebase.CmdIDKey 50 done chan roachpb.ResponseWithError // Used to signal waiting RPC handler 51 raftCmd struct { 52 RangeID RangeID 53 OriginReplica ReplicaDescriptor 54 Cmd BatchRequest 55 } 56 } 57 ``` 58 59 and within which the main change is changing the type of `Cmd` from 60 `BatchRequest` to a new type which carries the effects of the application of 61 the batch, as computed (by the lease holder) earlier in `proposeRaftCommand`. 62 63 For simplicity, in this draft we will assume that the `raftCmd` struct will 64 be modified according to these requirements. 65 66 The "effects" contain 67 68 * any key-value writes: For example the outcome of a `Put`, `DeleteRange`, `CPut`, `BeginTransaction`, or `RequestLease`. 69 Some alternatives exist: writes could be encoded as `[]MVCCKeyValue`, or even 70 as post-MVCC raw (i.e. opaque) key-value pairs. We'll assume the former for 71 now. 72 * parameters for any "triggers" which must run with the *application* of the 73 command. In particular, this is necessary on `EndTransaction` with a commit 74 trigger, or when the lease holder changes (to apply the new lease). 75 There are more of them, but this already shows that these triggers (which are 76 downstream of Raft) will require various pieces of data to be passed around 77 with them. 78 79 Note that for the "triggers" above, a lot of the work they currently do can 80 move pre-Raft as well. For example, for `RequestLease` below we must change some 81 in-memory state on `*Replica` (lease proto and timestamp cache) plus call out 82 to Gossip. Still, the bulk of the code in `(*Replica).RequestLease` (i.e. the 83 actual logic) will move pre-Raft, and the remaining trigger is conditional on 84 being on the lease holder (so that one could even remove it completely from the 85 "effects" and hack their execution into the lease holder role itself). 86 87 Thus, we arrive at the following (proto-backed) candidate struct: 88 89 ```proto 90 # Pseudo-Go-Protobuf, all naming ad-hoc and TBD. 91 92 message RaftCmd { 93 optional SomeMetadata # tbd (proposing replica, enginepb.MVCCStats diff, ...) 94 repeated oneof effect { 95 # Applied in field and slice order. 96 optional Writes writes = ...; # see MVCC section for Writes type 97 # Carries out the following: 98 # * sanity checks 99 # * compute stats for new left hand side (LHS) [move pre-raft] 100 # * copy some data to new range (abort cache) [move pre-raft] 101 # * make right-hand side `*Replica` 102 # * copy tsCache (in-mem) [lease holder only] 103 # * r.store.SplitRange 104 # * maybe trigger Raft election 105 optional Split ...; 106 # Carries out the following: 107 # * sanity checks 108 # * stats computations [move pre-raft] 109 # * copy some data to new range (abort cache) [move pre-raft] 110 # * remove some metadata [move pre-raft] 111 # careful: this is a DeleteRange; could benefit from a ClearRange 112 # Raft effect (all inline values; no versioning needed) 113 # * clear tsCache (not really necessary) 114 optional Merge ...: 115 # Carries out the following: 116 # * `(*Replica).setDesc` 117 # * maybe add to GC queue [lease holder only] 118 optional ChangeReplicas ...; 119 # Carries out the following: 120 # * maybe gossips system config [lease holder only] 121 optional ModifiedSpan ...; 122 # Carries out the following: 123 # * sanity checks (in particular FindReplica) 124 # * update in-memory lease 125 # * on Raft(!) leader: maybe try to elect new leader 126 # * on (new) lease holder: update tsCache low water mark 127 # (might be easier to do this unconditionally on all nodes) 128 # * on (new) lease holder: maybe gossip system configs 129 optional Lease ...; 130 # Those below may not be required, but something to think about 131 optional GCRange ...; # allow optimized GC (some logic below of Raft?) 132 optional ClearRange ...; # allow erasing key range without huge proposal 133 } 134 } 135 ``` 136 137 Since this simple structure is easy to reason about, it seems adequate to 138 strive for the minimum amount of interaction between the various effects 139 (though the data required for the triggers may benefit from more interaction 140 between them in some cases) 141 142 ### Computing the (ordered) write set 143 144 Before a lease holder proposes a command, it has the job of translating a 145 `BatchRequest` into a `RaftCmd`. Note that we assume that writes wait for 146 both reads and writes in the command queue (so that at the point at which 147 the `RaftCmd` is constructed, all prior mutations have been carried out and 148 the timestamp cache updated). 149 150 This largely takes over the work which was done in 151 152 ``` 153 applyRaftCommandInBatch -> executeWriteBatch -> executeBatch 154 -> (ReplicaCommands) -> (MVCC) 155 ``` 156 157 This is hopefully largely mechanical, with the bulk of work in the MVCC layer, 158 outlined in the next section. 159 160 The work to be carried out in `ReplicaCommands` is mostly isolating the various 161 side effects (most prominently the commit and range lease triggers). In the 162 process, it should be possible to remove these methods from `(*Replica)` 163 (though that is not a stated goal of this RFC). 164 165 ### MVCC & Engine Layer 166 167 Operations in the MVCC layer (`MVCC{Get,Put,Scan,DeleteRange,...}`) currently 168 return some variation of `([]roachpb.KeyValue, []roachpb.Intent, error)` and 169 operate on the underlying `engine.Engine` (which in practice is a RocksDB 170 batch) with heavy use of iterators. 171 172 With the new code, we want mostly the same, but with two important changes: 173 174 * we do not actually intend to commit the batch (though we could do this on the 175 proposing node to avoid overhead) 176 * we need to construct the `Writes` field, which (in some way or 177 another) tracks the key-value pairs written (i.e. created, changed or 178 deleted) over the lifetime of the batch. 179 180 It's also desirable to avoid invasive changes to the MVCC layer. The following 181 extension to the `Engine` interface seems adequate (mod changes to improve 182 buffer reuse): 183 184 ```go 185 type Engine interface { 186 // ... 187 NewBatch() Batch 188 } 189 190 var ErrNotTracking = errors.New("batch is not tracking writes") 191 192 type Writes []MVCCKeyValue // actual type see below 193 194 type Batch interface { 195 Engine 196 // GetWrites returns a slice of Write updates which were carried out during 197 // batch, or ErrNotTracking if the batch does not support this feature. 198 // The ordering of the writes is not guaranteed to be stable and may not 199 // correspond to the actual order in which the writes were carried out. 200 // However, the result of applying the updates in the returned order *is* 201 // stable, i.e. non-commutative (overlapping writes) can't rearrange 202 // themselves freely). [i.e. can use a map under the hood] 203 // The returned Writes are read-only and must not be used after their Batch 204 // is finalized. [i.e. can reclaim the memory at that point, or we add an 205 // explicit method to finalize the writes if that is more convenient]. 206 GetWrites() (Writes, error) 207 } 208 ``` 209 210 With this strategy, the MVCC changes are fairly locally scoped and not too 211 invasive. 212 213 Naively, this can be achieved by wrapping a "regular" batch with an appropriate 214 in-memory map which is populated on writes. 215 216 However, using [RocksDB's `WriteBatch` 217 format](https://github.com/facebook/rocksdb/blob/f38540b12ada4fe06598c42d4c084f9d920289ff/db/write_batch.cc#L10) 218 seems opportune: 219 220 ``` 221 // WriteBatch::rep_ := 222 // sequence: fixed64 223 // count: fixed32 224 // data: record[count] 225 // record := 226 // kTypeValue varstring varstring 227 // kTypeDeletion varstring 228 // kTypeSingleDeletion varstring 229 // kTypeMerge varstring varstring 230 // kTypeColumnFamilyValue varint32 varstring varstring 231 // kTypeColumnFamilyDeletion varint32 varstring varstring 232 // kTypeColumnFamilySingleDeletion varint32 varstring varstring 233 // kTypeColumnFamilyMerge varint32 varstring varstring 234 // varstring := 235 // len: varint32 236 // data: uint8[len] 237 ``` 238 239 * The format is straightforward to implement; the writes need to be serialized 240 anyway for the Raft proposal, and this appears to be an efficient format to 241 use, even when the underlying storage engine isn't based on RocksDB. 242 * We currently use RocksDB and this is already the internal format used by a 243 RocksDB batch. We can completely avoid an additional copy and serialization 244 step in this case (i.e. construct the batch, borrow out the contained 245 representation to Raft until the command commits, and apply the batch). 246 Followers can create a `WriteBatch` directly from this representation. 247 Serialization is a key point in performance[1]. 248 * Again for RocksDB, we might also be able to piggy-back on existing 249 implementation to operate on the superposition of multiple `WriteBatch`es 250 (which would be needed to remove write-write blocking). 251 252 The format should be relatively stable (likely only additions), but there might 253 be future migrations coming up as RocksDB changes their underlying assumptions. 254 We should consider checking in with them about this idea. 255 256 Using `WriteBatch` also means that the writes are completely opaque to the 257 stack (at least unless unserialized early, and loaded, and then accessed 258 through MVCC operations), so the followers essentially can't act based on 259 them. That should not be a problem (but could if we want the side effects 260 to use information from the write set in a performant way). 261 262 ## Raft command application 263 264 This will be a new, straightforward facility which applies a `RaftCmd` against 265 the underlying `Engine` in a batch. The point of this whole change is that this 266 part of the stack is very straightforward and with no side effects. 267 268 `(*Replica).executeCmd` is likely a good place where most of this code would 269 live. 270 271 ## Replay protection 272 273 Replay protection is naively an issue because applying a logicless batch of 274 writes is at odds with our current method, which largely relies on 275 write/timestamp conflicts. 276 However, as @bdarnell pointed out, as of recently (#5973) we (mostly) colocate 277 the Raft lease holder and the range lease holder on the same node and have more 278 options for dealing with this (i.e. making the application of a command 279 conditional on the Raft log position it was originally proposed at). 280 281 More concretely (via @bdarnell): 282 283 > The RaftCmd would also include a term and log index, to be populated with the 284 > lease holder's latest log information when it is proposed. The writes would only be 285 > applied if it committed at that index; they would be dropped (and reproposed) 286 > if they committed at any other index. This would only be a possibility when 287 > the lease holder is not the raft leader (and even then it's unlikely in 288 > stable lease situations). We may be able to be even stricter about colocating 289 > the two roles (in which case we could perhaps remove raft-level replays 290 > completely), but I'm wary of forcing too many raft elections. 291 292 ### Test changes 293 294 These should be limited to those tests that exercise triggers and/or use 295 lower-level Raft internals inside of the `storage` package (plus some surprises 296 which are certain to occur with a change of this magnitude). 297 298 ## Migration strategy 299 300 We implement the `Freeze`/`Unfreeze` Raft commands suggested by @bdarnell in 301 \#5985: everything proposed between `Freeze` and `Unfreeze` applies as a 302 non-retryable error (and in particular, does not change state). Assuming a 303 working implementation of this (which will need to be hashed out separately and 304 may require its own migration considerations), a migration entails the 305 following: 306 307 * **apply** `Freeze` on all Replicas 308 * stop all nodes (possibly through a special shutdown mode which terminates 309 a node when all Replicas are in fact frozen) 310 * replace the binary 311 * start all nodes 312 * apply `Unfreeze` on all Replicas. 313 314 Since the replaced binary will not have to apply Raft log entries proposed by 315 the previous binary, all is well. 316 317 This work is nicely isolated from the remainder of this RFC, and can (should) 318 be tackled separately as soon as consensus is reached. 319 320 This migration strategy also has other upshots: 321 322 * it can migrate Raft changes while this RFC is being implemented 323 * it can be used for future stop-the-world upgrades even when a Raft log freeze 324 isn't necessary(since it also introduces a "stop for upgrade" mode with user 325 tooling, etc). 326 327 # Drawbacks 328 329 * The amount of data sent over the wire is expected to increase. In particular, 330 DeleteRange could be problematic. 331 It seems worthwhile to get a sense of expectations by means of a prototype. 332 * While the amount of code downstream of raft is drastically reduced, it is not 333 zero, so we will still need online Raft migrations (#5985) eventually. 334 335 There were concerns that this change would make it harder/impossible to get 336 rid of write-write blocking in the future. However, that concern was based on 337 the possibility of Raft reordering, which we can eliminate at this point (see 338 [Replay protection](#replay-protection)), and the problem boils down to having 339 all intermediate states which would result from applying the prefixes of the 340 existing overlapping writes (including side effects) available to execute on 341 top of, which happens at a layer above the changes in this RFC. 342 343 # Implementation strategy 344 345 Tackle the following in parallel (and roughly in that order): 346 347 * flesh out and implement `{F,Unf}reeze`; these are immediately useful for 348 all upcoming migrations. 349 * prototype and evaluate expected inflation of Raft proposal sizes 350 * separate the side effects of Raft applications from `*Replica` and clearly 351 separate the parts which can be moved pre-proposal as well as the parameters 352 required for the post-proposal part. This informs the actual design of the 353 Raft "effects" in this proposal. 354 Most of these should come up naturally when trying to remove the `(*Replica)` 355 receiver on most of the `ReplicaCommands` (`RequestLease`, `Put`, ...) and 356 manually going through the rest of the call stack up to `processRaftCommand`. 357 * implement protection against Raft replays/reordering, ideally such that 358 the lease holder knows at proposal time the log index at which commands commit 359 (unless they have to be reproposed). 360 * flesh out the WriteBatch encoding. Ideally, this includes a parser in Go (to 361 avoid an obstruction to switching storage engines), but the critical part 362 here for the purpose of this RFC is hooking up the `RocksDB`-specific 363 implementation and guarding against upstream changes. 364 We can trust RocksDB to be backwards-compatible in any changes they make, but 365 we require bidirectional compatibility. We create a batch on the lease holder and 366 ship it off to the follower to be applied, and the follower might be running 367 either an older or a newer version of the code. If they add a new record 368 type, we either need to be able to guarantee that the new record types won't 369 be used in our case, or translate them back to something the old version can 370 understand. 371 * introduce and implement the `Batch` interface. This could be done after the 372 `WriteBatch` encoding is done or, if required to unblock the implementation 373 of this RFC, start with a less performant version based on an auxiliary map 374 and protobuf serialization. 375 376 The goal here should be to identify all the necessary refactors to keep them 377 out of the remaining "forklift" portion of the change. 378 379 ## Is this a feasible change? 380 381 It's scary 382 383 * if you try to migrate it properly (but see above for an option which seems 384 viable) 385 * if you try to forklift it all 386 * that with a change of this magnitude we could stumble upon surprises 387 in late-ish stages which call the whole change into question 388 * that it requires a large concerted effort to be carried out, with lots of 389 moving parts touched. 390 * that performance implications aren't 100% clear (need prototyping) 391 392 As I understand it, the change is very ambitious, but not as vast in scope as, 393 for example, the plans we have with distributed SQL (which, of course, are not 394 on such a tight deadline). 395 396 # Alternatives 397 398 Other than leaving the current system in place and pushing complexity onto the 399 migration story, none. 400 401 # Unresolved questions 402 403 None currently. 404 405 # Footnotes 406 407 [1]: via @petermattis: 408 > There is another nice effect from using the WriteBatch: instead of 409 marshaling and unmarshaling individual keys and values, we'll be passing around 410 a blob. While investigating batch insert performance I added some 411 instrumentation around RaftCommand marshaling and noticed: 412 413 ``` 414 raft command marshal: 20002: 6.2ms 415 raft command unmarshal: 20002: 29.1ms 416 raft command marshal: 20002: 6.0ms 417 raft command unmarshal: 20002: 12.4ms 418 raft command marshal: 20002: 6.0ms 419 raft command unmarshal: 20002: 42.6ms 420 ``` 421 422 The first number (20002) is the number of requests in the BatchRequest and 423 the second is the time it took to marshal/unmarshal. The marshaled data size 424 is almost exactly 1MB.