github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160420_proposer_evaluated_kv.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160420_proposer_evaluated_kv.md (about)

     1  - Feature Name: proposer_evaluated_kv
     2  - Status: completed
     3  - Start Date: 2016-04-19
     4  - Authors: Tobias Schottdorf
     5  - RFC PR: [#6166](https://github.com/cockroachdb/cockroach/pull/6166)
     6  - Cockroach Issue: [#6290](https://github.com/cockroachdb/cockroach/pull/6290)
     7  
     8  # Summary
     9  
    10  Pursue an idea by @spencerkimball brought up in the context of
    11  [#5985](https://github.com/cockroachdb/cockroach/pull/5985): move the bulk of
    12  the logic involved in applying commands to a Replica upstream of Raft.
    13  
    14  # Motivation
    15  
    16  The [first exploration of
    17  migrations](https://github.com/cockroachdb/cockroach/pull/5985) highlighted
    18  the complexity concomitant with migrations on top of the current Raft execution
    19  model.
    20  
    21  As of writing, the leading replica is the only replica that proposes commands
    22  to Raft. However, each replica applies commands individually, which leads to
    23  large swaths of code living "downstream" of Raft. Since all of this code needs
    24  to produce identical output even during migrations, a satisfactory migration
    25  story is almost inachievable, a comment made by several reviewers (and the
    26  author).
    27  
    28  By having the proposing replica compute the effects on the key-value state
    29  before proposing the command to Raft, much of this downstream code moves
    30  upstream of Raft; naively what's left are simple writes to key-value pairs
    31  (i.e. all commands move upstream except for a simplified MVCCPut operation).
    32  While some complexity remains, this has promise for a much facilitated
    33  migration story, as well as de-triplicating work previously done by each
    34  individual replica.
    35  
    36  # Detailed design
    37  
    38  ## Execution path
    39  
    40  ### Proposing commands
    41  
    42  The main change in the execution path will be in `proposeRaftCommand`, which
    43  currently takes a `BatchRequest` and creates a `pendingCmd`, which currently
    44  has the form
    45  
    46  ```go
    47  type pendingCmd struct {
    48    ctx     context.Context
    49    idKey   storagebase.CmdIDKey
    50    done    chan roachpb.ResponseWithError // Used to signal waiting RPC handler
    51    raftCmd struct {
    52        RangeID       RangeID
    53        OriginReplica ReplicaDescriptor
    54        Cmd           BatchRequest
    55    }
    56  }
    57  ```
    58  
    59  and within which the main change is changing the type of `Cmd` from
    60  `BatchRequest` to a new type which carries the effects of the application of
    61  the batch, as computed (by the lease holder) earlier in `proposeRaftCommand`.
    62  
    63  For simplicity, in this draft we will assume that the `raftCmd` struct will
    64  be modified according to these requirements.
    65  
    66  The "effects" contain
    67  
    68  * any key-value writes: For example the outcome of a `Put`, `DeleteRange`, `CPut`, `BeginTransaction`, or `RequestLease`.
    69    Some alternatives exist: writes could be encoded as `[]MVCCKeyValue`, or even
    70    as post-MVCC raw (i.e. opaque) key-value pairs. We'll assume the former for
    71    now.
    72  * parameters for any "triggers" which must run with the *application* of the
    73    command. In particular, this is necessary on `EndTransaction` with a commit
    74    trigger, or when the lease holder changes (to apply the new lease).
    75    There are more of them, but this already shows that these triggers (which are
    76    downstream of Raft) will require various pieces of data to be passed around
    77    with them.
    78  
    79  Note that for the "triggers" above, a lot of the work they currently do can
    80  move pre-Raft as well. For example, for `RequestLease` below we must change some
    81  in-memory state on `*Replica` (lease proto and timestamp cache) plus call out
    82  to Gossip. Still, the bulk of the code in `(*Replica).RequestLease` (i.e. the
    83  actual logic) will move pre-Raft, and the remaining trigger is conditional on
    84  being on the lease holder (so that one could even remove it completely from the
    85  "effects" and hack their execution into the lease holder role itself).
    86  
    87  Thus, we arrive at the following (proto-backed) candidate struct:
    88  
    89  ```proto
    90  # Pseudo-Go-Protobuf, all naming ad-hoc and TBD.
    91  
    92  message RaftCmd {
    93    optional SomeMetadata # tbd (proposing replica, enginepb.MVCCStats diff, ...)
    94    repeated oneof effect {
    95      # Applied in field and slice order.
    96      optional Writes writes = ...; # see MVCC section for Writes type
    97      # Carries out the following:
    98      # * sanity checks
    99      # * compute stats for new left hand side (LHS)      [move pre-raft]
   100      # * copy some data to new range (abort cache)       [move pre-raft]
   101      # * make right-hand side `*Replica`
   102      # * copy tsCache (in-mem)                           [lease holder only]
   103      # * r.store.SplitRange
   104      # * maybe trigger Raft election
   105      optional Split ...;
   106      # Carries out the following:
   107      # * sanity checks
   108      # * stats computations                              [move pre-raft]
   109      # * copy some data to new range (abort cache)       [move pre-raft]
   110      # * remove some metadata                            [move pre-raft]
   111      #   careful: this is a DeleteRange; could benefit from a ClearRange
   112      #   Raft effect (all inline values; no versioning needed)
   113      # * clear tsCache (not really necessary)
   114      optional Merge ...:
   115      # Carries out the following:
   116      # * `(*Replica).setDesc`
   117      # * maybe add to GC queue                           [lease holder only]
   118      optional ChangeReplicas ...;
   119      # Carries out the following:
   120      # * maybe gossips system config                     [lease holder only]
   121      optional ModifiedSpan ...;
   122      # Carries out the following:
   123      # * sanity checks (in particular FindReplica)
   124      # * update in-memory lease
   125      # * on Raft(!) leader: maybe try to elect new leader
   126      # * on (new) lease holder: update tsCache low water mark
   127      #   (might be easier to do this unconditionally on all nodes)
   128      # * on (new) lease holder: maybe gossip system configs
   129      optional Lease ...;
   130      # Those below may not be required, but something to think about
   131      optional GCRange ...;    # allow optimized GC (some logic below of Raft?)
   132      optional ClearRange ...; # allow erasing key range without huge proposal
   133    }
   134  }
   135  ```
   136  
   137  Since this simple structure is easy to reason about, it seems adequate to
   138  strive for the minimum amount of interaction between the various effects
   139  (though the data required for the triggers may benefit from more interaction
   140  between them in some cases)
   141  
   142  ### Computing the (ordered) write set
   143  
   144  Before a lease holder proposes a command, it has the job of translating a
   145  `BatchRequest` into a `RaftCmd`. Note that we assume that writes wait for
   146  both reads and writes in the command queue (so that at the point at which
   147  the `RaftCmd` is constructed, all prior mutations have been carried out and
   148  the timestamp cache updated).
   149  
   150  This largely takes over the work which was done in
   151  
   152  ```
   153  applyRaftCommandInBatch -> executeWriteBatch -> executeBatch
   154                          -> (ReplicaCommands) -> (MVCC)
   155  ```
   156  
   157  This is hopefully largely mechanical, with the bulk of work in the MVCC layer,
   158  outlined in the next section.
   159  
   160  The work to be carried out in `ReplicaCommands` is mostly isolating the various
   161  side effects (most prominently the commit and range lease triggers). In the
   162  process, it should be possible to remove these methods from `(*Replica)`
   163  (though that is not a stated goal of this RFC).
   164  
   165  ### MVCC & Engine Layer
   166  
   167  Operations in the MVCC layer (`MVCC{Get,Put,Scan,DeleteRange,...}`) currently
   168  return some variation of `([]roachpb.KeyValue, []roachpb.Intent, error)` and
   169  operate on the underlying `engine.Engine` (which in practice is a RocksDB
   170  batch) with heavy use of iterators.
   171  
   172  With the new code, we want mostly the same, but with two important changes:
   173  
   174  * we do not actually intend to commit the batch (though we could do this on the
   175    proposing node to avoid overhead)
   176  * we need to construct the `Writes` field, which (in some way or
   177    another) tracks the key-value pairs written (i.e. created, changed or
   178    deleted) over the lifetime of the batch.
   179  
   180  It's also desirable to avoid invasive changes to the MVCC layer. The following
   181  extension to the `Engine` interface seems adequate (mod changes to improve
   182  buffer reuse):
   183  
   184  ```go
   185  type Engine interface {
   186    // ...
   187    NewBatch() Batch
   188  }
   189  
   190  var ErrNotTracking = errors.New("batch is not tracking writes")
   191  
   192  type Writes []MVCCKeyValue // actual type see below
   193  
   194  type Batch interface {
   195    Engine
   196    // GetWrites returns a slice of Write updates which were carried out during
   197    // batch, or ErrNotTracking if the batch does not support this feature.
   198    // The ordering of the writes is not guaranteed to be stable and may not
   199    // correspond to the actual order in which the writes were carried out.
   200    // However, the result of applying the updates in the returned order *is*
   201    // stable, i.e. non-commutative (overlapping writes) can't rearrange
   202    // themselves freely). [i.e. can use a map under the hood]
   203    // The returned Writes are read-only and must not be used after their Batch
   204    // is finalized. [i.e. can reclaim the memory at that point, or we add an
   205    // explicit method to finalize the writes if that is more convenient].
   206    GetWrites() (Writes, error)
   207  }
   208  ```
   209  
   210  With this strategy, the MVCC changes are fairly locally scoped and not too
   211  invasive.
   212  
   213  Naively, this can be achieved by wrapping a "regular" batch with an appropriate
   214  in-memory map which is populated on writes.
   215  
   216  However, using [RocksDB's `WriteBatch`
   217  format](https://github.com/facebook/rocksdb/blob/f38540b12ada4fe06598c42d4c084f9d920289ff/db/write_batch.cc#L10)
   218  seems opportune:
   219  
   220  ```
   221  // WriteBatch::rep_ :=
   222  //    sequence: fixed64
   223  //    count: fixed32
   224  //    data: record[count]
   225  // record :=
   226  //    kTypeValue varstring varstring
   227  //    kTypeDeletion varstring
   228  //    kTypeSingleDeletion varstring
   229  //    kTypeMerge varstring varstring
   230  //    kTypeColumnFamilyValue varint32 varstring varstring
   231  //    kTypeColumnFamilyDeletion varint32 varstring varstring
   232  //    kTypeColumnFamilySingleDeletion varint32 varstring varstring
   233  //    kTypeColumnFamilyMerge varint32 varstring varstring
   234  // varstring :=
   235  //    len: varint32
   236  //    data: uint8[len]
   237  ```
   238  
   239  * The format is straightforward to implement; the writes need to be serialized
   240    anyway for the Raft proposal, and this appears to be an efficient format to
   241    use, even when the underlying storage engine isn't based on RocksDB.
   242  * We currently use RocksDB and this is already the internal format used by a
   243    RocksDB batch. We can completely avoid an additional copy and serialization
   244    step in this case (i.e. construct the batch, borrow out the contained
   245    representation to Raft until the command commits, and apply the batch).
   246    Followers can create a `WriteBatch` directly from this representation.
   247    Serialization is a key point in performance[1].
   248  * Again for RocksDB, we might also be able to piggy-back on existing
   249    implementation to operate on the superposition of multiple `WriteBatch`es
   250    (which would be needed to remove write-write blocking).
   251  
   252  The format should be relatively stable (likely only additions), but there might
   253  be future migrations coming up as RocksDB changes their underlying assumptions.
   254  We should consider checking in with them about this idea.
   255  
   256  Using `WriteBatch` also means that the writes are completely opaque to the
   257  stack (at least unless unserialized early, and loaded, and then accessed
   258  through MVCC operations), so the followers essentially can't act based on
   259  them. That should not be a problem (but could if we want the side effects
   260  to use information from the write set in a performant way).
   261  
   262  ## Raft command application
   263  
   264  This will be a new, straightforward facility which applies a `RaftCmd` against
   265  the underlying `Engine` in a batch. The point of this whole change is that this
   266  part of the stack is very straightforward and with no side effects.
   267  
   268  `(*Replica).executeCmd` is likely a good place where most of this code would
   269  live.
   270  
   271  ## Replay protection
   272  
   273  Replay protection is naively an issue because applying a logicless batch of
   274  writes is at odds with our current method, which largely relies on
   275  write/timestamp conflicts.
   276  However, as @bdarnell pointed out, as of recently (#5973) we (mostly) colocate
   277  the Raft lease holder and the range lease holder on the same node and have more
   278  options for dealing with this (i.e. making the application of a command
   279  conditional on the Raft log position it was originally proposed at).
   280  
   281  More concretely (via @bdarnell):
   282  
   283  > The RaftCmd would also include a term and log index, to be populated with the
   284  > lease holder's latest log information when it is proposed. The writes would only be
   285  > applied if it committed at that index; they would be dropped (and reproposed)
   286  > if they committed at any other index. This would only be a possibility when
   287  > the lease holder is not the raft leader (and even then it's unlikely in
   288  > stable lease situations). We may be able to be even stricter about colocating
   289  > the two roles (in which case we could perhaps remove raft-level replays
   290  > completely), but I'm wary of forcing too many raft elections.
   291  
   292  ### Test changes
   293  
   294  These should be limited to those tests that exercise triggers and/or use
   295  lower-level Raft internals inside of the `storage` package (plus some surprises
   296  which are certain to occur with a change of this magnitude).
   297  
   298  ## Migration strategy
   299  
   300  We implement the `Freeze`/`Unfreeze` Raft commands suggested by @bdarnell in
   301  \#5985: everything proposed between `Freeze` and `Unfreeze` applies as a
   302  non-retryable error (and in particular, does not change state). Assuming a
   303  working implementation of this (which will need to be hashed out separately and
   304  may require its own migration considerations), a migration entails the
   305  following:
   306  
   307  * **apply** `Freeze` on all Replicas
   308  * stop all nodes (possibly through a special shutdown mode which terminates
   309    a node when all Replicas are in fact frozen)
   310  * replace the binary
   311  * start all nodes
   312  * apply `Unfreeze` on all Replicas.
   313  
   314  Since the replaced binary will not have to apply Raft log entries proposed by
   315  the previous binary, all is well.
   316  
   317  This work is nicely isolated from the remainder of this RFC, and can (should)
   318  be tackled separately as soon as consensus is reached.
   319  
   320  This migration strategy also has other upshots:
   321  
   322  * it can migrate Raft changes while this RFC is being implemented
   323  * it can be used for future stop-the-world upgrades even when a Raft log freeze
   324    isn't necessary(since it also introduces a "stop for upgrade" mode with user
   325    tooling, etc).
   326  
   327  # Drawbacks
   328  
   329  * The amount of data sent over the wire is expected to increase. In particular,
   330  DeleteRange could be problematic.
   331  It seems worthwhile to get a sense of expectations by means of a prototype.
   332  * While the amount of code downstream of raft is drastically reduced, it is not
   333    zero, so we will still need online Raft migrations (#5985) eventually.
   334  
   335  There were concerns that this change would make it harder/impossible to get
   336  rid of write-write blocking in the future. However, that concern was based on
   337  the possibility of Raft reordering, which we can eliminate at this point (see
   338  [Replay protection](#replay-protection)), and the problem boils down to having
   339  all intermediate states which would result from applying the prefixes of the
   340  existing overlapping writes (including side effects) available to execute on
   341  top of, which happens at a layer above the changes in this RFC.
   342  
   343  # Implementation strategy
   344  
   345  Tackle the following in parallel (and roughly in that order):
   346  
   347  * flesh out and implement `{F,Unf}reeze`; these are immediately useful for
   348    all upcoming migrations.
   349  * prototype and evaluate expected inflation of Raft proposal sizes
   350  * separate the side effects of Raft applications from `*Replica` and clearly
   351    separate the parts which can be moved pre-proposal as well as the parameters
   352    required for the post-proposal part. This informs the actual design of the
   353    Raft "effects" in this proposal.
   354    Most of these should come up naturally when trying to remove the `(*Replica)`
   355    receiver on most of the `ReplicaCommands` (`RequestLease`, `Put`, ...) and
   356    manually going through the rest of the call stack up to `processRaftCommand`.
   357  * implement protection against Raft replays/reordering, ideally such that
   358    the lease holder knows at proposal time the log index at which commands commit
   359    (unless they have to be reproposed).
   360  * flesh out the WriteBatch encoding. Ideally, this includes a parser in Go (to
   361    avoid an obstruction to switching storage engines), but the critical part
   362    here for the purpose of this RFC is hooking up the `RocksDB`-specific
   363    implementation and guarding against upstream changes.
   364    We can trust RocksDB to be backwards-compatible in any changes they make, but
   365    we require bidirectional compatibility. We create a batch on the lease holder and
   366    ship it off to the follower to be applied, and the follower might be running
   367    either an older or a newer version of the code. If they add a new record
   368    type, we either need to be able to guarantee that the new record types won't
   369    be used in our case, or translate them back to something the old version can
   370    understand.
   371  * introduce and implement the `Batch` interface. This could be done after the
   372    `WriteBatch` encoding is done or, if required to unblock the implementation
   373    of this RFC, start with a less performant version based on an auxiliary map
   374    and protobuf serialization.
   375  
   376  The goal here should be to identify all the necessary refactors to keep them
   377  out of the remaining "forklift" portion of the change.
   378  
   379  ## Is this a feasible change?
   380  
   381  It's scary
   382  
   383  * if you try to migrate it properly (but see above for an option which seems
   384    viable)
   385  * if you try to forklift it all
   386  * that with a change of this magnitude we could stumble upon surprises
   387    in late-ish stages which call the whole change into question
   388  * that it requires a large concerted effort to be carried out, with lots of
   389    moving parts touched.
   390  * that performance implications aren't 100% clear (need prototyping)
   391  
   392  As I understand it, the change is very ambitious, but not as vast in scope as,
   393  for example, the plans we have with distributed SQL (which, of course, are not
   394  on such a tight deadline).
   395  
   396  # Alternatives
   397  
   398  Other than leaving the current system in place and pushing complexity onto the
   399  migration story, none.
   400  
   401  # Unresolved questions
   402  
   403  None currently.
   404  
   405  # Footnotes
   406  
   407  [1]: via @petermattis:
   408    > There is another nice effect from using the WriteBatch: instead of
   409    marshaling and unmarshaling individual keys and values, we'll be passing around
   410    a blob. While investigating batch insert performance I added some
   411    instrumentation around RaftCommand marshaling and noticed:
   412  
   413    ```
   414    raft command marshal:   20002: 6.2ms
   415    raft command unmarshal: 20002: 29.1ms
   416    raft command marshal:   20002: 6.0ms
   417    raft command unmarshal: 20002: 12.4ms
   418    raft command marshal:   20002: 6.0ms
   419    raft command unmarshal: 20002: 42.6ms
   420    ```
   421  
   422    The first number (20002) is the number of requests in the BatchRequest and
   423    the second is the time it took to marshal/unmarshal. The marshaled data size
   424    is almost exactly 1MB.