github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/kv/kvserver/apply/doc.go (about)

     1  // Copyright 2018 The Cockroach Authors.
     2  //
     3  // Use of this software is governed by the Business Source License
     4  // included in the file licenses/BSL.txt.
     5  //
     6  // As of the Change Date specified in that file, in accordance with
     7  // the Business Source License, use of this software will be governed
     8  // by the Apache License, Version 2.0, included in the file
     9  // licenses/APL.txt.
    10  
    11  /*
    12  Package apply provides abstractions and routines associated with the application
    13  of committed raft entries to a replicated state machine.
    14  
    15  State Machine Replication
    16  
    17  Raft entry application is the process of taking entries that have been committed
    18  to a raft group's "raft log" through raft consensus and using them to drive the
    19  state machines of each member of the raft group (i.e. each replica). Committed
    20  entries are decoded into commands in the same order that they are arranged in
    21  the raft log (i.e. in order of increasing log index). This ordering of decoded
    22  commands is then treated as the input to state transitions on each replica.
    23  
    24  The key to this general approach, known as "state machine replication", is that
    25  all state transitions are fully deterministic given only the current state of
    26  the machine and the command to apply as input. This ensures that if each
    27  instance is driven from the same consistent shared log (same entries, same
    28  order), they will all stay in sync. In other words, if we ensure that all
    29  replicas start as identical copies of each other and we ensure that all replicas
    30  perform the same state transitions, in the same order, deterministically, then
    31  through induction we know that all replicas will remain identical copies of each
    32  other when compared at the same log index.
    33  
    34  This poses a problem for replicas that fail for any reason to apply an entry. If
    35  the failure wasn't deterministic across all replicas then they can't carry on
    36  applying entries, as their state may have diverged from their peers. The only
    37  reasonable recourse is to signal that the replica has become corrupted. This
    38  demonstrates why it is necessary to separate deterministic command failures from
    39  non-deterministic state transition failures. The former, which we call "command
    40  rejection" is permissible as long as all replicas come to the same decision to
    41  reject the command and handle the rejection in the same way (e.g. decide not to
    42  make any state transition). The latter, on the other hand, it not permissible,
    43  and is typically handled by crashing the node.
    44  
    45  Performance Concerns
    46  
    47  The state machine replication approach also poses complications that affect
    48  performance.
    49  
    50  A first challenge falls out from the requirement that all replicated commands be
    51  sequentially applied on each replica to enforce determinism. This requirement
    52  must hold even as the concurrency of the systems processing requests and driving
    53  replication grows. If this concurrency imbalance becomes so great that the
    54  sequential processing of updates to the replicated state machine can no longer
    55  keep up with the concurrent processing feeding inputs into the replicated state
    56  machine, replication itself becomes a throughput bottleneck for the system,
    57  manifesting as replication lag. This problem, sometimes referred to as the
    58  "parallelism gap", is fundamentally due to the loss of context on the
    59  interaction between commands after replication and a resulting inability to
    60  determine whether concurrent application of commands would be possible without
    61  compromising determinism. Put another way, above the level of state machine
    62  replication, it is easy to determine which commands conflict with one another,
    63  and those that do not conflict can be run concurrently. However, below the level
    64  of replication, it is unclear which commands conflict, so to ensure determinism
    65  during state machine transitions, no concurrency is possible.
    66  
    67  Although it makes no attempt to explicitly introduce concurrency into command
    68  application, this package does attempt to improve replication throughput and
    69  reduce this parallelism gap through the use of batching. A notion of command
    70  triviality is exposed to clients of this package, and those commands that are
    71  trivial are considered able to have their application batched with other
    72  adjacent trivial commands. This batching, while still preserving a strict
    73  ordering of commands, allows multiple commands to achieve some concurrency in
    74  their interaction with the state machine. For instance, writes to a storage
    75  engine from different commands are able to be batched together using this
    76  interface. For more, see Batch.
    77  
    78  A second challenge arising from the technique of state machine replication is
    79  its interaction with client responses and acknowledgment. We saw before that a
    80  command is guaranteed to eventually apply if its corresponding raft entry is
    81  committed in the raft log - individual replicas have no other choice but to
    82  apply it. However, depending on the replicated state, the fact that a command
    83  will apply may not be sufficient to return a response to a client. In some
    84  cases, the command may still be rejected (deterministically) and the client
    85  should be alerted of that. In more extreme cases, the result of the command may
    86  not even be known until it is applied to the state machine. In CockroachDB, this
    87  was the case until the major rework that took place in 2016 called "proposer
    88  evaluated KV" (see docs/RFCS/20160420_proposer_evaluated_kv.md). With the
    89  completion of that change, client responses are determined before replication
    90  begins. The only remaining work to be done after replication of a command
    91  succeeds is to determine whether it will be rejected and replaced by an empty
    92  command. To facilitate this acknowledgement as early as possible, this package
    93  provides the ability to acknowledge a series of commands before applying them to
    94  the state machine. Outcomes are determined before performing any durable work by
    95  stepping commands through an in-memory "ephemeral" copy of the state machine.
    96  For more, see Task.AckCommittedEntriesBeforeApplication.
    97  
    98  A final challenge comes from the desire to properly prioritize the application
    99  of commands across multiple state machines in systems like CockroachDB where
   100  each machine hosts hundreds or thousands of replicas. This is a complicated
   101  concern that must take into consideration the need for each replica's state
   102  machine to stay up-to-date (is it a leaseholder? is it serving reads?), the need
   103  to acknowledge clients in a timely manner (are clients waiting for command
   104  application?), the desire to delay application to accumulate larger application
   105  batches (will batching improve system throughput?), and a number of other
   106  factors. This package has not begun to answer these questions, but it serves to
   107  provide the abstractions necessary to perform such prioritization in the future.
   108  
   109  Usage
   110  
   111  The package exports a set of interfaces that users must provide implementations
   112  for. Notably, users of the package must provide a StateMachine that encapsulates
   113  the logic behind performing individual state transitions and a Decoder that is
   114  capable of decoding raft entries and providing iteration over corresponding
   115  Command objects.
   116  
   117  These two structures can be used to create an application Task, which is capable
   118  of applying raft entries to the StateMachine (see Task.ApplyCommittedEntries).
   119  To do so, the Commands that were decoded using the Decoder (see Task.Decode) are
   120  passed through a pipeline of stages. First, the Commands are checked for
   121  rejection while being staged in an application Batch, which produces a set of
   122  CheckedCommands. Next, the application Batch is committed to the StateMachine.
   123  Following this, the in-memory side-effects of the CheckedCommands are applied to
   124  the StateMachine, producing AppliedCommands. Finally, these AppliedCommands are
   125  finalized and their clients are acknowledged.
   126  */
   127  package apply