github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/kv/kvserver/apply/doc.go

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/kv/kvserver/apply/doc.go (about)

1 // Copyright 2018 The Cockroach Authors.
2 //
3 // Use of this software is governed by the Business Source License
4 // included in the file licenses/BSL.txt.
5 //
6 // As of the Change Date specified in that file, in accordance with
7 // the Business Source License, use of this software will be governed
8 // by the Apache License, Version 2.0, included in the file
9 // licenses/APL.txt.
10
11 /*
12 Package apply provides abstractions and routines associated with the application
13 of committed raft entries to a replicated state machine.
14
15 State Machine Replication
16
17 Raft entry application is the process of taking entries that have been committed
18 to a raft group's "raft log" through raft consensus and using them to drive the
19 state machines of each member of the raft group (i.e. each replica). Committed
20 entries are decoded into commands in the same order that they are arranged in
21 the raft log (i.e. in order of increasing log index). This ordering of decoded
22 commands is then treated as the input to state transitions on each replica.
23
24 The key to this general approach, known as "state machine replication", is that
25 all state transitions are fully deterministic given only the current state of
26 the machine and the command to apply as input. This ensures that if each
27 instance is driven from the same consistent shared log (same entries, same
28 order), they will all stay in sync. In other words, if we ensure that all
29 replicas start as identical copies of each other and we ensure that all replicas
30 perform the same state transitions, in the same order, deterministically, then
31 through induction we know that all replicas will remain identical copies of each
32 other when compared at the same log index.
33
34 This poses a problem for replicas that fail for any reason to apply an entry. If
35 the failure wasn't deterministic across all replicas then they can't carry on
36 applying entries, as their state may have diverged from their peers. The only
37 reasonable recourse is to signal that the replica has become corrupted. This
38 demonstrates why it is necessary to separate deterministic command failures from
39 non-deterministic state transition failures. The former, which we call "command
40 rejection" is permissible as long as all replicas come to the same decision to
41 reject the command and handle the rejection in the same way (e.g. decide not to
42 make any state transition). The latter, on the other hand, it not permissible,
43 and is typically handled by crashing the node.
44
45 Performance Concerns
46
47 The state machine replication approach also poses complications that affect
48 performance.
49
50 A first challenge falls out from the requirement that all replicated commands be
51 sequentially applied on each replica to enforce determinism. This requirement
52 must hold even as the concurrency of the systems processing requests and driving
53 replication grows. If this concurrency imbalance becomes so great that the
54 sequential processing of updates to the replicated state machine can no longer
55 keep up with the concurrent processing feeding inputs into the replicated state
56 machine, replication itself becomes a throughput bottleneck for the system,
57 manifesting as replication lag. This problem, sometimes referred to as the
58 "parallelism gap", is fundamentally due to the loss of context on the
59 interaction between commands after replication and a resulting inability to
60 determine whether concurrent application of commands would be possible without
61 compromising determinism. Put another way, above the level of state machine
62 replication, it is easy to determine which commands conflict with one another,
63 and those that do not conflict can be run concurrently. However, below the level
64 of replication, it is unclear which commands conflict, so to ensure determinism
65 during state machine transitions, no concurrency is possible.
66
67 Although it makes no attempt to explicitly introduce concurrency into command
68 application, this package does attempt to improve replication throughput and
69 reduce this parallelism gap through the use of batching. A notion of command
70 triviality is exposed to clients of this package, and those commands that are
71 trivial are considered able to have their application batched with other
72 adjacent trivial commands. This batching, while still preserving a strict
73 ordering of commands, allows multiple commands to achieve some concurrency in
74 their interaction with the state machine. For instance, writes to a storage
75 engine from different commands are able to be batched together using this
76 interface. For more, see Batch.
77
78 A second challenge arising from the technique of state machine replication is
79 its interaction with client responses and acknowledgment. We saw before that a
80 command is guaranteed to eventually apply if its corresponding raft entry is
81 committed in the raft log - individual replicas have no other choice but to
82 apply it. However, depending on the replicated state, the fact that a command
83 will apply may not be sufficient to return a response to a client. In some
84 cases, the command may still be rejected (deterministically) and the client
85 should be alerted of that. In more extreme cases, the result of the command may
86 not even be known until it is applied to the state machine. In CockroachDB, this
87 was the case until the major rework that took place in 2016 called "proposer
88 evaluated KV" (see docs/RFCS/20160420_proposer_evaluated_kv.md). With the
89 completion of that change, client responses are determined before replication
90 begins. The only remaining work to be done after replication of a command
91 succeeds is to determine whether it will be rejected and replaced by an empty
92 command. To facilitate this acknowledgement as early as possible, this package
93 provides the ability to acknowledge a series of commands before applying them to
94 the state machine. Outcomes are determined before performing any durable work by
95 stepping commands through an in-memory "ephemeral" copy of the state machine.
96 For more, see Task.AckCommittedEntriesBeforeApplication.
97
98 A final challenge comes from the desire to properly prioritize the application
99 of commands across multiple state machines in systems like CockroachDB where
100 each machine hosts hundreds or thousands of replicas. This is a complicated
101 concern that must take into consideration the need for each replica's state
102 machine to stay up-to-date (is it a leaseholder? is it serving reads?), the need
103 to acknowledge clients in a timely manner (are clients waiting for command
104 application?), the desire to delay application to accumulate larger application
105 batches (will batching improve system throughput?), and a number of other
106 factors. This package has not begun to answer these questions, but it serves to
107 provide the abstractions necessary to perform such prioritization in the future.
108
109 Usage
110
111 The package exports a set of interfaces that users must provide implementations
112 for. Notably, users of the package must provide a StateMachine that encapsulates
113 the logic behind performing individual state transitions and a Decoder that is
114 capable of decoding raft entries and providing iteration over corresponding
115 Command objects.
116
117 These two structures can be used to create an application Task, which is capable
118 of applying raft entries to the StateMachine (see Task.ApplyCommittedEntries).
119 To do so, the Commands that were decoded using the Decoder (see Task.Decode) are
120 passed through a pipeline of stages. First, the Commands are checked for
121 rejection while being staged in an application Batch, which produces a set of
122 CheckedCommands. Next, the application Batch is committed to the StateMachine.
123 Following this, the in-memory side-effects of the CheckedCommands are applied to
124 the StateMachine, producing AppliedCommands. Finally, these AppliedCommands are
125 finalized and their clients are acknowledged.
126 */
127 package apply