github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/pkg/kv/kvserver/apply/doc.go (about) 1 // Copyright 2018 The Cockroach Authors. 2 // 3 // Use of this software is governed by the Business Source License 4 // included in the file licenses/BSL.txt. 5 // 6 // As of the Change Date specified in that file, in accordance with 7 // the Business Source License, use of this software will be governed 8 // by the Apache License, Version 2.0, included in the file 9 // licenses/APL.txt. 10 11 /* 12 Package apply provides abstractions and routines associated with the application 13 of committed raft entries to a replicated state machine. 14 15 State Machine Replication 16 17 Raft entry application is the process of taking entries that have been committed 18 to a raft group's "raft log" through raft consensus and using them to drive the 19 state machines of each member of the raft group (i.e. each replica). Committed 20 entries are decoded into commands in the same order that they are arranged in 21 the raft log (i.e. in order of increasing log index). This ordering of decoded 22 commands is then treated as the input to state transitions on each replica. 23 24 The key to this general approach, known as "state machine replication", is that 25 all state transitions are fully deterministic given only the current state of 26 the machine and the command to apply as input. This ensures that if each 27 instance is driven from the same consistent shared log (same entries, same 28 order), they will all stay in sync. In other words, if we ensure that all 29 replicas start as identical copies of each other and we ensure that all replicas 30 perform the same state transitions, in the same order, deterministically, then 31 through induction we know that all replicas will remain identical copies of each 32 other when compared at the same log index. 33 34 This poses a problem for replicas that fail for any reason to apply an entry. If 35 the failure wasn't deterministic across all replicas then they can't carry on 36 applying entries, as their state may have diverged from their peers. The only 37 reasonable recourse is to signal that the replica has become corrupted. This 38 demonstrates why it is necessary to separate deterministic command failures from 39 non-deterministic state transition failures. The former, which we call "command 40 rejection" is permissible as long as all replicas come to the same decision to 41 reject the command and handle the rejection in the same way (e.g. decide not to 42 make any state transition). The latter, on the other hand, it not permissible, 43 and is typically handled by crashing the node. 44 45 Performance Concerns 46 47 The state machine replication approach also poses complications that affect 48 performance. 49 50 A first challenge falls out from the requirement that all replicated commands be 51 sequentially applied on each replica to enforce determinism. This requirement 52 must hold even as the concurrency of the systems processing requests and driving 53 replication grows. If this concurrency imbalance becomes so great that the 54 sequential processing of updates to the replicated state machine can no longer 55 keep up with the concurrent processing feeding inputs into the replicated state 56 machine, replication itself becomes a throughput bottleneck for the system, 57 manifesting as replication lag. This problem, sometimes referred to as the 58 "parallelism gap", is fundamentally due to the loss of context on the 59 interaction between commands after replication and a resulting inability to 60 determine whether concurrent application of commands would be possible without 61 compromising determinism. Put another way, above the level of state machine 62 replication, it is easy to determine which commands conflict with one another, 63 and those that do not conflict can be run concurrently. However, below the level 64 of replication, it is unclear which commands conflict, so to ensure determinism 65 during state machine transitions, no concurrency is possible. 66 67 Although it makes no attempt to explicitly introduce concurrency into command 68 application, this package does attempt to improve replication throughput and 69 reduce this parallelism gap through the use of batching. A notion of command 70 triviality is exposed to clients of this package, and those commands that are 71 trivial are considered able to have their application batched with other 72 adjacent trivial commands. This batching, while still preserving a strict 73 ordering of commands, allows multiple commands to achieve some concurrency in 74 their interaction with the state machine. For instance, writes to a storage 75 engine from different commands are able to be batched together using this 76 interface. For more, see Batch. 77 78 A second challenge arising from the technique of state machine replication is 79 its interaction with client responses and acknowledgment. We saw before that a 80 command is guaranteed to eventually apply if its corresponding raft entry is 81 committed in the raft log - individual replicas have no other choice but to 82 apply it. However, depending on the replicated state, the fact that a command 83 will apply may not be sufficient to return a response to a client. In some 84 cases, the command may still be rejected (deterministically) and the client 85 should be alerted of that. In more extreme cases, the result of the command may 86 not even be known until it is applied to the state machine. In CockroachDB, this 87 was the case until the major rework that took place in 2016 called "proposer 88 evaluated KV" (see docs/RFCS/20160420_proposer_evaluated_kv.md). With the 89 completion of that change, client responses are determined before replication 90 begins. The only remaining work to be done after replication of a command 91 succeeds is to determine whether it will be rejected and replaced by an empty 92 command. To facilitate this acknowledgement as early as possible, this package 93 provides the ability to acknowledge a series of commands before applying them to 94 the state machine. Outcomes are determined before performing any durable work by 95 stepping commands through an in-memory "ephemeral" copy of the state machine. 96 For more, see Task.AckCommittedEntriesBeforeApplication. 97 98 A final challenge comes from the desire to properly prioritize the application 99 of commands across multiple state machines in systems like CockroachDB where 100 each machine hosts hundreds or thousands of replicas. This is a complicated 101 concern that must take into consideration the need for each replica's state 102 machine to stay up-to-date (is it a leaseholder? is it serving reads?), the need 103 to acknowledge clients in a timely manner (are clients waiting for command 104 application?), the desire to delay application to accumulate larger application 105 batches (will batching improve system throughput?), and a number of other 106 factors. This package has not begun to answer these questions, but it serves to 107 provide the abstractions necessary to perform such prioritization in the future. 108 109 Usage 110 111 The package exports a set of interfaces that users must provide implementations 112 for. Notably, users of the package must provide a StateMachine that encapsulates 113 the logic behind performing individual state transitions and a Decoder that is 114 capable of decoding raft entries and providing iteration over corresponding 115 Command objects. 116 117 These two structures can be used to create an application Task, which is capable 118 of applying raft entries to the StateMachine (see Task.ApplyCommittedEntries). 119 To do so, the Commands that were decoded using the Decoder (see Task.Decode) are 120 passed through a pipeline of stages. First, the Commands are checked for 121 rejection while being staged in an application Batch, which produces a set of 122 CheckedCommands. Next, the application Batch is committed to the StateMachine. 123 Following this, the in-memory side-effects of the CheckedCommands are applied to 124 the StateMachine, producing AppliedCommands. Finally, these AppliedCommands are 125 finalized and their clients are acknowledged. 126 */ 127 package apply