github.com/onflow/flow-go@v0.35.7-crescendo-preview.23-atree-inlining/consensus/hotstuff/README.md (about) 1 # Flow's HotStuff 2 3 We use a BFT consensus algorithm with deterministic finality in Flow for 4 * Consensus Nodes: decide on the content of blocks by including collections of transactions, 5 * Cluster of Collector Nodes: batch transactions into collections. 6 7 Conceptually, Flow uses a derivative of [HotStuff](https://arxiv.org/abs/1803.05069) called Jolteon. 8 It was originally described in the paper ['Jolteon and Ditto: Network-Adaptive Efficient Consensus with Asynchronous Fallback'](https://arxiv.org/abs/2106.10362), 9 published June 2021 by Meta’s blockchain research team Novi and academic collaborators. 10 Meta’s team (then called 'Diem') [implemented](https://github.com/diem/diem/blob/latest/consensus/README.md) Jolteon with marginal modifications and named it 11 [DiemBFT v4](https://developers.diem.com/papers/diem-consensus-state-machine-replication-in-the-diem-blockchain/2021-08-17.pdf), 12 which was subsequently rebranded as [AptosBFT](https://github.com/aptos-labs/aptos-core/tree/main/developer-docs-site/static/papers/aptos-consensus-state-machine-replication-in-the-aptos-blockchain). 13 Conceptually, Jolteon (DiemBFT v4, AptosBFT) belongs to the family of HotStuff consensus protocols, 14 but adds two significant improvements over the original HotStuff: (i) Jolteon incorporates a PaceMaker with active message exchange for view synchronization 15 and (ii) utilizes the most efficient 2-chain commit rule. 16 17 The foundational innovation in the original HotStuff was its pipelining of block production and finalization. 18 It utilizes leaders for information collection and to drive consensus, which makes it highly message-efficient. 19 In HotStuff, the consensus mechanics are cleverly arranged such that the protocol runs as fast as network conditions permit. 20 (Not be limited by fixed, minimal wait times for messages.) 21 This property is called "responsiveness" and is very important in practise. 22 HotStuff is a round-based consensus algorithm, which requires a supermajority of nodes to be in the same view to make progress. It is the role 23 of the pacemaker to guarantee that eventually a supermajority of nodes will be in the same view. In the original HotStuff, the pacemaker was 24 essentially left as a black box. The only requirement was that the pacemaker had to get the nodes eventually into the same view. 25 Vanilla HotStuff requires 3 subsequent children to finalize a block on the happy path (aka '3-chain rule'). 26 In the original HotStuff paper, the authors discuss the more efficient 2-chain rule. They explain a timing-related edge-case, where the protocol 27 could theoretically get stuck in a timeout loop without progress. To guarantee liveness despite this edge case, the event-driven HotStuff variant in the original paper employs 28 the 3-chain rule. 29 30 As this discussion illustrates, HotStuff is more a family of algorithms: the pacemaker is conceptually separated and can be implemented 31 in may ways. The finality rule is easily swapped out. In addition, there are various other degrees of freedom left open in the family 32 of HotStuff protocols. 33 The Jolteon protocol extends HotStuff by specifying one particular pacemaker, which utilizes dedicated messages to synchronize views and provides very strong guarantees. 34 Thereby, the Jolteon pacemaker closes the timing edge case forcing the original HotStuff to use the 3-chain rule for liveness. 35 As a consequence, the Jolteon protocol can utilize the most efficient 2-chain rule. 36 While Jolteon's close integration of the pacemaker into the consensus meachanics changes the correctness and liveness proofs significantly, 37 the protocol's runtime behaviour matches the HotStuff framework. Therefore, we categorize Jolteon, DiemBFT v4, and AptosBFT as members of the HotStuff family. 38 39 Flow's consensus is largely an implementation of Jolteon with some elements from DiemBFT v4. While these consensus protocols are identical on a conceptual level, 40 they subtly differ in the information included into the different consensus messages. For Flow's implementation, we combined nuances 41 from Jolteon and DiemBFT v4 as follows to improve runtime efficiency, reduce code complexity, and minimize the surface for byzantine attacks: 42 43 44 * Flow's `TimeoutObject` implements the `timeout` message in Jolteon. 45 * In the Jolteon protocol, 46 the `timeout` message contains the view `V`, which the sender wishes to abandon and the latest 47 Quorum Certificate [QC] known to the sender. Due to successive leader failures, the QC might 48 not be from the previous view, i.e. `QC.View + 1 < V` is possible. When receiving a `timeout` message, 49 it is possible for the recipient to advance to round `QC.View + 1`, 50 but not necessarily to view `V`, as a malicious sender might set `V` to an erroneously large value. 51 On the one hand, a recipient that has fallen behind cannot catch up to view `V` immediately. On the other hand, the recipient must 52 cache the timeout to guarantee liveness, making it vulnerable to memory exhaustion attacks. 53 * [DiemBFT v4](https://developers.diem.com/papers/diem-consensus-state-machine-replication-in-the-diem-blockchain/2021-08-17.pdf) introduced the additional rule that the `timeout` 54 must additionally include the Timeout Certificate [TC] for the previous view, 55 if and only if the contained QC is _not_ from the previous round (i.e. `QC.View + 1 < V`). Conceptually, 56 this means that the sender of the timeout message must prove that they entered round `V` according to protocol rules. 57 In other words, malicious nodes cannot send timeouts for future views that they should not have entered. 58 59 For Flow, we follow the convention from DiemBFT v4. This modification simplifies byzantine-resilient processing of `TimeoutObject`s, 60 avoiding subtle spamming and memory exhaustion attacks. Furthermore, it speeds up the recovery of crashed nodes. 61 62 * For consensus votes, we stick with the original Jolteon format, i.e. we do _not_ include the highest QC known to the voter, which is the case in DiemBFT v4. 63 The QC is useful only on the unhappy path, where a node has missed some recent blocks. However, including a QC in every vote adds consistent overhead to the happy 64 path. In Jolteon as well as DiemBFT v4, the timeout messages already contain the highest known QCs. Therefore, the highest QC is already 65 shared among the network in the unhappy path even without including it in the votes. 66 * In Jolteon, the TC contains the full QCs from a supermajority of nodes, which have some overhead in size. DiemBFT v4 improves this by only including 67 the QC's respective views in the TC. Flow utilizes this optimization from DiemBFT v4. 68 69 In the following, we will use the terms Jolteon and HotStuff interchangeably to refer to Flow's consensus implementation. 70 Beyond the Realm of HotStuff and Jolteon, we have added the following advancement to Flow's consensus system: 71 72 * Flow contains a **decentralized random beacon** (based on [Dfinity's proposal](https://dfinity.org/pdf-viewer/library/dfinity-consensus.pdf)). 73 The random beacon is run by Flow's consensus nodes and integrated into the consensus voting process. The random beacon provides a nearly 74 unbiasable source of entropy natively within the protocol that is verifiable and deterministic. The random beacon can be used to generate pseudo random 75 numbers, which we use within Flow protocol in various places. We plan to also use the random beacon to implement secure pseudo random number generators in Candence. 76 77 78 ## Architecture 79 80 _Concepts and Terminology_ 81 82 In Flow, there are multiple HotStuff instances running in parallel. Specifically, the consensus nodes form a HotStuff committee and each collector cluster is its own committee. 83 In the following, we refer to an authorized set of nodes, who run a particular HotStuff instance as a (HotStuff) `committee`. 84 85 * Flow allows nodes to have different weights, reflecting how much they are trusted by the protocol. 86 The weight of a node can change over time due to stake changes or discovering protocol violations. 87 A `super-majority` of nodes is defined as a subset of the consensus committee, 88 where the nodes have _more_ than 2/3 of the entire committee's accumulated weight. 89 * Conceptually, Flow allows that the random beacon is run only by a subset of consensus nodes, aka the "random beacon committee". 90 * The messages from zero-weighted nodes are ignored by all committee members. 91 92 93 ### Determining block validity 94 95 In addition to Jolteon's requirements on block validity, the Flow protocol adds additional requirements. 96 For example, it is illegal to repeatedly include the same payload entities (e.g. collections, challenges, etc) in the same fork. 97 Generally, payload entities expire. However, within the expiry horizon, all ancestors of a block need to be known 98 to verify that payload entities are not repeated. 99 100 We exclude the entire logic for determining payload validity from the HotStuff core implementation. 101 This functionality is encapsulated in the Chain Compliance Layer (CCL) which precedes HotStuff. 102 The CCL is designed to forward only fully validated blocks to the HotStuff core logic. 103 The CCL forwards a block to the HotStuff core logic only if 104 * the block's header is valid (including QC and optional TC), 105 * the block's payload is valid, 106 * the block is connected to the most recently finalized block, and 107 * all ancestors have previously been forwarded to HotStuff. 108 109 If ancestors of a block are missing, the CCL caches the respective block and (iteratively) requests missing ancestors. 110 111 ### Payload generation 112 Payloads are generated outside the HotStuff core logic. HotStuff only incorporates the payload root hash into the block header. 113 114 ### Structure of votes 115 In Flow's HotStuff implementation, votes are used for two purposes: 116 1. Prove that a super-majority of committee nodes consider the respective block a valid extension of the chain. 117 Therefore, nodes include a `StakingSignature` (BLS with curve BLS12-381) in their vote. 118 2. Construct a Source of Randomness as described in [Dfinity's Random Beacon](https://dfinity.org/pdf-viewer/library/dfinity-consensus.pdf). 119 Therefore, consensus nodes include a `RandomBeaconSignature` (also BLS with curve BLS12-381, used in a threshold signature scheme) in their vote. 120 121 When the primary collects the votes, it verifies the content of `SigData`, which can contain only a `StakingSignature` or a pair `StakingSignature` + `RandomBeaconSignature`. 122 A `StakingSignature` must be present in all votes. (There is an optimization already implemented in the code, making the `StakingSignature` optional, but it is not enabled.) 123 If either signature is invalid, the entire vote is discarded. From all valid votes, the 124 `StakingSignatures` and the `RandomBeaconSignatures` are aggregated separately. 125 126 For purely consensus-theoretical purposes, it would be sufficient to use a threshold signature scheme. However, thresholds signatures have the following two 127 important limitations, for which reason Flow uses aggregated signatures in addition: 128 * The threshold signature carries no information about who signed. Meaning with the threshold signature alone, we have to way to distinguish 129 the nodes are contributing from the ones being offline. The mature flow protocol will reward nodes based on their contributions to QCs, 130 which requires a conventional aggregated signature. 131 * Furthermore, the distributed key generation [DKG] for threshold keys currently limits the number of nodes. By including a signature aggregate, 132 we can scale the consensus committee somewhat beyond the limitations of the [DKG]. Then, the nodes contributing to the random beacon would only 133 be a subset of the entire consensus committee. 134 135 ### Communication topology 136 137 * Following [version 6 of the HotStuff paper](https://arxiv.org/abs/1803.05069v6), 138 replicas forward their votes for block `b` to the leader of the _next_ view, i.e. the primary for view `b.View + 1`. 139 * A proposer will attach its own vote for its proposal in the block proposal message 140 (instead of signing the block proposal for authenticity and separately sending a vote). 141 142 143 ### Primary section 144 For primary section, we use a randomized, weight-proportional selection. 145 146 147 ## Implementation Components 148 HotStuff's core logic is broken down into multiple components. 149 The figure below illustrates the dependencies of the core components and information flow between these components. 150 151 ![](/docs/ComponentInteraction.png) 152 <!--- source: https://drive.google.com/file/d/1rZsYta0F9Uz5_HM84MlMmMbiR62YALX-/ --> 153 154 * `MessageHub` is responsible for relaying HotStuff messages. Incoming messages are relayed to the respective modules depending on their message type. 155 Outgoing messages are relayed to the committee though the networking layer via epidemic gossip ('broadcast') or one-to-one communication ('unicast'). 156 * `compliance.Engine` is responsible for processing incoming blocks, caching if needed, validating, extending state and forwarding them to HotStuff for further processing. 157 Note: The embedded `compliance.Core` component is responsible for business logic and maintaining state; `compliance.Engine` schedules work and manages worker threads for the `Core`. 158 * `EventLoop` buffers all incoming events. It manages a single worker routine executing the EventHandler`'s logic. 159 * `EventHandler` orchestrates all HotStuff components and implements the [HotStuff's state machine](/docs/StateMachine.png). 160 The event handler is designed to be executed single-threaded. 161 * `SafetyRules` tracks the latest vote, the latest timeout and determines whether to vote for a block and if it's safe to timeout current round. 162 * `Pacemaker` implements Jolteon's PaceMaker. It manages and updates a replica's local view and synchronizes it with other replicas. 163 The `Pacemaker` ensures liveness by keeping a supermajority of the committee in the same view. 164 * `Forks` maintains an in-memory representation of all blocks `b`, whose view is larger or equal to the view of the latest finalized block (known to this specific replica). 165 As blocks with missing ancestors are cached outside HotStuff (by the Chain Compliance Layer), 166 all blocks stored in `Forks` are guaranteed to be connected to the genesis block 167 (or the trusted checkpoint from which the replica started). `Forks` tracks the finalized blocks and triggers finalization events whenever it 168 observes a valid extension to the chain of finalized blocks. `Forks` is implemented using `LevelledForest`: 169 - Conceptually, a blockchain constructs a tree of blocks. When removing all blocks 170 with views strictly smaller than the last finalized block, this graph decomposes into multiple disconnected trees (referred to as a forest in graph theory). 171 `LevelledForest` is an in-memory data structure to store and maintain a levelled forest. It provides functions to add vertices, query vertices by their ID 172 (block's hash), query vertices by level (block's view), query the children of a vertex, and prune vertices by level (remove them from memory). 173 To separate general graph-theoretical concepts from the concrete blockchain application, `LevelledForest` refers to blocks as graph `vertices` 174 and to a block's view number as `level`. 175 * `Validator` validates the HotStuff-relevant aspects of 176 - QC: total weight of all signers is more than 2/3 of committee weight, validity of signatures, view number is strictly monotonously increasing; 177 - TC: total weight of all signers is more than 2/3 of committee weight, validity of signatures, proof for entering view; 178 - block proposal: from designated primary for the block's respective view, contains proposer's vote for its own block, QC in block is valid, 179 a valid TC for the previous view is included if and only if the QC is not for the previous view; 180 - vote: validity of signature, voter is has positive weight. 181 * `VoteAggregator` caches votes on a per-block basis and builds QC if enough votes have been accumulated. 182 * `TimeoutAggregator` caches timeouts on a per-view basis and builds TC if enough timeouts have been accumulated. Performs validation and verification of timeouts. 183 * `Replicas` maintains the list of all authorized network members and their respective weight, queryable by view. 184 It maintains a static list, which changes only between epochs. Furthermore, `Replicas` knows the primary for each view. 185 * `DynamicCommittee` maintains the list of all authorized network members and their respective weight on a per-block basis. 186 It extends `Replicas` allowing for committee changes mid epoch, e.g. due to slashing or node ejection. 187 * `BlockProducer` constructs the payload of a block, after the HotStuff core logic has decided which fork to extend 188 189 # Implementation 190 191 We have translated the HotStuff protocol into the state machine shown below. The state machine is implemented in `EventHandler`. 192 193 ![](/docs/StateMachine.png) 194 <!--- source: https://drive.google.com/file/d/1la4jxyaEJJfip7NCWBM9YBTz6PK4-N9e/ --> 195 196 197 #### PaceMaker 198 The HotStuff state machine interacts with the PaceMaker, which triggers view changes. The PaceMaker keeps track of 199 liveness data (newest QC, current view, TC for last view), and updates it when supplied with new data from `EventHandler`. 200 Conceptually, the PaceMaker interfaces with the `EventHandler` in two different modes: 201 * [asynchronous] On timeouts, the PaceMaker will emit a timeout event, which is processed as any other event (such as incoming blocks or votes) through the `EventLoop`. 202 * [synchronous] When progress is made following the core business logic, the `EventHandler` will inform the PaceMaker about discovering new QCs or TCs 203 via a direct method call (see `PaceMaker interface`). If the PaceMaker changed the view in response, it returns 204 a `NewViewEvent` which will be synchronously processed by the `EventHandler`. 205 206 Flow's PaceMaker utilizes dedicated messages for synchronizing the consensus participant's local views. 207 It broadcasts a `TimeoutObject`, whenever no progress is made during the current round. After collecting timeouts from a supermajority of participants, 208 the replica constructs a TC which can be used to enter the next round `V = TC.View + 1`. For calculating round timeouts we use a truncated exponential backoff. 209 We will increase round duration exponentially if no progress is made and exponentially decrease timeouts on happy path. 210 During normal operation with some benign crash failures, a small number of `k` subsequent leader failures is expected. 211 Therefore, our PaceMaker tolerates a few failures (`k=6`) before starting to increase timeouts, which is valuable for quickly 212 skipping over the offline replicase. However, the probability of `k` subsequent leader failures decreases exponentially with `k` (due to Flow's randomized leader selection). 213 Therefore, beyond `k=6`, we start increasing timeouts. 214 The timeout values are limited by lower and upper-bounds to ensure that the PaceMaker can change from large to small timeouts in a reasonable number of views. 215 The specific values for lower and upper timeout bounds are protocol-specified; we envision the bounds to be on the order of 1sec (lower bound) and one minute (upper bound). 216 217 **Progress**, from the perspective of the PaceMaker, is defined as entering view `V` 218 for which the replica knows a QC or a TC with `V = QC.view + 1` or `V = TC.view + 1`. 219 In other words, we transition into the next view when observing a quorum from the last view. 220 In contrast to HotStuff, Jolteon only allows a transition into view `V+1` after observing a valid quorum for view `V`. There is no other, passive method for honest nodes to change views. 221 222 A central, non-trivial functionality of the PaceMaker is to _skip views_. 223 Specifically, given a QC or TC with view `V`, the Pacemaker will skip ahead to view `V + 1` if `currentView ≤ V`. 224 225 ![](/docs/PaceMaker.png) 226 <!--- source: https://drive.google.com/file/d/1la4jxyaEJJfip7NCWBM9YBTz6PK4-N9e/ --> 227 228 229 ## Code structure 230 231 All relevant code implementing the core HotStuff business logic is contained in `/consensus/hotstuff/` (folder containing this README). 232 When starting to look into the code, we suggest starting with `/consensus/hotstuff/event_loop.go` and `/consensus/hotstuff/event_handler.go`. 233 234 235 ### Folder structure 236 237 All files in the `/consensus/hotstuff/` folder, except for `follower_loop.go`, are interfaces for HotStuff-related components. 238 The concrete implementations for all HotStuff-relevant components are in corresponding sub-folders. 239 For completeness, we list the component implemented in each sub-folder below: 240 241 * `/consensus/hotstuff/blockproducer` builds a block proposal for a specified QC, interfaces with the logic for assembling a block payload, combines all relevant fields into a new block proposal. 242 * `/consensus/hotstuff/committees` maintains the list of all authorized network members and their respective weight on a per-block and per-view basis depending on implementation; contains the primary selection algorithm. 243 * `/consensus/hotstuff/eventloop` buffers all incoming events, so `EventHandler` can process one event at a time in a single thread. 244 * `/consensus/hotstuff/eventhandler` orchestrates all HotStuff components and implements the HotStuff state machine. The event handler is designed to be executed single-threaded. 245 * `/consensus/hotstuff/follower` This component is only used by nodes that are _not_ participating in the HotStuff committee. As Flow has dedicated node roles with specialized network functions, only a subset of nodes run the full HotStuff protocol. Nevertheless, all nodes need to be able to act on blocks being finalized. The approach we have taken for Flow is that block proposals are broadcast to all nodes (including non-committee nodes). Non-committee nodes locally determine block finality by applying HotStuff's finality rules. The HotStuff Follower contains the functionality to consume block proposals and trigger downstream processing of finalized blocks. The Follower does not _actively_ participate in HotStuff. 246 * `/consensus/hotstuff/forks` maintains an in-memory representation of blocks, whose view is larger or equal to the view of the latest finalized block (known to this specific HotStuff replica). Per convention, all blocks stored in `forks` passed validation and their ancestry is fully known. `forks` tracks the last finalized block and implements the 2-chain finalization rule. Specifically, we finalize block `B`, if a _certified_ child `B'` is known that was produced in the view `B.View +1`. 247 * `/consensus/hotstuff/helper` contains broadly-used helper functions for testing 248 * `/consensus/hotstuff/integration` integration tests for verifying correct interaction of multiple HotStuff replicas 249 * `/consensus/hotstuff/model` contains the HotStuff data models, including block proposal, vote, timeout, etc. 250 Many HotStuff data models are built on top of basic data models defined in `/model/flow/`. 251 * `/consensus/hotstuff/notifications`: All relevant events within the HotStuff logic are exported though a notification system. Notifications are used by _some_ HotStuff components internally to drive core logic (e.g. events from `VoteAggregator` and `TimeoutAggregator` can trigger progress in the `EventHandler`). Furthermore, notifications inform other components within the same node of relevant progress and are used for collecting HotStuff metrics. Per convention, notifications are idempotent. 252 * `/consensus/hotstuff/pacemaker` contains the implementation of Flow's Active PaceMaker, as described above. Is responsible for protocol liveness. 253 * `/consensus/hotstuff/persister` stores the latest safety and liveness data _synchronously_ on disk. 254 The `persister` only covers the minimal amount of data that is absolutely necessary to avoid equivocation after a crash. 255 This data must be stored on disk immediately whenever updated, before the node can progress with its consensus logic. 256 In comparison, the majority of the consensus state is held in-memory for performance reasons and updated in an eventually consistent manner. 257 After a crash, some of this data might be lost (but can be re-requested) without risk of protocol violations. 258 * `/consensus/hotstuff/safetyrules` tracks the latest vote and the latest timeout. 259 It determines whether to vote for a block and if it's safe to construct a timeout for the current round. 260 * `/consensus/hotstuff/signature` contains the implementation for threshold signature aggregation for all types of signatures that are used in HotStuff protocol. 261 * `/consensus/hotstuff/timeoutcollector` encapsulates the logic for validating timeouts for one particular view and aggregating them to a TC. 262 * `/consensus/hotstuff/timeoutaggregator` orchestrates the `TimeoutCollector`s for different views. It distributes timeouts to the respective `TimeoutCollector` and prunes collectors that are no longer needed. 263 * `/consensus/hotstuff/tracker` implements utility code for tracking the newest QC and TC in a multithreaded environment. 264 * `/consensus/hotstuff/validator` holds the logic for validating the HotStuff-relevant aspects of blocks, QCs, TC, and votes 265 * `/consensus/hotstuff/verification` contains integration of Flow's cryptographic primitives (signing and signature verification) 266 * `/consensus/hotstuff/votecollector` encapsulates the logic for caching, validating, and aggregating votes for one particular view. 267 It tracks, whether a valid proposal for view is known and when enough votes have been collected, it builds a QC. 268 * `/consensus/hotstuff/voteaggregator` orchestrates the `VoteCollector`s for different views. It distributes votes to the respective `VoteCollector`, notifies the `VoteCollector` about the arrival of their respective block, and prunes collectors that are no longer needed. 269 270 271 ## Telemetry 272 273 The HotStuff state machine exposes some details about its internal progress as notification through the `hotstuff.Consumer`. 274 The following figure depicts at which points notifications are emitted. 275 276 ![](/docs/StateMachine_with_notifications.png) 277 <!--- source: https://drive.google.com/file/d/1la4jxyaEJJfip7NCWBM9YBTz6PK4-N9e/ --> 278 279 We have implemented a telemetry system (`hotstuff.notifications.TelemetryConsumer`) which implements the `Consumer` interface. 280 The `TelemetryConsumer` tracks all events as belonging together that were emitted during a path through the state machine as well as events from components that perform asynchronous processing (`VoteAggregator`, `TimeoutAggregator`). 281 Each `path` through the state machine is identified by a unique id. 282 Generally, the `TelemetryConsumer` could export the collected data to a variety of backends. 283 For now, we export the data to a logger.