github.com/pokt-network/tendermint@v0.32.11-0.20230426215212-59310158d3e9/docs/architecture/adr-040-blockchain-reactor-refactor.md (about) 1 # ADR 040: Blockchain Reactor Refactor 2 3 ## Changelog 4 5 19-03-2019: Initial draft 6 7 ## Context 8 9 The Blockchain Reactor's high level responsibility is to enable peers who are far behind the current state of the 10 blockchain to quickly catch up by downloading many blocks in parallel from its peers, verifying block correctness, and 11 executing them against the ABCI application. We call the protocol executed by the Blockchain Reactor `fast-sync`. 12 The current architecture diagram of the blockchain reactor can be found here: 13 14 ![Blockchain Reactor Architecture Diagram](img/bc-reactor.png) 15 16 The current architecture consists of dozens of routines and it is tightly depending on the `Switch`, making writing 17 unit tests almost impossible. Current tests require setting up complex dependency graphs and dealing with concurrency. 18 Note that having dozens of routines is in this case overkill as most of the time routines sits idle waiting for 19 something to happen (message to arrive or timeout to expire). Due to dependency on the `Switch`, testing relatively 20 complex network scenarios and failures (for example adding and removing peers) is very complex tasks and frequently lead 21 to complex tests with not deterministic behavior ([#3400]). Impossibility to write proper tests makes confidence in 22 the code low and this resulted in several issues (some are fixed in the meantime and some are still open): 23 [#3400], [#2897], [#2896], [#2699], [#2888], [#2457], [#2622], [#2026]. 24 25 ## Decision 26 27 To remedy these issues we plan a major refactor of the blockchain reactor. The proposed architecture is largely inspired 28 by ADR-30 and is presented on the following diagram: 29 ![Blockchain Reactor Refactor Diagram](img/bc-reactor-refactor.png) 30 31 We suggest a concurrency architecture where the core algorithm (we call it `Controller`) is extracted into a finite 32 state machine. The active routine of the reactor is called `Executor` and is responsible for receiving and sending 33 messages from/to peers and triggering timeouts. What messages should be sent and timeouts triggered is determined mostly 34 by the `Controller`. The exception is `Peer Heartbeat` mechanism which is `Executor` responsibility. The heartbeat 35 mechanism is used to remove slow and unresponsive peers from the peer list. Writing of unit tests is simpler with 36 this architecture as most of the critical logic is part of the `Controller` function. We expect that simpler concurrency 37 architecture will not have significant negative effect on the performance of this reactor (to be confirmed by 38 experimental evaluation). 39 40 41 ### Implementation changes 42 43 We assume the following system model for "fast sync" protocol: 44 45 * a node is connected to a random subset of all nodes that represents its peer set. Some nodes are correct and some 46 might be faulty. We don't make assumptions about ratio of faulty nodes, i.e., it is possible that all nodes in some 47 peer set are faulty. 48 * we assume that communication between correct nodes is synchronous, i.e., if a correct node `p` sends a message `m` to 49 a correct node `q` at time `t`, then `q` will receive message the latest at time `t+Delta` where `Delta` is a system 50 parameter that is known by network participants. `Delta` is normally chosen to be an order of magnitude higher than 51 the real communication delay (maximum) between correct nodes. Therefore if a correct node `p` sends a request message 52 to a correct node `q` at time `t` and there is no the corresponding reply at time `t + 2*Delta`, then `p` can assume 53 that `q` is faulty. Note that the network assumptions for the consensus reactor are different (we assume partially 54 synchronous model there). 55 56 The requirements for the "fast sync" protocol are formally specified as follows: 57 58 - `Correctness`: If a correct node `p` is connected to a correct node `q` for a long enough period of time, then `p` 59 - will eventually download all requested blocks from `q`. 60 - `Termination`: If a set of peers of a correct node `p` is stable (no new nodes are added to the peer set of `p`) for 61 - a long enough period of time, then protocol eventually terminates. 62 - `Fairness`: A correct node `p` sends requests for blocks to all peers from its peer set. 63 64 As explained above, the `Executor` is responsible for sending and receiving messages that are part of the `fast-sync` 65 protocol. The following messages are exchanged as part of `fast-sync` protocol: 66 67 ``` go 68 type Message int 69 const ( 70 MessageUnknown Message = iota 71 MessageStatusRequest 72 MessageStatusResponse 73 MessageBlockRequest 74 MessageBlockResponse 75 ) 76 ``` 77 `MessageStatusRequest` is sent periodically to all peers as a request for a peer to provide its current height. It is 78 part of the `Peer Heartbeat` mechanism and a failure to respond timely to this message results in a peer being removed 79 from the peer set. Note that the `Peer Heartbeat` mechanism is used only while a peer is in `fast-sync` mode. We assume 80 here existence of a mechanism that gives node a possibility to inform its peers that it is in the `fast-sync` mode. 81 82 ``` go 83 type MessageStatusRequest struct { 84 SeqNum int64 // sequence number of the request 85 } 86 ``` 87 `MessageStatusResponse` is sent as a response to `MessageStatusRequest` to inform requester about the peer current 88 height. 89 90 ``` go 91 type MessageStatusResponse struct { 92 SeqNum int64 // sequence number of the corresponding request 93 Height int64 // current peer height 94 } 95 ``` 96 97 `MessageBlockRequest` is used to make a request for a block and the corresponding commit certificate at a given height. 98 99 ``` go 100 type MessageBlockRequest struct { 101 Height int64 102 } 103 ``` 104 105 `MessageBlockResponse` is a response for the corresponding block request. In addition to providing the block and the 106 corresponding commit certificate, it contains also a current peer height. 107 108 ``` go 109 type MessageBlockResponse struct { 110 Height int64 111 Block Block 112 Commit Commit 113 PeerHeight int64 114 } 115 ``` 116 117 In addition to sending and receiving messages, and `HeartBeat` mechanism, controller is also managing timeouts 118 that are triggered upon `Controller` request. `Controller` is then informed once a timeout expires. 119 120 ``` go 121 type TimeoutTrigger int 122 const ( 123 TimeoutUnknown TimeoutTrigger = iota 124 TimeoutResponseTrigger 125 TimeoutTerminationTrigger 126 ) 127 ``` 128 129 The `Controller` can be modelled as a function with clearly defined inputs: 130 131 * `State` - current state of the node. Contains data about connected peers and its behavior, pending requests, 132 * received blocks, etc. 133 * `Event` - significant events in the network. 134 135 producing clear outputs: 136 137 * `State` - updated state of the node, 138 * `MessageToSend` - signal what message to send and to which peer 139 * `TimeoutTrigger` - signal that timeout should be triggered. 140 141 142 We consider the following `Event` types: 143 144 ``` go 145 type Event int 146 const ( 147 EventUnknown Event = iota 148 EventStatusReport 149 EventBlockRequest 150 EventBlockResponse 151 EventRemovePeer 152 EventTimeoutResponse 153 EventTimeoutTermination 154 ) 155 ``` 156 157 `EventStatusResponse` event is generated once `MessageStatusResponse` is received by the `Executor`. 158 159 ``` go 160 type EventStatusReport struct { 161 PeerID ID 162 Height int64 163 } 164 ``` 165 166 `EventBlockRequest` event is generated once `MessageBlockRequest` is received by the `Executor`. 167 168 ``` go 169 type EventBlockRequest struct { 170 Height int64 171 PeerID p2p.ID 172 } 173 ``` 174 `EventBlockResponse` event is generated upon reception of `MessageBlockResponse` message by the `Executor`. 175 176 ``` go 177 type EventBlockResponse struct { 178 Height int64 179 Block Block 180 Commit Commit 181 PeerID ID 182 PeerHeight int64 183 } 184 ``` 185 `EventRemovePeer` is generated by `Executor` to signal that the connection to a peer is closed due to peer misbehavior. 186 187 ``` go 188 type EventRemovePeer struct { 189 PeerID ID 190 } 191 ``` 192 `EventTimeoutResponse` is generated by `Executor` to signal that a timeout triggered by `TimeoutResponseTrigger` has 193 expired. 194 195 ``` go 196 type EventTimeoutResponse struct { 197 PeerID ID 198 Height int64 199 } 200 ``` 201 `EventTimeoutTermination` is generated by `Executor` to signal that a timeout triggered by `TimeoutTerminationTrigger` 202 has expired. 203 204 ``` go 205 type EventTimeoutTermination struct { 206 Height int64 207 } 208 ``` 209 210 `MessageToSend` is just a wrapper around `Message` type that contains id of the peer to which message should be sent. 211 212 ``` go 213 type MessageToSend struct { 214 PeerID ID 215 Message Message 216 } 217 ``` 218 219 The Controller state machine can be in two modes: `ModeFastSync` when 220 a node is trying to catch up with the network by downloading committed blocks, 221 and `ModeConsensus` in which it executes Tendermint consensus protocol. We 222 consider that `fast sync` mode terminates once the Controller switch to 223 `ModeConsensus`. 224 225 ``` go 226 type Mode int 227 const ( 228 ModeUnknown Mode = iota 229 ModeFastSync 230 ModeConsensus 231 ) 232 ``` 233 `Controller` is managing the following state: 234 235 ``` go 236 type ControllerState struct { 237 Height int64 // the first block that is not committed 238 Mode Mode // mode of operation 239 PeerMap map[ID]PeerStats // map of peer IDs to peer statistics 240 MaxRequestPending int64 // maximum height of the pending requests 241 FailedRequests []int64 // list of failed block requests 242 PendingRequestsNum int // total number of pending requests 243 Store []BlockInfo // contains list of downloaded blocks 244 Executor BlockExecutor // store, verify and executes blocks 245 } 246 ``` 247 248 `PeerStats` data structure keeps for every peer its current height and a list of pending requests for blocks. 249 250 ``` go 251 type PeerStats struct { 252 Height int64 253 PendingRequest int64 // a request sent to this peer 254 } 255 ``` 256 257 `BlockInfo` data structure is used to store information (as part of block store) about downloaded blocks: from what peer 258 a block and the corresponding commit certificate are received. 259 ``` go 260 type BlockInfo struct { 261 Block Block 262 Commit Commit 263 PeerID ID // a peer from which we received the corresponding Block and Commit 264 } 265 ``` 266 267 The `Controller` is initialized by providing an initial height (`startHeight`) from which it will start downloading 268 blocks from peers and the current state of the `BlockExecutor`. 269 270 ``` go 271 func NewControllerState(startHeight int64, executor BlockExecutor) ControllerState { 272 state = ControllerState {} 273 state.Height = startHeight 274 state.Mode = ModeFastSync 275 state.MaxRequestPending = startHeight - 1 276 state.PendingRequestsNum = 0 277 state.Executor = executor 278 initialize state.PeerMap, state.FailedRequests and state.Store to empty data structures 279 return state 280 } 281 ``` 282 283 The core protocol logic is given with the following function: 284 285 ``` go 286 func handleEvent(state ControllerState, event Event) (ControllerState, Message, TimeoutTrigger, Error) { 287 msg = nil 288 timeout = nil 289 error = nil 290 291 switch state.Mode { 292 case ModeConsensus: 293 switch event := event.(type) { 294 case EventBlockRequest: 295 msg = createBlockResponseMessage(state, event) 296 return state, msg, timeout, error 297 default: 298 error = "Only respond to BlockRequests while in ModeConsensus!" 299 return state, msg, timeout, error 300 } 301 302 case ModeFastSync: 303 switch event := event.(type) { 304 case EventBlockRequest: 305 msg = createBlockResponseMessage(state, event) 306 return state, msg, timeout, error 307 308 case EventStatusResponse: 309 return handleEventStatusResponse(event, state) 310 311 case EventRemovePeer: 312 return handleEventRemovePeer(event, state) 313 314 case EventBlockResponse: 315 return handleEventBlockResponse(event, state) 316 317 case EventResponseTimeout: 318 return handleEventResponseTimeout(event, state) 319 320 case EventTerminationTimeout: 321 // Termination timeout is triggered in case of empty peer set and in case there are no pending requests. 322 // If this timeout expires and in the meantime no new peers are added or new pending requests are made 323 // then `fast-sync` mode terminates by switching to `ModeConsensus`. 324 // Note that termination timeout should be higher than the response timeout. 325 if state.Height == event.Height && state.PendingRequestsNum == 0 { state.State = ConsensusMode } 326 return state, msg, timeout, error 327 328 default: 329 error = "Received unknown event type!" 330 return state, msg, timeout, error 331 } 332 } 333 } 334 ``` 335 336 ``` go 337 func createBlockResponseMessage(state ControllerState, event BlockRequest) MessageToSend { 338 msgToSend = nil 339 if _, ok := state.PeerMap[event.PeerID]; !ok { peerStats = PeerStats{-1, -1} } 340 if state.Executor.ContainsBlockWithHeight(event.Height) && event.Height > peerStats.Height { 341 peerStats = event.Height 342 msg = BlockResponseMessage{ 343 Height: event.Height, 344 Block: state.Executor.getBlock(eventHeight), 345 Commit: state.Executor.getCommit(eventHeight), 346 PeerID: event.PeerID, 347 CurrentHeight: state.Height - 1, 348 } 349 msgToSend = MessageToSend { event.PeerID, msg } 350 } 351 state.PeerMap[event.PeerID] = peerStats 352 return msgToSend 353 } 354 ``` 355 356 ``` go 357 func handleEventStatusResponse(event EventStatusResponse, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error) { 358 if _, ok := state.PeerMap[event.PeerID]; !ok { 359 peerStats = PeerStats{ -1, -1 } 360 } else { 361 peerStats = state.PeerMap[event.PeerID] 362 } 363 364 if event.Height > peerStats.Height { peerStats.Height = event.Height } 365 // if there are no pending requests for this peer, try to send him a request for block 366 if peerStats.PendingRequest == -1 { 367 msg = createBlockRequestMessages(state, event.PeerID, peerStats.Height) 368 // msg is nil if no request for block can be made to a peer at this point in time 369 if msg != nil { 370 peerStats.PendingRequests = msg.Height 371 state.PendingRequestsNum++ 372 // when a request for a block is sent to a peer, a response timeout is triggered. If no corresponding block is sent by the peer 373 // during response timeout period, then the peer is considered faulty and is removed from the peer set. 374 timeout = ResponseTimeoutTrigger{ msg.PeerID, msg.Height, PeerTimeout } 375 } else if state.PendingRequestsNum == 0 { 376 // if there are no pending requests and no new request can be placed to the peer, termination timeout is triggered. 377 // If termination timeout expires and we are still at the same height and there are no pending requests, the "fast-sync" 378 // mode is finished and we switch to `ModeConsensus`. 379 timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout } 380 } 381 } 382 state.PeerMap[event.PeerID] = peerStats 383 return state, msg, timeout, error 384 } 385 ``` 386 387 ``` go 388 func handleEventRemovePeer(event EventRemovePeer, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error) { 389 if _, ok := state.PeerMap[event.PeerID]; ok { 390 pendingRequest = state.PeerMap[event.PeerID].PendingRequest 391 // if a peer is removed from the peer set, its pending request is declared failed and added to the `FailedRequests` list 392 // so it can be retried. 393 if pendingRequest != -1 { 394 add(state.FailedRequests, pendingRequest) 395 } 396 state.PendingRequestsNum-- 397 delete(state.PeerMap, event.PeerID) 398 // if the peer set is empty after removal of this peer then termination timeout is triggered. 399 if state.PeerMap.isEmpty() { 400 timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout } 401 } 402 } else { error = "Removing unknown peer!" } 403 return state, msg, timeout, error 404 ``` 405 406 ``` go 407 func handleEventBlockResponse(event EventBlockResponse, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error) 408 if state.PeerMap[event.PeerID] { 409 peerStats = state.PeerMap[event.PeerID] 410 // when expected block arrives from a peer, it is added to the store so it can be verified and if correct executed after. 411 if peerStats.PendingRequest == event.Height { 412 peerStats.PendingRequest = -1 413 state.PendingRequestsNum-- 414 if event.PeerHeight > peerStats.Height { peerStats.Height = event.PeerHeight } 415 state.Store[event.Height] = BlockInfo{ event.Block, event.Commit, event.PeerID } 416 // blocks are verified sequentially so adding a block to the store does not mean that it will be immediately verified 417 // as some of the previous blocks might be missing. 418 state = verifyBlocks(state) // it can lead to event.PeerID being removed from peer list 419 if _, ok := state.PeerMap[event.PeerID]; ok { 420 // we try to identify new request for a block that can be asked to the peer 421 msg = createBlockRequestMessage(state, event.PeerID, peerStats.Height) 422 if msg != nil { 423 peerStats.PendingRequests = msg.Height 424 state.PendingRequestsNum++ 425 // if request for block is made, response timeout is triggered 426 timeout = ResponseTimeoutTrigger{ msg.PeerID, msg.Height, PeerTimeout } 427 } else if state.PeerMap.isEmpty() || state.PendingRequestsNum == 0 { 428 // if the peer map is empty (the peer can be removed as block verification failed) or there are no pending requests 429 // termination timeout is triggered. 430 timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout } 431 } 432 } 433 } else { error = "Received Block from wrong peer!" } 434 } else { error = "Received Block from unknown peer!" } 435 436 state.PeerMap[event.PeerID] = peerStats 437 return state, msg, timeout, error 438 } 439 ``` 440 441 ``` go 442 func handleEventResponseTimeout(event, state) { 443 if _, ok := state.PeerMap[event.PeerID]; ok { 444 peerStats = state.PeerMap[event.PeerID] 445 // if a response timeout expires and the peer hasn't delivered the block, the peer is removed from the peer list and 446 // the request is added to the `FailedRequests` so the block can be downloaded from other peer 447 if peerStats.PendingRequest == event.Height { 448 add(state.FailedRequests, pendingRequest) 449 delete(state.PeerMap, event.PeerID) 450 state.PendingRequestsNum-- 451 // if peer set is empty, then termination timeout is triggered 452 if state.PeerMap.isEmpty() { 453 timeout = TimeoutTrigger{ state.Height, TerminationTimeout } 454 } 455 } 456 } 457 return state, msg, timeout, error 458 } 459 ``` 460 461 ``` go 462 func createBlockRequestMessage(state ControllerState, peerID ID, peerHeight int64) MessageToSend { 463 msg = nil 464 blockHeight = -1 465 r = find request in state.FailedRequests such that r <= peerHeight // returns `nil` if there are no such request 466 // if there is a height in failed requests that can be downloaded from the peer send request to it 467 if r != nil { 468 blockNumber = r 469 delete(state.FailedRequests, r) 470 } else if state.MaxRequestPending < peerHeight { 471 // if height of the maximum pending request is smaller than peer height, then ask peer for next block 472 state.MaxRequestPending++ 473 blockHeight = state.MaxRequestPending // increment state.MaxRequestPending and then return the new value 474 } 475 476 if blockHeight > -1 { msg = MessageToSend { peerID, MessageBlockRequest { blockHeight } } 477 return msg 478 } 479 ``` 480 481 ``` go 482 func verifyBlocks(state State) State { 483 done = false 484 for !done { 485 block = state.Store[height] 486 if block != nil { 487 verified = verify block.Block using block.Commit // return `true` is verification succeed, 'false` otherwise 488 489 if verified { 490 block.Execute() // executing block is costly operation so it might make sense executing asynchronously 491 state.Height++ 492 } else { 493 // if block verification failed, then it is added to `FailedRequests` and the peer is removed from the peer set 494 add(state.FailedRequests, height) 495 state.Store[height] = nil 496 if _, ok := state.PeerMap[block.PeerID]; ok { 497 pendingRequest = state.PeerMap[block.PeerID].PendingRequest 498 // if there is a pending request sent to the peer that is just to be removed from the peer set, add it to `FailedRequests` 499 if pendingRequest != -1 { 500 add(state.FailedRequests, pendingRequest) 501 state.PendingRequestsNum-- 502 } 503 delete(state.PeerMap, event.PeerID) 504 } 505 done = true 506 } 507 } else { done = true } 508 } 509 return state 510 } 511 ``` 512 513 In the proposed architecture `Controller` is not active task, i.e., it is being called by the `Executor`. Depending on 514 the return values returned by `Controller`,`Executor` will send a message to some peer (`msg` != nil), trigger a 515 timeout (`timeout` != nil) or deal with errors (`error` != nil). 516 In case a timeout is triggered, it will provide as an input to `Controller` the corresponding timeout event once 517 timeout expires. 518 519 520 ## Status 521 522 Draft. 523 524 ## Consequences 525 526 ### Positive 527 528 - isolated implementation of the algorithm 529 - improved testability - simpler to prove correctness 530 - clearer separation of concerns - easier to reason 531 532 ### Negative 533 534 ### Neutral