github.com/pokt-network/tendermint@v0.32.11-0.20230426215212-59310158d3e9/docs/architecture/adr-040-blockchain-reactor-refactor.md (about)

     1  # ADR 040: Blockchain Reactor Refactor
     2  
     3  ## Changelog
     4  
     5  19-03-2019: Initial draft
     6  
     7  ## Context
     8  
     9  The Blockchain Reactor's high level responsibility is to enable peers who are far behind the current state of the
    10  blockchain to quickly catch up by downloading many blocks in parallel from its peers, verifying block correctness, and
    11  executing them against the ABCI application. We call the protocol executed by the Blockchain Reactor `fast-sync`.
    12  The current architecture diagram of the blockchain reactor can be found here:
    13  
    14  ![Blockchain Reactor Architecture Diagram](img/bc-reactor.png)
    15  
    16  The current architecture consists of dozens of routines and it is tightly depending on the `Switch`, making writing
    17  unit tests almost impossible. Current tests require setting up complex dependency graphs and dealing with concurrency.
    18  Note that having dozens of routines is in this case overkill as most of the time routines sits idle waiting for
    19  something to happen (message to arrive or timeout to expire). Due to dependency on the `Switch`, testing relatively
    20  complex network scenarios and failures (for example adding and removing peers) is very complex tasks and frequently lead
    21  to complex tests with not deterministic behavior ([#3400]). Impossibility to write proper tests makes confidence in
    22  the code low and this resulted in several issues (some are fixed in the meantime and some are still open):
    23  [#3400], [#2897], [#2896], [#2699], [#2888], [#2457], [#2622], [#2026].
    24  
    25  ## Decision
    26  
    27  To remedy these issues we plan a major refactor of the blockchain reactor. The proposed architecture is largely inspired
    28  by ADR-30 and is presented on the following diagram:
    29  ![Blockchain Reactor Refactor Diagram](img/bc-reactor-refactor.png)
    30  
    31  We suggest a concurrency architecture where the core algorithm (we call it `Controller`) is extracted into a finite
    32  state machine. The active routine of the reactor is called `Executor` and is responsible for receiving and sending
    33  messages from/to peers and triggering timeouts. What messages should be sent and timeouts triggered is determined mostly
    34  by the `Controller`. The exception is `Peer Heartbeat` mechanism which is `Executor` responsibility. The heartbeat
    35  mechanism is used to remove slow and unresponsive peers from the peer list. Writing of unit tests is simpler with
    36  this architecture as most of the critical logic is part of the `Controller` function. We expect that simpler concurrency
    37  architecture will not have significant negative effect on the performance of this reactor (to be confirmed by
    38  experimental evaluation).
    39  
    40  
    41  ### Implementation changes
    42  
    43  We assume the following system model for "fast sync" protocol:
    44  
    45  * a node is connected to a random subset of all nodes that represents its peer set. Some nodes are correct and some
    46    might be faulty. We don't make assumptions about ratio of faulty nodes, i.e., it is possible that all nodes in some
    47  	peer set are faulty.
    48  * we assume that communication between correct nodes is synchronous, i.e., if a correct node `p` sends a message `m` to
    49    a correct node `q` at time `t`, then `q` will receive message the latest at time `t+Delta` where `Delta` is a system
    50  	parameter that is known by network participants. `Delta` is normally chosen to be an order of magnitude higher than
    51  	the real communication delay (maximum) between correct nodes. Therefore if a correct node `p` sends a request message
    52  	to a correct node `q` at time `t` and there is no the corresponding reply at time `t + 2*Delta`, then `p` can assume
    53  	that `q` is faulty. Note that the network assumptions for the consensus reactor are different (we assume partially
    54  	synchronous model there).
    55  
    56  The requirements for the "fast sync" protocol are formally specified as follows:
    57  
    58  - `Correctness`: If a correct node `p` is connected to a correct node `q` for a long enough period of time, then `p`
    59  - will eventually download all requested blocks from `q`.
    60  - `Termination`: If a set of peers of a correct node `p` is stable (no new nodes are added to the peer set of `p`) for
    61  - a long enough period of time, then protocol eventually terminates.
    62  - `Fairness`: A correct node `p` sends requests for blocks to all peers from its peer set.
    63  
    64  As explained above, the `Executor` is responsible for sending and receiving messages that are part of the `fast-sync`
    65  protocol. The following messages are exchanged as part of `fast-sync` protocol:
    66  
    67  ``` go
    68  type Message int
    69  const (
    70    MessageUnknown Message = iota
    71    MessageStatusRequest
    72    MessageStatusResponse
    73    MessageBlockRequest
    74    MessageBlockResponse
    75  )
    76  ```
    77  `MessageStatusRequest` is sent periodically to all peers as a request for a peer to provide its current height. It is
    78  part of the `Peer Heartbeat` mechanism and a failure to respond timely to this message results in a peer being removed
    79  from the peer set. Note that the `Peer Heartbeat` mechanism is used only while a peer is in `fast-sync` mode. We assume
    80  here existence of a mechanism that gives node a possibility to inform its peers that it is in the `fast-sync` mode.
    81  
    82  ``` go
    83  type MessageStatusRequest struct {
    84    SeqNum int64     // sequence number of the request
    85  }
    86  ```
    87  `MessageStatusResponse` is sent as a response to `MessageStatusRequest` to inform requester about the peer current
    88  height.
    89  
    90  ``` go
    91  type MessageStatusResponse struct {
    92    SeqNum int64     // sequence number of the corresponding request
    93    Height int64     // current peer height
    94  }
    95  ```
    96  
    97  `MessageBlockRequest` is used to make a request for a block and the corresponding commit certificate at a given height.
    98  
    99  ``` go
   100  type MessageBlockRequest struct {
   101    Height int64
   102  }
   103  ```
   104  
   105  `MessageBlockResponse` is a response for the corresponding block request. In addition to providing the block and the
   106  corresponding commit certificate, it contains also a current peer height.
   107  
   108  ``` go
   109  type MessageBlockResponse struct {
   110    Height         int64
   111    Block          Block
   112    Commit         Commit
   113    PeerHeight     int64
   114  }
   115  ```
   116  
   117  In addition to sending and receiving messages, and `HeartBeat` mechanism, controller is also managing timeouts
   118  that are triggered upon `Controller` request. `Controller` is then informed once a timeout expires.
   119  
   120  ``` go
   121  type TimeoutTrigger int
   122  const (
   123    TimeoutUnknown TimeoutTrigger = iota
   124    TimeoutResponseTrigger
   125    TimeoutTerminationTrigger
   126  )
   127  ```
   128  
   129  The `Controller` can be modelled as a function with clearly defined inputs:
   130  
   131  * `State` - current state of the node. Contains data about connected peers and its behavior, pending requests,
   132  * received blocks, etc.
   133  * `Event` - significant events in the network.
   134  
   135  producing clear outputs:
   136  
   137  * `State` - updated state of the node,
   138  * `MessageToSend` - signal what message to send and to which peer
   139  * `TimeoutTrigger` - signal that timeout should be triggered.
   140  
   141  
   142  We consider the following `Event` types:
   143  
   144  ``` go
   145  type Event int
   146  const (
   147    EventUnknown Event = iota
   148    EventStatusReport
   149    EventBlockRequest
   150    EventBlockResponse
   151    EventRemovePeer
   152    EventTimeoutResponse
   153    EventTimeoutTermination
   154  )
   155  ```
   156  
   157  `EventStatusResponse` event is generated once `MessageStatusResponse` is received by the `Executor`.
   158  
   159  ``` go
   160  type EventStatusReport struct {
   161    PeerID ID
   162    Height int64
   163  }
   164  ```
   165  
   166  `EventBlockRequest` event is generated once `MessageBlockRequest` is received by the `Executor`.
   167  
   168  ``` go
   169  type EventBlockRequest struct {
   170    Height int64
   171    PeerID p2p.ID
   172  }
   173  ```
   174  `EventBlockResponse` event is generated upon reception of `MessageBlockResponse` message by the `Executor`.
   175  
   176  ``` go
   177  type EventBlockResponse struct {
   178    Height             int64
   179    Block              Block
   180    Commit             Commit
   181    PeerID             ID
   182    PeerHeight         int64
   183  }
   184  ```
   185  `EventRemovePeer` is generated by `Executor` to signal that the connection to a peer is closed due to peer misbehavior.
   186  
   187  ``` go
   188  type EventRemovePeer struct {
   189    PeerID ID
   190  }
   191  ```
   192  `EventTimeoutResponse` is generated by `Executor` to signal that a timeout triggered by `TimeoutResponseTrigger` has
   193  expired.
   194  
   195  ``` go
   196  type EventTimeoutResponse struct {
   197    PeerID ID
   198    Height int64
   199  }
   200  ```
   201  `EventTimeoutTermination` is generated by `Executor` to signal that a timeout triggered by `TimeoutTerminationTrigger`
   202  has expired.
   203  
   204  ``` go
   205  type EventTimeoutTermination struct {
   206    Height int64
   207  }
   208  ```
   209  
   210  `MessageToSend` is just a wrapper around `Message` type that contains id of the peer to which message should be sent.
   211  
   212  ``` go
   213  type MessageToSend struct {
   214    PeerID  ID
   215    Message Message
   216  }
   217  ```
   218  
   219  The Controller state machine can be in two modes: `ModeFastSync` when
   220  a node is trying to catch up with the network by downloading committed blocks,
   221  and `ModeConsensus` in which it executes Tendermint consensus protocol. We
   222  consider that `fast sync` mode terminates once the Controller switch to
   223  `ModeConsensus`.
   224  
   225  ``` go
   226  type Mode int
   227  const (
   228    ModeUnknown Mode = iota
   229    ModeFastSync
   230    ModeConsensus
   231  )
   232  ```
   233  `Controller` is managing the following state:
   234  
   235  ``` go
   236  type ControllerState struct {
   237    Height             int64            // the first block that is not committed
   238    Mode               Mode             // mode of operation
   239    PeerMap            map[ID]PeerStats // map of peer IDs to peer statistics
   240    MaxRequestPending  int64            // maximum height of the pending requests
   241    FailedRequests     []int64          // list of failed block requests
   242    PendingRequestsNum int              // total number of pending requests
   243    Store              []BlockInfo      // contains list of downloaded blocks
   244    Executor           BlockExecutor    // store, verify and executes blocks
   245  }
   246  ```
   247  
   248  `PeerStats` data structure keeps for every peer its current height and a list of pending requests for blocks.
   249  
   250  ``` go
   251  type PeerStats struct {
   252    Height             int64
   253    PendingRequest     int64             // a request sent to this peer
   254  }
   255  ```
   256  
   257  `BlockInfo` data structure is used to store information (as part of block store) about downloaded blocks: from what peer
   258   a block and the corresponding commit certificate are received.
   259  ``` go
   260  type BlockInfo struct {
   261    Block  Block
   262    Commit Commit
   263    PeerID ID                // a peer from which we received the corresponding Block and Commit
   264  }
   265  ```
   266  
   267  The `Controller` is initialized by providing an initial height (`startHeight`) from which it will start downloading
   268  blocks from peers and the current state of the `BlockExecutor`.
   269  
   270  ``` go
   271  func NewControllerState(startHeight int64, executor BlockExecutor) ControllerState {
   272    state = ControllerState {}
   273    state.Height = startHeight
   274    state.Mode = ModeFastSync
   275    state.MaxRequestPending = startHeight - 1
   276    state.PendingRequestsNum = 0
   277    state.Executor = executor
   278    initialize state.PeerMap, state.FailedRequests and state.Store to empty data structures
   279    return state
   280  }
   281  ```
   282  
   283  The core protocol logic is given with the following function:
   284  
   285  ``` go
   286  func handleEvent(state ControllerState, event Event) (ControllerState, Message, TimeoutTrigger, Error) {
   287    msg = nil
   288    timeout = nil
   289    error = nil
   290  
   291    switch state.Mode {
   292    case ModeConsensus:
   293      switch event := event.(type) {
   294      case EventBlockRequest:
   295        msg = createBlockResponseMessage(state, event)
   296        return state, msg, timeout, error
   297      default:
   298        error = "Only respond to BlockRequests while in ModeConsensus!"
   299        return state, msg, timeout, error
   300      }
   301  
   302    case ModeFastSync:
   303      switch event := event.(type) {
   304      case EventBlockRequest:
   305        msg = createBlockResponseMessage(state, event)
   306        return state, msg, timeout, error
   307  
   308      case EventStatusResponse:
   309        return handleEventStatusResponse(event, state)
   310  
   311      case EventRemovePeer:
   312        return handleEventRemovePeer(event, state)
   313  
   314      case EventBlockResponse:
   315        return handleEventBlockResponse(event, state)
   316  
   317      case EventResponseTimeout:
   318        return handleEventResponseTimeout(event, state)
   319  
   320      case EventTerminationTimeout:
   321        // Termination timeout is triggered in case of empty peer set and in case there are no pending requests.
   322        // If this timeout expires and in the meantime no new peers are added or new pending requests are made
   323        // then `fast-sync` mode terminates by switching to `ModeConsensus`.
   324        // Note that termination timeout should be higher than the response timeout.
   325        if state.Height == event.Height && state.PendingRequestsNum == 0 { state.State = ConsensusMode }
   326        return state, msg, timeout, error
   327  
   328      default:
   329        error = "Received unknown event type!"
   330        return state, msg, timeout, error
   331      }
   332    }
   333  }
   334  ```
   335  
   336  ``` go
   337  func createBlockResponseMessage(state ControllerState, event BlockRequest) MessageToSend {
   338    msgToSend = nil
   339    if _, ok := state.PeerMap[event.PeerID]; !ok { peerStats = PeerStats{-1, -1} }
   340    if state.Executor.ContainsBlockWithHeight(event.Height) && event.Height > peerStats.Height {
   341      peerStats = event.Height
   342      msg = BlockResponseMessage{
   343       Height:        event.Height,
   344       Block:         state.Executor.getBlock(eventHeight),
   345       Commit:        state.Executor.getCommit(eventHeight),
   346       PeerID:        event.PeerID,
   347       CurrentHeight: state.Height - 1,
   348      }
   349      msgToSend = MessageToSend { event.PeerID, msg }
   350    }
   351    state.PeerMap[event.PeerID] = peerStats
   352    return msgToSend
   353  }
   354  ```
   355  
   356  ``` go
   357  func handleEventStatusResponse(event EventStatusResponse, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error) {
   358    if _, ok := state.PeerMap[event.PeerID]; !ok {
   359      peerStats = PeerStats{ -1, -1 }
   360    } else {
   361      peerStats = state.PeerMap[event.PeerID]
   362    }
   363  
   364    if event.Height > peerStats.Height { peerStats.Height = event.Height }
   365    // if there are no pending requests for this peer, try to send him a request for block
   366    if peerStats.PendingRequest == -1 {
   367      msg = createBlockRequestMessages(state, event.PeerID, peerStats.Height)
   368      // msg is nil if no request for block can be made to a peer at this point in time
   369      if msg != nil {
   370        peerStats.PendingRequests = msg.Height
   371        state.PendingRequestsNum++
   372        // when a request for a block is sent to a peer, a response timeout is triggered. If no corresponding block is sent by the peer
   373        // during response timeout period, then the peer is considered faulty and is removed from the peer set.
   374        timeout = ResponseTimeoutTrigger{ msg.PeerID, msg.Height, PeerTimeout }
   375      } else if state.PendingRequestsNum == 0 {
   376        // if there are no pending requests and no new request can be placed to the peer, termination timeout is triggered.
   377        // If termination timeout expires and we are still at the same height and there are no pending requests, the "fast-sync"
   378        // mode is finished and we switch to `ModeConsensus`.
   379        timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout }
   380      }
   381    }
   382    state.PeerMap[event.PeerID] = peerStats
   383    return state, msg, timeout, error
   384  }
   385  ```
   386  
   387  ``` go
   388  func handleEventRemovePeer(event EventRemovePeer, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error) {
   389    if _, ok := state.PeerMap[event.PeerID]; ok {
   390      pendingRequest = state.PeerMap[event.PeerID].PendingRequest
   391      // if a peer is removed from the peer set, its pending request is declared failed and added to the `FailedRequests` list
   392      // so it can be retried.
   393      if pendingRequest != -1 {
   394        add(state.FailedRequests, pendingRequest)
   395      }
   396      state.PendingRequestsNum--
   397      delete(state.PeerMap, event.PeerID)
   398      // if the peer set is empty after removal of this peer then termination timeout is triggered.
   399      if state.PeerMap.isEmpty() {
   400        timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout }
   401      }
   402    } else { error = "Removing unknown peer!" }
   403    return state, msg, timeout, error
   404  ```
   405  
   406  ``` go
   407  func handleEventBlockResponse(event EventBlockResponse, state ControllerState) (ControllerState, MessageToSend, TimeoutTrigger, Error)
   408    if state.PeerMap[event.PeerID] {
   409      peerStats = state.PeerMap[event.PeerID]
   410      // when expected block arrives from a peer, it is added to the store so it can be verified and if correct executed after.
   411      if peerStats.PendingRequest == event.Height {
   412        peerStats.PendingRequest = -1
   413        state.PendingRequestsNum--
   414        if event.PeerHeight > peerStats.Height { peerStats.Height = event.PeerHeight }
   415        state.Store[event.Height] = BlockInfo{ event.Block, event.Commit, event.PeerID }
   416        // blocks are verified sequentially so adding a block to the store does not mean that it will be immediately verified
   417        // as some of the previous blocks might be missing.
   418        state = verifyBlocks(state) // it can lead to event.PeerID being removed from peer list
   419        if _, ok := state.PeerMap[event.PeerID]; ok {
   420          // we try to identify new request for a block that can be asked to the peer
   421          msg = createBlockRequestMessage(state, event.PeerID, peerStats.Height)
   422          if msg != nil {
   423            peerStats.PendingRequests = msg.Height
   424            state.PendingRequestsNum++
   425            // if request for block is made, response timeout is triggered
   426            timeout = ResponseTimeoutTrigger{ msg.PeerID, msg.Height, PeerTimeout }
   427          } else if state.PeerMap.isEmpty() || state.PendingRequestsNum == 0 {
   428            // if the peer map is empty (the peer can be removed as block verification failed) or there are no pending requests
   429            // termination timeout is triggered.
   430             timeout = TerminationTimeoutTrigger{ state.Height, TerminationTimeout }
   431          }
   432        }
   433      } else { error = "Received Block from wrong peer!" }
   434    } else { error = "Received Block from unknown peer!" }
   435  
   436    state.PeerMap[event.PeerID] = peerStats
   437    return state, msg, timeout, error
   438  }
   439  ```
   440  
   441  ``` go
   442  func handleEventResponseTimeout(event, state) {
   443    if _, ok := state.PeerMap[event.PeerID]; ok {
   444      peerStats = state.PeerMap[event.PeerID]
   445      // if a response timeout expires and the peer hasn't delivered the block, the peer is removed from the peer list and
   446      // the request is added to the `FailedRequests` so the block can be downloaded from other peer
   447    if peerStats.PendingRequest == event.Height {
   448      add(state.FailedRequests, pendingRequest)
   449      delete(state.PeerMap, event.PeerID)
   450      state.PendingRequestsNum--
   451      // if peer set is empty, then termination timeout is triggered
   452      if state.PeerMap.isEmpty() {
   453        timeout = TimeoutTrigger{ state.Height, TerminationTimeout }
   454      }
   455    }
   456    }
   457    return state, msg, timeout, error
   458  }
   459  ```
   460  
   461  ``` go
   462  func createBlockRequestMessage(state ControllerState, peerID ID, peerHeight int64) MessageToSend {
   463    msg = nil
   464    blockHeight = -1
   465    r = find request in state.FailedRequests such that r <= peerHeight // returns `nil` if there are no such request
   466    // if there is a height in failed requests that can be downloaded from the peer send request to it
   467    if r != nil {
   468      blockNumber = r
   469      delete(state.FailedRequests, r)
   470    } else if state.MaxRequestPending < peerHeight {
   471    // if height of the maximum pending request is smaller than peer height, then ask peer for next block
   472      state.MaxRequestPending++
   473      blockHeight = state.MaxRequestPending // increment state.MaxRequestPending and then return the new value
   474    }
   475  
   476    if blockHeight > -1 { msg = MessageToSend { peerID, MessageBlockRequest { blockHeight } }
   477    return msg
   478  }
   479  ```
   480  
   481  ``` go
   482  func verifyBlocks(state State) State {
   483    done = false
   484    for !done {
   485      block = state.Store[height]
   486      if block != nil {
   487        verified = verify block.Block using block.Commit // return `true` is verification succeed, 'false` otherwise
   488  
   489        if verified {
   490          block.Execute()   // executing block is costly operation so it might make sense executing asynchronously
   491          state.Height++
   492        } else {
   493          // if block verification failed, then it is added to `FailedRequests` and the peer is removed from the peer set
   494          add(state.FailedRequests, height)
   495          state.Store[height] = nil
   496          if _, ok := state.PeerMap[block.PeerID]; ok {
   497            pendingRequest = state.PeerMap[block.PeerID].PendingRequest
   498            // if there is a pending request sent to the peer that is just to be removed from the peer set, add it to `FailedRequests`
   499            if pendingRequest != -1 {
   500              add(state.FailedRequests, pendingRequest)
   501              state.PendingRequestsNum--
   502            }
   503            delete(state.PeerMap, event.PeerID)
   504          }
   505          done = true
   506        }
   507      } else { done = true }
   508    }
   509    return state
   510  }
   511  ```
   512  
   513  In the proposed architecture `Controller` is not active task, i.e., it is being called by the `Executor`. Depending on
   514  the return values returned by `Controller`,`Executor` will send a message to some peer (`msg` != nil), trigger a
   515  timeout (`timeout` != nil) or deal with errors (`error` != nil).
   516  In case a timeout is triggered, it will provide as an input to `Controller` the corresponding timeout event once
   517  timeout expires.
   518  
   519  
   520  ## Status
   521  
   522  Draft.
   523  
   524  ## Consequences
   525  
   526  ### Positive
   527  
   528  - isolated implementation of the algorithm
   529  - improved testability - simpler to prove correctness
   530  - clearer separation of concerns - easier to reason
   531  
   532  ### Negative
   533  
   534  ### Neutral