github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/architecture/adr-067-mempool-refactor.md (about)

     1  # ADR 067: Mempool Refactor
     2  
     3  - [ADR 067: Mempool Refactor](#adr-067-mempool-refactor)
     4    - [Changelog](#changelog)
     5    - [Status](#status)
     6    - [Context](#context)
     7      - [Current Design](#current-design)
     8    - [Alternative Approaches](#alternative-approaches)
     9    - [Prior Art](#prior-art)
    10      - [Ethereum](#ethereum)
    11      - [Diem](#diem)
    12    - [Decision](#decision)
    13    - [Detailed Design](#detailed-design)
    14      - [CheckTx](#checktx)
    15      - [Mempool](#mempool)
    16      - [Eviction](#eviction)
    17      - [Gossiping](#gossiping)
    18      - [Performance](#performance)
    19    - [Future Improvements](#future-improvements)
    20    - [Consequences](#consequences)
    21      - [Positive](#positive)
    22      - [Negative](#negative)
    23      - [Neutral](#neutral)
    24    - [References](#references)
    25  
    26  ## Changelog
    27  
    28  - April 19, 2021: Initial Draft (@alexanderbez)
    29  
    30  ## Status
    31  
    32  Accepted
    33  
    34  ## Context
    35  
    36  Tendermint Core has a reactor and data structure, mempool, that facilitates the
    37  ephemeral storage of uncommitted transactions. Honest nodes participating in a
    38  Tendermint network gossip these uncommitted transactions to each other if they
    39  pass the application's `CheckTx`. In addition, block proposers select from the
    40  mempool a subset of uncommitted transactions to include in the next block.
    41  
    42  Currently, the mempool in Tendermint Core is designed as a FIFO queue. In other
    43  words, transactions are included in blocks as they are received by a node. There
    44  currently is no explicit and prioritized ordering of these uncommitted transactions.
    45  This presents a few technical and UX challenges for operators and applications.
    46  
    47  Namely, validators are not able to prioritize transactions by their fees or any
    48  incentive aligned mechanism. In addition, the lack of prioritization also leads
    49  to cascading effects in terms of DoS and various attack vectors on networks,
    50  e.g. [cosmos/cosmos-sdk#8224](https://github.com/cosmos/cosmos-sdk/discussions/8224).
    51  
    52  Thus, Tendermint Core needs the ability for an application and its users to
    53  prioritize transactions in a flexible and performant manner. Specifically, we're
    54  aiming to either improve, maintain or add the following properties in the
    55  Tendermint mempool:
    56  
    57  - Allow application-determined transaction priority.
    58  - Allow efficient concurrent reads and writes.
    59  - Allow block proposers to reap transactions efficiently by priority.
    60  - Maintain a fixed mempool capacity by transaction size and evict lower priority
    61    transactions to make room for higher priority transactions.
    62  - Allow transactions to be gossiped by priority efficiently.
    63  - Allow operators to specify a maximum TTL for transactions in the mempool before
    64    they're automatically evicted if not selected for a block proposal in time.
    65  - Ensure the design allows for future extensions, such as replace-by-priority and
    66    allowing multiple pending transactions per sender, to be incorporated easily.
    67  
    68  Note, not all of these properties will be addressed by the proposed changes in
    69  this ADR. However, this proposal will ensure that any unaddressed properties
    70  can be addressed in an easy and extensible manner in the future.
    71  
    72  ### Current Design
    73  
    74  ![mempool](./img/mempool-v0.jpeg)
    75  
    76  At the core of the `v0` mempool reactor is a concurrent linked-list. This is the
    77  primary data structure that contains `Tx` objects that have passed `CheckTx`.
    78  When a node receives a transaction from another peer, it executes `CheckTx`, which
    79  obtains a read-lock on the `*CListMempool`. If the transaction passes `CheckTx`
    80  locally on the node, it is added to the `*CList` by obtaining a write-lock. It
    81  is also added to the `cache` and `txsMap`, both of which obtain their own respective
    82  write-locks and map a reference from the transaction hash to the `Tx` itself.
    83  
    84  Transactions are continuously gossiped to peers whenever a new transaction is added
    85  to a local node's `*CList`, where the node at the front of the `*CList` is selected.
    86  Another transaction will not be gossiped until the `*CList` notifies the reader
    87  that there are more transactions to gossip.
    88  
    89  When a proposer attempts to propose a block, they will execute `ReapMaxBytesMaxGas`
    90  on the reactor's `*CListMempool`. This call obtains a read-lock on the `*CListMempool`
    91  and selects as many transactions as possible starting from the front of the `*CList`
    92  moving to the back of the list.
    93  
    94  When a block is finally committed, a caller invokes `Update` on the reactor's
    95  `*CListMempool` with all the selected transactions. Note, the caller must also
    96  explicitly obtain a write-lock on the reactor's `*CListMempool`. This call
    97  will remove all the supplied transactions from the `txsMap` and the `*CList`, both
    98  of which obtain their own respective write-locks. In addition, the transaction
    99  may also be removed from the `cache` which obtains it's own write-lock.
   100  
   101  ## Alternative Approaches
   102  
   103  When considering which approach to take for a priority-based flexible and
   104  performant mempool, there are two core candidates. The first candidate is less
   105  invasive in the required  set of protocol and implementation changes, which
   106  simply extends the existing `CheckTx` ABCI method. The second candidate essentially
   107  involves the introduction of new ABCI method(s) and would require a higher degree
   108  of complexity in protocol and implementation changes, some of which may either
   109  overlap or conflict with the upcoming introduction of [ABCI++](https://github.com/tendermint/tendermint/blob/v0.37.x/docs/rfc/rfc-013-abci%2B%2B.md).
   110  
   111  For more information on the various approaches and proposals, please see the
   112  [mempool discussion](https://github.com/tendermint/tendermint/discussions/6295).
   113  
   114  ## Prior Art
   115  
   116  ### Ethereum
   117  
   118  The Ethereum mempool, specifically [Geth](https://github.com/ethereum/go-ethereum),
   119  contains a mempool, `*TxPool`, that contains various mappings indexed by account,
   120  such as a `pending` which contains all processable transactions for accounts
   121  prioritized by nonce. It also contains a `queue` which is the exact same mapping
   122  except it contains not currently processable transactions. The mempool also
   123  contains a `priced` index of type `*txPricedList` that is a priority queue based
   124  on transaction price.
   125  
   126  ### Diem
   127  
   128  The [Diem mempool](https://github.com/diem/diem/blob/master/mempool/README.md#implementation-details)
   129  contains a similar approach to the one we propose. Specifically, the Diem mempool
   130  contains a mapping from `Account:[]Tx`. On top of this primary mapping from account
   131  to a list of transactions, are various indexes used to perform certain actions.
   132  
   133  The main index, `PriorityIndex`. is an ordered queue of transactions that are
   134  “consensus-ready” (i.e., they have a sequence number which is sequential to the
   135  current sequence number for the account). This queue is ordered by gas price so
   136  that if a client is willing to pay more (than other clients) per unit of
   137  execution, then they can enter consensus earlier.
   138  
   139  ## Decision
   140  
   141  To incorporate a priority-based flexible and performant mempool in Tendermint Core,
   142  we will introduce new fields, `priority` and `sender`, into the `ResponseCheckTx`
   143  type.
   144  
   145  We will introduce a new versioned mempool reactor, `v1` and assume an implicit
   146  version of the current mempool reactor as `v0`. In the new `v1` mempool reactor,
   147  we largely keep the functionality the same as `v0` except we augment the underlying
   148  data structures. Specifically, we keep a mapping of senders to transaction objects.
   149  On top of this mapping, we index transactions to provide the ability to efficiently
   150  gossip and reap transactions by priority.
   151  
   152  ## Detailed Design
   153  
   154  ### CheckTx
   155  
   156  We introduce the following new fields into the `ResponseCheckTx` type:
   157  
   158  ```diff
   159  message ResponseCheckTx {
   160    uint32         code       = 1;
   161    bytes          data       = 2;
   162    string         log        = 3;  // nondeterministic
   163    string         info       = 4;  // nondeterministic
   164    int64          gas_wanted = 5 [json_name = "gas_wanted"];
   165    int64          gas_used   = 6 [json_name = "gas_used"];
   166    repeated Event events     = 7 [(gogoproto.nullable) = false, (gogoproto.jsontag) = "events,omitempty"];
   167    string         codespace  = 8;
   168  + int64          priority   = 9;
   169  + string         sender     = 10;
   170  }
   171  ```
   172  
   173  It is entirely up the application in determining how these fields are populated
   174  and with what values, e.g. the `sender` could be the signer and fee payer
   175  of the transaction, the `priority` could be the cumulative sum of the fee(s).
   176  
   177  Only `sender` is required, while `priority` can be omitted which would result in
   178  using the default value of zero.
   179  
   180  ### Mempool
   181  
   182  The existing concurrent-safe linked-list will be replaced by a thread-safe map
   183  of `<sender:*Tx>`, i.e a mapping from `sender` to a single `*Tx` object, where
   184  each `*Tx` is the next valid and processable transaction from the given `sender`.
   185  
   186  On top of this mapping, we index all transactions by priority using a thread-safe
   187  priority queue, i.e. a [max heap](https://en.wikipedia.org/wiki/Min-max_heap).
   188  When a proposer is ready to select transactions for the next block proposal,
   189  transactions are selected from this priority index by highest priority order.
   190  When a transaction is selected and reaped, it is removed from this index and
   191  from the `<sender:*Tx>` mapping.
   192  
   193  We define `Tx` as the following data structure:
   194  
   195  ```go
   196  type Tx struct {
   197    // Tx represents the raw binary transaction data.
   198    Tx []byte
   199  
   200    // Priority defines the transaction's priority as specified by the application
   201    // in the ResponseCheckTx response.
   202    Priority int64
   203  
   204    // Sender defines the transaction's sender as specified by the application in
   205    // the ResponseCheckTx response.
   206    Sender string
   207  
   208    // Index defines the current index in the priority queue index. Note, if
   209    // multiple Tx indexes are needed, this field will be removed and each Tx
   210    // index will have its own wrapped Tx type.
   211    Index int
   212  }
   213  ```
   214  
   215  ### Eviction
   216  
   217  Upon successfully executing `CheckTx` for a new `Tx` and the mempool is currently
   218  full, we must check if there exists a `Tx` of lower priority that can be evicted
   219  to make room for the new `Tx` with higher priority and with sufficient size
   220  capacity left.
   221  
   222  If such a `Tx` exists, we find it by obtaining a read lock and sorting the
   223  priority queue index. Once sorted, we find the first `Tx` with lower priority and
   224  size such that the new `Tx` would fit within the mempool's size limit. We then
   225  remove this `Tx` from the priority queue index as well as the `<sender:*Tx>`
   226  mapping.
   227  
   228  This will require additional `O(n)` space and `O(n*log(n))` runtime complexity. Note that the space complexity does not depend on the size of the tx.
   229  
   230  ### Gossiping
   231  
   232  We keep the existing thread-safe linked list as an additional index. Using this
   233  index, we can efficiently gossip transactions in the same manner as they are
   234  gossiped now (FIFO).
   235  
   236  Gossiping transactions will not require locking any other indexes.
   237  
   238  ### Performance
   239  
   240  Performance should largely remain unaffected apart from the space overhead of
   241  keeping an additional priority queue index and the case where we need to evict
   242  transactions from the priority queue index. There should be no reads which
   243  block writes on any index
   244  
   245  ## Future Improvements
   246  
   247  There are a few considerable ways in which the proposed design can be improved or
   248  expanded upon. Namely, transaction gossiping and for the ability to support
   249  multiple transactions from the same `sender`.
   250  
   251  With regards to transaction gossiping, we need empirically validate whether we
   252  need to gossip by priority. In addition, the current method of gossiping may not
   253  be the most efficient. Specifically, broadcasting all the transactions a node
   254  has in it's mempool to it's peers. Rather, we should explore for the ability to
   255  gossip transactions on a request/response basis similar to Ethereum and other
   256  protocols. Not only does this reduce bandwidth and complexity, but also allows
   257  for us to explore gossiping by priority or other dimensions more efficiently.
   258  
   259  Allowing for multiple transactions from the same `sender` is important and will
   260  most likely be a needed feature in the future development of the mempool, but for
   261  now it suffices to have the preliminary design agreed upon. Having the ability
   262  to support multiple transactions per `sender` will require careful thought with
   263  regards to the interplay of the corresponding ABCI application. Regardless, the
   264  proposed design should allow for adaptations to support this feature in a
   265  non-contentious and backwards compatible manner.
   266  
   267  ## Consequences
   268  
   269  ### Positive
   270  
   271  - Transactions are allowed to be prioritized by the application.
   272  
   273  ### Negative
   274  
   275  - Increased size of the `ResponseCheckTx` Protocol Buffer type.
   276  - Causal ordering is NOT maintained.
   277    - It is possible that certain transactions broadcasted in a particular order may
   278    pass `CheckTx` but not end up being committed in a block because they fail
   279    `CheckTx` later. e.g. Consider Tx<sub>1</sub> that sends funds from existing
   280    account Alice to a _new_ account Bob with priority P<sub>1</sub> and then later
   281    Bob's _new_ account sends funds back to Alice in Tx<sub>2</sub> with P<sub>2</sub>,
   282    such that P<sub>2</sub> > P<sub>1</sub>. If executed in this order, both
   283    transactions will pass `CheckTx`. However, when a proposer is ready to select
   284    transactions for the next block proposal, they will select Tx<sub>2</sub> before
   285    Tx<sub>1</sub> and thus Tx<sub>2</sub> will _fail_ because Tx<sub>1</sub> must
   286    be executed first. This is because there is a _causal ordering_,
   287    Tx<sub>1</sub> ➝ Tx<sub>2</sub>. These types of situations should be rare as
   288    most transactions are not causally ordered and can be circumvented by simply
   289    trying again at a later point in time or by ensuring the "child" priority is
   290    lower than the "parent" priority. In other words, if parents always have
   291    priories that are higher than their children, then the new mempool design will
   292    maintain causal ordering.
   293  
   294  ### Neutral
   295  
   296  - A transaction that passed `CheckTx` and entered the mempool can later be evicted
   297    at a future point in time if a higher priority transaction entered while the
   298    mempool was full.
   299  
   300  ## References
   301  
   302  - [ABCI++](https://github.com/tendermint/tendermint/blob/v0.37.x/docs/rfc/rfc-013-abci%2B%2B.md)
   303  - [Mempool Discussion](https://github.com/tendermint/tendermint/discussions/6295)