github.com/pokt-network/tendermint@v0.32.11-0.20230426215212-59310158d3e9/docs/architecture/adr-053-state-sync-prototype.md (about)

     1  # ADR 053: State Sync Prototype
     2  
     3  This ADR outlines the plan for an initial state sync prototype, and is subject to change as we gain feedback and experience. It builds on discussions and findings in [ADR-042](./adr-042-state-sync.md), see that for background information.
     4  
     5  ## Changelog
     6  
     7  * 2020-01-28: Initial draft (Erik Grinaker)
     8  
     9  * 2020-02-18: Updates after initial prototype (Erik Grinaker)
    10      * ABCI: added missing `reason` fields.
    11      * ABCI: used 32-bit 1-based chunk indexes (was 64-bit 0-based).
    12      * ABCI: moved `RequestApplySnapshotChunk.chain_hash` to `RequestOfferSnapshot.app_hash`.
    13      * Gaia: snapshots must include node versions as well, both for inner and leaf nodes.
    14      * Added experimental prototype info.
    15      * Added open questions and implementation plan.
    16  
    17  * 2020-03-29: Strengthened and simplified ABCI interface (Erik Grinaker)
    18      * ABCI: replaced `chunks` with `chunk_hashes` in `Snapshot`.
    19      * ABCI: removed `SnapshotChunk` message.
    20      * ABCI: renamed `GetSnapshotChunk` to `LoadSnapshotChunk`.
    21      * ABCI: chunks are now exchanged simply as `bytes`.
    22      * ABCI: chunks are now 0-indexed, for parity with `chunk_hashes` array.
    23      * Reduced maximum chunk size to 16 MB, and increased snapshot message size to 4 MB.
    24  
    25  ## Context
    26  
    27  State sync will allow a new node to receive a snapshot of the application state without downloading blocks or going through consensus. This bootstraps the node significantly faster than the current fast sync system, which replays all historical blocks.
    28  
    29  Background discussions and justifications are detailed in [ADR-042](./adr-042-state-sync.md). Its recommendations can be summarized as:
    30  
    31  * The application periodically takes full state snapshots (i.e. eager snapshots).
    32  
    33  * The application splits snapshots into smaller chunks that can be individually verified against a chain app hash.
    34  
    35  * Tendermint uses the light client to obtain a trusted chain app hash for verification.
    36  
    37  * Tendermint discovers and downloads snapshot chunks in parallel from multiple peers, and passes them to the application via ABCI to be applied and verified against the chain app hash.
    38  
    39  * Historical blocks are not backfilled, so state synced nodes will have a truncated block history.
    40  
    41  ## Tendermint Proposal
    42  
    43  This describes the snapshot/restore process seen from Tendermint. The interface is kept as small and general as possible to give applications maximum flexibility.
    44  
    45  ### Snapshot Data Structure
    46  
    47  A node can have multiple snapshots taken at various heights. Snapshots can be taken in different application-specified formats (e.g. MessagePack as format `1` and Protobuf as format `2`, or similarly for schema versioning). Each snapshot consists of multiple chunks containing the actual state data, for parallel downloads and reduced memory usage.
    48  
    49  ```proto
    50  message Snapshot {
    51      uint64         height       = 1;  // The height at which the snapshot was taken
    52      uint32         format       = 2;  // The application-specific snapshot format
    53      repeated bytes chunk_hashes = 3;  // SHA-256 checksums of all chunks, in order
    54      bytes          metadata     = 4;  // Arbitrary application metadata
    55  }
    56  ```
    57  
    58  Chunks are exchanged simply as `bytes`, and cannot be larger than 16 MB. `Snapshot` messages should be less than 4 MB.
    59  
    60  ### ABCI Interface
    61  
    62  ```proto
    63  // Lists available snapshots
    64  message RequestListSnapshots {}
    65  
    66  message ResponseListSnapshots {
    67      repeated Snapshot snapshots = 1;
    68  }
    69  
    70  // Offers a snapshot to the application
    71  message RequestOfferSnapshot {
    72      Snapshot snapshot = 1;
    73      bytes    app_hash = 2;
    74  }
    75  
    76  message ResponseOfferSnapshot {
    77      bool   accepted = 1;
    78      Reason reason = 2;
    79  
    80      enum Reason {            // Reason why snapshot was rejected
    81          unknown        = 0;  // Unknown or generic reason
    82          invalid_height = 1;  // Height is rejected: avoid this height
    83          invalid_format = 2;  // Format is rejected: avoid this format
    84      }
    85  }
    86  
    87  // Loads a snapshot chunk
    88  message RequestLoadSnapshotChunk {
    89      uint64 height = 1;
    90      uint32 format = 2;
    91      uint32 chunk  = 3; // Zero-indexed
    92  }
    93  
    94  message ResponseLoadSnapshotChunk {
    95      bytes chunk = 1;
    96  }
    97  
    98  // Applies a snapshot chunk
    99  message RequestApplySnapshotChunk {
   100      bytes chunk = 1;
   101  }
   102  
   103  message ResponseApplySnapshotChunk {
   104      bool   applied = 1;
   105      Reason reason  = 2;      // Reason why chunk failed
   106  
   107      enum Reason {            // Reason why chunk failed
   108          unknown        = 0;  // Unknown or generic reason
   109          verify_failed  = 1;  // Snapshot verification failed
   110      }
   111  }
   112  ```
   113  
   114  ### Taking Snapshots
   115  
   116  Tendermint is not aware of the snapshotting process at all, it is entirely an application concern. The following guarantees must be provided:
   117  
   118  * **Periodic:** snapshots must be taken periodically, not on-demand, for faster restores, lower load, and less DoS risk.
   119  
   120  * **Deterministic:** snapshots must be deterministic, and identical across all nodes - typically by taking a snapshot at given height intervals.
   121  
   122  * **Consistent:** snapshots must be consistent, i.e. not affected by concurrent writes - typically by using a data store that supports versioning and/or snapshot isolation.
   123  
   124  * **Asynchronous:** snapshots must be asynchronous, i.e. not halt block processing and state transitions.
   125  
   126  * **Chunked:** snapshots must be split into chunks of reasonable size (on the order of megabytes), and each chunk must be verifiable against the chain app hash.
   127  
   128  * **Garbage collected:** snapshots must be garbage collected periodically.
   129  
   130  ### Restoring Snapshots
   131  
   132  Nodes should have options for enabling state sync and/or fast sync, and be provided a trusted header hash for the light client.
   133  
   134  When starting an empty node with state sync and fast sync enabled, snapshots are restored as follows:
   135  
   136  1. The node checks that it is empty, i.e. that it has no state nor blocks.
   137  
   138  2. The node contacts the given seeds to discover peers.
   139  
   140  3. The node contacts a set of full nodes, and verifies the trusted block header using the given hash via the light client.
   141  
   142  4. The node requests available snapshots via P2P from peers, via `RequestListSnapshots`. Peers will return the 10 most recent snapshots, one message per snapshot.
   143  
   144  5. The node aggregates snapshots from multiple peers, ordered by height and format (in reverse). If there are `chunk_hashes` mismatches between different snapshots, the one hosted by the largest amount of peers is chosen. The node iterates over all snapshots in reverse order by height and format until it finds one that satisfies all of the following conditions:
   145  
   146      * The snapshot height's block is considered trustworthy by the light client (i.e. snapshot height is greater than trusted header and within unbonding period of the latest trustworthy block).
   147  
   148      * The snapshot's height or format hasn't been explicitly rejected by an earlier `RequestOfferSnapshot` call (via `invalid_height` or `invalid_format`).
   149  
   150      * The application accepts the `RequestOfferSnapshot` call.
   151  
   152  6. The node downloads chunks in parallel from multiple peers, via `RequestLoadSnapshotChunk`, and both the sender and receiver verifies their checksums. Chunk messages cannot exceed 16 MB.
   153  
   154  7. The node passes chunks sequentially to the app via `RequestApplySnapshotChunk`.
   155  
   156  8. Once all chunks have been applied, the node compares the app hash to the chain app hash, and if they do not match it either errors or discards the state and starts over.
   157  
   158  9. The node switches to fast sync to catch up blocks that were committed while restoring the snapshot.
   159  
   160  10. The node switches to normal consensus mode.
   161  
   162  ## Gaia Proposal
   163  
   164  This describes the snapshot process seen from Gaia, using format version `1`. The serialization format is unspecified, but likely to be compressed Amino or Protobuf.
   165  
   166  ### Snapshot Metadata
   167  
   168  In the initial version there is no snapshot metadata, so it is set to an empty byte buffer.
   169  
   170  Once all chunks have been successfully built, snapshot metadata should be stored in a database and served via `RequestListSnapshots`.
   171  
   172  ### Snapshot Chunk Format
   173  
   174  The Gaia data structure consists of a set of named IAVL trees. A root hash is constructed by taking the root hashes of each of the IAVL trees, then constructing a Merkle tree of the sorted name/hash map.
   175  
   176  IAVL trees are versioned, but a snapshot only contains the version relevant for the snapshot height. All historical versions are ignored.
   177  
   178  IAVL trees are insertion-order dependent, so key/value pairs must be set in an appropriate insertion order to produce the same tree branching structure. This insertion order can be found by doing a breadth-first scan of all nodes (including inner nodes) and collecting unique keys in order. However, the node hash also depends on the node's version, so snapshots must contain the inner nodes' version numbers as well.
   179  
   180  For the initial prototype, each chunk consists of a complete dump of all node data for all nodes in an entire IAVL tree. Thus the number of chunks equals the number of persistent stores in Gaia. No incremental verification of chunks is done, only a final app hash comparison at the end of the snapshot restoration.
   181  
   182  For a production version, it should be sufficient to store key/value/version for all nodes (leaf and inner) in insertion order, chunked in some appropriate way. If per-chunk verification is required, the chunk must also contain enough information to reconstruct the Merkle proofs all the way up to the root of the multistore, e.g. by storing a complete subtree's key/value/version data plus Merkle hashes of all other branches up to the multistore root. The exact approach will depend on tradeoffs between size, time, and verification. IAVL RangeProofs are not recommended, since these include redundant data such as proofs for intermediate and leaf nodes that can be derived from the above data.
   183  
   184  Chunks should be built greedily by collecting node data up to some size limit (e.g. 10 MB) and serializing it. Chunk data is stored in the file system as `snapshots/<height>/<format>/<chunk>`, and a SHA-256 checksum is stored along with the snapshot metadata.
   185  
   186  ### Snapshot Scheduling
   187  
   188  Snapshots should be taken at some configurable height interval, e.g. every 1000 blocks. All nodes should preferably have the same snapshot schedule, such that all nodes can serve chunks for a given snapshot.
   189  
   190  Taking consistent snapshots of IAVL trees is greatly simplified by them being versioned: simply snapshot the version that corresponds to the snapshot height, while concurrent writes create new versions. IAVL pruning must not prune a version that is being snapshotted.
   191  
   192  Snapshots must also be garbage collected after some configurable time, e.g. by keeping the latest `n` snapshots.
   193  
   194  ## Experimental Prototype
   195  
   196  An experimental but functional state sync prototype is available in the `erik/statesync-prototype` branches of the Tendermint, IAVL, Cosmos SDK, and Gaia repositories. To fetch the necessary branches:
   197  
   198  ```sh
   199  $ mkdir statesync
   200  $ cd statesync
   201  $ git clone git@github.com:tendermint/tendermint -b erik/statesync-prototype
   202  $ git clone git@github.com:tendermint/iavl -b erik/statesync-prototype
   203  $ git clone git@github.com:cosmos/cosmos-sdk -b erik/statesync-prototype
   204  $ git clone git@github.com:cosmos/gaia -b erik/statesync-prototype
   205  ```
   206  
   207  To spin up three nodes of a four-node testnet:
   208  
   209  ```sh
   210  $ cd gaia
   211  $ ./tools/start.sh
   212  ```
   213  
   214  Wait for the first snapshot to be taken at height 3, then (in a separate terminal) start the fourth node with state sync enabled:
   215  
   216  ```sh
   217  $ ./tools/sync.sh
   218  ```
   219  
   220  To stop the testnet, run:
   221  
   222  ```sh
   223  $ ./tools/stop.sh
   224  ```
   225  
   226  ## Resolved Questions
   227  
   228  * Is it OK for state-synced nodes to not have historical blocks nor historical IAVL versions?
   229  
   230      > Yes, this is as intended. Maybe backfill blocks later.
   231  
   232  * Do we need incremental chunk verification for first version?
   233  
   234      > No, we'll start simple. Can add chunk verification via a new snapshot format without any breaking changes in Tendermint. For adversarial conditions, maybe consider support for whitelisting peers to download chunks from.
   235  
   236  * Should the snapshot ABCI interface be a separate optional ABCI service, or mandatory?
   237  
   238      > Mandatory, to keep things simple for now. It will therefore be a breaking change and push the release. For apps using the Cosmos SDK, we can provide a default implementation that does not serve snapshots and errors when trying to apply them.
   239  
   240  * How can we make sure `ListSnapshots` data is valid? An adversary can provide fake/invalid snapshots to DoS peers.
   241  
   242      > For now, just pick snapshots that are available on a large number of peers. Maybe support whitelisting. We may consider e.g. placing snapshot manifests on the blockchain later.
   243  
   244  * Should we punish nodes that provide invalid snapshots? How?
   245  
   246      > No, these are full nodes not validators, so we can't punish them. Just disconnect from them and ignore them.
   247  
   248  * Should we call these snapshots? The SDK already uses the term "snapshot" for `PruningOptions.SnapshotEvery`, and state sync will introduce additional SDK options for snapshot scheduling and pruning that are not related to IAVL snapshotting or pruning.
   249  
   250      > Yes. Hopefully these concepts are distinct enough that we can refer to state sync snapshots and IAVL snapshots without too much confusion.
   251  
   252  * Should we store snapshot and chunk metadata in a database? Can we use the database for chunks?
   253  
   254      > As a first approach, store metadata in a database and chunks in the filesystem.
   255  
   256  * Should a snapshot at height H be taken before or after the block at H is processed? E.g. RPC `/commit` returns app_hash after _previous_ height, i.e. _before_  current height.
   257  
   258      > After commit.
   259  
   260  * Do we need to support all versions of blockchain reactor (i.e. fast sync)?
   261  
   262      > We should remove the v1 reactor completely once v2 has stabilized.
   263  
   264  * Should `ListSnapshots` be a streaming API instead of a request/response API?
   265  
   266      > No, just use a max message size.
   267  
   268  ## Implementation Plan
   269  
   270  ### Core Tasks
   271  
   272  * **Tendermint:** light client P2P transport [#4456](https://github.com/tendermint/tendermint/issues/4456)
   273  
   274  * **IAVL:** export/import API [#210](https://github.com/tendermint/iavl/issues/210)
   275  
   276  * **Cosmos SDK:** snapshotting, scheduling, and pruning [#5689](https://github.com/cosmos/cosmos-sdk/issues/5689)
   277  
   278  * **Tendermint:** support starting with a truncated block history
   279  
   280  * **Tendermint:** state sync reactor and ABCI interface [#828](https://github.com/tendermint/tendermint/issues/828)
   281  
   282  * **Cosmos SDK:** snapshot ABCI implementation [#5690](https://github.com/cosmos/cosmos-sdk/issues/5690)
   283  
   284  ### Nice-to-Haves
   285  
   286  * **Tendermint:** staged reactor startup (state sync → fast sync → block replay → wal replay → consensus)
   287  
   288      > Let's do a time-boxed prototype (a few days) and see how much work it will be.
   289  
   290    * Notify P2P peers about channel changes [#4394](https://github.com/tendermint/tendermint/issues/4394)
   291  
   292    * Check peers have certain channels [#1148](https://github.com/tendermint/tendermint/issues/1148)
   293  
   294  * **Tendermint:** prune blockchain history [#3652](https://github.com/tendermint/tendermint/issues/3652)
   295  
   296  * **Tendermint:** allow genesis to start from non-zero height [#2543](https://github.com/tendermint/tendermint/issues/2543)
   297  
   298  ### Follow-up Tasks
   299  
   300  * **Tendermint:** light client verification for fast sync [#4457](https://github.com/tendermint/tendermint/issues/4457)
   301  
   302  * **Tendermint:** allow start with only blockstore [#3713](https://github.com/tendermint/tendermint/issues/3713)
   303  
   304  * **Tendermint:** node should go back to fast-syncing when lagging significantly [#129](https://github.com/tendermint/tendermint/issues/129)
   305  
   306  * **Tendermint:** backfill historical blocks [#4629](https://github.com/tendermint/tendermint/issues/4629)
   307  
   308  ## Status
   309  
   310  Accepted
   311  
   312  ## References
   313  
   314  * [ADR-042](./adr-042-state-sync.md) and its references