github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/architecture/adr-053-state-sync-prototype.md

github.com/badrootd/nibiru-cometbft@v0.37.5-0.20240307173500-2a75559eee9b/docs/architecture/adr-053-state-sync-prototype.md (about)

1 # ADR 053: State Sync Prototype
2
3 State sync is now [merged](https://github.com/tendermint/tendermint/pull/4705). Up-to-date ABCI documentation is [available](https://github.com/tendermint/spec/pull/90), refer to it rather than this ADR for details.
4
5 This ADR outlines the plan for an initial state sync prototype, and is subject to change as we gain feedback and experience. It builds on discussions and findings in [ADR-042](./adr-042-state-sync.md), see that for background information.
6
7 ## Changelog
8
9 * 2020-01-28: Initial draft (Erik Grinaker)
10
11 * 2020-02-18: Updates after initial prototype (Erik Grinaker)
12 * ABCI: added missing `reason` fields.
13 * ABCI: used 32-bit 1-based chunk indexes (was 64-bit 0-based).
14 * ABCI: moved `RequestApplySnapshotChunk.chain_hash` to `RequestOfferSnapshot.app_hash`.
15 * Gaia: snapshots must include node versions as well, both for inner and leaf nodes.
16 * Added experimental prototype info.
17 * Added open questions and implementation plan.
18
19 * 2020-03-29: Strengthened and simplified ABCI interface (Erik Grinaker)
20 * ABCI: replaced `chunks` with `chunk_hashes` in `Snapshot`.
21 * ABCI: removed `SnapshotChunk` message.
22 * ABCI: renamed `GetSnapshotChunk` to `LoadSnapshotChunk`.
23 * ABCI: chunks are now exchanged simply as `bytes`.
24 * ABCI: chunks are now 0-indexed, for parity with `chunk_hashes` array.
25 * Reduced maximum chunk size to 16 MB, and increased snapshot message size to 4 MB.
26
27 * 2020-04-29: Update with final released ABCI interface (Erik Grinaker)
28
29 ## Context
30
31 State sync will allow a new node to receive a snapshot of the application state without downloading blocks or going through consensus. This bootstraps the node significantly faster than the current fast sync system, which replays all historical blocks.
32
33 Background discussions and justifications are detailed in [ADR-042](./adr-042-state-sync.md). Its recommendations can be summarized as:
34
35 * The application periodically takes full state snapshots (i.e. eager snapshots).
36
37 * The application splits snapshots into smaller chunks that can be individually verified against a chain app hash.
38
39 * Tendermint uses the light client to obtain a trusted chain app hash for verification.
40
41 * Tendermint discovers and downloads snapshot chunks in parallel from multiple peers, and passes them to the application via ABCI to be applied and verified against the chain app hash.
42
43 * Historical blocks are not backfilled, so state synced nodes will have a truncated block history.
44
45 ## Tendermint Proposal
46
47 This describes the snapshot/restore process seen from Tendermint. The interface is kept as small and general as possible to give applications maximum flexibility.
48
49 ### Snapshot Data Structure
50
51 A node can have multiple snapshots taken at various heights. Snapshots can be taken in different application-specified formats (e.g. MessagePack as format `1` and Protobuf as format `2`, or similarly for schema versioning). Each snapshot consists of multiple chunks containing the actual state data, for parallel downloads and reduced memory usage.
52
53 ```proto
54 message Snapshot {
55 uint64 height = 1; // The height at which the snapshot was taken
56 uint32 format = 2; // The application-specific snapshot format
57 uint32 chunks = 3; // Number of chunks in the snapshot
58 bytes hash = 4; // Arbitrary snapshot hash - should be equal only for identical snapshots
59 bytes metadata = 5; // Arbitrary application metadata
60 }
61 ```
62
63 Chunks are exchanged simply as `bytes`, and cannot be larger than 16 MB. `Snapshot` messages should be less than 4 MB.
64
65 ### ABCI Interface
66
67 ```proto
68 // Lists available snapshots
69 message RequestListSnapshots {}
70
71 message ResponseListSnapshots {
72 repeated Snapshot snapshots = 1;
73 }
74
75 // Offers a snapshot to the application
76 message RequestOfferSnapshot {
77 Snapshot snapshot = 1; // snapshot offered by peers
78 bytes app_hash = 2; // light client-verified app hash for snapshot height
79 }
80
81 message ResponseOfferSnapshot {
82 Result result = 1;
83
84 enum Result {
85 accept = 0; // Snapshot accepted, apply chunks
86 abort = 1; // Abort all snapshot restoration
87 reject = 2; // Reject this specific snapshot, and try a different one
88 reject_format = 3; // Reject all snapshots of this format, and try a different one
89 reject_sender = 4; // Reject all snapshots from the sender(s), and try a different one
90 }
91 }
92
93 // Loads a snapshot chunk
94 message RequestLoadSnapshotChunk {
95 uint64 height = 1;
96 uint32 format = 2;
97 uint32 chunk = 3; // Zero-indexed
98 }
99
100 message ResponseLoadSnapshotChunk {
101 bytes chunk = 1;
102 }
103
104 // Applies a snapshot chunk
105 message RequestApplySnapshotChunk {
106 uint32 index = 1;
107 bytes chunk = 2;
108 string sender = 3;
109 }
110
111 message ResponseApplySnapshotChunk {
112 Result result = 1;
113 repeated uint32 refetch_chunks = 2; // Chunks to refetch and reapply (regardless of result)
114 repeated string reject_senders = 3; // Chunk senders to reject and ban (regardless of result)
115
116 enum Result {
117 accept = 0; // Chunk successfully accepted
118 abort = 1; // Abort all snapshot restoration
119 retry = 2; // Retry chunk, combine with refetch and reject as appropriate
120 retry_snapshot = 3; // Retry snapshot, combine with refetch and reject as appropriate
121 reject_snapshot = 4; // Reject this snapshot, try a different one but keep sender rejections
122 }
123 }
124 ```
125
126 ### Taking Snapshots
127
128 Tendermint is not aware of the snapshotting process at all, it is entirely an application concern. The following guarantees must be provided:
129
130 * **Periodic:** snapshots must be taken periodically, not on-demand, for faster restores, lower load, and less DoS risk.
131
132 * **Deterministic:** snapshots must be deterministic, and identical across all nodes - typically by taking a snapshot at given height intervals.
133
134 * **Consistent:** snapshots must be consistent, i.e. not affected by concurrent writes - typically by using a data store that supports versioning and/or snapshot isolation.
135
136 * **Asynchronous:** snapshots must be asynchronous, i.e. not halt block processing and state transitions.
137
138 * **Chunked:** snapshots must be split into chunks of reasonable size (on the order of megabytes), and each chunk must be verifiable against the chain app hash.
139
140 * **Garbage collected:** snapshots must be garbage collected periodically.
141
142 ### Restoring Snapshots
143
144 Nodes should have options for enabling state sync and/or fast sync, and be provided a trusted header hash for the light client.
145
146 When starting an empty node with state sync and fast sync enabled, snapshots are restored as follows:
147
148 1. The node checks that it is empty, i.e. that it has no state nor blocks.
149
150 2. The node contacts the given seeds to discover peers.
151
152 3. The node contacts a set of full nodes, and verifies the trusted block header using the given hash via the light client.
153
154 4. The node requests available snapshots via P2P from peers, via `RequestListSnapshots`. Peers will return the 10 most recent snapshots, one message per snapshot.
155
156 5. The node aggregates snapshots from multiple peers, ordered by height and format (in reverse). If there are mismatches between different snapshots, the one hosted by the largest amount of peers is chosen. The node iterates over all snapshots in reverse order by height and format until it finds one that satisfies all of the following conditions:
157
158 * The snapshot height's block is considered trustworthy by the light client (i.e. snapshot height is greater than trusted header and within unbonding period of the latest trustworthy block).
159
160 * The snapshot's height or format hasn't been explicitly rejected by an earlier `RequestOfferSnapshot`.
161
162 * The application accepts the `RequestOfferSnapshot` call.
163
164 6. The node downloads chunks in parallel from multiple peers, via `RequestLoadSnapshotChunk`. Chunk messages cannot exceed 16 MB.
165
166 7. The node passes chunks sequentially to the app via `RequestApplySnapshotChunk`.
167
168 8. Once all chunks have been applied, the node compares the app hash to the chain app hash, and if they do not match it either errors or discards the state and starts over.
169
170 9. The node switches to fast sync to catch up blocks that were committed while restoring the snapshot.
171
172 10. The node switches to normal consensus mode.
173
174 ## Gaia Proposal
175
176 This describes the snapshot process seen from Gaia, using format version `1`. The serialization format is unspecified, but likely to be compressed Amino or Protobuf.
177
178 ### Snapshot Metadata
179
180 In the initial version there is no snapshot metadata, so it is set to an empty byte buffer.
181
182 Once all chunks have been successfully built, snapshot metadata should be stored in a database and served via `RequestListSnapshots`.
183
184 ### Snapshot Chunk Format
185
186 The Gaia data structure consists of a set of named IAVL trees. A root hash is constructed by taking the root hashes of each of the IAVL trees, then constructing a Merkle tree of the sorted name/hash map.
187
188 IAVL trees are versioned, but a snapshot only contains the version relevant for the snapshot height. All historical versions are ignored.
189
190 IAVL trees are insertion-order dependent, so key/value pairs must be set in an appropriate insertion order to produce the same tree branching structure. This insertion order can be found by doing a breadth-first scan of all nodes (including inner nodes) and collecting unique keys in order. However, the node hash also depends on the node's version, so snapshots must contain the inner nodes' version numbers as well.
191
192 For the initial prototype, each chunk consists of a complete dump of all node data for all nodes in an entire IAVL tree. Thus the number of chunks equals the number of persistent stores in Gaia. No incremental verification of chunks is done, only a final app hash comparison at the end of the snapshot restoration.
193
194 For a production version, it should be sufficient to store key/value/version for all nodes (leaf and inner) in insertion order, chunked in some appropriate way. If per-chunk verification is required, the chunk must also contain enough information to reconstruct the Merkle proofs all the way up to the root of the multistore, e.g. by storing a complete subtree's key/value/version data plus Merkle hashes of all other branches up to the multistore root. The exact approach will depend on tradeoffs between size, time, and verification. IAVL RangeProofs are not recommended, since these include redundant data such as proofs for intermediate and leaf nodes that can be derived from the above data.
195
196 Chunks should be built greedily by collecting node data up to some size limit (e.g. 10 MB) and serializing it. Chunk data is stored in the file system as `snapshots/<height>/<format>/<chunk>`, and a SHA-256 checksum is stored along with the snapshot metadata.
197
198 ### Snapshot Scheduling
199
200 Snapshots should be taken at some configurable height interval, e.g. every 1000 blocks. All nodes should preferably have the same snapshot schedule, such that all nodes can serve chunks for a given snapshot.
201
202 Taking consistent snapshots of IAVL trees is greatly simplified by them being versioned: simply snapshot the version that corresponds to the snapshot height, while concurrent writes create new versions. IAVL pruning must not prune a version that is being snapshotted.
203
204 Snapshots must also be garbage collected after some configurable time, e.g. by keeping the latest `n` snapshots.
205
206 ## Resolved Questions
207
208 * Is it OK for state-synced nodes to not have historical blocks nor historical IAVL versions?
209
210 > Yes, this is as intended. Maybe backfill blocks later.
211
212 * Do we need incremental chunk verification for first version?
213
214 > No, we'll start simple. Can add chunk verification via a new snapshot format without any breaking changes in Tendermint. For adversarial conditions, maybe consider support for whitelisting peers to download chunks from.
215
216 * Should the snapshot ABCI interface be a separate optional ABCI service, or mandatory?
217
218 > Mandatory, to keep things simple for now. It will therefore be a breaking change and push the release. For apps using the Cosmos SDK, we can provide a default implementation that does not serve snapshots and errors when trying to apply them.
219
220 * How can we make sure `ListSnapshots` data is valid? An adversary can provide fake/invalid snapshots to DoS peers.
221
222 > For now, just pick snapshots that are available on a large number of peers. Maybe support whitelisting. We may consider e.g. placing snapshot manifests on the blockchain later.
223
224 * Should we punish nodes that provide invalid snapshots? How?
225
226 > No, these are full nodes not validators, so we can't punish them. Just disconnect from them and ignore them.
227
228 * Should we call these snapshots? The SDK already uses the term "snapshot" for `PruningOptions.SnapshotEvery`, and state sync will introduce additional SDK options for snapshot scheduling and pruning that are not related to IAVL snapshotting or pruning.
229
230 > Yes. Hopefully these concepts are distinct enough that we can refer to state sync snapshots and IAVL snapshots without too much confusion.
231
232 * Should we store snapshot and chunk metadata in a database? Can we use the database for chunks?
233
234 > As a first approach, store metadata in a database and chunks in the filesystem.
235
236 * Should a snapshot at height H be taken before or after the block at H is processed? E.g. RPC `/commit` returns app_hash after _previous_ height, i.e. _before_ current height.
237
238 > After commit.
239
240 * Do we need to support all versions of blockchain reactor (i.e. fast sync)?
241
242 > We should remove the v1 reactor completely once v2 has stabilized.
243
244 * Should `ListSnapshots` be a streaming API instead of a request/response API?
245
246 > No, just use a max message size.
247
248 ## Status
249
250 Implemented
251
252 ## References
253
254 * [ADR-042](./adr-042-state-sync.md) and its references