github.com/cosmos/cosmos-sdk@v0.50.10/docs/architecture/adr-065-store-v2.md (about) 1 # ADR-065: Store V2 2 3 ## Changelog 4 5 * Feb 14, 2023: Initial Draft (@alexanderbez) 6 7 ## Status 8 9 DRAFT 10 11 ## Abstract 12 13 The storage and state primitives that Cosmos SDK based applications have used have 14 by and large not changed since the launch of the inaugural Cosmos Hub. The demands 15 and needs of Cosmos SDK based applications, from both developer and client UX 16 perspectives, have evolved and outgrown the ecosystem since these primitives 17 were first introduced. 18 19 Over time as these applications have gained significant adoption, many critical 20 shortcomings and flaws have been exposed in the state and storage primitives of 21 the Cosmos SDK. 22 23 In order to keep up with the evolving demands and needs of both clients and developers, 24 a major overhaul to these primitives are necessary. 25 26 ## Context 27 28 The Cosmos SDK provides application developers with various storage primitives 29 for dealing with application state. Specifically, each module contains its own 30 merkle commitment data structure -- an IAVL tree. In this data structure, a module 31 can store and retrieve key-value pairs along with Merkle commitments, i.e. proofs, 32 to those key-value pairs indicating that they do or do not exist in the global 33 application state. This data structure is the base layer `KVStore`. 34 35 In addition, the SDK provides abstractions on top of this Merkle data structure. 36 Namely, a root multi-store (RMS) is a collection of each module's `KVStore`. 37 Through the RMS, the application can serve queries and provide proofs to clients 38 in addition to provide a module access to its own unique `KVStore` though the use 39 of `StoreKey`, which is an OCAP primitive. 40 41 There are further layers of abstraction that sit between the RMS and the underlying 42 IAVL `KVStore`. A `GasKVStore` is responsible for tracking gas IO consumption for 43 state machine reads and writes. A `CacheKVStore` is responsible for providing a 44 way to cache reads and buffer writes to make state transitions atomic, e.g. 45 transaction execution or governance proposal execution. 46 47 There are a few critical drawbacks to these layers of abstraction and the overall 48 design of storage in the Cosmos SDK: 49 50 * Since each module has its own IAVL `KVStore`, commitments are not [atomic](https://github.com/cosmos/cosmos-sdk/issues/14625) 51 * Note, we can still allow modules to have their own IAVL `KVStore`, but the 52 IAVL library will need to support the ability to pass a DB instance as an 53 argument to various IAVL APIs. 54 * Since IAVL is responsible for both state storage and commitment, running an 55 archive node becomes increasingly expensive as disk space grows exponentially. 56 * As the size of a network increases, various performance bottlenecks start to 57 emerge in many areas such as query performance, network upgrades, state 58 migrations, and general application performance. 59 * Developer UX is poor as it does not allow application developers to experiment 60 with different types of approaches to storage and commitments, along with the 61 complications of many layers of abstractions referenced above. 62 63 See the [Storage Discussion](https://github.com/cosmos/cosmos-sdk/discussions/13545) for more information. 64 65 ## Alternatives 66 67 There was a previous attempt to refactor the storage layer described in [ADR-040](./adr-040-storage-and-smt-state-commitments.md). 68 However, this approach mainly stems on the short comings of IAVL and various performance 69 issues around it. While there was a (partial) implementation of [ADR-040](./adr-040-storage-and-smt-state-commitments.md), 70 it was never adopted for a variety of reasons, such as the reliance on using an 71 SMT, which was more in a research phase, and some design choices that couldn't 72 be fully agreed upon, such as the snap-shotting mechanism that would result in 73 massive state bloat. 74 75 ## Decision 76 77 We propose to build upon some of the great ideas introduced in [ADR-040](./adr-040-storage-and-smt-state-commitments.md), 78 while being a bit more flexible with the underlying implementations and overall 79 less intrusive. Specifically, we propose to: 80 81 * Separate the concerns of state commitment (**SC**), needed for consensus, and 82 state storage (**SS**), needed for state machine and clients. 83 * Reduce layers of abstractions necessary between the RMS and underlying stores. 84 * Provide atomic module store commitments by providing a batch database object 85 to core IAVL APIs. 86 * Reduce complexities in the `CacheKVStore` implementation while also improving 87 performance<sup>[3]</sup>. 88 89 Furthermore, we will keep the IAVL is the backing [commitment](https://cryptography.fandom.com/wiki/Commitment_scheme) 90 store for the time being. While we might not fully settle on the use of IAVL in 91 the long term, we do not have strong empirical evidence to suggest a better 92 alternative. Given that the SDK provides interfaces for stores, it should be sufficient 93 to change the backing commitment store in the future should evidence arise to 94 warrant a better alternative. However there is promising work being done to IAVL 95 that should result in significant performance improvement <sup>[1,2]</sup>. 96 97 ### Separating SS and SC 98 99 By separating SS and SC, it will allow for us to optimize against primary use cases 100 and access patterns to state. Specifically, The SS layer will be responsible for 101 direct access to data in the form of (key, value) pairs, whereas the SC layer (IAVL) 102 will be responsible for committing to data and providing Merkle proofs. 103 104 Note, the underlying physical storage database will be the same between both the 105 SS and SC layers. So to avoid collisions between (key, value) pairs, both layers 106 will be namespaced. 107 108 #### State Commitment (SC) 109 110 Given that the existing solution today acts as both SS and SC, we can simply 111 repurpose it to act solely as the SC layer without any significant changes to 112 access patterns or behavior. In other words, the entire collection of existing 113 IAVL-backed module `KVStore`s will act as the SC layer. 114 115 However, in order for the SC layer to remain lightweight and not duplicate a 116 majority of the data held in the SS layer, we encourage node operators to keep 117 tight pruning strategies. 118 119 #### State Storage (SS) 120 121 In the RMS, we will expose a *single* `KVStore` backed by the same physical 122 database that backs the SC layer. This `KVStore` will be explicitly namespaced 123 to avoid collisions and will act as the primary storage for (key, value) pairs. 124 125 While we most likely will continue the use of `cosmos-db`, or some local interface, 126 to allow for flexibility and iteration over preferred physical storage backends 127 as research and benchmarking continues. However, we propose to hardcode the use 128 of RocksDB as the primary physical storage backend. 129 130 Since the SS layer will be implemented as a `KVStore`, it will support the 131 following functionality: 132 133 * Range queries 134 * CRUD operations 135 * Historical queries and versioning 136 * Pruning 137 138 The RMS will keep track of all buffered writes using a dedicated and internal 139 `MemoryListener` for each `StoreKey`. For each block height, upon `Commit`, the 140 SS layer will write all buffered (key, value) pairs under a [RocksDB user-defined timestamp](https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29) column 141 family using the block height as the timestamp, which is an unsigned integer. 142 This will allow a client to fetch (key, value) pairs at historical and current 143 heights along with making iteration and range queries relatively performant as 144 the timestamp is the key suffix. 145 146 Note, we choose not to use a more general approach of allowing any embedded key/value 147 database, such as LevelDB or PebbleDB, using height key-prefixed keys to 148 effectively version state because most of these databases use variable length 149 keys which would effectively make actions likes iteration and range queries less 150 performant. 151 152 Since operators might want pruning strategies to differ in SS compared to SC, 153 e.g. having a very tight pruning strategy in SC while having a looser pruning 154 strategy for SS, we propose to introduce an additional pruning configuration, 155 with parameters that are identical to what exists in the SDK today, and allow 156 operators to control the pruning strategy of the SS layer independently of the 157 SC layer. 158 159 Note, the SC pruning strategy must be congruent with the operator's state sync 160 configuration. This is so as to allow state sync snapshots to execute successfully, 161 otherwise, a snapshot could be triggered on a height that is not available in SC. 162 163 #### State Sync 164 165 The state sync process should be largely unaffected by the separation of the SC 166 and SS layers. However, if a node syncs via state sync, the SS layer of the node 167 will not have the state synced height available, since the IAVL import process is 168 not setup in way to easily allow direct key/value insertion. A modification of 169 the IAVL import process would be necessary to facilitate having the state sync 170 height available. 171 172 Note, this is not problematic for the state machine itself because when a query 173 is made, the RMS will automatically direct the query correctly (see [Queries](#queries)). 174 175 #### Queries 176 177 To consolidate the query routing between both the SC and SS layers, we propose to 178 have a notion of a "query router" that is constructed in the RMS. This query router 179 will be supplied to each `KVStore` implementation. The query router will route 180 queries to either the SC layer or the SS layer based on a few parameters. If 181 `prove: true`, then the query must be routed to the SC layer. Otherwise, if the 182 query height is available in the SS layer, the query will be served from the SS 183 layer. Otherwise, we fall back on the SC layer. 184 185 If no height is provided, the SS layer will assume the latest height. The SS 186 layer will store a reverse index to lookup `LatestVersion -> timestamp(version)` 187 which is set on `Commit`. 188 189 #### Proofs 190 191 Since the SS layer is naturally a storage layer only, without any commitments 192 to (key, value) pairs, it cannot provide Merkle proofs to clients during queries. 193 194 Since the pruning strategy against the SC layer is configured by the operator, 195 we can therefore have the RMS route the query SC layer if the version exists and 196 `prove: true`. Otherwise, the query will fall back to the SS layer without a proof. 197 198 We could explore the idea of using state snapshots to rebuild an in-memory IAVL 199 tree in real time against a version closest to the one provided in the query. 200 However, it is not clear what the performance implications will be of this approach. 201 202 ### Atomic Commitment 203 204 We propose to modify the existing IAVL APIs to accept a batch DB object instead 205 of relying on an internal batch object in `nodeDB`. Since each underlying IAVL 206 `KVStore` shares the same DB in the SC layer, this will allow commits to be 207 atomic. 208 209 Specifically, we propose to: 210 211 * Remove the `dbm.Batch` field from `nodeDB` 212 * Update the `SaveVersion` method of the `MutableTree` IAVL type to accept a batch object 213 * Update the `Commit` method of the `CommitKVStore` interface to accept a batch object 214 * Create a batch object in the RMS during `Commit` and pass this object to each 215 `KVStore` 216 * Write the database batch after all stores have committed successfully 217 218 Note, this will require IAVL to be updated to not rely or assume on any batch 219 being present during `SaveVersion`. 220 221 ## Consequences 222 223 As a result of a new store V2 package, we should expect to see improved performance 224 for queries and transactions due to the separation of concerns. We should also 225 expect to see improved developer UX around experimentation of commitment schemes 226 and storage backends for further performance, in addition to a reduced amount of 227 abstraction around KVStores making operations such as caching and state branching 228 more intuitive. 229 230 However, due to the proposed design, there are drawbacks around providing state 231 proofs for historical queries. 232 233 ### Backwards Compatibility 234 235 This ADR proposes changes to the storage implementation in the Cosmos SDK through 236 an entirely new package. Interfaces may be borrowed and extended from existing 237 types that exist in `store`, but no existing implementations or interfaces will 238 be broken or modified. 239 240 ### Positive 241 242 * Improved performance of independent SS and SC layers 243 * Reduced layers of abstraction making storage primitives easier to understand 244 * Atomic commitments for SC 245 * Redesign of storage types and interfaces will allow for greater experimentation 246 such as different physical storage backends and different commitment schemes 247 for different application modules 248 249 ### Negative 250 251 * Providing proofs for historical state is challenging 252 253 ### Neutral 254 255 * Keeping IAVL as the primary commitment data structure, although drastic 256 performance improvements are being made 257 258 ## Further Discussions 259 260 ### Module Storage Control 261 262 Many modules store secondary indexes that are typically solely used to support 263 client queries, but are actually not needed for the state machine's state 264 transitions. What this means is that these indexes technically have no reason to 265 exist in the SC layer at all, as they take up unnecessary space. It is worth 266 exploring what an API would look like to allow modules to indicate what (key, value) 267 pairs they want to be persisted in the SC layer, implicitly indicating the SS 268 layer as well, as opposed to just persisting the (key, value) pair only in the 269 SS layer. 270 271 ### Historical State Proofs 272 273 It is not clear what the importance or demand is within the community of providing 274 commitment proofs for historical state. While solutions can be devised such as 275 rebuilding trees on the fly based on state snapshots, it is not clear what the 276 performance implications are for such solutions. 277 278 ### Physical DB Backends 279 280 This ADR proposes usage of RocksDB to utilize user-defined timestamps as a 281 versioning mechanism. However, other physical DB backends are available that may 282 offer alternative ways to implement versioning while also providing performance 283 improvements over RocksDB. E.g. PebbleDB supports MVCC timestamps as well, but 284 we'll need to explore how PebbleDB handles compaction and state growth over time. 285 286 ## References 287 288 * [1] https://github.com/cosmos/iavl/pull/676 289 * [2] https://github.com/cosmos/iavl/pull/664 290 * [3] https://github.com/cosmos/cosmos-sdk/issues/14990