github.com/cosmos/cosmos-sdk@v0.50.10/docs/architecture/adr-065-store-v2.md (about)

     1  # ADR-065: Store V2
     2  
     3  ## Changelog
     4  
     5  * Feb 14, 2023: Initial Draft (@alexanderbez)
     6  
     7  ## Status
     8  
     9  DRAFT
    10  
    11  ## Abstract
    12  
    13  The storage and state primitives that Cosmos SDK based applications have used have
    14  by and large not changed since the launch of the inaugural Cosmos Hub. The demands
    15  and needs of Cosmos SDK based applications, from both developer and client UX
    16  perspectives, have evolved and outgrown the ecosystem since these primitives
    17  were first introduced.
    18  
    19  Over time as these applications have gained significant adoption, many critical
    20  shortcomings and flaws have been exposed in the state and storage primitives of
    21  the Cosmos SDK.
    22  
    23  In order to keep up with the evolving demands and needs of both clients and developers,
    24  a major overhaul to these primitives are necessary.
    25  
    26  ## Context
    27  
    28  The Cosmos SDK provides application developers with various storage primitives
    29  for dealing with application state. Specifically, each module contains its own
    30  merkle commitment data structure -- an IAVL tree. In this data structure, a module
    31  can store and retrieve key-value pairs along with Merkle commitments, i.e. proofs,
    32  to those key-value pairs indicating that they do or do not exist in the global
    33  application state. This data structure is the base layer `KVStore`.
    34  
    35  In addition, the SDK provides abstractions on top of this Merkle data structure.
    36  Namely, a root multi-store (RMS) is a collection of each module's `KVStore`.
    37  Through the RMS, the application can serve queries and provide proofs to clients
    38  in addition to provide a module access to its own unique `KVStore` though the use
    39  of `StoreKey`, which is an OCAP primitive.
    40  
    41  There are further layers of abstraction that sit between the RMS and the underlying
    42  IAVL `KVStore`. A `GasKVStore` is responsible for tracking gas IO consumption for
    43  state machine reads and writes. A `CacheKVStore` is responsible for providing a
    44  way to cache reads and buffer writes to make state transitions atomic, e.g.
    45  transaction execution or governance proposal execution.
    46  
    47  There are a few critical drawbacks to these layers of abstraction and the overall
    48  design of storage in the Cosmos SDK:
    49  
    50  * Since each module has its own IAVL `KVStore`, commitments are not [atomic](https://github.com/cosmos/cosmos-sdk/issues/14625)
    51      * Note, we can still allow modules to have their own IAVL `KVStore`, but the
    52        IAVL library will need to support the ability to pass a DB instance as an
    53        argument to various IAVL APIs.
    54  * Since IAVL is responsible for both state storage and commitment, running an 
    55    archive node becomes increasingly expensive as disk space grows exponentially.
    56  * As the size of a network increases, various performance bottlenecks start to
    57    emerge in many areas such as query performance, network upgrades, state
    58    migrations, and general application performance.
    59  * Developer UX is poor as it does not allow application developers to experiment
    60    with different types of approaches to storage and commitments, along with the
    61    complications of many layers of abstractions referenced above.
    62  
    63  See the [Storage Discussion](https://github.com/cosmos/cosmos-sdk/discussions/13545) for more information.
    64  
    65  ## Alternatives
    66  
    67  There was a previous attempt to refactor the storage layer described in [ADR-040](./adr-040-storage-and-smt-state-commitments.md).
    68  However, this approach mainly stems on the short comings of IAVL and various performance
    69  issues around it. While there was a (partial) implementation of [ADR-040](./adr-040-storage-and-smt-state-commitments.md),
    70  it was never adopted for a variety of reasons, such as the reliance on using an
    71  SMT, which was more in a research phase, and some design choices that couldn't
    72  be fully agreed upon, such as the snap-shotting mechanism that would result in
    73  massive state bloat.
    74  
    75  ## Decision
    76  
    77  We propose to build upon some of the great ideas introduced in [ADR-040](./adr-040-storage-and-smt-state-commitments.md),
    78  while being a bit more flexible with the underlying implementations and overall
    79  less intrusive. Specifically, we propose to:
    80  
    81  * Separate the concerns of state commitment (**SC**), needed for consensus, and
    82    state storage (**SS**), needed for state machine and clients.
    83  * Reduce layers of abstractions necessary between the RMS and underlying stores.
    84  * Provide atomic module store commitments by providing a batch database object
    85    to core IAVL APIs.
    86  * Reduce complexities in the `CacheKVStore` implementation while also improving
    87    performance<sup>[3]</sup>.
    88  
    89  Furthermore, we will keep the IAVL is the backing [commitment](https://cryptography.fandom.com/wiki/Commitment_scheme)
    90  store for the time being. While we might not fully settle on the use of IAVL in
    91  the long term, we do not have strong empirical evidence to suggest a better
    92  alternative. Given that the SDK provides interfaces for stores, it should be sufficient
    93  to change the backing commitment store in the future should evidence arise to
    94  warrant a better alternative. However there is promising work being done to IAVL
    95  that should result in significant performance improvement <sup>[1,2]</sup>.
    96  
    97  ### Separating SS and SC
    98  
    99  By separating SS and SC, it will allow for us to optimize against primary use cases
   100  and access patterns to state. Specifically, The SS layer will be responsible for
   101  direct access to data in the form of (key, value) pairs, whereas the SC layer (IAVL)
   102  will be responsible for committing to data and providing Merkle proofs.
   103  
   104  Note, the underlying physical storage database will be the same between both the
   105  SS and SC layers. So to avoid collisions between (key, value) pairs, both layers
   106  will be namespaced.
   107  
   108  #### State Commitment (SC)
   109  
   110  Given that the existing solution today acts as both SS and SC, we can simply
   111  repurpose it to act solely as the SC layer without any significant changes to
   112  access patterns or behavior. In other words, the entire collection of existing
   113  IAVL-backed module `KVStore`s will act as the SC layer.
   114  
   115  However, in order for the SC layer to remain lightweight and not duplicate a
   116  majority of the data held in the SS layer, we encourage node operators to keep
   117  tight pruning strategies.
   118  
   119  #### State Storage (SS)
   120  
   121  In the RMS, we will expose a *single* `KVStore` backed by the same physical
   122  database that backs the SC layer. This `KVStore` will be explicitly namespaced
   123  to avoid collisions and will act as the primary storage for (key, value) pairs.
   124  
   125  While we most likely will continue the use of `cosmos-db`, or some local interface,
   126  to allow for flexibility and iteration over preferred physical storage backends
   127  as research and benchmarking continues. However, we propose to hardcode the use
   128  of RocksDB as the primary physical storage backend.
   129  
   130  Since the SS layer will be implemented as a `KVStore`, it will support the
   131  following functionality:
   132  
   133  * Range queries
   134  * CRUD operations
   135  * Historical queries and versioning
   136  * Pruning
   137  
   138  The RMS will keep track of all buffered writes using a dedicated and internal
   139  `MemoryListener` for each `StoreKey`. For each block height, upon `Commit`, the
   140  SS layer will write all buffered (key, value) pairs under a [RocksDB user-defined timestamp](https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29) column
   141  family using the block height as the timestamp, which is an unsigned integer.
   142  This will allow a client to fetch (key, value) pairs at historical and current
   143  heights along with making iteration and range queries relatively performant as
   144  the timestamp is the key suffix.
   145  
   146  Note, we choose not to use a more general approach of allowing any embedded key/value
   147  database, such as LevelDB or PebbleDB, using height key-prefixed keys to
   148  effectively version state because most of these databases use variable length
   149  keys which would effectively make actions likes iteration and range queries less
   150  performant.
   151  
   152  Since operators might want pruning strategies to differ in SS compared to SC,
   153  e.g. having a very tight pruning strategy in SC while having a looser pruning
   154  strategy for SS, we propose to introduce an additional pruning configuration,
   155  with parameters that are identical to what exists in the SDK today, and allow
   156  operators to control the pruning strategy of the SS layer independently of the
   157  SC layer.
   158  
   159  Note, the SC pruning strategy must be congruent with the operator's state sync
   160  configuration. This is so as to allow state sync snapshots to execute successfully,
   161  otherwise, a snapshot could be triggered on a height that is not available in SC.
   162  
   163  #### State Sync
   164  
   165  The state sync process should be largely unaffected by the separation of the SC
   166  and SS layers. However, if a node syncs via state sync, the SS layer of the node
   167  will not have the state synced height available, since the IAVL import process is
   168  not setup in way to easily allow direct key/value insertion. A modification of
   169  the IAVL import process would be necessary to facilitate having the state sync
   170  height available.
   171  
   172  Note, this is not problematic for the state machine itself because when a query
   173  is made, the RMS will automatically direct the query correctly (see [Queries](#queries)).
   174  
   175  #### Queries
   176  
   177  To consolidate the query routing between both the SC and SS layers, we propose to
   178  have a notion of a "query router" that is constructed in the RMS. This query router
   179  will be supplied to each `KVStore` implementation. The query router will route
   180  queries to either the SC layer or the SS layer based on a few parameters. If
   181  `prove: true`, then the query must be routed to the SC layer. Otherwise, if the
   182  query height is available in the SS layer, the query will be served from the SS
   183  layer. Otherwise, we fall back on the SC layer.
   184  
   185  If no height is provided, the SS layer will assume the latest height. The SS
   186  layer will store a reverse index to lookup `LatestVersion -> timestamp(version)`
   187  which is set on `Commit`.
   188  
   189  #### Proofs
   190  
   191  Since the SS layer is naturally a storage layer only, without any commitments
   192  to (key, value) pairs, it cannot provide Merkle proofs to clients during queries.
   193  
   194  Since the pruning strategy against the SC layer is configured by the operator,
   195  we can therefore have the RMS route the query SC layer if the version exists and
   196  `prove: true`. Otherwise, the query will fall back to the SS layer without a proof.
   197  
   198  We could explore the idea of using state snapshots to rebuild an in-memory IAVL
   199  tree in real time against a version closest to the one provided in the query.
   200  However, it is not clear what the performance implications will be of this approach.
   201  
   202  ### Atomic Commitment
   203  
   204  We propose to modify the existing IAVL APIs to accept a batch DB object instead
   205  of relying on an internal batch object in `nodeDB`. Since each underlying IAVL
   206  `KVStore` shares the same DB in the SC layer, this will allow commits to be
   207  atomic.
   208  
   209  Specifically, we propose to:
   210  
   211  * Remove the `dbm.Batch` field from `nodeDB`
   212  * Update the `SaveVersion` method of the `MutableTree` IAVL type to accept a batch object
   213  * Update the `Commit` method of the `CommitKVStore` interface to accept a batch object
   214  * Create a batch object in the RMS during `Commit` and pass this object to each
   215    `KVStore`
   216  * Write the database batch after all stores have committed successfully
   217  
   218  Note, this will require IAVL to be updated to not rely or assume on any batch
   219  being present during `SaveVersion`.
   220  
   221  ## Consequences
   222  
   223  As a result of a new store V2 package, we should expect to see improved performance
   224  for queries and transactions due to the separation of concerns. We should also
   225  expect to see improved developer UX around experimentation of commitment schemes
   226  and storage backends for further performance, in addition to a reduced amount of
   227  abstraction around KVStores making operations such as caching and state branching
   228  more intuitive.
   229  
   230  However, due to the proposed design, there are drawbacks around providing state
   231  proofs for historical queries.
   232  
   233  ### Backwards Compatibility
   234  
   235  This ADR proposes changes to the storage implementation in the Cosmos SDK through
   236  an entirely new package. Interfaces may be borrowed and extended from existing
   237  types that exist in `store`, but no existing implementations or interfaces will
   238  be broken or modified.
   239  
   240  ### Positive
   241  
   242  * Improved performance of independent SS and SC layers
   243  * Reduced layers of abstraction making storage primitives easier to understand
   244  * Atomic commitments for SC
   245  * Redesign of storage types and interfaces will allow for greater experimentation
   246    such as different physical storage backends and different commitment schemes
   247    for different application modules
   248  
   249  ### Negative
   250  
   251  * Providing proofs for historical state is challenging
   252  
   253  ### Neutral
   254  
   255  * Keeping IAVL as the primary commitment data structure, although drastic
   256    performance improvements are being made
   257  
   258  ## Further Discussions
   259  
   260  ### Module Storage Control
   261  
   262  Many modules store secondary indexes that are typically solely used to support
   263  client queries, but are actually not needed for the state machine's state
   264  transitions. What this means is that these indexes technically have no reason to
   265  exist in the SC layer at all, as they take up unnecessary space. It is worth
   266  exploring what an API would look like to allow modules to indicate what (key, value)
   267  pairs they want to be persisted in the SC layer, implicitly indicating the SS
   268  layer as well, as opposed to just persisting the (key, value) pair only in the
   269  SS layer.
   270  
   271  ### Historical State Proofs
   272  
   273  It is not clear what the importance or demand is within the community of providing
   274  commitment proofs for historical state. While solutions can be devised such as
   275  rebuilding trees on the fly based on state snapshots, it is not clear what the
   276  performance implications are for such solutions.
   277  
   278  ### Physical DB Backends
   279  
   280  This ADR proposes usage of RocksDB to utilize user-defined timestamps as a
   281  versioning mechanism. However, other physical DB backends are available that may
   282  offer alternative ways to implement versioning while also providing performance
   283  improvements over RocksDB. E.g. PebbleDB supports MVCC timestamps as well, but
   284  we'll need to explore how PebbleDB handles compaction and state growth over time.
   285  
   286  ## References
   287  
   288  * [1] https://github.com/cosmos/iavl/pull/676
   289  * [2] https://github.com/cosmos/iavl/pull/664
   290  * [3] https://github.com/cosmos/cosmos-sdk/issues/14990