github.com/MetalBlockchain/metalgo@v1.11.9/x/merkledb/README.md (about) 1 # MerkleDB 2 3 ## Structure 4 5 A _Merkle radix trie_ is a data structure that is both a [Merkle tree](https://en.wikipedia.org/wiki/Merkle_tree) and a [radix trie](https://en.wikipedia.org/wiki/Radix_tree). MerkleDB is an implementation of a persisted key-value store (sometimes just called "a store") using a Merkle radix trie. We sometimes use "Merkle radix trie" and "MerkleDB instance" interchangeably below, but the two are not the same. MerkleDB maintains data in a Merkle radix trie, but not all Merkle radix tries implement a key-value store. 6 7 Like all tries, a MerkleDB instance is composed of nodes. Conceputally, a node has: 8 * A unique _key_ which identifies its position in the trie. A node's key is a prefix of its childrens' keys. 9 * A unique _ID_, which is the hash of the node. 10 * A _children_ array, where each element is the ID of the child at that index. A child at a lower index is to the "left" of children at higher indices. 11 * An optional value. If a node has a value, then the node's key maps to its value in the key-value store. Otherwise the key isn't present in the store. 12 13 and looks like this: 14 ``` 15 Node 16 +--------------------------------------------+ 17 | ID: 32 bytes | 18 | Key: ? bytes | 19 | Value: Some(value) | None | 20 | Children: | 21 | 0: Some(child0ID) | None | 22 | 1: Some(child2ID) | None | 23 | ... | 24 | BranchFactor-1: Some(child15ID) | None | 25 +--------------------------------------------+ 26 ``` 27 28 This conceptual picture differs slightly from the implementation of the `node` in MerkleDB but is still useful in understanding how MerkleDB works. 29 30 ## Root IDs and Revisions 31 32 The ID of the root node is called the _root ID_, or sometimes just the _root_ of the trie. If any node in a MerkleDB instance changes, the root ID will change. This follows from the fact that changing a node changes its ID, which changes its parent's reference to it, which changes the parent, which changes the parent's ID, and so on until the root. 33 34 The root ID also serves as a unique identifier of a given state; instances with the same key-value mappings always have the same root ID, and instances with different key-value mappings always have different root IDs. We call a state with a given root ID a _revision_, and we sometimes say that a MerkleDB instance is "at" a given revision or root ID. The two are equivalent. 35 36 ## Views 37 38 A _view_ is a proposal to modify a MerkleDB. If a view is _committed_, its changes are written to the MerkleDB. It can be queried, and when it is, it returns the state that the MerkleDB will contain if the view is committed. A view is immutable after creation. Namely, none of its key-value pairs can be modified. 39 40 A view can be built atop the MerkleDB itself, or it can be built atop another view. Views can be chained together. For example, we might have: 41 42 ``` 43 db 44 / \ 45 view1 view2 46 | 47 view3 48 ``` 49 50 where `view1` and `view2` are built atop MerkleDB instance `db` and `view3` is built atop `view1`. Equivalently, we say that `db` is the parent of `view1` and `view2`, and `view3` is a child of `view1`. `view1` and `view2` are _siblings_. 51 52 `view1` contains all the key-value pairs in `db`, except those modified by `view1`. That is, if `db` has key-value pair `(k,v)`, and `view1` doesn't modify that pair, then `view1` will return `v` when queried for the value of `k`. If `db` has `(k,v)` but `view1` modifies the pair to `(k, v')` then it will return `v'` when queried for the value of `k`. Similar for `view2`. 53 54 `view3` has all of the key-value pairs as `view1`, except those modified in `view3`. That is, it has the state after the changes in `view1` are applied to `db`, followed by those in `view3`. 55 56 A view can be committed only if its parent is the MerkleDB (and not another view). A view can only be committed once. In the above diagram, `view3` can't be committed until `view1` is committed. 57 58 When a view is created, we don't apply changes to the trie's structure or calculate the new IDs of nodes because this requires expensive hashing. Instead, we lazily apply changes and calculate node IDs (including the root ID) when necessary. 59 60 ### Validity 61 62 When a view is committed, its siblings and all of their descendants are _invalidated_. An invalid view can't be read or committed. Method calls on it will return `ErrInvalid`. 63 64 In the diagram above, if `view1` were committed, `view2` would be invalidated. It `view2` were committed, `view1` and `view3` would be invalidated. 65 66 ## Proofs 67 68 ### Simple Proofs 69 70 MerkleDB instances can produce _merkle proofs_, sometimes just called "proofs." A merkle proof uses cryptography to prove that a given key-value pair is or isn't in the key-value store with a given root. That is, a MerkleDB instance with root ID `r` can create a proof that shows that it has a key-value pair `(k,v)`, or that `k` is not present. 71 72 Proofs can be useful as a client fetching data in a Byzantine environment. Suppose there are one or more servers, which may be Byzantine, serving a distirbuted key-value store using MerkleDB, and a client that wants to retrieve key-value pairs. Suppose also that the client can learn a "trusted" root ID, perhaps because it's posted on a blockchain. The client can request a key-value pair from a server, and use the returned proof to verify that the returned key-value pair is actually in the key-value store with (or isn't, as it were.) 73 74 ```mermaid 75 flowchart TD 76 A[Client] -->|"ProofRequest(k,r)"| B(Server) 77 B --> |"Proof(k,r)"| C(Client) 78 C --> |Proof Valid| D(Client trusts key-value pair from proof) 79 C --> |Proof Invalid| E(Client doesn't trust key-value pair from proof) 80 ``` 81 82 `ProofRequest(k,r)` is a request for the value that `k` maps to in the MerkleDB instance with root `r` and a proof for that data's correctness. 83 84 `Proof(k,r)` is a proof that purports to show either that key-value pair `(k,v)` exists in the revision at `r`, or that `k` isn't in the revision. 85 86 #### Verification 87 88 A proof is represented as: 89 90 ```go 91 type Proof struct { 92 // Nodes in the proof path from root --> target key 93 // (or node that would be where key is if it doesn't exist). 94 // Always contains at least the root. 95 Path []ProofNode 96 97 // This is a proof that [key] exists/doesn't exist. 98 Key Key 99 100 // Nothing if [Key] isn't in the trie. 101 // Otherwise, the value corresponding to [Key]. 102 Value maybe.Maybe[[]byte] 103 } 104 105 type ProofNode struct { 106 Key Key 107 // Nothing if this is an intermediate node. 108 // The value in this node if its length < [HashLen]. 109 // The hash of the value in this node otherwise. 110 ValueOrHash maybe.Maybe[[]byte] 111 Children map[byte]ids.ID 112 } 113 ``` 114 115 For an inclusion proof, the last node in `Path` should be the one containing `Key`. 116 For an exclusion proof, the last node is either: 117 * The node that would be the parent of `Key`, if such node has no child at the index `Key` would be at. 118 * The node at the same child index `Key` would be at, otherwise. 119 120 In other words, the last node of a proof says either, "the key is in the trie, and this node contains it," or, "the key isn't in the trie, and this node's existence precludes the existence of the key." 121 122 The prover can't simply trust that such a node exists, though. It has to verify this. The prover creates an empty trie and inserts the nodes in `Path`. If the root ID of this trie matches the `r`, the verifier can trust that the last node really does exist in the trie. If the last node _didn't_ really exist, the proof creator couldn't create `Path` such that its nodes both imply the existence of the ("fake") last node and also result in the correct root ID. This follows from the one-way property of hashing. 123 124 ### Range Proofs 125 126 MerkleDB instances can also produce _range proofs_. A range proof proves that a contiguous set of key-value pairs is or isn't in the key-value store with a given root. This is similar to the merkle proofs described above, except for multiple key-value pairs. 127 128 ```mermaid 129 flowchart TD 130 A[Client] -->|"RangeProofRequest(start,end,r)"| B(Server) 131 B --> |"RangeProof(start,end,r)"| C(Client) 132 C --> |Proof Valid| D(Client trusts key-value pairs) 133 C --> |Proof Invalid| E(Client doesn't trust key-value pairs) 134 ``` 135 136 `RangeProofRequest(start,end,r)` is a request for all of the key-value pairs, in order, between keys `start` and `end` at revision `r`. 137 138 `RangeProof(start,end,r)` contains a list of key-value pairs `kvs`, sorted by increasing key. It purports to show that, at revision `r`: 139 * Each element of `kvs` is a key-value pair in the store. 140 * There are no keys at/after `start` but before the first key in `kvs`. 141 * For adjacent key-value pairs `(k1,v1)` and `(k2,v2)` in `kvs`, there doesn't exist a key-value pair `(k3,v3)` in the store such that `k1 < k3 < k2`. In other words, `kvs` is a contiguous set of key-value pairs. 142 143 Clients can use range proofs to efficiently download many key-value pairs at a time from a MerkleDB instance, as opposed to getting a proof for each key-value pair individually. 144 145 #### Verification 146 147 Like simple proofs, range proofs can be verified without any additional context or knowledge of the contents of the key-value store. 148 149 A range proof is represented as: 150 151 ```go 152 type RangeProof struct { 153 // Invariant: At least one of [StartProof], [EndProof], [KeyValues] is non-empty. 154 155 // A proof that the smallest key in the requested range does/doesn't exist. 156 // Note that this may not be an entire proof -- nodes are omitted if 157 // they are also in [EndProof]. 158 StartProof []ProofNode 159 160 // If no upper range bound was given and [KeyValues] is empty, this is empty. 161 // 162 // If no upper range bound was given and [KeyValues] is non-empty, this is 163 // a proof for the largest key in [KeyValues]. 164 // 165 // Otherwise this is a proof for the upper range bound. 166 EndProof []ProofNode 167 168 // This proof proves that the key-value pairs in [KeyValues] are in the trie. 169 // Sorted by increasing key. 170 KeyValues []KeyValue 171 } 172 ``` 173 174 The prover creates an empty trie and adds to it all of the key-value pairs in `KeyValues`. 175 176 Then, it inserts: 177 * The nodes in `StartProof` 178 * The nodes in `EndProof` 179 180 For each node in `StartProof`, the prover only populates `Children` entries whose key is before `start`. 181 For each node in `EndProof`, it populates only `Children` entries whose key is after `end`, where `end` is the largest key proven by the range proof. 182 183 Then, it calculates the root ID of this trie and compares it to the expected one. 184 185 If the proof: 186 * Omits any key-values in the range 187 * Includes additional key-values that aren't really in the range 188 * Provides an incorrect value for a key in the range 189 190 then the actual root ID won't match the expected root ID. 191 192 Like simple proofs, range proof verification relies on the fact that the proof generator can't forge data such that it results in a trie with both incorrect data and the correct root ID. 193 194 ### Change Proofs 195 196 Finally, MerkleDB instances can produce and verify _change proofs_. A change proof proves that a set of key-value changes were applied to a MerkleDB instance in the process of changing its root from `r` to `r'`. For example, suppose there's an instance with root `r` 197 198 ```mermaid 199 flowchart TD 200 A[Client] -->|"ChangeProofRequest(start,end,r,r')"| B(Server) 201 B --> |"ChangeProof(start,end,r,r')"| C(Client) 202 C --> |Proof Valid| D(Client trusts key-value pair changes) 203 C --> |Proof Invalid| E(Client doesn't trust key-value changes) 204 ``` 205 206 `ChangeProofRequest(start,end,r,r')` is a request for all key-value pairs, in order, between keys `start` and `end`, that occurred after the root of was `r` and before the root was `r'`. 207 208 `ChangeProof(start,end,r,r')` contains a set of key-value pairs `kvs`. It purports to show that: 209 * Each element of `kvs` is a key-value pair in the at revision `r'` but not at revision `r`. 210 * There are no key-value changes between `r` and `r'` such that the key is at/after `start` but before the first key in `kvs`. 211 * For adjacent key-value changes `(k1,v1)` and `(k2,v2)` in `kvs`, there doesn't exist a key-value change `(k3,v3)` between `r` and `r'` such that `k1 < k3 < k2`. In other words, `kvs` is a contiguous set of key-value changes. 212 213 Change proofs are useful for applying changes between revisions. For example, suppose a client has a MerkleDB instance at revision `r`. The client learns that the state has been updated and that the new root is `r'`. The client can request a change proof from a server at revision `r'`, and apply the changes in the change proof to change its state from `r` to `r'`. Note that `r` and `r'` need not be "consecutive" revisions. For example, it's possible that the state goes from revision `r` to `r1` to `r2` to `r'`. The client apply changes to get directly from `r` to `r'`, without ever needing to be at revision `r1` or `r2`. 214 215 #### Verification 216 217 Unlike simple proofs and range proofs, change proofs require additional context to verify. Namely, the prover must have the trie at the start root `r`. 218 219 The verification algorithm is similar to range proofs, except that instead of inserting the key-value changes, start proof and end proof into an empty trie, they are added to the trie at revision `r`. 220 221 ## Serialization 222 223 ### Node 224 225 Nodes are persisted in an underlying database. In order to persist nodes, we must first serialize them. Serialization is done by the `encoder` interface defined in `codec.go`. 226 227 The node serialization format is: 228 229 ``` 230 +----------------------------------------------------+ 231 | Value existence flag (1 byte) | 232 +----------------------------------------------------+ 233 | Value length (varint) (optional) | 234 +----------------------------------------------------+ 235 | Value (variable length bytes) (optional) | 236 +----------------------------------------------------+ 237 | Number of children (varint) | 238 +----------------------------------------------------+ 239 | Child index (varint) | 240 +----------------------------------------------------+ 241 | Child compressed key length (varint) | 242 +----------------------------------------------------+ 243 | Child compressed key (variable length bytes) | 244 +----------------------------------------------------+ 245 | Child ID (32 bytes) | 246 +----------------------------------------------------+ 247 | Child has value (1 bytes) | 248 +----------------------------------------------------+ 249 | Child index (varint) | 250 +----------------------------------------------------+ 251 | Child compressed key length (varint) | 252 +----------------------------------------------------+ 253 | Child compressed key (variable length bytes) | 254 +----------------------------------------------------+ 255 | Child ID (32 bytes) | 256 +----------------------------------------------------+ 257 | Child has value (1 bytes) | 258 +----------------------------------------------------+ 259 |... | 260 +----------------------------------------------------+ 261 ``` 262 263 Where: 264 * `Value existence flag` is `1` if this node has a value, otherwise `0`. 265 * `Value length` is the length of the value, if it exists (i.e. if `Value existence flag` is `1`.) Otherwise not serialized. 266 * `Value` is the value, if it exists (i.e. if `Value existence flag` is `1`.) Otherwise not serialized. 267 * `Number of children` is the number of children this node has. 268 * `Child index` is the index of a child node within the list of the node's children. 269 * `Child compressed key length` is the length of the child node's compressed key. 270 * `Child compressed key` is the child node's compressed key. 271 * `Child ID` is the child node's ID. 272 * `Child has value` indicates if that child has a value. 273 274 For each child of the node, we have an additional: 275 276 ``` 277 +----------------------------------------------------+ 278 | Child index (varint) | 279 +----------------------------------------------------+ 280 | Child compressed key length (varint) | 281 +----------------------------------------------------+ 282 | Child compressed key (variable length bytes) | 283 +----------------------------------------------------+ 284 | Child ID (32 bytes) | 285 +----------------------------------------------------+ 286 | Child has value (1 bytes) | 287 +----------------------------------------------------+ 288 ``` 289 290 Note that the `Child index` are not necessarily sequential. For example, if a node has 3 children, the `Child index` values could be `0`, `2`, and `15`. 291 However, the `Child index` values must be strictly increasing. For example, the `Child index` values cannot be `0`, `0`, and `1`, or `1`, `0`. 292 293 Since a node can have up to 16 children, there can be up to 16 such blocks of children data. 294 295 #### Example 296 297 Let's take a look at an example node. 298 299 Its byte representation (in hex) is: `0x01020204000210579EB3718A7E437D2DDCE931AC7CC05A0BC695A9C2084F5DF12FB96AD0FA32660E06FFF09845893C4F9D92C4E097FCF2589BC9D6882B1F18D1C2FC91D7DF1D3FCBDB4238` 300 301 The node's key is empty (its the root) and has value `0x02`. 302 It has two children. 303 The first is at child index `0`, has compressed key `0x01` and ID (in hex) `0x579eb3718a7e437d2ddce931ac7cc05a0bc695a9c2084f5df12fb96ad0fa3266`. 304 The second is at child index `14`, has compressed key `0x0F0F0F` and ID (in hex) `0x9845893c4f9d92c4e097fcf2589bc9d6882b1f18d1c2fc91d7df1d3fcbdb4238`. 305 306 ``` 307 +--------------------------------------------------------------------+ 308 | Value existence flag (1 byte) | 309 | 0x01 | 310 +--------------------------------------------------------------------+ 311 | Value length (varint) (optional) | 312 | 0x02 | 313 +--------------------------------------------------------------------+ 314 | Value (variable length bytes) (optional) | 315 | 0x02 | 316 +--------------------------------------------------------------------+ 317 | Number of children (varint) | 318 | 0x04 | 319 +--------------------------------------------------------------------+ 320 | Child index (varint) | 321 | 0x00 | 322 +--------------------------------------------------------------------+ 323 | Child compressed key length (varint) | 324 | 0x02 | 325 +--------------------------------------------------------------------+ 326 | Child compressed key (variable length bytes) | 327 | 0x10 | 328 +--------------------------------------------------------------------+ 329 | Child ID (32 bytes) | 330 | 0x579EB3718A7E437D2DDCE931AC7CC05A0BC695A9C2084F5DF12FB96AD0FA3266 | 331 +--------------------------------------------------------------------+ 332 | Child index (varint) | 333 | 0x0E | 334 +--------------------------------------------------------------------+ 335 | Child compressed key length (varint) | 336 | 0x06 | 337 +--------------------------------------------------------------------+ 338 | Child compressed key (variable length bytes) | 339 | 0xFFF0 | 340 +--------------------------------------------------------------------+ 341 | Child ID (32 bytes) | 342 | 0x9845893C4F9D92C4E097FCF2589BC9D6882B1F18D1C2FC91D7DF1D3FCBDB4238 | 343 +--------------------------------------------------------------------+ 344 ``` 345 346 ### Node Hashing 347 348 Each node must have a unique ID that identifies it. This ID is calculated by hashing the following values: 349 * The node's children 350 * The node's value digest 351 * The node's key 352 353 The node's value digest is: 354 * Nothing, if the node has no value 355 * The node's value, if it has a value < 32 bytes 356 * The hash of the node's value otherwise 357 358 We use the node's value digest rather than its value when hashing so that when we send proofs, each `ProofNode` doesn't need to contain the node's value, which could be very large. By using the value digest, we allow a proof verifier to calculate a node's ID while limiting the size of the data sent to the verifier. 359 360 Specifically, we encode these values in the following way: 361 362 ``` 363 +----------------------------------------------------+ 364 | Number of children (varint) | 365 +----------------------------------------------------+ 366 | Child index (varint) | 367 +----------------------------------------------------+ 368 | Child ID (32 bytes) | 369 +----------------------------------------------------+ 370 | Child index (varint) | 371 +----------------------------------------------------+ 372 | Child ID (32 bytes) | 373 +----------------------------------------------------+ 374 |... | 375 +----------------------------------------------------+ 376 | Value existence flag (1 byte) | 377 +----------------------------------------------------+ 378 | Value length (varint) (optional) | 379 +----------------------------------------------------+ 380 | Value (variable length bytes) (optional) | 381 +----------------------------------------------------+ 382 | Key bit length (varint) | 383 +----------------------------------------------------+ 384 | Key (variable length bytes) | 385 +----------------------------------------------------+ 386 ``` 387 388 Where: 389 * `Number of children` is the number of children this node has. 390 * `Child index` is the index of a child node within the list of the node's children. 391 * `Child ID` is the child node's ID. 392 * `Value existence flag` is `1` if this node has a value, otherwise `0`. 393 * `Value length` is the length of the value, if it exists (i.e. if `Value existence flag` is `1`.) Otherwise not serialized. 394 * `Value` is the value, if it exists (i.e. if `Value existence flag` is `1`.) Otherwise not serialized. 395 * `Key length` is the number of bits in this node's key. 396 * `Key` is the node's key. 397 398 Note that, as with the node serialization format, the `Child index` values aren't necessarily sequential, but they are unique and strictly increasing. 399 Also like the node serialization format, there can be up to 16 blocks of children data. 400 However, note that child compressed keys are not included in the node ID calculation. 401 402 Once this is encoded, we `sha256` hash the resulting bytes to get the node's ID. 403 404 ### Encoding Varints and Bytes 405 406 Varints are encoded with `binary.PutUvarint` from the standard library's `binary/encoding` package. 407 Bytes are encoded by simply copying them onto the buffer. 408 409 ## Design choices 410 411 ### []byte copying 412 413 A node may contain a value, which is represented in Go as a `[]byte`. This slice is never edited, allowing it to be used without copying it first in many places. When a value leaves the library, for example when returned in `Get`, `GetValue`, `GetProof`, `GetRangeProof`, etc., the value is copied to prevent edits made outside the library from being reflected in the database. 414 415 ### Split Node Storage 416 417 Nodes with values ("value nodes") are persisted under one database prefix, while nodes without values ("intermediate nodes") are persisted under another database prefix. This separation allows for easy iteration over all key-value pairs in the database, as this is simply iterating over the database prefix containing value nodes. 418 419 ### Single Node Type 420 421 MerkleDB uses one type to represent nodes, rather than having multiple types (e.g. branch nodes, value nodes, extension nodes) as other Merkle Trie implementations do. 422 423 Not using extension nodes results in worse storage efficiency (some nodes may have mostly empty children) but simpler code. 424 425 ### Locking 426 427 `merkleDB` has a `RWMutex` named `lock`. Its read operations don't store data in a map, so a read lock suffices for read operations. 428 `merkleDB` has a `Mutex` named `commitLock`. It enforces that only a single view/batch is attempting to commit to the database at one time. `lock` is insufficient because there is a period of view preparation where read access should still be allowed, followed by a period where a full write lock is needed. The `commitLock` ensures that only a single goroutine makes the transition from read => write. 429 430 A `view` is built atop another trie, which may be the underlying `merkleDB` or another `view`. 431 We use locking to guarantee atomicity/consistency of trie operations. 432 433 `view` has a `RWMutex` named `commitLock` which ensures that we don't create a view atop the `view` while it's being committed. 434 It also has a `RWMutex` named `validityTrackingLock` that is held during methods that change the view's validity, tracking of child views' validity, or of the `view` parent trie. This lock ensures that writing/reading from `view` or any of its descendants is safe. 435 The `CommitToDB` method grabs the `merkleDB`'s `commitLock`. This is the only `view` method that modifies the underlying `merkleDB`. 436 437 In some of `merkleDB`'s methods, we create a `view` and call unexported methods on it without locking it. 438 We do so because the exported counterpart of the method read locks the `merkleDB`, which is already locked. 439 This pattern is safe because the `merkleDB` is locked, so no data under the view is changing, and nobody else has a reference to the view, so there can't be any concurrent access. 440 441 To prevent deadlocks, `view` and `merkleDB` never acquire the `commitLock` of descendant views. 442 That is, locking is always done from a view toward to the underlying `merkleDB`, never the other way around. 443 The `validityTrackingLock` goes the opposite way. A view can lock the `validityTrackingLock` of its children, but not its ancestors. Because of this, any function that takes the `validityTrackingLock` must not take the `commitLock` as this may cause a deadlock. Keeping `commitLock` solely in the ancestor direction and `validityTrackingLock` solely in the descendant direction prevents deadlocks from occurring. 444 445 ## TODOs 446 447 - [ ] Analyze performance of using database snapshots rather than in-memory history 448 - [ ] Improve intermediate node regeneration after ungraceful shutdown by reusing successfully written subtrees