github.com/mre-fog/trillianxx@v1.1.2-0.20180615153820-ae375a99d36a/docs/storage/storage.md (about) 1 # Storage Design Notes 2 ## Author: Martin Smith 3 4 ## Tree Node Storage 5 6 The node level of storage provides a fairly abstract tree model that is used to implement 7 verifiable logs and maps. Most users will not need this level but it is important to know the 8 concepts involved. The API for this is defined in `storage/tree_storage.go`. Related protos are in 9 the `storage/storagepb` package. 10 11 The model provides a versioned view of the tree. Each transaction that modifies the tree results 12 in a new revision. Revision numbers increase monotonically as the tree is mutated. 13 14 ### NodeIDs 15 16 Nodes in storage are uniquely identified by a `NodeID`. This combines a tree path with 17 a revision number. The path is effectively used as a bitwise subtree prefix in the tree. 18 In our subtree storage optimization the path prefix identifies the subtree and the remaining 19 path is the path to the node within that subtree. 20 21 The same `NodeID` objects are used by both logs and maps but they are interpreted differently. 22 There are API functions that create them for each case. Mixing node ID types in API calls will 23 give incorrect results. 24 25 ### Subtree Stratification 26 27 As an optimization, the tree is not stored as a set of raw nodes but at as a collection of subtrees. 28 29 Currently, subtrees must be be a multiple of 8 levels deep (referred to as `strataDepth` in the 30 code) so it's not allowed to have e.g. a 7 level depth but 8 or 16 is fine. Only the bottom level 31 nodes (the "leaves") of each subtree are physically stored. Intermediate subtree nodes are rehashed 32 from the "leaves" when the subtree is loaded into memory. See `storage/cache/subtree_cache.go` for 33 more details. 34 35 Note some caveats to the above paragraph. If depth multiples other than 8 are used this 36 might require changes to the way node ID prefix and suffixes are packed and unpacked from 37 byte slices. There are additional assumptions that all log subtrees are the same depth, though 38 these would be easier to remove. 39 40 For maps the internal nodes are always cleared when stored and then rebuilt from the subtree 41 "leaf" nodes when the subtree is reloaded. Logs use a modified strategy because it is possible 42 for internal nodes to depend on more than one subtree. For logs the internal nodes are cleared 43 on storage if the subtree is fully populated. This prevents the loss of internal nodes that 44 depend on other subtrees as the tree is growing in levels. 45 46 This storage arrangement was chosen because we have predictable access patterns to our data and do 47 not require classes of tree modification like re-parenting a subtree. It would probably not be 48 suitable for a general purpose tree. 49 50 Subtrees are keyed by the `NodeID` of the root (effectively a path prefix) and contain the 51 intermediate nodes for that subtree, identified by their suffix. These are actually stored in a 52 proto map where the key is the suffix path. 53 54 Subtrees are versioned to support the access model for nodes described above. Node addresses within 55 the model are distinct because the path to a subtree must be unique. 56 57 When node updates are applied they will affect one or more subtrees and caching is used to increase 58 efficiency. After all updates have been done in-memory the cache is flushed to storage so each 59 affected subtree is only written once. All writes for a transaction will be at the same revision 60 number. 61 62 Subtrees are helpful for reads because it is likely that many of the nodes traversed in 63 Merkle paths for proofs are part of the same subtree. The number of subtrees involved in a path 64 through a large tree from the leaves to the root is also bounded. For writes the subtree update 65 batches what would be many smaller writes into one larger but manageable one. 66 67 We gain space efficiency by not storing intermediate nodes (except as noted above for logs 68 and partially full subtrees). This is a big saving, especially for log storage. It avoids storing 69 entire tree levels, which get very large as the tree grows. This adds up to an approx 50% space 70 saving. This is magnified further as we store many versions of the tree. For the map case 71 things aren't quite as good because multiple subtree revisions need to be stored with the same 72 prefix but only one "leaf" node differing. The efficiency of this needs to be determined for 73 large maps. 74 75 #### Subtree Diagrams 76 77 This diagram shows a tree as it might actually be stored by our code using subtrees of depth 8. 78 79 Each subtree does not include its "root" node, though this counts as part of the depth. There are 80 additional subtrees below and to the right of the child subtree shown, they can't easily be shown 81 in the diagram. Obviously, there could be less than 256 "leaf" nodes in the subtrees if they are not 82 yet fully populated. A node always belongs to exactly one subtree, there is no overlap. 83 84  85 86 As it's hard to visualize the structure at scale with stratum depth 8, some examples of smaller 87 depths might make things clearer. Though these are not supported by the current implementation 88 the diagrams are much simpler. 89 90 This diagram shows a tree with stratum depth 2. It is a somewhat special case as all the levels are 91 stored. Note that the root node is never stored and is always recalculated. 92 93  94 95 This diagram shows a tree with stratum depth 3. Note that only the bottom level of each subtree is 96 stored and how the binary path is used as a subtree prefix to identify subtrees. 97 98  99 100 ### Consistency and Other Requirements 101 102 Storage implementations must provide strongly consistent updates to the tree data. Some users may 103 see an earlier view than others if updates have not been fully propagated yet but they must not see 104 partial updates or inconsistent views. 105 106 It is not a requirement that the underlying storage is relational. Our initial implementation uses 107 an RDBMS and has this [database schema diagram](database-diagram.pdf). 108 109 ## Log Storage 110 111 ### The Log Tree 112 113 The node tree built for a log is a representation of a Merkle Tree, which starts out empty and grows 114 as leaves are added. A Merkle Tree of a specific size is a fixed and well-defined shape. 115 116 Leaves are never removed and a completely populated left subtree of the tree structure is never 117 further mutated. 118 119 The personality layer is responsible for deciding whether to accept duplicate leaves as it controls 120 the leaf identity hash value. For example it could add a timestamp to the data it hashes so 121 that duplicate leaf data always has a different leaf identity hash. 122 123 The log stores two hashes per leaf, a raw SHA256 hash of the leaf data used for deduplication 124 (by the personality layer) and the Merkle Leaf Hash of the data, which becomes part of the tree. 125 126 ### Log NodeIDs / Tree Coordinates 127 128 Log nodes are notionally addressed using a three dimensional coordinate tuple (level in tree, index 129 in level, revision number). 130 131 Level zero is always the leaf level and additional intermediate levels are added above this as the 132 tree grows. Such growth does not affect nodes written at a previous revision. Levels are only 133 created when they are required. The level of a node is always the level in the overall tree. The 134 `NodeID` coordinates are independent of any subtree storage optimization. 135 136 Index is the horizontal position of the node in the level, with zero being the leftmost node in 137 each level. 138 139 For example in a tree of size two the leaves are (level 0, index 0) and (level 0, index 1) and the 140 root is (level 1, index 0). 141 142 The storage implementation must be able to provide access to nodes using this coordinate scheme but 143 it is not required to store them this way. The current implementation compacts subtrees for 144 increased write efficiency so nodes are not distinct database entities. This is hidden by the node 145 API. 146 147 ### Log Startup 148 149 When log storage is intialized and its tree is not empty the existing state is loaded into a 150 `compact_merkle_tree`. This can be done efficiently and only requires a few node accesses to 151 restore the tree state by reading intermediate hashes at each tree level. 152 153 As a crosscheck the root hash of the compact tree is compared against the current log root. If it 154 does not match then the log is corrupt and cannot be used. 155 156 ### Writing Leaves and Sequencing 157 158 In the current RDBMS storage implementation log clients queue new leaves to the log, and a 159 `LeafData` record is created. Further writes to the Merkle Tree are coordinated by the sequencer, 160 which adds leaves to the tree. The sequencer is responsible for ordering the leaves and creating 161 the `SequencedLeafData` row linking the leaf and its sequence number. Queued submissions that have 162 not been sequenced are not accessible via the log APIs. 163 164 When leaves are added to the tree they are processed by a `merkle/compact_merkle_tree`, this causes a 165 batched set of tree node updates to be applied. Each update is given its own revision number. The 166 result is that a number of tree snapshots are directly available in storage. This contrasts with 167 [previous implementations](https://github.com/google/certificate-transparency/blob/master/cpp/merkletree/) 168 for Certificate Transparency where the tree is in RAM and only the most recent snapshot is directly 169 available. Note that we may batch log updates so we don't necessarily have all intermediate tree 170 snapshots directly available from storage. 171 172 As an optimization intermediate nodes with only a left child are not stored. There is more detail 173 on how this affects access to tree paths in the file `merkle/merkle_paths.go`. This differs from 174 the 175 [Certificate Transparency C++](https://github.com/google/certificate-transparency/blob/master/cpp/merkletree/merkle_tree.h) 176 in-memory tree implementation. In summary the code must handle cases where there is no right 177 sibling of the rightmost node in a level. 178 179 Each batched write also produces an internal tree head, linking that write to the revision number. 180 181 ### Reading Log Nodes 182 183 When nodes are accessed the typical pattern is to request the newest version of a node with a 184 specified level and index that is not greater than a revision number that represents the tree at a 185 specific size of interest. 186 187 If there are multiple tree snapshots at the same tree size it does not matter which one is picked 188 for this as they include the same set of leaves. The Log API provides support for determining the 189 correct version to request when reading nodes. 190 191 #### Reading Leaves 192 193 API requests for leaf data involve a straightforward query by leaf data hash, leaf Merkle hash or 194 leaf index followed by formatting and marshaling the data to be returned to the client. 195 196 #### Serving Proofs 197 198 API requests for proofs involve more work but both inclusion and consistency proofs follow the 199 same pattern. 200 201 Path calculations in the tree need to be based on a particular tree revision. It is possible to 202 use any tree revision that corresponds to a tree at least as large as the tree that the proof is 203 for. We currently use the most recent tree revision number for all proof node calculation / 204 fetches. This is useful as we already have it available from when the transaction was initialized. 205 206 There is no guarantee that we have the exact tree size snapshot available for any particular 207 request so we're already prepared to pay the cost of some hash recomputation, as described further 208 below. In practice the impact of this should be minor, and will be amortized across all requests. 209 The proof path is also limited to the portion of the tree that existed at the time of the 210 requested tree, not the version we use to compute it. 211 212 An example may help here. Suppose we want an inclusion proof for index 10 from the tree as it was 213 at size 50. We use the latest tree revision, which corresponds to a size of 250,000,000. The 214 path calculation cannot reference or recompute an internal node that did not exist at tree 215 size 50 so the huge current tree size is irrelevant to serving this proof. 216 217 The tree path for the proof is calculated for a tree size using an algorithm based on the 218 reference implementation of RFC 6962. The output of this is an ordered slice of `NodeIDs` that must 219 be fetched from storage and a set of flags indicating required hash recomputations. After a 220 successful read the hashes are extracted from the nodes, rehashed if necessary and returned to 221 the client. 222 223 Recomputation is needed because we don't necessarily store snapshots on disk for every tree size. 224 To serve proofs at a version intermediate between two stored versions it can be necessary to 225 recompute hashes on the rightmost path. This requires extra nodes to be fetched but is bounded 226 by the depth of the tree so this never becomes unmanageable. 227 228 Consider the state of the tree as it grows from size 7 to size 8 as shown in the following 229 diagrams: 230 231  232 233  234 235 Assume that only the size 8 tree is stored. When the tree of size eight is queried for an 236 inclusion proof of leaf 'e' to the older root at size 7 the proof cannot be directly constructed 237 from the node hashes as they are represented in storage at the later point. 238 239 The value of node 'z' differs from the prior state, which got overwritten when the internal node âtâ 240 was added at size 8. This hash value 'z' at size 7 is needed to construct the proof so it must 241 be recalculated. The value of 's' is unchanged so hashing 's' together with 'g' correctly 242 recreates the hash of the internal node 'z'. 243 244 The example is a simple case but there may be several levels of nodes affected depending on the 245 size of the tree and therefore the shape of the right hand path at that size. 246 247 ## Map Storage 248 249 NOTE: Initial outline documentation. A more complete documentation for maps will follow later. 250 251 Maps are instances of a sparse Merkle tree where most of the nodes are not present. For more 252 details on how this assists an implementation see the 253 [Revocation Transparency](https://github.com/google/trillian/blob/master/docs/RevocationTransparency.pdf) document. 254 255 There are two major differences in the way map storage uses the tree storage from a log to 256 represent its sparse Merkle tree. 257 258 Firstly the strata depths are not uniform across the full depth of the map. In the log case the 259 tree depth (and hence path length from the root to a leaf) expands up to 64 as the tree grows 260 and later leaves will have longer paths than earlier ones. All log subtrees will be stored with 261 depth 8 (see previous diagrams) from the root down to the leaf nodes for as many subtrees as 262 are needed. For example a path of 17 bits will traverse three subtrees of depth 8. 263 264 In a map all the paths from root to leaf are effectively 256 bits long. When these are passed to 265 storage the top part of the tree (currently 80 bits) is stored as a set of depth 8 subtrees. 266 Below this point the remainder use a single subtree strata, where the nodes are expected to be 267 extremely sparse. 268 269 The second main difference is that the tree uses the HStar2 algorithm. This requires a different 270 method of rebuilding internal nodes when the map subtrees are read from disk. 271 272 Each mutation to the map is a set of key / value updates and results in a collated set of subtree 273 writes. 274 275 ### Map NodeIDs 276 277 A `NodeID` for a map represents the bitwise path of the node from the root. These are 278 usually created directly from a hash by `types.NewNodeIDFromHash()` Unlike the log all the bits 279 in the ID are part of the 'prefix' because there is no suffix (the log suffix is the leaf index 280 within the subtree).