github.com/mre-fog/trillianxx@v1.1.2-0.20180615153820-ae375a99d36a/docs/storage/storage.md

github.com/mre-fog/trillianxx@v1.1.2-0.20180615153820-ae375a99d36a/docs/storage/storage.md (about)

1 # Storage Design Notes
2 ## Author: Martin Smith
3
4 ## Tree Node Storage
5
6 The node level of storage provides a fairly abstract tree model that is used to implement
7 verifiable logs and maps. Most users will not need this level but it is important to know the
8 concepts involved. The API for this is defined in `storage/tree_storage.go`. Related protos are in
9 the `storage/storagepb` package.
10
11 The model provides a versioned view of the tree. Each transaction that modifies the tree results
12 in a new revision. Revision numbers increase monotonically as the tree is mutated.
13
14 ### NodeIDs
15
16 Nodes in storage are uniquely identified by a `NodeID`. This combines a tree path with
17 a revision number. The path is effectively used as a bitwise subtree prefix in the tree.
18 In our subtree storage optimization the path prefix identifies the subtree and the remaining
19 path is the path to the node within that subtree.
20
21 The same `NodeID` objects are used by both logs and maps but they are interpreted differently.
22 There are API functions that create them for each case. Mixing node ID types in API calls will
23 give incorrect results.
24
25 ### Subtree Stratification
26
27 As an optimization, the tree is not stored as a set of raw nodes but at as a collection of subtrees.
28
29 Currently, subtrees must be be a multiple of 8 levels deep (referred to as `strataDepth` in the
30 code) so it's not allowed to have e.g. a 7 level depth but 8 or 16 is fine. Only the bottom level
31 nodes (the "leaves") of each subtree are physically stored. Intermediate subtree nodes are rehashed
32 from the "leaves" when the subtree is loaded into memory. See `storage/cache/subtree_cache.go` for
33 more details.
34
35 Note some caveats to the above paragraph. If depth multiples other than 8 are used this
36 might require changes to the way node ID prefix and suffixes are packed and unpacked from
37 byte slices. There are additional assumptions that all log subtrees are the same depth, though
38 these would be easier to remove.
39
40 For maps the internal nodes are always cleared when stored and then rebuilt from the subtree
41 "leaf" nodes when the subtree is reloaded. Logs use a modified strategy because it is possible
42 for internal nodes to depend on more than one subtree. For logs the internal nodes are cleared
43 on storage if the subtree is fully populated. This prevents the loss of internal nodes that
44 depend on other subtrees as the tree is growing in levels.
45
46 This storage arrangement was chosen because we have predictable access patterns to our data and do
47 not require classes of tree modification like re-parenting a subtree. It would probably not be
48 suitable for a general purpose tree.
49
50 Subtrees are keyed by the `NodeID` of the root (effectively a path prefix) and contain the
51 intermediate nodes for that subtree, identified by their suffix. These are actually stored in a
52 proto map where the key is the suffix path.
53
54 Subtrees are versioned to support the access model for nodes described above. Node addresses within
55 the model are distinct because the path to a subtree must be unique.
56
57 When node updates are applied they will affect one or more subtrees and caching is used to increase
58 efficiency. After all updates have been done in-memory the cache is flushed to storage so each
59 affected subtree is only written once. All writes for a transaction will be at the same revision
60 number.
61
62 Subtrees are helpful for reads because it is likely that many of the nodes traversed in
63 Merkle paths for proofs are part of the same subtree. The number of subtrees involved in a path
64 through a large tree from the leaves to the root is also bounded. For writes the subtree update
65 batches what would be many smaller writes into one larger but manageable one.
66
67 We gain space efficiency by not storing intermediate nodes (except as noted above for logs
68 and partially full subtrees). This is a big saving, especially for log storage. It avoids storing
69 entire tree levels, which get very large as the tree grows. This adds up to an approx 50% space
70 saving. This is magnified further as we store many versions of the tree. For the map case
71 things aren't quite as good because multiple subtree revisions need to be stored with the same
72 prefix but only one "leaf" node differing. The efficiency of this needs to be determined for
73 large maps.
74
75 #### Subtree Diagrams
76
77 This diagram shows a tree as it might actually be stored by our code using subtrees of depth 8.
78
79 Each subtree does not include its "root" node, though this counts as part of the depth. There are
80 additional subtrees below and to the right of the child subtree shown, they can't easily be shown
81 in the diagram. Obviously, there could be less than 256 "leaf" nodes in the subtrees if they are not
82 yet fully populated. A node always belongs to exactly one subtree, there is no overlap.
83
84 ![strata depth 8 tree](StratumDepth8.png "Stratum Depth 8")
85
86 As it's hard to visualize the structure at scale with stratum depth 8, some examples of smaller
87 depths might make things clearer. Though these are not supported by the current implementation
88 the diagrams are much simpler.
89
90 This diagram shows a tree with stratum depth 2. It is a somewhat special case as all the levels are
91 stored. Note that the root node is never stored and is always recalculated.
92
93 ![strata depth 2 tree diagram](StratumDepth2.png "Stratum Depth 2")
94
95 This diagram shows a tree with stratum depth 3. Note that only the bottom level of each subtree is
96 stored and how the binary path is used as a subtree prefix to identify subtrees.
97
98 ![strata depth 3 tree diagram](StratumDepth3.png "Stratum Depth 3")
99
100 ### Consistency and Other Requirements
101
102 Storage implementations must provide strongly consistent updates to the tree data. Some users may
103 see an earlier view than others if updates have not been fully propagated yet but they must not see
104 partial updates or inconsistent views.
105
106 It is not a requirement that the underlying storage is relational. Our initial implementation uses
107 an RDBMS and has this [database schema diagram](database-diagram.pdf).
108
109 ## Log Storage
110
111 ### The Log Tree
112
113 The node tree built for a log is a representation of a Merkle Tree, which starts out empty and grows
114 as leaves are added. A Merkle Tree of a specific size is a fixed and well-defined shape.
115
116 Leaves are never removed and a completely populated left subtree of the tree structure is never
117 further mutated.
118
119 The personality layer is responsible for deciding whether to accept duplicate leaves as it controls
120 the leaf identity hash value. For example it could add a timestamp to the data it hashes so
121 that duplicate leaf data always has a different leaf identity hash.
122
123 The log stores two hashes per leaf, a raw SHA256 hash of the leaf data used for deduplication
124 (by the personality layer) and the Merkle Leaf Hash of the data, which becomes part of the tree.
125
126 ### Log NodeIDs / Tree Coordinates
127
128 Log nodes are notionally addressed using a three dimensional coordinate tuple (level in tree, index
129 in level, revision number).
130
131 Level zero is always the leaf level and additional intermediate levels are added above this as the
132 tree grows. Such growth does not affect nodes written at a previous revision. Levels are only
133 created when they are required. The level of a node is always the level in the overall tree. The
134 `NodeID` coordinates are independent of any subtree storage optimization.
135
136 Index is the horizontal position of the node in the level, with zero being the leftmost node in
137 each level.
138
139 For example in a tree of size two the leaves are (level 0, index 0) and (level 0, index 1) and the
140 root is (level 1, index 0).
141
142 The storage implementation must be able to provide access to nodes using this coordinate scheme but
143 it is not required to store them this way. The current implementation compacts subtrees for
144 increased write efficiency so nodes are not distinct database entities. This is hidden by the node
145 API.
146
147 ### Log Startup
148
149 When log storage is intialized and its tree is not empty the existing state is loaded into a
150 `compact_merkle_tree`. This can be done efficiently and only requires a few node accesses to
151 restore the tree state by reading intermediate hashes at each tree level.
152
153 As a crosscheck the root hash of the compact tree is compared against the current log root. If it
154 does not match then the log is corrupt and cannot be used.
155
156 ### Writing Leaves and Sequencing
157
158 In the current RDBMS storage implementation log clients queue new leaves to the log, and a
159 `LeafData` record is created. Further writes to the Merkle Tree are coordinated by the sequencer,
160 which adds leaves to the tree. The sequencer is responsible for ordering the leaves and creating
161 the `SequencedLeafData` row linking the leaf and its sequence number. Queued submissions that have
162 not been sequenced are not accessible via the log APIs.
163
164 When leaves are added to the tree they are processed by a `merkle/compact_merkle_tree`, this causes a
165 batched set of tree node updates to be applied. Each update is given its own revision number. The
166 result is that a number of tree snapshots are directly available in storage. This contrasts with
167 [previous implementations](https://github.com/google/certificate-transparency/blob/master/cpp/merkletree/)
168 for Certificate Transparency where the tree is in RAM and only the most recent snapshot is directly
169 available. Note that we may batch log updates so we don't necessarily have all intermediate tree
170 snapshots directly available from storage.
171
172 As an optimization intermediate nodes with only a left child are not stored. There is more detail
173 on how this affects access to tree paths in the file `merkle/merkle_paths.go`. This differs from
174 the
175 [Certificate Transparency C++](https://github.com/google/certificate-transparency/blob/master/cpp/merkletree/merkle_tree.h)
176 in-memory tree implementation. In summary the code must handle cases where there is no right
177 sibling of the rightmost node in a level.
178
179 Each batched write also produces an internal tree head, linking that write to the revision number.
180
181 ### Reading Log Nodes
182
183 When nodes are accessed the typical pattern is to request the newest version of a node with a
184 specified level and index that is not greater than a revision number that represents the tree at a
185 specific size of interest.
186
187 If there are multiple tree snapshots at the same tree size it does not matter which one is picked
188 for this as they include the same set of leaves. The Log API provides support for determining the
189 correct version to request when reading nodes.
190
191 #### Reading Leaves
192
193 API requests for leaf data involve a straightforward query by leaf data hash, leaf Merkle hash or
194 leaf index followed by formatting and marshaling the data to be returned to the client.
195
196 #### Serving Proofs
197
198 API requests for proofs involve more work but both inclusion and consistency proofs follow the
199 same pattern.
200
201 Path calculations in the tree need to be based on a particular tree revision. It is possible to
202 use any tree revision that corresponds to a tree at least as large as the tree that the proof is
203 for. We currently use the most recent tree revision number for all proof node calculation /
204 fetches. This is useful as we already have it available from when the transaction was initialized.
205
206 There is no guarantee that we have the exact tree size snapshot available for any particular
207 request so we're already prepared to pay the cost of some hash recomputation, as described further
208 below. In practice the impact of this should be minor, and will be amortized across all requests.
209 The proof path is also limited to the portion of the tree that existed at the time of the
210 requested tree, not the version we use to compute it.
211
212 An example may help here. Suppose we want an inclusion proof for index 10 from the tree as it was
213 at size 50. We use the latest tree revision, which corresponds to a size of 250,000,000. The
214 path calculation cannot reference or recompute an internal node that did not exist at tree
215 size 50 so the huge current tree size is irrelevant to serving this proof.
216
217 The tree path for the proof is calculated for a tree size using an algorithm based on the
218 reference implementation of RFC 6962. The output of this is an ordered slice of `NodeIDs` that must
219 be fetched from storage and a set of flags indicating required hash recomputations. After a
220 successful read the hashes are extracted from the nodes, rehashed if necessary and returned to
221 the client.
222
223 Recomputation is needed because we don't necessarily store snapshots on disk for every tree size.
224 To serve proofs at a version intermediate between two stored versions it can be necessary to
225 recompute hashes on the rightmost path. This requires extra nodes to be fetched but is bounded
226 by the depth of the tree so this never becomes unmanageable.
227
228 Consider the state of the tree as it grows from size 7 to size 8 as shown in the following
229 diagrams:
230
231 ![Merkle tree size 7 diagram](tree_7.png "Merkle Tree Size 7")
232
233 ![Merkle tree size 8 diagram](tree_8.png "Merkle Tree Size 8")
234
235 Assume that only the size 8 tree is stored. When the tree of size eight is queried for an
236 inclusion proof of leaf 'e' to the older root at size 7 the proof cannot be directly constructed
237 from the node hashes as they are represented in storage at the later point.
238
239 The value of node 'z' differs from the prior state, which got overwritten when the internal node ‘t’
240 was added at size 8. This hash value 'z' at size 7 is needed to construct the proof so it must
241 be recalculated. The value of 's' is unchanged so hashing 's' together with 'g' correctly
242 recreates the hash of the internal node 'z'.
243
244 The example is a simple case but there may be several levels of nodes affected depending on the
245 size of the tree and therefore the shape of the right hand path at that size.
246
247 ## Map Storage
248
249 NOTE: Initial outline documentation. A more complete documentation for maps will follow later.
250
251 Maps are instances of a sparse Merkle tree where most of the nodes are not present. For more
252 details on how this assists an implementation see the
253 [Revocation Transparency](https://github.com/google/trillian/blob/master/docs/RevocationTransparency.pdf) document.
254
255 There are two major differences in the way map storage uses the tree storage from a log to
256 represent its sparse Merkle tree.
257
258 Firstly the strata depths are not uniform across the full depth of the map. In the log case the
259 tree depth (and hence path length from the root to a leaf) expands up to 64 as the tree grows
260 and later leaves will have longer paths than earlier ones. All log subtrees will be stored with
261 depth 8 (see previous diagrams) from the root down to the leaf nodes for as many subtrees as
262 are needed. For example a path of 17 bits will traverse three subtrees of depth 8.
263
264 In a map all the paths from root to leaf are effectively 256 bits long. When these are passed to
265 storage the top part of the tree (currently 80 bits) is stored as a set of depth 8 subtrees.
266 Below this point the remainder use a single subtree strata, where the nodes are expected to be
267 extremely sparse.
268
269 The second main difference is that the tree uses the HStar2 algorithm. This requires a different
270 method of rebuilding internal nodes when the map subtrees are read from disk.
271
272 Each mutation to the map is a set of key / value updates and results in a collated set of subtree
273 writes.
274
275 ### Map NodeIDs
276
277 A `NodeID` for a map represents the bitwise path of the node from the root. These are
278 usually created directly from a hash by `types.NewNodeIDFromHash()` Unlike the log all the bits
279 in the ID are part of the 'prefix' because there is no suffix (the log suffix is the leaf index
280 within the subtree).