github.com/mre-fog/trillianxx@v1.1.2-0.20180615153820-ae375a99d36a/docs/storage/storage.md (about)

     1  # Storage Design Notes
     2  ## Author: Martin Smith
     3  
     4  ## Tree Node Storage
     5  
     6  The node level of storage provides a fairly abstract tree model that is used to implement
     7  verifiable logs and maps. Most users will not need this level but it is important to know the
     8  concepts involved. The API for this is defined in `storage/tree_storage.go`. Related protos are in
     9  the `storage/storagepb` package.
    10  
    11  The model provides a versioned view of the tree. Each transaction that modifies the tree results
    12  in a new revision. Revision numbers increase monotonically as the tree is mutated. 
    13  
    14  ### NodeIDs
    15  
    16  Nodes in storage are uniquely identified by a `NodeID`. This combines a tree path with
    17  a revision number. The path is effectively used as a bitwise subtree prefix in the tree.
    18  In our subtree storage optimization the path prefix identifies the subtree and the remaining
    19  path is the path to the node within that subtree.
    20  
    21  The same `NodeID` objects are used by both logs and maps but they are interpreted differently.
    22  There are API functions that create them for each case. Mixing node ID types in API calls will
    23  give incorrect results.
    24  
    25  ### Subtree Stratification
    26  
    27  As an optimization, the tree is not stored as a set of raw nodes but at as a collection of subtrees.
    28  
    29  Currently, subtrees must be be a multiple of 8 levels deep (referred to as `strataDepth` in the
    30  code) so it's not allowed to have e.g. a 7 level depth but 8 or 16 is fine. Only the bottom level
    31  nodes (the "leaves") of each subtree are physically stored. Intermediate subtree nodes are rehashed
    32  from the "leaves" when the subtree is loaded into memory. See `storage/cache/subtree_cache.go` for
    33  more details.
    34  
    35  Note some caveats to the above paragraph. If depth multiples other than 8 are used this
    36  might require changes to the way node ID prefix and suffixes are packed and unpacked from
    37  byte slices. There are additional assumptions that all log subtrees are the same depth, though
    38  these would be easier to remove.
    39  
    40  For maps the internal nodes are always cleared when stored and then rebuilt from the subtree
    41  "leaf" nodes when the subtree is reloaded. Logs use a modified strategy because it is possible
    42  for internal nodes to depend on more than one subtree. For logs the internal nodes are cleared
    43  on storage if the subtree is fully populated. This prevents the loss of internal nodes that
    44  depend on other subtrees as the tree is growing in levels.
    45  
    46  This storage arrangement was chosen because we have predictable access patterns to our data and do
    47  not require classes of tree modification like re-parenting a subtree. It would probably not be
    48  suitable for a general purpose tree.
    49  
    50  Subtrees are keyed by the `NodeID` of the root (effectively a path prefix) and contain the
    51  intermediate nodes for that subtree, identified by their suffix. These are actually stored in a
    52  proto map where the key is the suffix path. 
    53  
    54  Subtrees are versioned to support the access model for nodes described above. Node addresses within
    55  the model are distinct because the path to a subtree must be unique.
    56  
    57  When node updates are applied they will affect one or more subtrees and caching is used to increase
    58  efficiency. After all updates have been done in-memory the cache is flushed to storage so each
    59  affected subtree is only written once. All writes for a transaction will be at the same revision
    60  number.
    61  
    62  Subtrees are helpful for reads because it is likely that many of the nodes traversed in
    63  Merkle paths for proofs are part of the same subtree. The number of subtrees involved in a path
    64  through a large tree from the leaves to the root is also bounded. For writes the subtree update
    65  batches what would be many smaller writes into one larger but manageable one.
    66  
    67  We gain space efficiency by not storing intermediate nodes (except as noted above for logs
    68  and partially full subtrees). This is a big saving, especially for log storage. It avoids storing
    69  entire tree levels, which get very large as the tree grows. This adds up to an approx 50% space
    70  saving. This is magnified further as we store many versions of the tree. For the map case
    71  things aren't quite as good because multiple subtree revisions need to be stored with the same
    72  prefix but only one "leaf" node differing. The efficiency of this needs to be determined for
    73  large maps.
    74  
    75  #### Subtree Diagrams
    76  
    77  This diagram shows a tree as it might actually be stored by our code using subtrees of depth 8.
    78  
    79  Each subtree does not include its "root" node, though this counts as part of the depth. There are
    80  additional subtrees below and to the right of the child subtree shown, they can't easily be shown
    81  in the diagram. Obviously, there could be less than 256 "leaf" nodes in the subtrees if they are not
    82  yet fully populated. A node always belongs to exactly one subtree, there is no overlap.
    83  
    84  ![strata depth 8 tree](StratumDepth8.png "Stratum Depth 8")
    85  
    86  As it's hard to visualize the structure at scale with stratum depth 8, some examples of smaller
    87  depths might make things clearer. Though these are not supported by the current implementation
    88  the diagrams are much simpler.
    89  
    90  This diagram shows a tree with stratum depth 2. It is a somewhat special case as all the levels are
    91  stored. Note that the root node is never stored and is always recalculated.
    92  
    93  ![strata depth 2 tree diagram](StratumDepth2.png "Stratum Depth 2")
    94  
    95  This diagram shows a tree with stratum depth 3. Note that only the bottom level of each subtree is
    96  stored and how the binary path is used as a subtree prefix to identify subtrees.
    97  
    98  ![strata depth 3 tree diagram](StratumDepth3.png "Stratum Depth 3")
    99  
   100  ### Consistency and Other Requirements
   101  
   102  Storage implementations must provide strongly consistent updates to the tree data. Some users may
   103  see an earlier view than others if updates have not been fully propagated yet but they must not see
   104  partial updates or inconsistent views.
   105  
   106  It is not a requirement that the underlying storage is relational. Our initial implementation uses
   107  an RDBMS and has this [database schema diagram](database-diagram.pdf).
   108  
   109  ## Log Storage
   110  
   111  ### The Log Tree
   112  
   113  The node tree built for a log is a representation of a Merkle Tree, which starts out empty and grows
   114  as leaves are added. A Merkle Tree of a specific size is a fixed and well-defined shape.
   115  
   116  Leaves are never removed and a completely populated left subtree of the tree structure is never
   117  further mutated.
   118  
   119  The personality layer is responsible for deciding whether to accept duplicate leaves as it controls
   120  the leaf identity hash value. For example it could add a timestamp to the data it hashes so
   121  that duplicate leaf data always has a different leaf identity hash.
   122  
   123  The log stores two hashes per leaf, a raw SHA256 hash of the leaf data used for deduplication
   124  (by the personality layer) and the Merkle Leaf Hash of the data, which becomes part of the tree.
   125  
   126  ### Log NodeIDs / Tree Coordinates
   127  
   128  Log nodes are notionally addressed using a three dimensional coordinate tuple (level in tree, index
   129  in level, revision number).
   130  
   131  Level zero is always the leaf level and additional intermediate levels are added above this as the
   132  tree grows. Such growth does not affect nodes written at a previous revision. Levels are only
   133  created when they are required. The level of a node is always the level in the overall tree. The
   134  `NodeID` coordinates are independent of any subtree storage optimization.
   135  
   136  Index is the horizontal position of the node in the level, with zero being the leftmost node in
   137  each level.
   138  
   139  For example in a tree of size two the leaves are (level 0, index 0) and (level 0, index 1) and the
   140  root is (level 1, index 0).
   141  
   142  The storage implementation must be able to provide access to nodes using this coordinate scheme but
   143  it is not required to store them this way. The current implementation compacts subtrees for
   144  increased write efficiency so nodes are not distinct database entities. This is hidden by the node
   145  API.
   146  
   147  ### Log Startup
   148  
   149  When log storage is intialized and its tree is not empty the existing state is loaded into a
   150  `compact_merkle_tree`. This can be done efficiently and only requires a few node accesses to
   151  restore the tree state by reading intermediate hashes at each tree level. 
   152  
   153  As a crosscheck the root hash of the compact tree is compared against the current log root. If it
   154  does not match then the log is corrupt and cannot be used.
   155  
   156  ### Writing Leaves and Sequencing
   157  
   158  In the current RDBMS storage implementation log clients queue new leaves to the log, and a 
   159  `LeafData` record is created. Further writes to the Merkle Tree are coordinated by the sequencer,
   160  which adds leaves to the tree. The sequencer is responsible for ordering the leaves and creating
   161  the `SequencedLeafData` row linking the leaf and its sequence number. Queued submissions that have
   162  not been sequenced are not accessible via the log APIs.
   163  
   164  When leaves are added to the tree they are processed by a `merkle/compact_merkle_tree`, this causes a
   165  batched set of tree node updates to be applied. Each update is given its own revision number. The
   166  result is that a number of tree snapshots are directly available in storage. This contrasts with
   167  [previous implementations](https://github.com/google/certificate-transparency/blob/master/cpp/merkletree/)
   168  for Certificate Transparency where the tree is in RAM and only the most recent snapshot is directly
   169  available. Note that we may batch log updates so we don't necessarily have all intermediate tree
   170  snapshots directly available from storage.
   171  
   172  As an optimization intermediate nodes with only a left child are not stored. There is more detail
   173  on how this affects access to tree paths in the file `merkle/merkle_paths.go`. This differs from
   174  the 
   175  [Certificate Transparency C++](https://github.com/google/certificate-transparency/blob/master/cpp/merkletree/merkle_tree.h)
   176  in-memory tree implementation. In summary the code must handle cases where there is no right
   177  sibling of the rightmost node in a level.
   178  
   179  Each batched write also produces an internal tree head, linking that write to the revision number.
   180  
   181  ### Reading Log Nodes
   182  
   183  When nodes are accessed the typical pattern is to request the newest version of a node with a
   184  specified level and index that is not greater than a revision number that represents the tree at a
   185  specific size of interest.
   186  
   187  If there are multiple tree snapshots at the same tree size it does not matter which one is picked
   188  for this as they include the same set of leaves. The Log API provides support for determining the
   189  correct version to request when reading nodes.
   190  
   191  #### Reading Leaves
   192  
   193  API requests for leaf data involve a straightforward query by leaf data hash, leaf Merkle hash or
   194  leaf index followed by formatting and marshaling the data to be returned to the client.
   195  
   196  #### Serving Proofs
   197  
   198  API requests for proofs involve more work but both inclusion and consistency proofs follow the 
   199  same pattern.
   200  
   201  Path calculations in the tree need to be based on a particular tree revision. It is possible to 
   202  use any tree revision that corresponds to a tree at least as large as the tree that the proof is
   203  for. We currently use the most recent tree revision number for all proof node calculation / 
   204  fetches. This is useful as we already have it available from when the transaction was initialized. 
   205  
   206  There is no guarantee that we have the exact tree size snapshot available for any particular
   207  request so we're already prepared to pay the cost of some hash recomputation, as described further
   208  below. In practice the impact of this should be minor, and will be amortized across all requests.
   209  The proof path is also limited to the portion of the tree that existed at the time of the
   210  requested tree, not the version we use to compute it.
   211  
   212  An example may help here. Suppose we want an inclusion proof for index 10 from the tree as it was
   213  at size 50. We use the latest tree revision, which corresponds to a size of 250,000,000. The
   214  path calculation cannot reference or recompute an internal node that did not exist at tree
   215  size 50 so the huge current tree size is irrelevant to serving this proof.
   216  
   217  The tree path for the proof is calculated for a tree size using an algorithm based on the
   218  reference implementation of RFC 6962. The output of this is an ordered slice of `NodeIDs` that must 
   219  be fetched from storage and a set of flags indicating required hash recomputations. After a 
   220  successful read the hashes are extracted from the nodes, rehashed if necessary and returned to
   221  the client.
   222  
   223  Recomputation is needed because we don't necessarily store snapshots on disk for every tree size. 
   224  To serve proofs at a version intermediate between two stored versions it can be necessary to
   225  recompute hashes on the rightmost path. This requires extra nodes to be fetched but is bounded
   226  by the depth of the tree so this never becomes unmanageable.
   227  
   228  Consider the state of the tree as it grows from size 7 to size 8 as shown in the following
   229  diagrams:
   230  
   231  ![Merkle tree size 7 diagram](tree_7.png "Merkle Tree Size 7")
   232  
   233  ![Merkle tree size 8 diagram](tree_8.png "Merkle Tree Size 8")
   234  
   235  Assume that only the size 8 tree is stored. When the tree of size eight is queried for an
   236  inclusion proof of leaf 'e' to the older root at size 7 the proof cannot be directly constructed
   237  from the node hashes as they are represented in storage at the later point. 
   238  
   239  The value of node 'z' differs from the prior state, which got overwritten when the internal node ‘t’
   240  was added at size 8. This hash value 'z' at size 7 is needed to construct the proof so it must
   241  be recalculated. The value of 's' is unchanged so hashing 's' together with 'g' correctly
   242  recreates the hash of the internal node 'z'.
   243  
   244  The example is a simple case but there may be several levels of nodes affected depending on the
   245  size of the tree and therefore the shape of the right hand path at that size.
   246  
   247  ## Map Storage
   248  
   249  NOTE: Initial outline documentation. A more complete documentation for maps will follow later.
   250  
   251  Maps are instances of a sparse Merkle tree where most of the nodes are not present. For more
   252  details on how this assists an implementation see the 
   253  [Revocation Transparency](https://github.com/google/trillian/blob/master/docs/RevocationTransparency.pdf) document.
   254  
   255  There are two major differences in the way map storage uses the tree storage from a log to
   256  represent its sparse Merkle tree.
   257  
   258  Firstly the strata depths are not uniform across the full depth of the map. In the log case the
   259  tree depth (and hence path length from the root to a leaf) expands up to 64 as the tree grows
   260  and later leaves will have longer paths than earlier ones. All log subtrees will be stored with
   261  depth 8 (see previous diagrams) from the root down to the leaf nodes for as many subtrees as
   262  are needed. For example a path of 17 bits will traverse three subtrees of depth 8.
   263  
   264  In a map all the paths from root to leaf are effectively 256 bits long. When these are passed to
   265  storage the top part of the tree (currently 80 bits) is stored as a set of depth 8 subtrees. 
   266  Below this point the remainder use a single subtree strata, where the nodes are expected to be 
   267  extremely sparse.
   268  
   269  The second main difference is that the tree uses the HStar2 algorithm. This requires a different
   270  method of rebuilding internal nodes when the map subtrees are read from disk.
   271  
   272  Each mutation to the map is a set of key / value updates and results in a collated set of subtree
   273  writes.
   274  
   275  ### Map NodeIDs
   276  
   277  A `NodeID` for a map represents the bitwise path of the node from the root. These are
   278  usually created directly from a hash by `types.NewNodeIDFromHash()` Unlike the log all the bits
   279  in the ID are part of the 'prefix' because there is no suffix (the log suffix is the leaf index
   280  within the subtree).