github.com/mre-fog/trillianxx@v1.1.2-0.20180615153820-ae375a99d36a/docs/storage/commit_log/commit_log_based_storage_design.md (about)

     1  ## Commit-Log based Trillian storage
     2  
     3  *Status: Draft*
     4  
     5  *Authors: al@google.com, drysdale@google.com, filippo@cloudflare.com*
     6  
     7  *Last Updated: 2017-05-12*
     8  
     9  ## Objective
    10  
    11  A design for an alternative Trillian storage layer which uses a distributed and
    12  immutable *commit log* as the source of truth for a Trillian Log's contents and
    13  sequence information, and one or more independent *"readonly"* databases built
    14  from the commit log to serve queries.
    15  
    16  This design allows for:
    17  
    18  *   flexibility in scaling Trillian deployments,
    19  *   easier recovery from corrupt/failed database deployments since in many
    20      cases operators can simply delete the failed DB instance and allow it to be
    21      rebuilt from the commit log, while the remaining instances continue to
    22      serve.
    23  
    24  Initially, this will be built using Apache Kafka for the commit log, with
    25  datacentre-local Apache HBase instances for the serving databases, since this
    26  is what Cloudflare has operational experience in running, but other distributed
    27  commit-log and database engines may be available - this model should also work
    28  with instance-local database implementations such as RocksDB etc. too.
    29  
    30  Having Trillian support a commit-log based storage system will also ensure
    31  Trillian doesn't inadvertently tie itself exclusively to strong globally
    32  consistent storage.
    33  
    34  ## Background
    35  
    36  Trillian currently supports two storage technologies, MySQL and Spanner, which
    37  provide strong global consistency.
    38  
    39  The design presented here requires:
    40  
    41  *   A durable, ordered, and immutable commit log.
    42  *   A "local" storage mechanism which can support the operations required by
    43      the Trillian {tree,log}_storage API.
    44  
    45  
    46  
    47  ## Design Overview
    48  
    49  ![overview diagram](commit_log_based_storage_design_overview.png "Overview")
    50  
    51  The `leaves` topic is the canonical source of truth for the ordering of leaves
    52  in a log.
    53  
    54  The `STHs` topic is a list of all STHs for a given log.
    55  
    56  Kafka topics are configured never to expire entries (this is a supported mode),
    57  and Kafka is known to scale to multiple terabytes within a single partition.
    58  
    59  HBase instances are assumed to be one-per-cluster, built from the contents of
    60  the Kafka topics, and, consequently, are essentially disposable.
    61  
    62  Queued leaves are sent by the Trillian frontends to the Kafka `Leaves` topic.
    63  Since Kafka topics are append-only and immutable, this effectively sequences
    64  the entries in the queue.
    65  The signer nodes track the leaves and STHs topics to bring their local database
    66  instances up-to-date.  The current master signer will additionally incorporate
    67  new entries in the leaves topic into its tree, ensuring the Kafka offset number
    68  of each leaf matches its position in the Merkle tree, then generate a new
    69  STH which it publishes to the STH topic before updating its local database.
    70  
    71  Since the commit log forms the source of truth for the log entry ordering and
    72  committed STHs, everything else can be derived from that. This means that
    73  updates to the serving HBase DBs can be made to be idempotent, which means that
    74  the transactional requirements of Trillian's LogStorage APIs can be relaxed:
    75  writes to local storage can be buffered and flushed at `Commit` time, and the
    76  only constraint on the implementation is that the final new/updated STH must
    77  only be written to the local storage iff all other buffered writes have been
    78  successfully flushed.
    79  
    80  The addition of this style of storage implementation requires that Trillian
    81  does not guarantee the perfect deduplication of entries, even though it may be
    82  possible to do so with some storage implementations.  i.e. personalities MUST
    83  present LeafIdentityHashes, and Trillian MAY deduplicate.
    84  
    85  ## Detailed Design
    86  
    87  #### Enqueuing leaves
    88  
    89  RPC calls to frontend `QueueLeaves` results in the leaves being individually
    90  added to the Kafka topic `Leaves`.  They need to be added individually to allow
    91  the Kafka topic sequencing to be the definitive source of log sequence
    92  information.
    93  
    94  Log frontends may attempt to de-duplicate incoming leaves by consulting the
    95  local storage DB using the identity hash (and/or e.g. using a per-instance LRU
    96  cache), but this will always be a "best effort" affair, so the Trillian APIs
    97  must not assume that duplicates are impossible, even though in practice, when
    98  using other storage implementations, they may well be so currently.
    99  
   100  #### Master election
   101  
   102  Multiple sequencers may be running to provide resilience, if this is the case
   103  there must be a mechanism for choosing a single master instance among the
   104  running sequencers. The Trillian repo provides an etcd-backed implementation
   105  of this already.
   106  
   107  A sequencer must only participate/remain the master if its local database state
   108  is at least as new at the latest message in the Kafka `STHs` topic.
   109  
   110  The current master sequencer will create new STHs and publish them to the
   111  `STHs` topic, the remaining sequencers will run in a "mirror" mode to keep
   112  their local database state up-to-date with the master.
   113  
   114  #### Local DB storage
   115  
   116  This does not *need* to be transactional, because writes should be idempotent,
   117  but the implementation of the Trillian storage driver must buffer *all*
   118  writes and only attempt to apply them to the local storage when `Commit` is
   119  called.
   120  
   121  The write of an updated STH to local storage needs slightly special attention,
   122  in that it must be the last thing written by `Commit`, and must only be written
   123  if all other buffered writes succeeded.
   124  
   125  In the case of a partial commit failure, or crash of the signer, the next
   126  sequencing cycle should find that identical writes are re-attempted due to the
   127  signer process outlined below.
   128  
   129  #### Sequencing
   130  
   131  Assigning sequence numbers to queued leaves is implicitly performed by the
   132  addition of entries to the Kafka `Leaves` topic (this is termed *offset* in
   133  Kafka documentation).
   134  
   135  
   136  ##### Abstract Signer process
   137  
   138  ```golang
   139  func SignerRun() {
   140    // if any of the below operations fail, just bail and retry
   141  
   142    // read `dbSTH` (containing `treeRevision` and `sthOffset`) from local DB
   143    dbSTH.treeRevision, dbSTH.sthOffset = tx.LatestSTH()
   144  
   145    // Sanity check that the STH table has what we already know.
   146    ourSTH := kafka.Read("STHs/<treeID>", dbSTH.sthOffset)
   147    if ourSTH == nil {
   148      glog.Errorf("should not happen - local DB has data ahead of STHs topic")
   149      return
   150    }
   151    if ourSTH.expectedOffset != dbSTH.sthOffset {
   152      glog.Errorf("should not happen - local DB committed to invalid STH from topic")
   153      return
   154    }
   155    if ourSTH.timestamp != dbSTH.timestamp || ourSTH.tree_size != dbSTH.tree_size {
   156      glog.Errorf("should not happen - local DB has different data than STHs topic")
   157      return
   158    }
   159  
   160    // Look to see if anyone else has already stored data just ahead of our STH.
   161    nextOffset := dbSTH.sthOffset
   162    nextSTH := nil
   163    for {
   164      nextOffset++
   165      nextSTH = kafka.Read("STHs/<treeID>", nextOffset)
   166      if nextSTH == nil {
   167        break
   168      }
   169      if nextSTH.expectedOffset != nextOffset {
   170        // Someone's been writing STHs when they weren't supposed to be, skip
   171        // this one until we find another which is in-sync.
   172        glog.Warning("skipping unexpected STH")
   173        continue
   174      }
   175      if nextSTH.timestamp < ourSTH.timestamp || nextSTH.tree_size < ourSTH.tree_size {
   176        glog.Fatal("should not happen - earlier STH with later offset")
   177        return
   178      }
   179    }
   180  
   181    if nextSTH == nil {
   182      // We're up-to-date with the STHs topic (as of a moment ago) ...
   183      if !IsMaster() {
   184        // ... but we're not allowed to create fresh STHs.
   185        return
   186      }
   187      // ... and we're the master. Move the STHs topic along to encompass any unincorporated leaves.
   188      offset := dbSTH.tree_size
   189      batch := kafka.Read("Leaves", offset, batchSize)
   190      for b := range batch {
   191        db.Put("/<treeID>/leaves/<b.offset>", b.contents)
   192      }
   193  
   194      root := UpdateMerkleTreeAndBufferNodes(batch, treeRevision+1)
   195      newSTH := STH{root, ...}
   196      newSTH.treeRevision = dbSTH.treeRevision + 1
   197      newSTH.expectedOffset = nextOffset
   198      actualOffset := kafka.Append("STHs/<treeID>", newSTH)
   199      if actualOffset != nextOffset {
   200        glog.Warning("someone else wrote an STH while we were master")
   201        tx.Abort()
   202        return
   203      }
   204      newSTH.sthOffset = actualOffset
   205      tx.BufferNewSTHForDB(newSTH)
   206      tx.Commit() // flush writes
   207    } else {
   208      // There is an STH one ahead of us that we're not caught up with yet.
   209      // Read the leaves between what we have in our DB, and that STH...
   210      leafRange := InclusiveExclusive(dbSTH.tree_size, nextSTH.tree_size)
   211      batch := kafka.Read("Leaves", leafRange)
   212      // ... and store them in our local DB
   213      for b := range batch {
   214        db.Put("<treeID>/leaves/<b.offset>", b.contents)
   215      }
   216      newRoot := tx.UpdateMerkleTreeAndBufferNodes(batch, treeRevision+1)
   217      if newRoot != nextSTH.root {
   218        glog.Warning("calculated root hash != expected root hash, corrupt DB?")
   219        tx.Abort()
   220        return
   221      }
   222      tx.BufferNewSTHForDB(nextSTH)
   223      tx.Commit() // flush writes
   224      // We may still not be caught up, but that's for the next time around.
   225    }
   226  }
   227  ```
   228  
   229  ##### Fit with storage interfaces
   230  
   231  LogStorage interfaces will need to be tweaked slightly, in particular:
   232   - `UpdateSequencedLeaves` should be pulled out of `LeafDequeuer` and moved
   233     into a `LeafSequencer` (or something) interface.
   234   - It would be nice to introduce a roll-up interface which describes the
   235     responsibilities of the "local DB" thing, so that we can compose
   236     `commit-queue+local` storage implementations using existing DB impls
   237     (or at least not tie this tightly to HBase).
   238  
   239  ###### TX
   240  ```golang
   241  
   242  type splitTX struct {
   243     treeID       int64
   244     ...
   245  
   246     dbTX         *storage.LogTX // something something handwavy
   247     cqTX         *storage.???   // something something handwavy
   248  
   249     dbSTH        *trillian.SignedTreeHead
   250     nextSTH      *trillian.SignedTreeHead  // actually something which contains this plus some metadata
   251     treeRevision int64
   252     sthOffset    int64
   253  }
   254  ```
   255  
   256  ###### `Storage.Begin()`
   257  
   258  Starts a Trillian transaction, this will do:
   259     1. the read of `currentSTH`, `treeRevision`, and `sthOffset` from the DB
   260     1. verification of that against its corresponding entry in Kafka
   261  
   262  and return a `LogTX` struct containing these values as unexported fields.
   263  **The HBase LogTX struct will buffer all writes locally until `Commit` is
   264  called**, whereupon it'll attempt to action the writes as HBase `PUT` requests
   265  (presumably it can be smart about batching where appropriate).
   266  
   267  ```golang
   268  // Begin starts a Trillian transaction.
   269  // This will get the latest known STH from the "local" DB, and verify
   270  // that the corresponding STH in Kafka matches.
   271  func (ls *CQComboStorage) Begin() (LogTX, error) {
   272    // create db and cq "TX" objects
   273  
   274    tx := &splitTX{...}
   275  
   276    // read `dbSTH` (containing `treeRevision` and `sthOffset`) from local DB
   277    tx.dbSTH, tx.treeRevision, tx.stdOffset := dbTX.latestSTH()
   278  
   279    // Sanity check that the STH table has what we already know.
   280    ourSTH := cqTX.GetSTHAt(tx.sthOffset)
   281  
   282    if ourSTH == nil {
   283      return nil, fmt.Errorf("should not happen - local DB has data ahead of STHs topic")
   284    }
   285    if ourSTH.expectedOffset != dbSTH.sthOffset {
   286      return nil, fmt.Errorf("should not happen - local DB committed to invalid STH from topic")
   287    }
   288    if ourSTH.timestamp != dbSTH.timestamp || ourSTH.tree_size != dbSTH.tree_size {
   289      return nil, fmt.Errorf("should not happen - local DB has different data than STHs topic")
   290    }
   291  
   292    ...
   293  
   294    return tx, nil
   295  }
   296  ```
   297  
   298  
   299  ###### `DequeueLeaves()`
   300  Calls to this method ignore `limit` and `cutoff` when there exist newer STHs in
   301  the Kafka queue (because we're following someone else's footsteps), and return
   302  the `batch` of leaves outlined above.
   303  
   304  *TODO(al): should this API be reworked?*
   305  
   306  ```golang
   307  func (tx *splitTX) DequeueLeaves() (..., error) {
   308    // Look to see if anyone else has already stored data just ahead of our STH.
   309    nextOffset := tx.sthOffset
   310    nextSTH := nil
   311    for {
   312        nextOffset++
   313        tx.nextSTH = tx.cqTX.GetSTHAt(nextOffset)
   314        if nextSTH == nil {
   315          break
   316        }
   317        if nextSTH.expectedOffset != nextOffset {
   318          // Someone's been writing STHs when they weren't supposed to be, skip
   319          // this one until we find another which is in-sync.
   320          glog.Warning("skipping invalid STH")
   321          continue
   322        }
   323        if nextSTH.timestamp < ourSTH.timestamp || nextSTH.tree_size < ourSTH.tree_size {
   324          return nil, fmt.Errorf("should not happen - earlier STH with later offset")
   325      }
   326    }
   327  
   328    if nextSTH == nil {
   329      offset := tx.dbSTH.tree_size
   330      batch := tx.cqTX.ReadLeaves(offset, limit)
   331      return batch, nil
   332    } else {
   333      // There is an STH one ahead of us that we're not caught up with yet.
   334      for {
   335        nextOffset++
   336        nextSTH = tx.cqTX.ReadSTH(nextOffset)
   337        if nextSTH.timestamp < dbSTH.timestamp || nextSTH.tree_size < dbSTH.tree_size {
   338          return nil, fmt.Errorf("should not happen - earlier STH with later offset")
   339        }
   340      }
   341      // Read the leaves between what we have in our DB, and that STH...
   342      leafRange := InclusiveExclusive(dbSTH.tree_size, nextSTH.tree_size)
   343      batch := tx.cqTX.ReadLeaves(leafRange)
   344      return nil, batch
   345    }
   346  }
   347  
   348  ```
   349  
   350  ###### `UpdateSequencedLeaves()`
   351  
   352  This method should be moved out from `LeafDequeuer` and into a new interface
   353  `LeafWriter` implemented by dbTX.
   354  
   355  **TODO(al): keep writing!**
   356  
   357