github.com/mre-fog/trillianxx@v1.1.2-0.20180615153820-ae375a99d36a/docs/storage/commit_log/commit_log_based_storage_design.md (about) 1 ## Commit-Log based Trillian storage 2 3 *Status: Draft* 4 5 *Authors: al@google.com, drysdale@google.com, filippo@cloudflare.com* 6 7 *Last Updated: 2017-05-12* 8 9 ## Objective 10 11 A design for an alternative Trillian storage layer which uses a distributed and 12 immutable *commit log* as the source of truth for a Trillian Log's contents and 13 sequence information, and one or more independent *"readonly"* databases built 14 from the commit log to serve queries. 15 16 This design allows for: 17 18 * flexibility in scaling Trillian deployments, 19 * easier recovery from corrupt/failed database deployments since in many 20 cases operators can simply delete the failed DB instance and allow it to be 21 rebuilt from the commit log, while the remaining instances continue to 22 serve. 23 24 Initially, this will be built using Apache Kafka for the commit log, with 25 datacentre-local Apache HBase instances for the serving databases, since this 26 is what Cloudflare has operational experience in running, but other distributed 27 commit-log and database engines may be available - this model should also work 28 with instance-local database implementations such as RocksDB etc. too. 29 30 Having Trillian support a commit-log based storage system will also ensure 31 Trillian doesn't inadvertently tie itself exclusively to strong globally 32 consistent storage. 33 34 ## Background 35 36 Trillian currently supports two storage technologies, MySQL and Spanner, which 37 provide strong global consistency. 38 39 The design presented here requires: 40 41 * A durable, ordered, and immutable commit log. 42 * A "local" storage mechanism which can support the operations required by 43 the Trillian {tree,log}_storage API. 44 45 46 47 ## Design Overview 48 49  50 51 The `leaves` topic is the canonical source of truth for the ordering of leaves 52 in a log. 53 54 The `STHs` topic is a list of all STHs for a given log. 55 56 Kafka topics are configured never to expire entries (this is a supported mode), 57 and Kafka is known to scale to multiple terabytes within a single partition. 58 59 HBase instances are assumed to be one-per-cluster, built from the contents of 60 the Kafka topics, and, consequently, are essentially disposable. 61 62 Queued leaves are sent by the Trillian frontends to the Kafka `Leaves` topic. 63 Since Kafka topics are append-only and immutable, this effectively sequences 64 the entries in the queue. 65 The signer nodes track the leaves and STHs topics to bring their local database 66 instances up-to-date. The current master signer will additionally incorporate 67 new entries in the leaves topic into its tree, ensuring the Kafka offset number 68 of each leaf matches its position in the Merkle tree, then generate a new 69 STH which it publishes to the STH topic before updating its local database. 70 71 Since the commit log forms the source of truth for the log entry ordering and 72 committed STHs, everything else can be derived from that. This means that 73 updates to the serving HBase DBs can be made to be idempotent, which means that 74 the transactional requirements of Trillian's LogStorage APIs can be relaxed: 75 writes to local storage can be buffered and flushed at `Commit` time, and the 76 only constraint on the implementation is that the final new/updated STH must 77 only be written to the local storage iff all other buffered writes have been 78 successfully flushed. 79 80 The addition of this style of storage implementation requires that Trillian 81 does not guarantee the perfect deduplication of entries, even though it may be 82 possible to do so with some storage implementations. i.e. personalities MUST 83 present LeafIdentityHashes, and Trillian MAY deduplicate. 84 85 ## Detailed Design 86 87 #### Enqueuing leaves 88 89 RPC calls to frontend `QueueLeaves` results in the leaves being individually 90 added to the Kafka topic `Leaves`. They need to be added individually to allow 91 the Kafka topic sequencing to be the definitive source of log sequence 92 information. 93 94 Log frontends may attempt to de-duplicate incoming leaves by consulting the 95 local storage DB using the identity hash (and/or e.g. using a per-instance LRU 96 cache), but this will always be a "best effort" affair, so the Trillian APIs 97 must not assume that duplicates are impossible, even though in practice, when 98 using other storage implementations, they may well be so currently. 99 100 #### Master election 101 102 Multiple sequencers may be running to provide resilience, if this is the case 103 there must be a mechanism for choosing a single master instance among the 104 running sequencers. The Trillian repo provides an etcd-backed implementation 105 of this already. 106 107 A sequencer must only participate/remain the master if its local database state 108 is at least as new at the latest message in the Kafka `STHs` topic. 109 110 The current master sequencer will create new STHs and publish them to the 111 `STHs` topic, the remaining sequencers will run in a "mirror" mode to keep 112 their local database state up-to-date with the master. 113 114 #### Local DB storage 115 116 This does not *need* to be transactional, because writes should be idempotent, 117 but the implementation of the Trillian storage driver must buffer *all* 118 writes and only attempt to apply them to the local storage when `Commit` is 119 called. 120 121 The write of an updated STH to local storage needs slightly special attention, 122 in that it must be the last thing written by `Commit`, and must only be written 123 if all other buffered writes succeeded. 124 125 In the case of a partial commit failure, or crash of the signer, the next 126 sequencing cycle should find that identical writes are re-attempted due to the 127 signer process outlined below. 128 129 #### Sequencing 130 131 Assigning sequence numbers to queued leaves is implicitly performed by the 132 addition of entries to the Kafka `Leaves` topic (this is termed *offset* in 133 Kafka documentation). 134 135 136 ##### Abstract Signer process 137 138 ```golang 139 func SignerRun() { 140 // if any of the below operations fail, just bail and retry 141 142 // read `dbSTH` (containing `treeRevision` and `sthOffset`) from local DB 143 dbSTH.treeRevision, dbSTH.sthOffset = tx.LatestSTH() 144 145 // Sanity check that the STH table has what we already know. 146 ourSTH := kafka.Read("STHs/<treeID>", dbSTH.sthOffset) 147 if ourSTH == nil { 148 glog.Errorf("should not happen - local DB has data ahead of STHs topic") 149 return 150 } 151 if ourSTH.expectedOffset != dbSTH.sthOffset { 152 glog.Errorf("should not happen - local DB committed to invalid STH from topic") 153 return 154 } 155 if ourSTH.timestamp != dbSTH.timestamp || ourSTH.tree_size != dbSTH.tree_size { 156 glog.Errorf("should not happen - local DB has different data than STHs topic") 157 return 158 } 159 160 // Look to see if anyone else has already stored data just ahead of our STH. 161 nextOffset := dbSTH.sthOffset 162 nextSTH := nil 163 for { 164 nextOffset++ 165 nextSTH = kafka.Read("STHs/<treeID>", nextOffset) 166 if nextSTH == nil { 167 break 168 } 169 if nextSTH.expectedOffset != nextOffset { 170 // Someone's been writing STHs when they weren't supposed to be, skip 171 // this one until we find another which is in-sync. 172 glog.Warning("skipping unexpected STH") 173 continue 174 } 175 if nextSTH.timestamp < ourSTH.timestamp || nextSTH.tree_size < ourSTH.tree_size { 176 glog.Fatal("should not happen - earlier STH with later offset") 177 return 178 } 179 } 180 181 if nextSTH == nil { 182 // We're up-to-date with the STHs topic (as of a moment ago) ... 183 if !IsMaster() { 184 // ... but we're not allowed to create fresh STHs. 185 return 186 } 187 // ... and we're the master. Move the STHs topic along to encompass any unincorporated leaves. 188 offset := dbSTH.tree_size 189 batch := kafka.Read("Leaves", offset, batchSize) 190 for b := range batch { 191 db.Put("/<treeID>/leaves/<b.offset>", b.contents) 192 } 193 194 root := UpdateMerkleTreeAndBufferNodes(batch, treeRevision+1) 195 newSTH := STH{root, ...} 196 newSTH.treeRevision = dbSTH.treeRevision + 1 197 newSTH.expectedOffset = nextOffset 198 actualOffset := kafka.Append("STHs/<treeID>", newSTH) 199 if actualOffset != nextOffset { 200 glog.Warning("someone else wrote an STH while we were master") 201 tx.Abort() 202 return 203 } 204 newSTH.sthOffset = actualOffset 205 tx.BufferNewSTHForDB(newSTH) 206 tx.Commit() // flush writes 207 } else { 208 // There is an STH one ahead of us that we're not caught up with yet. 209 // Read the leaves between what we have in our DB, and that STH... 210 leafRange := InclusiveExclusive(dbSTH.tree_size, nextSTH.tree_size) 211 batch := kafka.Read("Leaves", leafRange) 212 // ... and store them in our local DB 213 for b := range batch { 214 db.Put("<treeID>/leaves/<b.offset>", b.contents) 215 } 216 newRoot := tx.UpdateMerkleTreeAndBufferNodes(batch, treeRevision+1) 217 if newRoot != nextSTH.root { 218 glog.Warning("calculated root hash != expected root hash, corrupt DB?") 219 tx.Abort() 220 return 221 } 222 tx.BufferNewSTHForDB(nextSTH) 223 tx.Commit() // flush writes 224 // We may still not be caught up, but that's for the next time around. 225 } 226 } 227 ``` 228 229 ##### Fit with storage interfaces 230 231 LogStorage interfaces will need to be tweaked slightly, in particular: 232 - `UpdateSequencedLeaves` should be pulled out of `LeafDequeuer` and moved 233 into a `LeafSequencer` (or something) interface. 234 - It would be nice to introduce a roll-up interface which describes the 235 responsibilities of the "local DB" thing, so that we can compose 236 `commit-queue+local` storage implementations using existing DB impls 237 (or at least not tie this tightly to HBase). 238 239 ###### TX 240 ```golang 241 242 type splitTX struct { 243 treeID int64 244 ... 245 246 dbTX *storage.LogTX // something something handwavy 247 cqTX *storage.??? // something something handwavy 248 249 dbSTH *trillian.SignedTreeHead 250 nextSTH *trillian.SignedTreeHead // actually something which contains this plus some metadata 251 treeRevision int64 252 sthOffset int64 253 } 254 ``` 255 256 ###### `Storage.Begin()` 257 258 Starts a Trillian transaction, this will do: 259 1. the read of `currentSTH`, `treeRevision`, and `sthOffset` from the DB 260 1. verification of that against its corresponding entry in Kafka 261 262 and return a `LogTX` struct containing these values as unexported fields. 263 **The HBase LogTX struct will buffer all writes locally until `Commit` is 264 called**, whereupon it'll attempt to action the writes as HBase `PUT` requests 265 (presumably it can be smart about batching where appropriate). 266 267 ```golang 268 // Begin starts a Trillian transaction. 269 // This will get the latest known STH from the "local" DB, and verify 270 // that the corresponding STH in Kafka matches. 271 func (ls *CQComboStorage) Begin() (LogTX, error) { 272 // create db and cq "TX" objects 273 274 tx := &splitTX{...} 275 276 // read `dbSTH` (containing `treeRevision` and `sthOffset`) from local DB 277 tx.dbSTH, tx.treeRevision, tx.stdOffset := dbTX.latestSTH() 278 279 // Sanity check that the STH table has what we already know. 280 ourSTH := cqTX.GetSTHAt(tx.sthOffset) 281 282 if ourSTH == nil { 283 return nil, fmt.Errorf("should not happen - local DB has data ahead of STHs topic") 284 } 285 if ourSTH.expectedOffset != dbSTH.sthOffset { 286 return nil, fmt.Errorf("should not happen - local DB committed to invalid STH from topic") 287 } 288 if ourSTH.timestamp != dbSTH.timestamp || ourSTH.tree_size != dbSTH.tree_size { 289 return nil, fmt.Errorf("should not happen - local DB has different data than STHs topic") 290 } 291 292 ... 293 294 return tx, nil 295 } 296 ``` 297 298 299 ###### `DequeueLeaves()` 300 Calls to this method ignore `limit` and `cutoff` when there exist newer STHs in 301 the Kafka queue (because we're following someone else's footsteps), and return 302 the `batch` of leaves outlined above. 303 304 *TODO(al): should this API be reworked?* 305 306 ```golang 307 func (tx *splitTX) DequeueLeaves() (..., error) { 308 // Look to see if anyone else has already stored data just ahead of our STH. 309 nextOffset := tx.sthOffset 310 nextSTH := nil 311 for { 312 nextOffset++ 313 tx.nextSTH = tx.cqTX.GetSTHAt(nextOffset) 314 if nextSTH == nil { 315 break 316 } 317 if nextSTH.expectedOffset != nextOffset { 318 // Someone's been writing STHs when they weren't supposed to be, skip 319 // this one until we find another which is in-sync. 320 glog.Warning("skipping invalid STH") 321 continue 322 } 323 if nextSTH.timestamp < ourSTH.timestamp || nextSTH.tree_size < ourSTH.tree_size { 324 return nil, fmt.Errorf("should not happen - earlier STH with later offset") 325 } 326 } 327 328 if nextSTH == nil { 329 offset := tx.dbSTH.tree_size 330 batch := tx.cqTX.ReadLeaves(offset, limit) 331 return batch, nil 332 } else { 333 // There is an STH one ahead of us that we're not caught up with yet. 334 for { 335 nextOffset++ 336 nextSTH = tx.cqTX.ReadSTH(nextOffset) 337 if nextSTH.timestamp < dbSTH.timestamp || nextSTH.tree_size < dbSTH.tree_size { 338 return nil, fmt.Errorf("should not happen - earlier STH with later offset") 339 } 340 } 341 // Read the leaves between what we have in our DB, and that STH... 342 leafRange := InclusiveExclusive(dbSTH.tree_size, nextSTH.tree_size) 343 batch := tx.cqTX.ReadLeaves(leafRange) 344 return nil, batch 345 } 346 } 347 348 ``` 349 350 ###### `UpdateSequencedLeaves()` 351 352 This method should be moved out from `LeafDequeuer` and into a new interface 353 `LeafWriter` implemented by dbTX. 354 355 **TODO(al): keep writing!** 356 357