github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170605_dedicated_raft_storage.md (about) 1 - Feature Name: Dedicated storage engine for Raft 2 - Status: postponed 3 - Start Date: 2017-05-25 4 - Authors: Irfan Sharif 5 - RFC PR: [#16361](https://github.com/cockroachdb/cockroach/pull/16361) 6 - Cockroach Issue(s): 7 [#7807](https://github.com/cockroachdb/cockroach/issues/7807), 8 [#15245](https://github.com/cockroachdb/cockroach/issues/15245) 9 10 # Summary 11 12 At the time of writing each 13 [`Replica`](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/replica.go#L214) 14 is backed by a single instance of RocksDB 15 ([`Store.engine`](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/store.go#L391)) 16 which is used to store all modifications to the underlying state machine in 17 _addition_ to storing all consensus state. This RFC proposes the separation of 18 the two, outlines the motivations for doing so and alternatives considered. 19 20 # Motivation 21 22 Raft's RPCs typically require the recipient to persist information to stable 23 storage before responding. This 'persistent state' is comprised of the latest 24 term the server has seen, the candidate voted for in the current term (if any), 25 and the raft log entries themselves<sup>[1]</sup>. Modifications to any of the 26 above are [synchronously 27 updated](https://github.com/cockroachdb/cockroach/pull/15366) on stable storage 28 before responding to RPCs. 29 30 In our usage of RocksDB, data is only persisted when explicitly issuing a write 31 with [`sync = 32 true`](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/engine/db.cc#L1828). 33 Internally this also persists previously unsynchronized writes<sup>[2]</sup>. 34 35 Let's consider a sequential write-only workload on a single node cluster. The 36 internals of the Raft/RocksDB+Storage interface can be simplified to the 37 following: 38 - Convert the write command into a Raft proposal and 39 [submit](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/replica.go#L2811) 40 the proposal to the underlying raft group 41 - 'Downstream' of raft we 42 [persist](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/replica.go#L3120) 43 the newly generated log entry corresponding to the command 44 - We record the modifications to the underlying state machine but [_do 45 not_](https://github.com/cockroachdb/cockroach/blob/ea3b2c499/pkg/storage/replica.go#L4208) 46 persist this synchronously 47 48 One can see that for the `n+1-th` write, upon persisting the corresponding raft 49 log entry, we also end up persisting the state machine modifications from the 50 `n-th` write. It is worth mentioning here that asynchronous writes are often 51 more than a thousand times as fast as synchronous writes<sup>[3]</sup>. Given 52 our current usage of the same RocksDB instance for both the underlying state 53 machine _and_ the consensus state we effectively forego (for this particular 54 workload at least) the performance gain to be had in not persisting state 55 machine modifications. For `n` writes we have `n` unsynchronized and `n` 56 synchronized writes where for `n-1` of them, we also flush `n-1` earlier 57 unsynchronized writes to disk. 58 59 By having a dedicated storage engine for Raft's persistent state we can address 60 this specific sub-optimality. By isolating the two workloads (synchronized and 61 unsynchronized writes) into separately running storage engines such that 62 synchronized writes no longer flush to disk previously unsynchronized ones, for 63 `n` writes we can have `n` unsynchronized and `n` synchronized writes (smaller 64 payload than the alternative above). 65 66 # Benchmarks 67 68 As a sanity check we ran some initial benchmarks that gives us a rough idea of 69 the performance gain to expect from this change. We benchmarked the sequential 70 write-only workload as described in the section above and did so at the 71 `pkg/storage{,/engine}` layers. What follows is redacted, simplified code of 72 the original benchmarks and the results demonstrating the speedups. 73 74 To simulate our current implementation (`n+1-th` synchronized write persists 75 the `n-th` unsynchronized write) we alternate between synchronized and 76 unsynchronized writes `b.N` times in `BenchmarkBatchCommitSharedEngine`. 77 78 ```go 79 // pkg/storage/engine/bench_rocksdb_test.go 80 81 func BenchmarkBatchCommitSharedEngine(b *testing.B) { 82 for _, valueSize := range []int{1 << 10, 1 << 12, ..., 1 << 20} { 83 b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) { 84 // ... 85 for i := 0; i < b.N; i++ { 86 { 87 // ... 88 batch := eng.NewWriteOnlyBatch() 89 MVCCBlindPut(batch, key, value) 90 91 // Representative of persisting a raft log entry. 92 batch.Commit(true) 93 } 94 { 95 // ... 96 batch := eng.NewWriteOnlyBatch() 97 MVCCBlindPut(batch, key, value) 98 99 // Representative of an unsynchronized state machine write. 100 batch.Commit(false) 101 } 102 } 103 }) 104 } 105 } 106 ``` 107 108 To simulate the propose workload (`n` synchronized and `n` unsynchronized 109 writes, independent of one another) we simply issue `b.N` synchronized and 110 unsynchronized writes to a separate RocksDB instances in 111 `BenchmarkBatchCommitDedicatedEngines`. 112 113 ```go 114 func BenchmarkBatchCommitDedicatedEngines(b *testing.B) { 115 for _, valueSize := range []int{1 << 10, 1 << 12, ..., 1 << 20} { 116 b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) { 117 // ... 118 for i := 0; i < b.N; i++ { 119 { 120 // ... 121 batch := engA.NewWriteOnlyBatch() 122 MVCCBlindPut(batch, key, value) 123 124 // Representative of persisting a raft log entry. 125 batch.Commit(true) 126 } 127 { 128 // ... 129 batch := engB.NewWriteOnlyBatch() 130 MVCCBlindPut(batch, key, value) 131 132 // Representative of an unsynchronized state machine write. 133 batch.Commit(false) 134 } 135 } 136 }) 137 } 138 } 139 ``` 140 141 ```sh 142 ~ benchstat perf-shared-engine.txt perf-dedicated-engine.txt 143 name old time/op new time/op delta 144 BatchCommit/vs=1024-4 75.4µs ± 4% 70.2µs ± 2% -6.87% (p=0.000 n=19+17) 145 BatchCommit/vs=4096-4 117µs ± 5% 106µs ± 7% -9.76% (p=0.000 n=20+20) 146 BatchCommit/vs=16384-4 325µs ± 7% 209µs ± 5% -35.55% (p=0.000 n=20+18) 147 BatchCommit/vs=65536-4 1.05ms ±10% 1.08ms ±20% ~ (p=0.718 n=20+20) 148 BatchCommit/vs=262144-4 3.52ms ± 6% 2.81ms ± 7% -20.30% (p=0.000 n=17+18) 149 BatchCommit/vs=1048576-4 11.2ms ±18% 7.8ms ± 5% -30.56% (p=0.000 n=20+20) 150 151 name old speed new speed delta 152 BatchCommit/vs=1024-4 13.6MB/s ± 4% 14.6MB/s ± 2% +7.34% (p=0.000 n=19+17) 153 BatchCommit/vs=4096-4 34.9MB/s ± 5% 38.7MB/s ± 7% +10.88% (p=0.000 n=20+20) 154 BatchCommit/vs=16384-4 50.5MB/s ± 8% 78.4MB/s ± 5% +55.04% (p=0.000 n=20+18) 155 BatchCommit/vs=65536-4 62.6MB/s ± 9% 61.1MB/s ±17% ~ (p=0.718 n=20+20) 156 BatchCommit/vs=262144-4 74.5MB/s ± 5% 93.5MB/s ± 7% +25.43% (p=0.000 n=17+18) 157 BatchCommit/vs=1048576-4 94.8MB/s ±16% 135.2MB/s ± 5% +42.57% (p=0.000 n=20+20) 158 ``` 159 160 NOTE: 64 KiB workloads don't exhibit the same performance increase as compared 161 to other workloads, this is unexpected and needs to be investigated further. 162 See [drawbacks](#drawbacks) for more discussion on this. 163 164 Similarly we ran the equivalent benchmarks at the `pkg/storage` layer: 165 166 ```go 167 // pkg/storage/replica_raftstorage_test.go 168 169 func BenchmarkReplicaRaftStorageSameEngine(b *testing.B) { 170 for _, valueSize := range []int{1 << 10, 1 << 12, ... , 1 << 20} { 171 b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) { 172 // ... 173 rep := tc.store.GetReplica(rangeID) 174 rep.redirectOnOrAcquireLease() 175 176 defer settings.TestingSetBool(&syncRaftLog, true)() 177 178 for i := 0; i < b.N; i++ { 179 // ... 180 client.SendWrappedWith(rep, putArgs(key, value)) 181 } 182 }) 183 } 184 } 185 ``` 186 187 To simulate the proposed workload (`n` synchronized and `n` unsynchronized 188 writes, independent of one another) we simply issue `b.N` synchronized writes 189 followed by `b.N` unsynchronized ones in 190 `BenchmarkReplicaRaftStorageDedicatedEngine`. To see why this is equivalent 191 consider a sequence of alternating/interleaved synchronized and unsynchronized 192 writes where synchronized writes do not persist the previous unsynchronized 193 writes. If `S` is the time taken for a synchronized write and `U` is the time 194 taken for an unsynchronized one, 195 `S + U + S + U + ... S + U == S + S + ... + S + U + U + ... + U`. 196 197 ```go 198 // NOTE: syncApplyCmd is set to true to synchronize command applications (state 199 // machine changes) to persistent storage. Changes to pkg/storage/replica.go 200 // not shown here. 201 func BenchmarkReplicaRaftStorageDedicatedEngine(b *testing.B) { 202 for _, valueSize := range []int{1 << 10, 1 << 12, ... , 1 << 20} { 203 b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) { 204 // ... 205 rep := tc.store.GetReplica(rangeID) 206 rep.redirectOnOrAcquireLease() 207 208 defer settings.TestingSetBool(&syncRaftLog, false)() 209 defer settings.TestingSetBool(&syncApplyCmd, false)() 210 211 for i := 0; i < b.N/2; i++ { 212 // ... 213 client.SendWrappedWith(rep, putArgs(key, value)) 214 } 215 216 defer settings.TestingSetBool(&syncRaftLog, true)() 217 defer settings.TestingSetBool(&syncApplyCmd, true)() 218 219 for i := b.N/2; i < b.N; i++ { 220 // ... 221 client.SendWrappedWith(rep, putArgs(key, value)) 222 } 223 }) 224 } 225 } 226 ``` 227 228 ```sh 229 ~ benchstat perf-storage-alternating.txt perf-storage-sequential.txt 230 name old time/op new time/op delta 231 ReplicaRaftStorage/vs=1024-4 297µs ± 9% 268µs ± 3% -9.73% (p=0.000 n=10+10) 232 ReplicaRaftStorage/vs=4096-4 511µs ±10% 402µs ± 1% -21.29% (p=0.000 n=9+10) 233 ReplicaRaftStorage/vs=16384-4 2.16ms ± 2% 1.39ms ± 4% -35.70% (p=0.000 n=10+10) 234 ReplicaRaftStorage/vs=65536-4 3.60ms ± 3% 3.49ms ± 4% -3.17% (p=0.003 n=10+9) 235 ReplicaRaftStorage/vs=262144-4 10.3ms ± 7% 10.2ms ± 3% ~ (p=0.393 n=10+10) 236 ReplicaRaftStorage/vs=1048576-4 40.3ms ± 7% 40.8ms ± 3% ~ (p=0.481 n=10+10) 237 ``` 238 239 # Detailed design 240 241 We propose introducing a second RocksDB instance to store all raft consensus 242 data. This RocksDB instance will be specific to a given store (similar to our 243 existing setup) and will be addressable via a new member variable on `type 244 Store`, namely `Store.raftEngine` (the existing `Store.engine` will stay as 245 is). This instance will consequently manage the raft log entries for all the 246 replicas that belong to that store. This will be stored in a subdirectory 247 `raft` under our existing RocksDB storage directory. 248 At the time of writing the keys that would need to be written to the new engine 249 are the log keys and `HardState`<sup>[4](#column-families)</sup>. 250 251 ## Implementation strategy 252 253 We will phase this in bottom-up by first instantiating the new RocksDB instance 254 with reasonable default configurations (see [unresolved 255 questions](#unresolved-questions) below) at the `pkg/storage` layer (as opposed 256 to using the user level store specifications specified via the `--stores` flag 257 in the `cockroach start` command). The points at which raft data is written out 258 to our existing RocksDB instance, we will additionally write them out to our 259 new instance. Following this at any point where raft data is read, we will 260 _also_ read from the new instance and compare. At this point we can wean off 261 all Raft specific reads/writes and log truncations from the old instance and 262 have them serviced by the new. 263 264 Until the [migration story](#migration-strategy) is hashed out it's worthwhile 265 structuring this transition behind an environment variable that would determine 266 _which_ instance all Raft specific reads, writes and log truncations are 267 serviced from. This same mechanism could also be used to collect 'before and 268 after' performance numbers (at the very least as a sanity check). 269 We expect individual writes going through the system to speed up (on average we 270 can assume for every synchronized raft log entry write currently we're flushing 271 out a previously unsynchronized state machine transition). We expect read 272 performance to stay relatively unchanged. 273 274 **NB**: There's a subtle edge case to be wary of with respect to raft log 275 truncations, before truncating the raft log we need to ensure that the 276 application of the truncated entries have actually been persisted, i.e. for 277 `Put(k,v)` the primary RocksDB engine must have synced `(k,v)` before 278 truncating that `Put` operation from the Raft log. 279 Given we expect the `ReplicaState` to be stored in the first engine let's 280 consider the case where we've truncated a set of log entries and the 281 corresponding `TruncatedState`, to be stored on the first engine, is _not_ 282 synchronized to disk. If the node crashes at this point it will fail to load 283 the `TruncatedState` and has no way to bridge the gap between the last 284 persisted `ReplicaState` and the oldest entry in the truncated Raft log.</br> 285 To this end whenever we truncate, we need to _first_ sync the primary RocksDB 286 instance. Given RocksDB periodically flushes in-memory writes to disk, if we 287 can detect that the application of the entries to be truncated have _already_ 288 been persisted, we can avoid this step. See [future work](#future-work) for a 289 possible extension to this. 290 We should note that this will be the only time we _explicitly_ sync the primary 291 instance to disk, the performance blips that arise due to this will be 292 interesting to study. 293 294 Following this the `storage.NewStore` API will be amended to take in two 295 storage engines instead (the new engine is to be used as dedicated raft 296 storage). This API change propagates through any testing code that 297 bootstraps/initializes a node in some way, shape, or form. At the time or 298 writing the tests affected are spread across `pkg/{kv,storage,server,ts}`. 299 300 ## Migration strategy 301 302 A general migration process, in addition to solving the problem for existing 303 clusters, could be used to move the raft RocksDB instance from one location to 304 another (such as to the non-volatile memory discussed below). 305 How do we actually start using this RocksDB instance? One thought is that 306 only new nodes would use a separate RocksDB instance for the Raft log. 307 An offline migration alternative that would work for existing clusters and 308 rolling restarts could be the following: 309 - We detect that we're at the new version with the changes that move the 310 consensus state to a separate RocksDB instance, and we have existing 311 consensus data stored in the old 312 - We copy over _all_ consensus data (for all replicas on that given store) from 313 the old to the new and delete it in the old 314 - As the node is up and running all Raft specific reads/writes are directed to 315 the new 316 We should note that we don't have precedent for an offline, store-level 317 migration at this time. 318 319 An online approach that could enable live migrations moving consensus state 320 from one RocksDB instance to another would be the following: 321 - for a given replica we begin writing out all consensus state to _both_ RocksDB 322 instances, the instance the consensus state is being migrated to and the 323 instance it's being migrated from (at this point we still only exclusively 324 read from the old instance) 325 - at the next log truncation point we set a flag such that 326 - all subsequent Raft specific reads are directed to the new instance 327 - all subsequent Raft specific writes are _only_ directed to the new instance 328 - we delete existing consensus state (pertaining to the given replica) from the 329 old instance. This already happens (to some degree) in normal operation given we're 330 truncating the log 331 332 The implementation of the latter strategy is out of the scope for this RFC, the 333 offline store-level migration alternative should suffice. 334 335 ## TODO 336 - Investigate 'Support for Multiple Embedded Databases in the same 337 process'<sup>[5]</sup> 338 339 # Drawbacks 340 341 None that are immediately obvious. We'll have to pay close attention to how the 342 separate RocksDB instances interact with one another, the performance 343 implications can be non-obvious and subtle given the sharing of hardware 344 resources (disk and/or OS buffers). 345 To demonstrate this consider the following two versions of a benchmark, only 346 difference being that in one we have a single loop body and write out `b.N` synced 347 and unsynced writes (interleaved) versus two loop bodies with `b.N` synced and 348 unsynced writes each: 349 350 ```go 351 func BenchmarkBatchCommitInterleaved(b *testing.B) { 352 for _, valueSize := range []int{1 << 10, 1 << 12, ..., 1 << 20} { 353 b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) { 354 // ... 355 for i := 0; i < b.N; i++ { 356 // ... 357 batchA := engA.NewWriteOnlyBatch() 358 MVCCBlindPut(batchA, key, value) 359 batchA.Commit(true) 360 361 // ... 362 batchB := engB.NewWriteOnlyBatch() 363 MVCCBlindPut(batchB, key, value) 364 batchB.Commit(false) 365 } 366 }) 367 } 368 } 369 370 func BenchmarkBatchCommitSequential(b *testing.B) { 371 for _, valueSize := range []int{1 << 10, 1 << 12, 1 << 20} { 372 b.Run(fmt.Sprintf("vs=%d", valueSize), func(b *testing.B) { 373 // ... 374 for i := 0; i < b.N; i++ { 375 // ... 376 batchA := engA.NewWriteOnlyBatch() 377 MVCCBlindPut(batchA, key, value) 378 batchA.Commit(true) 379 } 380 381 for i := 0; i < b.N; i++ { 382 // ... 383 batchB := engB.NewWriteOnlyBatch() 384 MVCCBlindPut(batchB, key, value) 385 batchB.Commit(false) 386 } 387 }) 388 } 389 } 390 ``` 391 392 Here are the performance differences, especially stark for 64KiB workloads: 393 ```sh 394 ~ benchstat perf-interleaved.txt perf-sequential.txt 395 name old time/op new time/op delta 396 BatchCommit/vs=1024-4 70.1µs ± 2% 68.6µs ± 5% -2.15% (p=0.021 n=8+10) 397 BatchCommit/vs=4096-4 102µs ± 1% 97µs ± 7% -4.10% (p=0.013 n=9+10) 398 BatchCommit/vs=16384-4 207µs ± 5% 188µs ± 4% -9.46% (p=0.000 n=9+9) 399 BatchCommit/vs=65536-4 1.07ms ±12% 0.62ms ± 9% -41.90% (p=0.000 n=8+9) 400 BatchCommit/vs=262144-4 2.90ms ± 8% 2.70ms ± 4% -6.68% (p=0.000 n=9+10) 401 BatchCommit/vs=1048576-4 8.06ms ± 9% 7.90ms ± 5% ~ (p=0.631 n=10+10) 402 403 name old speed new speed delta 404 BatchCommit/vs=1024-4 14.6MB/s ± 2% 14.9MB/s ± 4% +2.22% (p=0.021 n=8+10) 405 BatchCommit/vs=4096-4 40.3MB/s ± 1% 42.1MB/s ± 7% +4.37% (p=0.013 n=9+10) 406 BatchCommit/vs=16384-4 78.6MB/s ± 5% 87.4MB/s ± 3% +11.09% (p=0.000 n=10+9) 407 BatchCommit/vs=65536-4 61.6MB/s ±13% 105.5MB/s ± 8% +71.32% (p=0.000 n=8+9) 408 BatchCommit/vs=262144-4 90.6MB/s ± 7% 97.1MB/s ± 4% +7.13% (p=0.000 n=9+10) 409 BatchCommit/vs=1048576-4 130MB/s ± 8% 133MB/s ± 5% ~ (p=0.631 n=10+10) 410 ``` 411 412 Clearly the separately running instances are not as isolated as expected. 413 414 # Alternatives 415 416 An alternative considered was rolling out our own WAL implementation optimized 417 for the Raft log usage patterns. Possible reasons for doing so: 418 - A native implementation in Go would avoid the CGo overhead we incur crossing 419 the Go/C++ boundary 420 - SanDisk published a paper<sup>[6]</sup> (a shorter slideshow can be found 421 [here](https://www.usenix.org/sites/default/files/conference/protected-files/inflow14_slides_yang.pdf)) 422 discussing the downsides of layering log systems on one another. Summary: 423 - Increased write pressure - each layer/log has its own metadata 424 - Fragmented logs - 'upper level' logs writes sequentially but the 'lower 425 level' logs gets mixed workloads, most likely to be random, destroying 426 sequentiality 427 - Unaligned segment sizes - garbage collection in 'upper level' log segments 428 can result in data invalidation across multiple 'lower level' log segments 429 430 Considering how any given store could have thousands of replicas, an approach 431 with each replica maintaining it's own separate file for it's WAL was a 432 non-starter. What we would really need is something that resembled a 433 multi-access, shared WAL (by multi-access here we mean there are multiple 434 logical append points in the log and each accessor is able to operate only it's 435 own logical section). 436 437 Consider what would be the most common operations: 438 - Accessing a given replica's raft log sequentially 439 - Prefix truncation of a given replica's raft log 440 441 A good first approximation would be allocating contiguous chunks of disk space 442 in sequence, each chunk assigned to a given accessor. Should an accessor run 443 out of allocated space, it seeks the next available chunk and adds metadata 444 linking across the two (think linked lists). Though this would enable fast 445 sequential log access, log prefix truncations are slightly trickier. Do we 446 truncate at chunk sized boundaries or truncate at user specified points and 447 thus causing fragmentation? 448 Perusing through open sourced implementations of WALs and some literature on 449 the subject, multi-access WALs tend to support _at most_ 10 accessors, let 450 alone thousands. Retro-fitting this for our use case (a single store can have 451 1000s of replicas), we'd have to opt-in for a 'sharded store' approach where 452 appropriately sized sets of replicas share an instance of a multi-access WAL. 453 454 Taking into account all of the above, it was deemed that the implementation 455 overhead and the additional introduced complexity (higher level organization 456 with sharded stores) was not worth what could _possibly_ be a tiny performance 457 increase. We suspect a tuned RocksDB instance would be hard to beat unless we 458 GC aggressively, not to mention it's battle tested. The internal knowledge base 459 for tuning and working with RocksDB is available at CockroachDB, so this 460 reduces future implementation risk as well.</br> 461 NOTE: At this time we have not explored potentially using 462 [dgraph-io/badger](https://github.com/dgraph-io/badger) instead. 463 464 Some Raft WAL implementations explored were the 465 [etcd/wal](https://github.com/coreos/etcd/tree/master/wal) implementation and 466 [hashicorp/raft](https://github.com/hashicorp/raft)'s LMDB 467 [implementation](https://github.com/hashicorp/raft-mdb). As stated above the 468 complexity comes about in managing logs for 1000s of replicas on the same 469 store. 470 471 # Unresolved questions 472 473 - RocksDB parameters/tuning for the Raft specific instance. 474 - We currently share a block cache across the multiple running RocksDB 475 instances across stores in a node. Would a similar structure be beneficial 476 here? Do we use the same block cache or have another dedicated one as well? 477 478 # Future work 479 480 Intel has demonstrated impressive performance increases by putting the raft log 481 in non-volatile memory instead of disk (for etcd/raft)<sup>[7]</sup>. Given 482 we're proposing a separate storage engine for the Raft log, in the presence of 483 more suitable hardware medium it should be easy enough to configure the Raft 484 specific RocksDB instance/multi-access WAL implementation to run on it. Even 485 without the presence of specialized hardware it might be desirable to configure 486 the Raft and regular RocksDB instances to use different disks. 487 488 RocksDB periodically flushes in-memory writes to disk, if we can detect which 489 writes have been persisted and use _that_ information to truncate the 490 corresponding raft log entries, we can avoid the (costly) explicit syncing of 491 the primary RocksDB instance. This is out of the scope for this RFC. 492 493 As an aside, [@tschottdorf](https://github.com/tschottdorf): 494 > we should refactor the way `evalTruncateLog` works. It currently 495 > takes writes all the way through the proposer-evaluated KV machinery, and at 496 > least from the graphs it looks that that's enough traffic to impair Raft 497 > throughput alone. We could lower the actual ranged clear below Raft (after all, 498 > no migration concerns there). We would be relaxing, somewhat, the stats which 499 > are now authoritative and would then only become "real" once the Raft log had 500 > actually been purged all the way up to the TruncatedState. I think there's no 501 > problem with that. 502 503 504 # Footnotes 505 \[1\]: https://raft.github.io/raft.pdf </br> 506 \[2\]: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ </br> 507 \[3\]: https://github.com/facebook/rocksdb/wiki/Basic-Operations#asynchronous-writes </br> 508 <a name="column-families">\[4\]</a>: via 509 [@bdarnell](https://github.com/bdarnell) & 510 [@tschottdorf](https://github.com/tschottdorf): 511 > We may want to consider using two column families for this, since the log 512 > keys are _usually_ (log tail can be replaced after leadership change) 513 > write-once and short-lived, while the hard state is overwritten frequently 514 > but never goes away completely. 515 516 \[5\]: https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#support-for-multiple-embedded-databases-in-the-same-process </br> 517 \[6\]: https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf </br> 518 \[7\]: http://thenewstack.io/intel-gives-the-etcd-key-value-store-a-needed-boost/ </br> 519 520 521 [1]: https://raft.github.io/raft.pdf 522 [2]: https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ 523 [3]: https://github.com/facebook/rocksdb/wiki/Basic-Operations#asynchronous-writes 524 [5]: https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#support-for-multiple-embedded-databases-in-the-same-process 525 [6]: https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf 526 [7]: http://thenewstack.io/intel-gives-the-etcd-key-value-store-a-needed-boost/