github.com/fibonacci-chain/fbc@v0.0.0-20231124064014-c7636198c1e9/libs/iavl/PERFORMANCE.md (about) 1 # Performance 2 3 After some discussion with Jae on the usability, it seems performance is a big concern. If every write takes around 1ms, that puts a serious upper limit on the speed of the consensus engine (especially since with the check/tx dichotomy, we need at least two writes (to cache, only one to disk) and likely two or more queries to handle any transaction). 4 5 As Jae notes: for CheckTx, a copy of IAVLTree doesn't need to be saved. During CheckTx it'll load inner nodes into the cache. The cache is shared w/ the AppendTx state IAVLTree, so during AppendTx we should save some time. There would only be 1 set of writes. Also, there's quite a bit of free time in between blocks as provided by Tendermint, during which CheckTx can run priming the cache, so hopefully this helps as well. 6 7 Jae: That said, I'm not sure exactly what the tx throughput would be during normal running times. I'm hoping that we can have e.g. 3 second blocks w/ say over a hundred txs per sec per block w/ 1 million items. That will get us through for some time, but that time is limited. 8 9 Ethan: I agree, and think this works now with goleveldb backing on most host machines. For public chains, maybe it is desired to push 1000 tx every 3 sec to a block, with a db size of 1 billion items. 10x the throughput with 1000x the data. That could be a long-term goal, and would scale to the cosmos and beyond. 10 11 ## Plan 12 13 For any goal, we need some clear steps. 14 15 1) Cleanup code, and write some more benchmark cases to capture "realistic" usage 16 2) Run tests on various hardware to see the best performing backing stores 17 3) Do profiling on the best performance to see if there are any easy performance gains 18 4) (Possibly) Write another implementation of merkle.Tree to improve all the memory overhead, consider CPU cache, etc.... 19 5) (Possibly) Write another backend datastore to persist the tree in a more efficient way 20 21 The rest of this document is the planned or completed actions for the above-listed steps. 22 23 ## Cleanup 24 25 Done in branch `cleanup_deps`: 26 * Fixed up dependeny management (tmlibs/db etc in glide/vendor) 27 * Updated Makefile (test, bench, get_deps) 28 * Fixed broken code - `looper.go` and one benchmark didn't run 29 30 Benchmarks should be parameterized on: 31 1) storage implementation 32 2) initial data size 33 3) length of keys 34 4) length of data 35 5) block size (frequency of copy/hash...) 36 Thus, we would see the same benchmark run against memdb with 100K items, goleveldb with 100K, leveldb with 100K, memdb with 10K, goleveldb with 10K... 37 38 Scenarios to run after db is set up. 39 * Pure query time (known/hits, vs. random/misses) 40 * Write timing (known/updates, vs. random/inserts) 41 * Delete timing (existing keys only) 42 * TMSP Usage: 43 * For each block size: 44 * 2x copy "last commit" -> check and real 45 * repeat for each tx: 46 * (50% update + 50% insert?) 47 * query + insert/update in check 48 * query + insert/update in real 49 * get hash 50 * save real 51 * real -> "last commit" 52 53 54 ## Benchmarks 55 56 After writing the benchmarks, we can run them under various environments and store the results under benchmarks directory. Some useful environments to test: 57 58 * Dev machines 59 * Digital ocean small/large machine 60 * Various AWS setups 61 62 Please run the benchmark on more machines and add the result. Just type: `make record` in the directory and wait a (long) while (with little other load on the machine). 63 64 This will require also a quick setup script to install go and run tests in these environments. Maybe some scripts even. Also, this will produce a lot of files and we may have to graph them to see something useful... 65 66 But for starting, my laptop, and one digital ocean and one aws server should be sufficient. At least to find the winner, before profiling. 67 68 69 ## Profiling 70 71 Once we figure out which current implementation looks fastest, let's profile it to make it even faster. It is great to optimize the memdb code to really speed up the hashing and tree-building logic. And then focus on the backend implementation to optimize the disk storage, which will be the next major pain point. 72 73 Some guides: 74 75 * [Profiling benchmarks locally](https://medium.com/@hackintoshrao/daily-code-optimization-using-benchmarks-and-profiling-in-golang-gophercon-india-2016-talk-874c8b4dc3c5#.jmnd8w2qr) 76 * [On optimizing memory](https://signalfx.com/blog/a-pattern-for-optimizing-go-2/) 77 * [Profiling running programs](http://blog.ralch.com/tutorial/golang-performance-and-memory-analysis/) 78 * [Dave Chenny's profiler pkg](https://github.com/pkg/profile) 79 80 Some ideas for speedups: 81 82 * [Speedup SHA256 100x on ARM](https://blog.minio.io/accelerating-sha256-by-100x-in-golang-on-arm-1517225f5ff4#.pybt7bb3w) 83 * [Faster SHA256 golang implementation](https://github.com/minio/sha256-simd) 84 * [Data structure alignment](http://stackoverflow.com/questions/39063530/optimising-datastructure-word-alignment-padding-in-golang) 85 * [Slice alignment](http://blog.chewxy.com/2016/07/25/on-the-memory-alignment-of-go-slice-values/) 86 * [Tool to analyze your structs](https://github.com/dominikh/go-structlayout) 87 88 ## Tree Re-implementation 89 90 If we want to copy lots of objects, it becomes better to think of using memcpy on large (eg. 4-16KB) buffers than copying individual structs. We also could allocate arrays of structs and align them to remove a lot of memory management and gc overhead. That means going down to some C-level coding... 91 92 Some links for thought: 93 94 * [Array representation of a binary tree](http://www.cse.hut.fi/en/research/SVG/TRAKLA2/tutorials/heap_tutorial/taulukkona.html) 95 * [Memcpy buffer size timing](http://stackoverflow.com/questions/21038965/why-does-the-speed-of-memcpy-drop-dramatically-every-4kb) 96 * [Calling memcpy from go](https://github.com/jsgilmore/shm/blob/master/memcpy.go) 97 * [Unsafe docs](https://godoc.org/unsafe) 98 * [...and how to use it](https://copyninja.info/blog/workaround-gotypesystems.html) 99 * [Or maybe just plain copy...](https://godoc.org/builtin#copy) 100 101 ## Backend implementation 102 103 Storing each link in the tree in leveldb treats each node as an isolated item. Since we know some usage patterns (when a parent is hit, very likely one child will be hit), we could try to organize the memory and disk location of the nodes ourselves to make it more efficient. Or course, this could be a long, slippery slope. 104 105 Inspired by the [Array representation](http://www.cse.hut.fi/en/research/SVG/TRAKLA2/tutorials/heap_tutorial/taulukkona.html) link above, we could consider other layouts for the nodes. For example, rather than store them alone, or the entire tree in one big array, the nodes could be placed in groups of 15 based on the parent (parent and 3 generations of children). Then we have 4 levels before jumping to another location. Maybe we just store this larger chunk as one leveldb location, or really try to do the mmap ourselves... 106 107 In any case, assuming around 100 bytes for one non-leaf node (3 sha hashes, plus prefix, plus other data), 15 nodes would be a little less than 2K, maybe even go one more level to 31 nodes and 3-4KB, where we could take best advantage of the memory/disk page size. 108 109 Some links for thought: 110 111 * [Memory mapped files](https://github.com/edsrzf/mmap-go)