github.com/matrixorigin/matrixone@v1.2.0/pkg/common/arenaskl/README.md

github.com/matrixorigin/matrixone@v1.2.0/pkg/common/arenaskl/README.md (about)

     1  # arenaskl
     2  
     3  Fast, lock-free, arena-based Skiplist implementation in Go that supports iteration
     4  in both directions.
     5  
     6  Since this is pebble's internal lib, matrixone copied it and modified it to support some features and make it easier to use in matrixone.
     7  
     8  ## Advantages
     9  
    10  Arenaskl offers several advantages over other skiplist implementations:
    11  
    12  * High performance that linearly scales with the number of cores. This is
    13    achieved by allocating from a fixed-size arena and by avoiding locks.
    14  * Iterators that can be allocated on the stack and easily cloned by value.
    15  * Simple-to-use and low overhead model for detecting and handling race conditions
    16    with other threads.
    17  * Support for iterating in reverse (i.e. previous links). 
    18  
    19  ## Limitations
    20  
    21  The advantages come at a cost that prevents arenaskl from being a general-purpose
    22  skiplist implementation:
    23  
    24  * The size of the arena sets a hard upper bound on the combined size of skiplist
    25    nodes, keys, and values. This limit includes even the size of deleted nodes,
    26    keys, and values.
    27  * Deletion is not supported. Instead, higher-level code is expected to
    28    add deletion tombstones and needs to process those tombstones
    29    appropriately.
    30  
    31  ## Pedigree
    32  
    33  This code is based on Andy Kimball's arenaskl code:
    34  
    35  https://github.com/andy-kimball/arenaskl
    36  
    37  The arenaskl code is based on the skiplist found in Badger, a Go-based
    38  KV store:
    39  
    40  https://github.com/dgraph-io/badger/tree/master/skl
    41  
    42  The skiplist in Badger is itself based on a C++ skiplist built for
    43  Facebook's RocksDB:
    44  
    45  https://github.com/facebook/rocksdb/tree/master/memtable
    46  
    47  ## Benchmarks
    48  
    49  The benchmarks consist of a mix of reads and writes executed in parallel. The
    50  fraction of reads is indicated in the run name: "frac_X" indicates a run where
    51  X percent of the operations are reads.
    52  
    53  The results are much better than `skiplist` and `slist`.
    54  
    55  ```
    56  name                  time/op
    57  ReadWrite/frac_0-8     470ns ±11%
    58  ReadWrite/frac_10-8    462ns ± 3%
    59  ReadWrite/frac_20-8    436ns ± 2%
    60  ReadWrite/frac_30-8    410ns ± 2%
    61  ReadWrite/frac_40-8    385ns ± 2%
    62  ReadWrite/frac_50-8    360ns ± 4%
    63  ReadWrite/frac_60-8    386ns ± 1%
    64  ReadWrite/frac_70-8    352ns ± 2%
    65  ReadWrite/frac_80-8    306ns ± 3%
    66  ReadWrite/frac_90-8    253ns ± 4%
    67  ReadWrite/frac_100-8  28.1ns ± 2%
    68  ```
    69  
    70  Note that the above numbers are for concurrent operations using 8x
    71  parallelism. The same benchmarks without concurrency (use these
    72  numbers when comparing vs batchskl):
    73  
    74  ```
    75  name                time/op
    76  ReadWrite/frac_0    1.53µs ± 1%
    77  ReadWrite/frac_10   1.46µs ± 2%
    78  ReadWrite/frac_20   1.39µs ± 3%
    79  ReadWrite/frac_30   1.28µs ± 3%
    80  ReadWrite/frac_40   1.21µs ± 2%
    81  ReadWrite/frac_50   1.11µs ± 3%
    82  ReadWrite/frac_60   1.23µs ±17%
    83  ReadWrite/frac_70   1.16µs ± 4%
    84  ReadWrite/frac_80    959ns ± 3%
    85  ReadWrite/frac_90    738ns ± 5%
    86  ReadWrite/frac_100  81.9ns ± 2%
    87  ```
    88  
    89  Forward and backward iteration are also fast:
    90  
    91  ```
    92  name                time/op
    93  IterNext            3.97ns ± 5%
    94  IterPrev            3.88ns ± 3%
    95  ```