github.com/coyove/sdss@v0.0.0-20231129015646-c2ec58cca6a2/contrib/roaring/README.md (about)

     1  roaring [![GoDoc](https://godoc.org/github.com/RoaringBitmap/roaring/roaring64?status.svg)](https://godoc.org/github.com/RoaringBitmap/roaring/roaring64) [![Go Report Card](https://goreportcard.com/badge/RoaringBitmap/roaring)](https://goreportcard.com/report/github.com/RoaringBitmap/roaring)
     2  [![Build Status](https://cloud.drone.io/api/badges/RoaringBitmap/roaring/status.svg)](https://cloud.drone.io/RoaringBitmap/roaring)
     3  ![Go-CI](https://github.com/RoaringBitmap/roaring/workflows/Go-CI/badge.svg)
     4  ![Go-ARM-CI](https://github.com/RoaringBitmap/roaring/workflows/Go-ARM-CI/badge.svg)
     5  ![Go-Windows-CI](https://github.com/RoaringBitmap/roaring/workflows/Go-Windows-CI/badge.svg)
     6  =============
     7  
     8  This is a go version of the Roaring bitmap data structure. 
     9  
    10  Roaring bitmaps are used by several major systems such as [Apache Lucene][lucene] and derivative systems such as [Solr][solr] and
    11  [Elasticsearch][elasticsearch], [Apache Druid (Incubating)][druid], [LinkedIn Pinot][pinot], [Netflix Atlas][atlas],  [Apache Spark][spark], [OpenSearchServer][opensearchserver], [anacrolix/torrent][anacrolix/torrent], [Whoosh][whoosh],  [Pilosa][pilosa],  [Microsoft Visual Studio Team Services (VSTS)][vsts], and eBay's [Apache Kylin][kylin]. The YouTube SQL Engine, [Google Procella](https://research.google/pubs/pub48388/), uses Roaring bitmaps for indexing.
    12  
    13  [lucene]: https://lucene.apache.org/
    14  [solr]: https://lucene.apache.org/solr/
    15  [elasticsearch]: https://www.elastic.co/products/elasticsearch
    16  [druid]: https://druid.apache.org/
    17  [spark]: https://spark.apache.org/
    18  [opensearchserver]: http://www.opensearchserver.com
    19  [anacrolix/torrent]: https://github.com/anacrolix/torrent
    20  [whoosh]: https://bitbucket.org/mchaput/whoosh/wiki/Home
    21  [pilosa]: https://www.pilosa.com/
    22  [kylin]: http://kylin.apache.org/
    23  [pinot]: http://github.com/linkedin/pinot/wiki
    24  [vsts]: https://www.visualstudio.com/team-services/
    25  [atlas]: https://github.com/Netflix/atlas
    26  
    27  Roaring bitmaps are found to work well in many important applications:
    28  
    29  > Use Roaring for bitmap compression whenever possible. Do not use other bitmap compression methods ([Wang et al., SIGMOD 2017](http://db.ucsd.edu/wp-content/uploads/2017/03/sidm338-wangA.pdf))
    30  
    31  
    32  The ``roaring`` Go library is used by
    33  * [anacrolix/torrent]
    34  * [runv](https://github.com/hyperhq/runv)
    35  * [InfluxDB](https://www.influxdata.com)
    36  * [Pilosa](https://www.pilosa.com/)
    37  * [Bleve](http://www.blevesearch.com)
    38  * [lindb](https://github.com/lindb/lindb)
    39  * [Elasticell](https://github.com/deepfabric/elasticell)
    40  * [SourceGraph](https://github.com/sourcegraph/sourcegraph)
    41  * [M3](https://github.com/m3db/m3)
    42  * [trident](https://github.com/NetApp/trident)
    43  * [Husky](https://www.datadoghq.com/blog/engineering/introducing-husky/)
    44  
    45  
    46  This library is used in production in several systems, it is part of the [Awesome Go collection](https://awesome-go.com).
    47  
    48  
    49  There are also  [Java](https://github.com/RoaringBitmap/RoaringBitmap) and [C/C++](https://github.com/RoaringBitmap/CRoaring) versions.  The Java, C, C++ and Go version are binary compatible: e.g,  you can save bitmaps
    50  from a Java program and load them back in Go, and vice versa. We have a [format specification](https://github.com/RoaringBitmap/RoaringFormatSpec).
    51  
    52  
    53  This code is licensed under Apache License, Version 2.0 (ASL2.0).
    54  
    55  Copyright 2016-... by the authors.
    56  
    57  When should you use a bitmap?
    58  ===================================
    59  
    60  
    61  Sets are a fundamental abstraction in
    62  software. They can be implemented in various
    63  ways, as hash sets, as trees, and so forth.
    64  In databases and search engines, sets are often an integral
    65  part of indexes. For example, we may need to maintain a set
    66  of all documents or rows  (represented by numerical identifier)
    67  that satisfy some property. Besides adding or removing
    68  elements from the set, we need fast functions
    69  to compute the intersection, the union, the difference between sets, and so on.
    70  
    71  
    72  To implement a set
    73  of integers, a particularly appealing strategy is the
    74  bitmap (also called bitset or bit vector). Using n bits,
    75  we can represent any set made of the integers from the range
    76  [0,n): the ith bit is set to one if integer i is present in the set.
    77  Commodity processors use words of W=32 or W=64 bits. By combining many such words, we can
    78  support large values of n. Intersections, unions and differences can then be implemented
    79   as bitwise AND, OR and ANDNOT operations.
    80  More complicated set functions can also be implemented as bitwise operations.
    81  
    82  When the bitset approach is applicable, it can be orders of
    83  magnitude faster than other possible implementation of a set (e.g., as a hash set)
    84  while using several times less memory.
    85  
    86  However, a bitset, even a compressed one is not always applicable. For example, if
    87  you have 1000 random-looking integers, then a simple array might be the best representation.
    88  We refer to this case as the "sparse" scenario.
    89  
    90  When should you use compressed bitmaps?
    91  ===================================
    92  
    93  An uncompressed BitSet can use a lot of memory. For example, if you take a BitSet
    94  and set the bit at position 1,000,000 to true and you have just over 100kB. That is over 100kB
    95  to store the position of one bit. This is wasteful  even if you do not care about memory:
    96  suppose that you need to compute the intersection between this BitSet and another one
    97  that has a bit at position 1,000,001 to true, then you need to go through all these zeroes,
    98  whether you like it or not. That can become very wasteful.
    99  
   100  This being said, there are definitively cases where attempting to use compressed bitmaps is wasteful.
   101  For example, if you have a small universe size. E.g., your bitmaps represent sets of integers
   102  from [0,n) where n is small (e.g., n=64 or n=128). If you are able to uncompressed BitSet and
   103  it does not blow up your memory usage,  then compressed bitmaps are probably not useful
   104  to you. In fact, if you do not need compression, then a BitSet offers remarkable speed.
   105  
   106  The sparse scenario is another use case where compressed bitmaps should not be used.
   107  Keep in mind that random-looking data is usually not compressible. E.g., if you have a small set of
   108  32-bit random integers, it is not mathematically possible to use far less than 32 bits per integer,
   109  and attempts at compression can be counterproductive.
   110  
   111  How does Roaring compares with the alternatives?
   112  ==================================================
   113  
   114  
   115  Most alternatives to Roaring are part of a larger family of compressed bitmaps that are run-length-encoded
   116  bitmaps. They identify long runs of 1s or 0s and they represent them with a marker word.
   117  If you have a local mix of 1s and 0, you use an uncompressed word.
   118  
   119  There are many formats in this family:
   120  
   121  * Oracle's BBC is an obsolete format at this point: though it may provide good compression,
   122  it is likely much slower than more recent alternatives due to excessive branching.
   123  * WAH is a patented variation on BBC that provides better performance.
   124  * Concise is a variation on the patented WAH. It some specific instances, it can compress
   125  much better than WAH (up to 2x better), but it is generally slower.
   126  * EWAH is both free of patent, and it is faster than all the above. On the downside, it
   127  does not compress quite as well. It is faster because it allows some form of "skipping"
   128  over uncompressed words. So though none of these formats are great at random access, EWAH
   129  is better than the alternatives.
   130  
   131  
   132  
   133  There is a big problem with these formats however that can hurt you badly in some cases: there is no random access. If you want to check whether a given value is present in the set, you have to start from the beginning and "uncompress" the whole thing. This means that if you want to intersect a big set with a large set, you still have to uncompress the whole big set in the worst case...
   134  
   135  Roaring solves this problem. It works in the following manner. It divides the data into chunks of 2<sup>16</sup> integers
   136  (e.g., [0, 2<sup>16</sup>), [2<sup>16</sup>, 2 x 2<sup>16</sup>), ...). Within a chunk, it can use an uncompressed bitmap, a simple list of integers,
   137  or a list of runs. Whatever format it uses, they all allow you to check for the present of any one value quickly
   138  (e.g., with a binary search). The net result is that Roaring can compute many operations much faster than run-length-encoded
   139  formats like WAH, EWAH, Concise... Maybe surprisingly, Roaring also generally offers better compression ratios.
   140  
   141  
   142  
   143  
   144  
   145  ### References
   146  
   147  - Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai, Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience 48 (4), 2018 [arXiv:1709.07821](https://arxiv.org/abs/1709.07821)
   148  -  Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin,
   149  Better bitmap performance with Roaring bitmaps,
   150  Software: Practice and Experience 46 (5), 2016.
   151  http://arxiv.org/abs/1402.6407 This paper used data from http://lemire.me/data/realroaring2014.html
   152  - Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser, Consistently faster and smaller compressed bitmaps with Roaring, Software: Practice and Experience 46 (11), 2016. http://arxiv.org/abs/1603.06549
   153  
   154  
   155  ### Dependencies
   156  
   157  Dependencies are fetched automatically by giving the `-t` flag to `go get`.
   158  
   159  they include
   160    - github.com/bits-and-blooms/bitset
   161    - github.com/mschoch/smat
   162    - github.com/glycerine/go-unsnap-stream
   163    - github.com/philhofer/fwd
   164    - github.com/jtolds/gls
   165  
   166  Note that the smat library requires Go 1.6 or better.
   167  
   168  #### Installation
   169  
   170    - go get -t github.com/RoaringBitmap/roaring
   171  
   172  
   173  ### Example
   174  
   175  Here is a simplified but complete example:
   176  
   177  ```go
   178  package main
   179  
   180  import (
   181      "fmt"
   182      "github.com/RoaringBitmap/roaring"
   183      "bytes"
   184  )
   185  
   186  
   187  func main() {
   188      // example inspired by https://github.com/fzandona/goroar
   189      fmt.Println("==roaring==")
   190      rb1 := roaring.BitmapOf(1, 2, 3, 4, 5, 100, 1000)
   191      fmt.Println(rb1.String())
   192  
   193      rb2 := roaring.BitmapOf(3, 4, 1000)
   194      fmt.Println(rb2.String())
   195  
   196      rb3 := roaring.New()
   197      fmt.Println(rb3.String())
   198  
   199      fmt.Println("Cardinality: ", rb1.GetCardinality())
   200  
   201      fmt.Println("Contains 3? ", rb1.Contains(3))
   202  
   203      rb1.And(rb2)
   204  
   205      rb3.Add(1)
   206      rb3.Add(5)
   207  
   208      rb3.Or(rb1)
   209  
   210      // computes union of the three bitmaps in parallel using 4 workers  
   211      roaring.ParOr(4, rb1, rb2, rb3)
   212      // computes intersection of the three bitmaps in parallel using 4 workers  
   213      roaring.ParAnd(4, rb1, rb2, rb3)
   214  
   215  
   216      // prints 1, 3, 4, 5, 1000
   217      i := rb3.Iterator()
   218      for i.HasNext() {
   219          fmt.Println(i.Next())
   220      }
   221      fmt.Println()
   222  
   223      // next we include an example of serialization
   224      buf := new(bytes.Buffer)
   225      rb1.WriteTo(buf) // we omit error handling
   226      newrb:= roaring.New()
   227      newrb.ReadFrom(buf)
   228      if rb1.Equals(newrb) {
   229      	fmt.Println("I wrote the content to a byte stream and read it back.")
   230      }
   231      // you can iterate over bitmaps using ReverseIterator(), Iterator, ManyIterator()
   232  }
   233  ```
   234  
   235  If you wish to use serialization and handle errors, you might want to
   236  consider the following sample of code:
   237  
   238  ```go
   239  	rb := BitmapOf(1, 2, 3, 4, 5, 100, 1000)
   240  	buf := new(bytes.Buffer)
   241  	size,err:=rb.WriteTo(buf)
   242  	if err != nil {
   243  		t.Errorf("Failed writing")
   244  	}
   245  	newrb:= New()
   246  	size,err=newrb.ReadFrom(buf)
   247  	if err != nil {
   248  		t.Errorf("Failed reading")
   249  	}
   250  	if ! rb.Equals(newrb) {
   251  		t.Errorf("Cannot retrieve serialized version")
   252  	}
   253  ```
   254  
   255  Given N integers in [0,x), then the serialized size in bytes of
   256  a Roaring bitmap should never exceed this bound:
   257  
   258  `` 8 + 9 * ((long)x+65535)/65536 + 2 * N ``
   259  
   260  That is, given a fixed overhead for the universe size (x), Roaring
   261  bitmaps never use more than 2 bytes per integer. You can call
   262  ``BoundSerializedSizeInBytes`` for a more precise estimate.
   263  
   264  ### 64-bit Roaring
   265  
   266  By default, roaring is used to stored unsigned 32-bit integers. However, we also offer
   267  an extension dedicated to 64-bit integers. It supports roughly the same functions:
   268  
   269  ```go
   270  package main
   271  
   272  import (
   273      "fmt"
   274      "github.com/RoaringBitmap/roaring/roaring64"
   275      "bytes"
   276  )
   277  
   278  
   279  func main() {
   280      // example inspired by https://github.com/fzandona/goroar
   281      fmt.Println("==roaring64==")
   282      rb1 := roaring64.BitmapOf(1, 2, 3, 4, 5, 100, 1000)
   283      fmt.Println(rb1.String())
   284  
   285      rb2 := roaring64.BitmapOf(3, 4, 1000)
   286      fmt.Println(rb2.String())
   287  
   288      rb3 := roaring64.New()
   289      fmt.Println(rb3.String())
   290  
   291      fmt.Println("Cardinality: ", rb1.GetCardinality())
   292  
   293      fmt.Println("Contains 3? ", rb1.Contains(3))
   294  
   295      rb1.And(rb2)
   296  
   297      rb3.Add(1)
   298      rb3.Add(5)
   299  
   300      rb3.Or(rb1)
   301  
   302  
   303  
   304      // prints 1, 3, 4, 5, 1000
   305      i := rb3.Iterator()
   306      for i.HasNext() {
   307          fmt.Println(i.Next())
   308      }
   309      fmt.Println()
   310  
   311      // next we include an example of serialization
   312      buf := new(bytes.Buffer)
   313      rb1.WriteTo(buf) // we omit error handling
   314      newrb:= roaring64.New()
   315      newrb.ReadFrom(buf)
   316      if rb1.Equals(newrb) {
   317      	fmt.Println("I wrote the content to a byte stream and read it back.")
   318      }
   319      // you can iterate over bitmaps using ReverseIterator(), Iterator, ManyIterator()
   320  }
   321  ```
   322  
   323  Only the 32-bit roaring format is standard and cross-operable between Java, C++, C and Go. There is no guarantee that the 64-bit versions are compatible.
   324  
   325  ### Documentation
   326  
   327  Current documentation is available at http://godoc.org/github.com/RoaringBitmap/roaring and http://godoc.org/github.com/RoaringBitmap/roaring64
   328  
   329  ### Goroutine safety
   330  
   331  In general, it should not generally be considered safe to access
   332  the same bitmaps using different goroutines--they are left
   333  unsynchronized for performance. Should you want to access
   334  a Bitmap from more than one goroutine, you should
   335  provide synchronization. Typically this is done by using channels to pass
   336  the *Bitmap around (in Go style; so there is only ever one owner),
   337  or by using `sync.Mutex` to serialize operations on Bitmaps.
   338  
   339  ### Coverage
   340  
   341  We test our software. For a report on our test coverage, see
   342  
   343  https://coveralls.io/github/RoaringBitmap/roaring?branch=master
   344  
   345  ### Benchmark
   346  
   347  Type
   348  
   349           go test -bench Benchmark -run -
   350           
   351  To run benchmarks on [Real Roaring Datasets](https://github.com/RoaringBitmap/real-roaring-datasets)
   352  run the following:
   353  
   354  ```sh
   355  go get github.com/RoaringBitmap/real-roaring-datasets
   356  BENCH_REAL_DATA=1 go test -bench BenchmarkRealData -run -
   357  ```
   358  
   359  ### Iterative use
   360  
   361  You can use roaring with gore:
   362  
   363  - go get -u github.com/motemen/gore
   364  - Make sure that ``$GOPATH/bin`` is in your ``$PATH``.
   365  - go get github.com/RoaringBitmap/roaring
   366  
   367  ```go
   368  $ gore
   369  gore version 0.2.6  :help for help
   370  gore> :import github.com/RoaringBitmap/roaring
   371  gore> x:=roaring.New()
   372  gore> x.Add(1)
   373  gore> x.String()
   374  "{1}"
   375  ```
   376  
   377  
   378  ### Fuzzy testing
   379  
   380  You can help us test further the library with fuzzy testing:
   381  
   382           go get github.com/dvyukov/go-fuzz/go-fuzz
   383           go get github.com/dvyukov/go-fuzz/go-fuzz-build
   384           go test -tags=gofuzz -run=TestGenerateSmatCorpus
   385           go-fuzz-build github.com/RoaringBitmap/roaring
   386           go-fuzz -bin=./roaring-fuzz.zip -workdir=workdir/ -timeout=200 -func FuzzSmat
   387  
   388  Let it run, and if the # of crashers is > 0, check out the reports in
   389  the workdir where you should be able to find the panic goroutine stack
   390  traces.
   391  
   392  You may also replace `-func FuzzSmat`  by `-func FuzzSerializationBuffer` or `-func FuzzSerializationStream`.
   393  
   394  ### Alternative in Go
   395  
   396  There is a Go version wrapping the C/C++ implementation https://github.com/RoaringBitmap/gocroaring
   397  
   398  For an alternative implementation in Go, see https://github.com/fzandona/goroar
   399  The two versions were written independently.
   400  
   401  
   402  ### Mailing list/discussion group
   403  
   404  https://groups.google.com/forum/#!forum/roaring-bitmaps