github.com/coyove/sdss@v0.0.0-20231129015646-c2ec58cca6a2/contrib/roaring/README.md (about) 1 roaring [![GoDoc](https://godoc.org/github.com/RoaringBitmap/roaring/roaring64?status.svg)](https://godoc.org/github.com/RoaringBitmap/roaring/roaring64) [![Go Report Card](https://goreportcard.com/badge/RoaringBitmap/roaring)](https://goreportcard.com/report/github.com/RoaringBitmap/roaring) 2 [![Build Status](https://cloud.drone.io/api/badges/RoaringBitmap/roaring/status.svg)](https://cloud.drone.io/RoaringBitmap/roaring) 3 ![Go-CI](https://github.com/RoaringBitmap/roaring/workflows/Go-CI/badge.svg) 4 ![Go-ARM-CI](https://github.com/RoaringBitmap/roaring/workflows/Go-ARM-CI/badge.svg) 5 ![Go-Windows-CI](https://github.com/RoaringBitmap/roaring/workflows/Go-Windows-CI/badge.svg) 6 ============= 7 8 This is a go version of the Roaring bitmap data structure. 9 10 Roaring bitmaps are used by several major systems such as [Apache Lucene][lucene] and derivative systems such as [Solr][solr] and 11 [Elasticsearch][elasticsearch], [Apache Druid (Incubating)][druid], [LinkedIn Pinot][pinot], [Netflix Atlas][atlas], [Apache Spark][spark], [OpenSearchServer][opensearchserver], [anacrolix/torrent][anacrolix/torrent], [Whoosh][whoosh], [Pilosa][pilosa], [Microsoft Visual Studio Team Services (VSTS)][vsts], and eBay's [Apache Kylin][kylin]. The YouTube SQL Engine, [Google Procella](https://research.google/pubs/pub48388/), uses Roaring bitmaps for indexing. 12 13 [lucene]: https://lucene.apache.org/ 14 [solr]: https://lucene.apache.org/solr/ 15 [elasticsearch]: https://www.elastic.co/products/elasticsearch 16 [druid]: https://druid.apache.org/ 17 [spark]: https://spark.apache.org/ 18 [opensearchserver]: http://www.opensearchserver.com 19 [anacrolix/torrent]: https://github.com/anacrolix/torrent 20 [whoosh]: https://bitbucket.org/mchaput/whoosh/wiki/Home 21 [pilosa]: https://www.pilosa.com/ 22 [kylin]: http://kylin.apache.org/ 23 [pinot]: http://github.com/linkedin/pinot/wiki 24 [vsts]: https://www.visualstudio.com/team-services/ 25 [atlas]: https://github.com/Netflix/atlas 26 27 Roaring bitmaps are found to work well in many important applications: 28 29 > Use Roaring for bitmap compression whenever possible. Do not use other bitmap compression methods ([Wang et al., SIGMOD 2017](http://db.ucsd.edu/wp-content/uploads/2017/03/sidm338-wangA.pdf)) 30 31 32 The ``roaring`` Go library is used by 33 * [anacrolix/torrent] 34 * [runv](https://github.com/hyperhq/runv) 35 * [InfluxDB](https://www.influxdata.com) 36 * [Pilosa](https://www.pilosa.com/) 37 * [Bleve](http://www.blevesearch.com) 38 * [lindb](https://github.com/lindb/lindb) 39 * [Elasticell](https://github.com/deepfabric/elasticell) 40 * [SourceGraph](https://github.com/sourcegraph/sourcegraph) 41 * [M3](https://github.com/m3db/m3) 42 * [trident](https://github.com/NetApp/trident) 43 * [Husky](https://www.datadoghq.com/blog/engineering/introducing-husky/) 44 45 46 This library is used in production in several systems, it is part of the [Awesome Go collection](https://awesome-go.com). 47 48 49 There are also [Java](https://github.com/RoaringBitmap/RoaringBitmap) and [C/C++](https://github.com/RoaringBitmap/CRoaring) versions. The Java, C, C++ and Go version are binary compatible: e.g, you can save bitmaps 50 from a Java program and load them back in Go, and vice versa. We have a [format specification](https://github.com/RoaringBitmap/RoaringFormatSpec). 51 52 53 This code is licensed under Apache License, Version 2.0 (ASL2.0). 54 55 Copyright 2016-... by the authors. 56 57 When should you use a bitmap? 58 =================================== 59 60 61 Sets are a fundamental abstraction in 62 software. They can be implemented in various 63 ways, as hash sets, as trees, and so forth. 64 In databases and search engines, sets are often an integral 65 part of indexes. For example, we may need to maintain a set 66 of all documents or rows (represented by numerical identifier) 67 that satisfy some property. Besides adding or removing 68 elements from the set, we need fast functions 69 to compute the intersection, the union, the difference between sets, and so on. 70 71 72 To implement a set 73 of integers, a particularly appealing strategy is the 74 bitmap (also called bitset or bit vector). Using n bits, 75 we can represent any set made of the integers from the range 76 [0,n): the ith bit is set to one if integer i is present in the set. 77 Commodity processors use words of W=32 or W=64 bits. By combining many such words, we can 78 support large values of n. Intersections, unions and differences can then be implemented 79 as bitwise AND, OR and ANDNOT operations. 80 More complicated set functions can also be implemented as bitwise operations. 81 82 When the bitset approach is applicable, it can be orders of 83 magnitude faster than other possible implementation of a set (e.g., as a hash set) 84 while using several times less memory. 85 86 However, a bitset, even a compressed one is not always applicable. For example, if 87 you have 1000 random-looking integers, then a simple array might be the best representation. 88 We refer to this case as the "sparse" scenario. 89 90 When should you use compressed bitmaps? 91 =================================== 92 93 An uncompressed BitSet can use a lot of memory. For example, if you take a BitSet 94 and set the bit at position 1,000,000 to true and you have just over 100kB. That is over 100kB 95 to store the position of one bit. This is wasteful even if you do not care about memory: 96 suppose that you need to compute the intersection between this BitSet and another one 97 that has a bit at position 1,000,001 to true, then you need to go through all these zeroes, 98 whether you like it or not. That can become very wasteful. 99 100 This being said, there are definitively cases where attempting to use compressed bitmaps is wasteful. 101 For example, if you have a small universe size. E.g., your bitmaps represent sets of integers 102 from [0,n) where n is small (e.g., n=64 or n=128). If you are able to uncompressed BitSet and 103 it does not blow up your memory usage, then compressed bitmaps are probably not useful 104 to you. In fact, if you do not need compression, then a BitSet offers remarkable speed. 105 106 The sparse scenario is another use case where compressed bitmaps should not be used. 107 Keep in mind that random-looking data is usually not compressible. E.g., if you have a small set of 108 32-bit random integers, it is not mathematically possible to use far less than 32 bits per integer, 109 and attempts at compression can be counterproductive. 110 111 How does Roaring compares with the alternatives? 112 ================================================== 113 114 115 Most alternatives to Roaring are part of a larger family of compressed bitmaps that are run-length-encoded 116 bitmaps. They identify long runs of 1s or 0s and they represent them with a marker word. 117 If you have a local mix of 1s and 0, you use an uncompressed word. 118 119 There are many formats in this family: 120 121 * Oracle's BBC is an obsolete format at this point: though it may provide good compression, 122 it is likely much slower than more recent alternatives due to excessive branching. 123 * WAH is a patented variation on BBC that provides better performance. 124 * Concise is a variation on the patented WAH. It some specific instances, it can compress 125 much better than WAH (up to 2x better), but it is generally slower. 126 * EWAH is both free of patent, and it is faster than all the above. On the downside, it 127 does not compress quite as well. It is faster because it allows some form of "skipping" 128 over uncompressed words. So though none of these formats are great at random access, EWAH 129 is better than the alternatives. 130 131 132 133 There is a big problem with these formats however that can hurt you badly in some cases: there is no random access. If you want to check whether a given value is present in the set, you have to start from the beginning and "uncompress" the whole thing. This means that if you want to intersect a big set with a large set, you still have to uncompress the whole big set in the worst case... 134 135 Roaring solves this problem. It works in the following manner. It divides the data into chunks of 2<sup>16</sup> integers 136 (e.g., [0, 2<sup>16</sup>), [2<sup>16</sup>, 2 x 2<sup>16</sup>), ...). Within a chunk, it can use an uncompressed bitmap, a simple list of integers, 137 or a list of runs. Whatever format it uses, they all allow you to check for the present of any one value quickly 138 (e.g., with a binary search). The net result is that Roaring can compute many operations much faster than run-length-encoded 139 formats like WAH, EWAH, Concise... Maybe surprisingly, Roaring also generally offers better compression ratios. 140 141 142 143 144 145 ### References 146 147 - Daniel Lemire, Owen Kaser, Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai, Roaring Bitmaps: Implementation of an Optimized Software Library, Software: Practice and Experience 48 (4), 2018 [arXiv:1709.07821](https://arxiv.org/abs/1709.07821) 148 - Samy Chambi, Daniel Lemire, Owen Kaser, Robert Godin, 149 Better bitmap performance with Roaring bitmaps, 150 Software: Practice and Experience 46 (5), 2016. 151 http://arxiv.org/abs/1402.6407 This paper used data from http://lemire.me/data/realroaring2014.html 152 - Daniel Lemire, Gregory Ssi-Yan-Kai, Owen Kaser, Consistently faster and smaller compressed bitmaps with Roaring, Software: Practice and Experience 46 (11), 2016. http://arxiv.org/abs/1603.06549 153 154 155 ### Dependencies 156 157 Dependencies are fetched automatically by giving the `-t` flag to `go get`. 158 159 they include 160 - github.com/bits-and-blooms/bitset 161 - github.com/mschoch/smat 162 - github.com/glycerine/go-unsnap-stream 163 - github.com/philhofer/fwd 164 - github.com/jtolds/gls 165 166 Note that the smat library requires Go 1.6 or better. 167 168 #### Installation 169 170 - go get -t github.com/RoaringBitmap/roaring 171 172 173 ### Example 174 175 Here is a simplified but complete example: 176 177 ```go 178 package main 179 180 import ( 181 "fmt" 182 "github.com/RoaringBitmap/roaring" 183 "bytes" 184 ) 185 186 187 func main() { 188 // example inspired by https://github.com/fzandona/goroar 189 fmt.Println("==roaring==") 190 rb1 := roaring.BitmapOf(1, 2, 3, 4, 5, 100, 1000) 191 fmt.Println(rb1.String()) 192 193 rb2 := roaring.BitmapOf(3, 4, 1000) 194 fmt.Println(rb2.String()) 195 196 rb3 := roaring.New() 197 fmt.Println(rb3.String()) 198 199 fmt.Println("Cardinality: ", rb1.GetCardinality()) 200 201 fmt.Println("Contains 3? ", rb1.Contains(3)) 202 203 rb1.And(rb2) 204 205 rb3.Add(1) 206 rb3.Add(5) 207 208 rb3.Or(rb1) 209 210 // computes union of the three bitmaps in parallel using 4 workers 211 roaring.ParOr(4, rb1, rb2, rb3) 212 // computes intersection of the three bitmaps in parallel using 4 workers 213 roaring.ParAnd(4, rb1, rb2, rb3) 214 215 216 // prints 1, 3, 4, 5, 1000 217 i := rb3.Iterator() 218 for i.HasNext() { 219 fmt.Println(i.Next()) 220 } 221 fmt.Println() 222 223 // next we include an example of serialization 224 buf := new(bytes.Buffer) 225 rb1.WriteTo(buf) // we omit error handling 226 newrb:= roaring.New() 227 newrb.ReadFrom(buf) 228 if rb1.Equals(newrb) { 229 fmt.Println("I wrote the content to a byte stream and read it back.") 230 } 231 // you can iterate over bitmaps using ReverseIterator(), Iterator, ManyIterator() 232 } 233 ``` 234 235 If you wish to use serialization and handle errors, you might want to 236 consider the following sample of code: 237 238 ```go 239 rb := BitmapOf(1, 2, 3, 4, 5, 100, 1000) 240 buf := new(bytes.Buffer) 241 size,err:=rb.WriteTo(buf) 242 if err != nil { 243 t.Errorf("Failed writing") 244 } 245 newrb:= New() 246 size,err=newrb.ReadFrom(buf) 247 if err != nil { 248 t.Errorf("Failed reading") 249 } 250 if ! rb.Equals(newrb) { 251 t.Errorf("Cannot retrieve serialized version") 252 } 253 ``` 254 255 Given N integers in [0,x), then the serialized size in bytes of 256 a Roaring bitmap should never exceed this bound: 257 258 `` 8 + 9 * ((long)x+65535)/65536 + 2 * N `` 259 260 That is, given a fixed overhead for the universe size (x), Roaring 261 bitmaps never use more than 2 bytes per integer. You can call 262 ``BoundSerializedSizeInBytes`` for a more precise estimate. 263 264 ### 64-bit Roaring 265 266 By default, roaring is used to stored unsigned 32-bit integers. However, we also offer 267 an extension dedicated to 64-bit integers. It supports roughly the same functions: 268 269 ```go 270 package main 271 272 import ( 273 "fmt" 274 "github.com/RoaringBitmap/roaring/roaring64" 275 "bytes" 276 ) 277 278 279 func main() { 280 // example inspired by https://github.com/fzandona/goroar 281 fmt.Println("==roaring64==") 282 rb1 := roaring64.BitmapOf(1, 2, 3, 4, 5, 100, 1000) 283 fmt.Println(rb1.String()) 284 285 rb2 := roaring64.BitmapOf(3, 4, 1000) 286 fmt.Println(rb2.String()) 287 288 rb3 := roaring64.New() 289 fmt.Println(rb3.String()) 290 291 fmt.Println("Cardinality: ", rb1.GetCardinality()) 292 293 fmt.Println("Contains 3? ", rb1.Contains(3)) 294 295 rb1.And(rb2) 296 297 rb3.Add(1) 298 rb3.Add(5) 299 300 rb3.Or(rb1) 301 302 303 304 // prints 1, 3, 4, 5, 1000 305 i := rb3.Iterator() 306 for i.HasNext() { 307 fmt.Println(i.Next()) 308 } 309 fmt.Println() 310 311 // next we include an example of serialization 312 buf := new(bytes.Buffer) 313 rb1.WriteTo(buf) // we omit error handling 314 newrb:= roaring64.New() 315 newrb.ReadFrom(buf) 316 if rb1.Equals(newrb) { 317 fmt.Println("I wrote the content to a byte stream and read it back.") 318 } 319 // you can iterate over bitmaps using ReverseIterator(), Iterator, ManyIterator() 320 } 321 ``` 322 323 Only the 32-bit roaring format is standard and cross-operable between Java, C++, C and Go. There is no guarantee that the 64-bit versions are compatible. 324 325 ### Documentation 326 327 Current documentation is available at http://godoc.org/github.com/RoaringBitmap/roaring and http://godoc.org/github.com/RoaringBitmap/roaring64 328 329 ### Goroutine safety 330 331 In general, it should not generally be considered safe to access 332 the same bitmaps using different goroutines--they are left 333 unsynchronized for performance. Should you want to access 334 a Bitmap from more than one goroutine, you should 335 provide synchronization. Typically this is done by using channels to pass 336 the *Bitmap around (in Go style; so there is only ever one owner), 337 or by using `sync.Mutex` to serialize operations on Bitmaps. 338 339 ### Coverage 340 341 We test our software. For a report on our test coverage, see 342 343 https://coveralls.io/github/RoaringBitmap/roaring?branch=master 344 345 ### Benchmark 346 347 Type 348 349 go test -bench Benchmark -run - 350 351 To run benchmarks on [Real Roaring Datasets](https://github.com/RoaringBitmap/real-roaring-datasets) 352 run the following: 353 354 ```sh 355 go get github.com/RoaringBitmap/real-roaring-datasets 356 BENCH_REAL_DATA=1 go test -bench BenchmarkRealData -run - 357 ``` 358 359 ### Iterative use 360 361 You can use roaring with gore: 362 363 - go get -u github.com/motemen/gore 364 - Make sure that ``$GOPATH/bin`` is in your ``$PATH``. 365 - go get github.com/RoaringBitmap/roaring 366 367 ```go 368 $ gore 369 gore version 0.2.6 :help for help 370 gore> :import github.com/RoaringBitmap/roaring 371 gore> x:=roaring.New() 372 gore> x.Add(1) 373 gore> x.String() 374 "{1}" 375 ``` 376 377 378 ### Fuzzy testing 379 380 You can help us test further the library with fuzzy testing: 381 382 go get github.com/dvyukov/go-fuzz/go-fuzz 383 go get github.com/dvyukov/go-fuzz/go-fuzz-build 384 go test -tags=gofuzz -run=TestGenerateSmatCorpus 385 go-fuzz-build github.com/RoaringBitmap/roaring 386 go-fuzz -bin=./roaring-fuzz.zip -workdir=workdir/ -timeout=200 -func FuzzSmat 387 388 Let it run, and if the # of crashers is > 0, check out the reports in 389 the workdir where you should be able to find the panic goroutine stack 390 traces. 391 392 You may also replace `-func FuzzSmat` by `-func FuzzSerializationBuffer` or `-func FuzzSerializationStream`. 393 394 ### Alternative in Go 395 396 There is a Go version wrapping the C/C++ implementation https://github.com/RoaringBitmap/gocroaring 397 398 For an alternative implementation in Go, see https://github.com/fzandona/goroar 399 The two versions were written independently. 400 401 402 ### Mailing list/discussion group 403 404 https://groups.google.com/forum/#!forum/roaring-bitmaps