github.com/pingcap/badger@v1.5.1-0.20230103063557-828f39b09b6d/cache/z/README.md (about) 1 ## bbloom: a bitset Bloom filter for go/golang 2 === 3 4 package implements a fast bloom filter with real 'bitset' and JSONMarshal/JSONUnmarshal to store/reload the Bloom filter. 5 6 NOTE: the package uses unsafe.Pointer to set and read the bits from the bitset. If you're uncomfortable with using the unsafe package, please consider using my bloom filter package at github.com/AndreasBriese/bloom 7 8 === 9 10 changelog 11/2015: new thread safe methods AddTS(), HasTS(), AddIfNotHasTS() following a suggestion from Srdjan Marinovic (github @a-little-srdjan), who used this to code a bloomfilter cache. 11 12 This bloom filter was developed to strengthen a website-log database and was tested and optimized for this log-entry mask: "2014/%02i/%02i %02i:%02i:%02i /info.html". 13 Nonetheless bbloom should work with any other form of entries. 14 15 ~~Hash function is a modified Berkeley DB sdbm hash (to optimize for smaller strings). sdbm http://www.cse.yorku.ca/~oz/hash.html~~ 16 17 Found sipHash (SipHash-2-4, a fast short-input PRF created by Jean-Philippe Aumasson and Daniel J. Bernstein.) to be about as fast. sipHash had been ported by Dimtry Chestnyk to Go (github.com/dchest/siphash ) 18 19 Minimum hashset size is: 512 ([4]uint64; will be set automatically). 20 21 ###install 22 23 ```sh 24 go get github.com/AndreasBriese/bbloom 25 ``` 26 27 ###test 28 + change to folder ../bbloom 29 + create wordlist in file "words.txt" (you might use `python permut.py`) 30 + run 'go test -bench=.' within the folder 31 32 ```go 33 go test -bench=. 34 ``` 35 36 ~~If you've installed the GOCONVEY TDD-framework http://goconvey.co/ you can run the tests automatically.~~ 37 38 using go's testing framework now (have in mind that the op timing is related to 65536 operations of Add, Has, AddIfNotHas respectively) 39 40 ### usage 41 42 after installation add 43 44 ```go 45 import ( 46 ... 47 "github.com/AndreasBriese/bbloom" 48 ... 49 ) 50 ``` 51 52 at your header. In the program use 53 54 ```go 55 // create a bloom filter for 65536 items and 1 % wrong-positive ratio 56 bf := bbloom.New(float64(1<<16), float64(0.01)) 57 58 // or 59 // create a bloom filter with 650000 for 65536 items and 7 locs per hash explicitly 60 // bf = bbloom.New(float64(650000), float64(7)) 61 // or 62 bf = bbloom.New(650000.0, 7.0) 63 64 // add one item 65 bf.Add([]byte("butter")) 66 67 // Number of elements added is exposed now 68 // Note: ElemNum will not be included in JSON export (for compatability to older version) 69 nOfElementsInFilter := bf.ElemNum 70 71 // check if item is in the filter 72 isIn := bf.Has([]byte("butter")) // should be true 73 isNotIn := bf.Has([]byte("Butter")) // should be false 74 75 // 'add only if item is new' to the bloomfilter 76 added := bf.AddIfNotHas([]byte("butter")) // should be false because 'butter' is already in the set 77 added = bf.AddIfNotHas([]byte("buTTer")) // should be true because 'buTTer' is new 78 79 // thread safe versions for concurrent use: AddTS, HasTS, AddIfNotHasTS 80 // add one item 81 bf.AddTS([]byte("peanutbutter")) 82 // check if item is in the filter 83 isIn = bf.HasTS([]byte("peanutbutter")) // should be true 84 isNotIn = bf.HasTS([]byte("peanutButter")) // should be false 85 // 'add only if item is new' to the bloomfilter 86 added = bf.AddIfNotHasTS([]byte("butter")) // should be false because 'peanutbutter' is already in the set 87 added = bf.AddIfNotHasTS([]byte("peanutbuTTer")) // should be true because 'penutbuTTer' is new 88 89 // convert to JSON ([]byte) 90 Json := bf.JSONMarshal() 91 92 // bloomfilters Mutex is exposed for external un-/locking 93 // i.e. mutex lock while doing JSON conversion 94 bf.Mtx.Lock() 95 Json = bf.JSONMarshal() 96 bf.Mtx.Unlock() 97 98 // restore a bloom filter from storage 99 bfNew := bbloom.JSONUnmarshal(Json) 100 101 isInNew := bfNew.Has([]byte("butter")) // should be true 102 isNotInNew := bfNew.Has([]byte("Butter")) // should be false 103 104 ``` 105 106 to work with the bloom filter. 107 108 ### why 'fast'? 109 110 It's about 3 times faster than William Fitzgeralds bitset bloom filter https://github.com/willf/bloom . And it is about so fast as my []bool set variant for Boom filters (see https://github.com/AndreasBriese/bloom ) but having a 8times smaller memory footprint: 111 112 113 Bloom filter (filter size 524288, 7 hashlocs) 114 github.com/AndreasBriese/bbloom 'Add' 65536 items (10 repetitions): 6595800 ns (100 ns/op) 115 github.com/AndreasBriese/bbloom 'Has' 65536 items (10 repetitions): 5986600 ns (91 ns/op) 116 github.com/AndreasBriese/bloom 'Add' 65536 items (10 repetitions): 6304684 ns (96 ns/op) 117 github.com/AndreasBriese/bloom 'Has' 65536 items (10 repetitions): 6568663 ns (100 ns/op) 118 119 github.com/willf/bloom 'Add' 65536 items (10 repetitions): 24367224 ns (371 ns/op) 120 github.com/willf/bloom 'Test' 65536 items (10 repetitions): 21881142 ns (333 ns/op) 121 github.com/dataence/bloom/standard 'Add' 65536 items (10 repetitions): 23041644 ns (351 ns/op) 122 github.com/dataence/bloom/standard 'Check' 65536 items (10 repetitions): 19153133 ns (292 ns/op) 123 github.com/cabello/bloom 'Add' 65536 items (10 repetitions): 131921507 ns (2012 ns/op) 124 github.com/cabello/bloom 'Contains' 65536 items (10 repetitions): 131108962 ns (2000 ns/op) 125 126 (on MBPro15 OSX10.8.5 i7 4Core 2.4Ghz) 127 128 129 With 32bit bloom filters (bloom32) using modified sdbm, bloom32 does hashing with only 2 bit shifts, one xor and one substraction per byte. smdb is about as fast as fnv64a but gives less collisions with the dataset (see mask above). bloom.New(float64(10 * 1<<16),float64(7)) populated with 1<<16 random items from the dataset (see above) and tested against the rest results in less than 0.05% collisions.