github.com/outcaste-io/ristretto@v0.2.3/z/README.md (about)

     1  ## bbloom: a bitset Bloom filter for go/golang
     2  ===
     3  
     4  package implements a fast bloom filter with real 'bitset' and JSONMarshal/JSONUnmarshal to store/reload the Bloom filter. 
     5  
     6  NOTE: the package uses unsafe.Pointer to set and read the bits from the bitset. If you're uncomfortable with using the unsafe package, please consider using my bloom filter package at github.com/AndreasBriese/bloom
     7  
     8  ===
     9  
    10  changelog 11/2015: new thread safe methods AddTS(), HasTS(), AddIfNotHasTS() following a suggestion from Srdjan Marinovic (github @a-little-srdjan), who used this to code a bloomfilter cache.  
    11  
    12  This bloom filter was developed to strengthen a website-log database and was tested and optimized for this log-entry mask: "2014/%02i/%02i %02i:%02i:%02i /info.html". 
    13  Nonetheless bbloom should work with any other form of entries. 
    14  
    15  ~~Hash function is a modified Berkeley DB sdbm hash (to optimize for smaller strings). sdbm  http://www.cse.yorku.ca/~oz/hash.html~~
    16  
    17  Found sipHash (SipHash-2-4, a fast short-input PRF created by Jean-Philippe Aumasson and Daniel J. Bernstein.) to be about as fast. sipHash had been ported by Dimtry Chestnyk to Go (github.com/dchest/siphash )
    18  
    19  Minimum hashset size is: 512 ([4]uint64; will be set automatically). 
    20  
    21  ###install
    22  
    23  ```sh
    24  go get github.com/AndreasBriese/bbloom
    25  ```
    26  
    27  ###test
    28  + change to folder ../bbloom 
    29  + create wordlist in file "words.txt" (you might use `python permut.py`)
    30  + run 'go test -bench=.' within the folder
    31  
    32  ```go
    33  go test -bench=.
    34  ```
    35  
    36  ~~If you've installed the GOCONVEY TDD-framework http://goconvey.co/ you can run the tests automatically.~~
    37  
    38  using go's testing framework now (have in mind that the op timing is related to 65536 operations of Add, Has, AddIfNotHas respectively)
    39  
    40  ### usage
    41  
    42  after installation add
    43  
    44  ```go
    45  import (
    46  	...
    47  	"github.com/AndreasBriese/bbloom"
    48  	...
    49  	)
    50  ```
    51  
    52  at your header. In the program use
    53  
    54  ```go
    55  // create a bloom filter for 65536 items and 1 % wrong-positive ratio 
    56  bf := bbloom.New(float64(1<<16), float64(0.01))
    57  
    58  // or 
    59  // create a bloom filter with 650000 for 65536 items and 7 locs per hash explicitly
    60  // bf = bbloom.New(float64(650000), float64(7))
    61  // or
    62  bf = bbloom.New(650000.0, 7.0)
    63  
    64  // add one item
    65  bf.Add([]byte("butter"))
    66  
    67  // Number of elements added is exposed now 
    68  // Note: ElemNum will not be included in JSON export (for compatability to older version)
    69  nOfElementsInFilter := bf.ElemNum
    70  
    71  // check if item is in the filter
    72  isIn := bf.Has([]byte("butter"))    // should be true
    73  isNotIn := bf.Has([]byte("Butter")) // should be false
    74  
    75  // 'add only if item is new' to the bloomfilter
    76  added := bf.AddIfNotHas([]byte("butter"))    // should be false because 'butter' is already in the set
    77  added = bf.AddIfNotHas([]byte("buTTer"))    // should be true because 'buTTer' is new
    78  
    79  // thread safe versions for concurrent use: AddTS, HasTS, AddIfNotHasTS
    80  // add one item
    81  bf.AddTS([]byte("peanutbutter"))
    82  // check if item is in the filter
    83  isIn = bf.HasTS([]byte("peanutbutter"))    // should be true
    84  isNotIn = bf.HasTS([]byte("peanutButter")) // should be false
    85  // 'add only if item is new' to the bloomfilter
    86  added = bf.AddIfNotHasTS([]byte("butter"))    // should be false because 'peanutbutter' is already in the set
    87  added = bf.AddIfNotHasTS([]byte("peanutbuTTer"))    // should be true because 'penutbuTTer' is new
    88  
    89  // convert to JSON ([]byte) 
    90  Json := bf.JSONMarshal()
    91  
    92  // bloomfilters Mutex is exposed for external un-/locking
    93  // i.e. mutex lock while doing JSON conversion
    94  bf.Mtx.Lock()
    95  Json = bf.JSONMarshal()
    96  bf.Mtx.Unlock()
    97  
    98  // restore a bloom filter from storage 
    99  bfNew := bbloom.JSONUnmarshal(Json)
   100  
   101  isInNew := bfNew.Has([]byte("butter"))    // should be true
   102  isNotInNew := bfNew.Has([]byte("Butter")) // should be false
   103  
   104  ```
   105  
   106  to work with the bloom filter.
   107  
   108  ### why 'fast'? 
   109  
   110  It's about 3 times faster than William Fitzgeralds bitset bloom filter https://github.com/willf/bloom . And it is about so fast as my []bool set variant for Boom filters (see https://github.com/AndreasBriese/bloom ) but having a 8times smaller memory footprint: 
   111  
   112  	
   113  	Bloom filter (filter size 524288, 7 hashlocs)
   114  	github.com/AndreasBriese/bbloom 'Add' 65536 items (10 repetitions): 6595800 ns (100 ns/op)
   115      github.com/AndreasBriese/bbloom 'Has' 65536 items (10 repetitions): 5986600 ns (91 ns/op)
   116  	github.com/AndreasBriese/bloom 'Add' 65536 items (10 repetitions): 6304684 ns (96 ns/op)
   117  	github.com/AndreasBriese/bloom 'Has' 65536 items (10 repetitions): 6568663 ns (100 ns/op)
   118  	
   119  	github.com/willf/bloom 'Add' 65536 items (10 repetitions): 24367224 ns (371 ns/op)
   120  	github.com/willf/bloom 'Test' 65536 items (10 repetitions): 21881142 ns (333 ns/op)
   121  	github.com/dataence/bloom/standard 'Add' 65536 items (10 repetitions): 23041644 ns (351 ns/op)
   122  	github.com/dataence/bloom/standard 'Check' 65536 items (10 repetitions): 19153133 ns (292 ns/op)
   123  	github.com/cabello/bloom 'Add' 65536 items (10 repetitions): 131921507 ns (2012 ns/op)
   124  	github.com/cabello/bloom 'Contains' 65536 items (10 repetitions): 131108962 ns (2000 ns/op)
   125  
   126  (on MBPro15 OSX10.8.5 i7 4Core 2.4Ghz)
   127  
   128  
   129  With 32bit bloom filters (bloom32) using modified sdbm, bloom32 does hashing with only 2 bit shifts, one xor and one substraction per byte. smdb is about as fast as fnv64a but gives less collisions with the dataset (see mask above). bloom.New(float64(10 * 1<<16),float64(7)) populated with 1<<16 random items from the dataset (see above) and tested against the rest results in less than 0.05% collisions.