github.com/cilium/statedb@v0.3.2/README.md (about) 1 # :memo: StateDB [](https://pkg.go.dev/github.com/cilium/statedb) 2 3 StateDB is an in-memory database for Go. The database is built on top of 4 [Persistent](https://en.wikipedia.org/wiki/Persistent_data_structure) [Adaptive Radix Trees](https://db.in.tum.de/~leis/papers/ART.pdf). 5 6 StateDB is/supports: 7 8 * In-memory. Objects and indexes are stored in main memory and not on disk. 9 This makes it easy to store and index any Go data type. 10 11 * Multi-Version Concurrency Control (MVCC). Both objects and indexes are immutable 12 and objects are versioned. A read transaction has access to an immutable snapshot 13 of the data. 14 15 * Cross-table write transactions. Write transactions lock the requested tables and 16 allow modifying objects in multiple tables as a single atomic action. Transactions 17 can be aborted to throw away the changes. 18 19 * Multiple indexes. A table may have one or more indexers for objects, with each 20 indexer returning zero or more keys. Indexes can be unique or non-unique. 21 A non-unique index is a concatenation of the primary and secondary keys. 22 23 * Watch channels. Changes to the database can be watched at fine-granularity via 24 Go channels that close when a relevant part of the database changes. This is 25 implemented by having a Go channel at each of the radix tree nodes. This enables 26 watching an individual object for changes, a key prefix, or the whole table. 27 28 ## Warning! Immutable data! Read this! 29 30 To support lockless readers and transactionality StateDB relies on both the indexes 31 and the objects themselves being immutable. Since in Go you cannot declare fields 32 `const` we cannot stop mutation of public fields in objects. This means that care 33 must be taken with objects stored in StateDB and not mutate objects that have been 34 inserted into it. This means both the fields directly in the object and everything 35 referenced from it, e.g. a map field must not be modified, but must be cloned first! 36 37 StateDB has a check in `Insert()` to validate that if an object is a pointer then it 38 cannot be replaced with the same pointer, but that at least a shallow clone has been 39 made. This of course doesn't extend to references within the object. 40 41 For "very important objects", please consider storing an interface type instead that 42 contains getter methods and a safe way of mutating the object, e.g. via the builder 43 pattern or a constructor function. 44 45 Also prefer persistent/immutable data structures within the object to avoid expensive 46 copying on mutation. The `part` package comes with persistent `Map[K]V` and `Set[T]`. 47 48 ## Example 49 50 Here's a quick example to show how using StateDB looks like. 51 52 ```go 53 // Define an object to store in the database. 54 type MyObject struct { 55 ID uint32 56 Foo string 57 } 58 59 // Define how to index and query the object. 60 var IDIndex = statedb.Index[*MyObject, uint32]{ 61 Name: "id", 62 FromObject: func(obj *MyObject) index.KeySet { 63 return index.NewKeySet(index.Uint64(obj.ID)) 64 }, 65 FromKey: func(id uint32) index.Key { 66 return index.Uint32(id) 67 }, 68 Unique: true, 69 } 70 71 // Create the database and the table. 72 func example() { 73 db := statedb.New() 74 myObjects, err := statedb.NewTable( 75 "my-objects", 76 IDIndex, 77 ) 78 if err != nil { ... } 79 80 if err := db.RegisterTable(myObjects); err != nil { 81 ... 82 } 83 84 wtxn := db.WriteTxn(myObjects) 85 86 // Insert some objects 87 myObjects.Insert(wtxn, &MyObject{1, "a"}) 88 myObjects.Insert(wtxn, &MyObject{2, "b"}) 89 myObjects.Insert(wtxn, &MyObject{3, "c"}) 90 91 // Modify an object 92 if obj, _, found := myObjects.Get(wtxn, IDIndex.Query(1)); found { 93 objCopy := *obj 94 objCopy.Foo = "d" 95 myObjects.Insert(wtxn, &objCopy) 96 } 97 98 // Delete an object 99 if obj, _, found := myObjects.Get(wtxn, IDIndex.Query(2)); found { 100 myObjects.Delete(wtxn, obj) 101 } 102 103 if feelingLucky { 104 // Commit the changes. 105 wtxn.Commit() 106 } else { 107 // Throw away the changes. 108 wtxn.Abort() 109 } 110 111 // Query the objects with a snapshot of the database. 112 txn := db.ReadTxn() 113 114 if obj, _, found := myObjects.Get(txn, IDIndex.Query(1)); found { 115 ... 116 } 117 118 // Iterate over all objects 119 for obj := range myObjects.All() { 120 ... 121 } 122 123 // Iterate with revision 124 for obj, revision := range myObjects.All() { 125 ... 126 } 127 128 // Iterate all objects and then wait until something changes. 129 objs, watch := myObjects.AllWatch(txn) 130 for obj := range objs { ... } 131 <-watch 132 133 // Grab a new snapshot to read the new changes. 134 txn = db.ReadTxn() 135 136 // Iterate objects with ID >= 2 137 objs, watch = myObjects.LowerBoundWatch(txn, IDIndex.Query(2)) 138 for obj := range objs { ... } 139 140 // Iterate objects where ID is between 0x1000_0000 and 0x1fff_ffff 141 objs, watch = myObjects.PrefixWatch(txn, IDIndex.Query(0x1000_0000)) 142 for obj := range objs { ... } 143 } 144 ``` 145 146 Read on for a more detailed guide or check out the [Go package docs](https://pkg.go.dev/github.com/cilium/statedb). 147 148 ## Guide to StateDB 149 150 StateDB can be used directly as a normal library, or as a [Hive](https://github.com/cilium/hive) Cell. 151 For example usage as part of Hive, see `reconciler/example`. Here we show a standalone example. 152 153 We start by defining the data type we want to store in the database. There are 154 no constraints on the type and it may be a primitive type like an `int` or a 155 struct type, or a pointer. Since each index stores a copy of the object one should 156 use a pointer if the object is large. 157 158 ```go 159 import ( 160 "github.com/cilium/statedb" 161 "github.com/cilium/statedb/index" 162 "github.com/cilium/statedb/part" 163 ) 164 165 type ID = uint64 166 type Tag = string 167 type MyObject struct { 168 ID ID // Identifier 169 Tags part.Set[Tag] // Set of tags 170 } 171 ``` 172 173 ### Indexes 174 175 With the object defined, we can describe how it should be indexed. Indexes are 176 constant values and can be defined as global variables alongside the object type. 177 Indexes take two type parameters, your object type and the key type: `Index[MyObject, ID]`. 178 Additionally you define two operations: `FromObject` that takes your object and returns 179 a set of StateDB keys (zero or many), and `FromKey` that takes the key type of your choosing and 180 converts it to a StateDB key. 181 182 ```go 183 // IDIndex is the primary index for MyObject indexing the 'ID' field. 184 var IDIndex = statedb.Index[*MyObject, ID]{ 185 Name: "id", 186 187 FromObject: func(obj *MyObject) index.KeySet { 188 return index.NewKeySet(index.Uint64(obj.ID)) 189 } 190 191 FromKey: func(id ID) index.Key { 192 return index.Uint64(id) 193 } 194 // Above is equal to just: 195 // FromKey: index.Uint64, 196 197 Unique: true, // IDs are unique. 198 } 199 ``` 200 201 The `index.Key` seen above is just a `[]byte`. The `index` package contains many functions 202 for converting into the `index.Key` type, for example `index.Uint64` and so on. 203 204 A single object can also map to multiple keys (multi-index). Let's construct an index 205 for tags. 206 207 ```go 208 var TagsIndex = statedb.Index[*MyObject, Tag]{ 209 Name: "tags", 210 211 FromObject: func(o *MyObject) index.KeySet { 212 // index.Set turns the part.Set[string] into a set of keys 213 // (set of byte slices) 214 return index.Set(o.Tags) 215 } 216 217 FromKey: index.String, 218 219 // Many objects may have the same tag, so we mark this as 220 // non-unique. 221 Unique: false, 222 } 223 ``` 224 225 With the indexes now defined, we can construct a table. 226 227 ### Setting up a table 228 229 ```go 230 func NewMyObjectTable() (statedb.RWTable[*MyObject], error) { 231 return statedb.NewTable[*MyObject]( 232 "my-objects", 233 234 IDIndex, // IDIndex is the primary index 235 TagsIndex, // TagsIndex is a secondary index 236 // ... more secondary indexes can be passed in here 237 ) 238 } 239 ``` 240 241 The `NewTable` function takes the name of the table, a primary index and zero or 242 more secondary indexes. The table name must match the regular expression 243 "^[a-z][a-z0-9_\\-]{0,30}$". 244 245 `NewTable` returns a `RWTable`, which is an interface for both reading and 246 writing to a table. An `RWTable` is a superset of `Table`, an interface 247 that contains methods just for reading. This provides a simple form of 248 type-level access control to the table. `NewTable` may return an error if 249 the name or indexers are malformed, for example if `IDIndex` is not unique 250 (primary index has to be), or if the indexers have overlapping names. 251 252 ### Inserting 253 254 With the table defined, we can now create the database and start writing and reading 255 to the table. 256 257 ```go 258 db := statedb.New() 259 260 myObjects, err := NewMyObjectTable() 261 if err != nil { return err } 262 263 // Register the table with the database. 264 err := db.RegisterTable(myObjects) 265 if err != nil { 266 // May fail if the table with the same name is already registered. 267 return err 268 } 269 ``` 270 271 To insert objects into a table, we'll need to create a `WriteTxn`. This locks 272 the target table(s) allowing for an atomic transaction change. 273 274 ```go 275 // Create a write transaction against the 'myObjects' table, locking 276 // it for writing. 277 // Note that the returned 'wtxn' holds internal state and it is not 278 // safe to use concurrently (e.g. you must not have multiple goroutines 279 // using the same WriteTxn in parallel). 280 wtxn := db.WriteTxn(myObjects) 281 282 // We can defer an Abort() of the transaction in case we encounter 283 // issues and want to forget our writes. This is a good practice 284 // to safe-guard against forgotten call to Commit(). Worry not though, 285 // StateDB has a finalizer on WriteTxn to catch forgotten Abort/Commit. 286 defer wtxn.Abort() 287 288 // Insert an object into the table. This will be visible to readers 289 // only when we commit. 290 obj := &MyObject{ID: 42, Tags: part.NewStringSet("hello")} 291 oldObj, hadOld, err := myObjects.Insert(wtxn, obj) 292 if err != nil { 293 // Insert can fail only if 'wtxn' is not locking the table we're 294 // writing to, or if 'wxtn' was already committed. 295 return err 296 } 297 // hadOld is true and oldObj points to an old version of the object 298 // if it was replaced. Since the object type can be a non-pointer 299 // we need the separate 'hadOld' boolean and cannot just check for nil. 300 301 // Commit the changes to the database and notify readers by closing the 302 // relevant watch channels. 303 wtxn.Commit() 304 ``` 305 306 307 ### Reading 308 309 Now that there's something in the table we can try out reading. We can 310 read either using a read-only `ReadTxn`, or we can read using a `WriteTxn`. 311 With a `ReadTxn` we'll be reading from a snapshot and nothing we do 312 will affect other readers or writers (unless you mutate the immutable object, 313 in which case bad things happen). 314 315 ```go 316 txn := db.ReadTxn() 317 ``` 318 319 The `txn` is now a frozen snapshot of the database that we can use 320 to read the data. 321 322 ```go 323 // Let's break out the types so you know what is going on. 324 var ( 325 obj *MyObject 326 revision statedb.Revision 327 found bool 328 watch <-chan struct{} 329 ) 330 // Get returns the first matching object in the query. 331 obj, revision, found = myObjects.Get(txn, IDIndex.Query(42)) 332 if found { 333 // obj points to the object we inserted earlier. 334 // revision is the "table revision" for the object. Revisions are 335 // incremented for a table on every insertion or deletion. 336 } 337 // GetWatch is the same as Get, but also gives us a watch 338 // channel that we can use to wait on the object to appear or to 339 // change. 340 obj, revision, watch, found = myObjects.GetWatch(txn, IDIndex.Query(42)) 341 <-watch // closes when object with ID '42' is inserted or deleted 342 ``` 343 344 ### Iterating 345 346 `List` can be used to iterate over all objects that match the query. 347 348 ```go 349 // List returns all matching objects as an iter.Seq2[Obj, Revision]. 350 objs := myObjects.List(txn, TagsIndex.Query("hello")) 351 for obj, revision := range objs { 352 ... 353 } 354 355 // ListWatch is like List, but also returns a watch channel. 356 objs, watch := myObjects.ListWatch(txn, TagsIndex.Query("hello")) 357 for obj, revision := range objs { ... } 358 359 // closes when an object with tag "hello" is inserted or deleted. 360 <-watch 361 ``` 362 363 `Prefix` or `PrefixWatch` can be used to iterate over objects that match a given prefix. 364 365 ```go 366 // Prefix does a prefix search on an index. Here it returns an iterator 367 // for all objects that have a tag that starts with "h". 368 objs, watch = myObjects.Prefix(txn, TagsIndex.Query("h")) 369 for obj := range objs { 370 ... 371 } 372 373 // closes when an object with a tag starting with "h" is inserted or deleted 374 <-watch 375 ``` 376 377 `LowerBound` or `LowerBoundWatch` can be used to iterate over objects that 378 have a key equal to or higher than given key. 379 380 ```go 381 // LowerBoundWatch can be used to find all objects with a key equal to or higher 382 // than specified key. The semantics of it depends on how the indexer works. 383 // For example index.Uint32 returns the big-endian or most significant byte 384 // first form of the integer, in other words the number 3 is the key 385 // []byte{0, 0, 0, 3}, which allows doing a meaningful LowerBound search on it. 386 objs, watch = myObjects.LowerBoundWatch(txn, IDIndex.Query(3)) 387 for obj, revision := range objs { 388 // obj.ID >= 3 389 } 390 391 // closes when anything happens to the table. This is because there isn't a 392 // clear notion of what part of the index to watch for, e.g. if the index 393 // stores 0x01, 0x11, 0x20, and we do LowerBound(0x10), then none of these nodes 394 // in the tree are what we should watch for since "0x01" is in the wrong subtree 395 // and we may insert "0x10" above "0x11", so cannot watch that either. LowerBound 396 // could return the watch channel of the node that shares a prefix with the search 397 // term, but instead StateDB currently does the conservative thing and returns the 398 // watch channel of the "root node". 399 <-watch 400 ``` 401 402 All objects stored in StateDB have an associated revision. The revision is unique 403 to the table and increments on every insert or delete. Revisions can be queried 404 with `ByRevision`. 405 406 ```go 407 // StateDB also has a built-in index for revisions and that can be used to 408 // iterate over the objects in the order they have been changed. Furthermore 409 // we can use this to wait for new changes! 410 lastRevision := statedb.Revision(0) 411 for { 412 objs, watch = myObjects.LowerBoundWatch(txn, statedb.ByRevision(lastRevision+1)) 413 for obj, revision := range objs { 414 // do something with obj 415 lastRevision = revision 416 } 417 418 // Wait until there are new changes. In real code we probably want to 419 // do a 'select' here and check for 'ctx.Done()' etc. 420 <-watch 421 422 // We should rate limit so we can see a batch of changes in one go. 423 // For sake of example just sleeping here, but you likely want to use the 424 // 'rate' package. 425 time.Sleep(100*time.Millisecond) 426 427 // Take a new snapshot so we can see the changes. 428 txn = db.ReadTxn() 429 } 430 ``` 431 432 As it's really useful to know when an object has been deleted, StateDB has 433 a facility for storing deleted objects in a separate index until they have 434 been observed. Using `Changes` one can iterate over insertions and deletions. 435 436 ```go 437 // Let's iterate over both inserts and deletes. We need to use 438 // a write transaction to create the change iterator as this needs to 439 // register with the table to track the deleted objects. 440 441 wtxn := statedb.WriteTxn(myObjects) 442 changeIter, err := myObjects.Changes(wtxn) 443 wtxn.Commit() 444 if err != nil { 445 // This can fail due to same reasons as e.g. Insert() 446 // e.g. transaction not locking target table or it has 447 // already been committed. 448 return err 449 } 450 451 // Now very similar to the LowerBound revision iteration above, we will 452 // iterate over the changes. 453 for { 454 changes, watch := changeIter.Next(db.ReadTxn()) 455 for change := range changes { 456 if change.Deleted { 457 fmt.Printf("Object %#v was deleted!\n", change.Object) 458 } else { 459 fmt.Printf("Object %#v was inserted!\n", change.Object) 460 } 461 } 462 // Wait for more changes. 463 select { 464 case <-ctx.Done(): 465 return 466 case <-watch: 467 } 468 } 469 ``` 470 471 ### Modifying objects 472 473 Modifying objects is basically just a query and an insert to override the object. 474 One must however take care to not modify the object returned by the query. 475 476 ```go 477 // Let's make a write transaction to modify the table. 478 wtxn := db.WriteTxn(myObjects) 479 480 // Now that we have the table written we can retrieve an object and none will 481 // be able to modify it until we commit. 482 obj, revision, found := myObjects.Get(wtxn, IDIndex.Query(42)) 483 if !found { panic("it should be there, I swear!") } 484 485 // We cannot just straight up modify 'obj' since someone might be reading it. 486 // It's supposed to be immutable after all. To make this easier, let's define 487 // a Clone() method. 488 func (obj *MyObject) Clone() *MyObject { 489 obj2 := *obj 490 return &obj2 491 } 492 493 // Now we can do a "shallow clone" of the object and we can modify the fields 494 // without the readers getting upset. Of course we still cannot modify anything 495 // referenced by those fields without cloning the fields themselves. But that's 496 // why we're using persistent data structures like 'part.Set' and 'part.Map'. 497 // 498 // Let's add a new tag. But first we clone. 499 obj = obj.Clone() 500 obj.Tags = obj.Tags.Set("foo") 501 502 // Now we have a new object that has "foo" set. We can now write it to the table. 503 oldObj, hadOld, err := myObjects.Insert(wtxn, obj) 504 // err should be nil, since we're using the WriteTxn correctly 505 // oldObj is the original object, without the "foo" tag 506 // hadOld is true since we replaced the object 507 508 // Commit the transaction so everyone sees it. 509 wtxn.Commit() 510 511 // We can also do a "compare-and-swap" to insert an object. This is useful when 512 // computing the change we want to make is expensive. Here's how you do it. 513 514 // Start with a ReadTxn that is cheap and doesn't block anyone. 515 txn := db.ReadTxn() 516 517 // Look up the object we want to update and perform some slow calculation 518 // to produce the desired new object. 519 obj, revision, found := myObjects.Get(txn, IDIndex.Query(42)) 520 obj = veryExpensiveCalculation(obj) 521 522 // Now that we're ready to insert we can grab a WriteTxn. 523 wtxn := db.WriteTxn(myObjects) 524 525 // Let's try and update the object with the revision of the object we used 526 // for that expensive calculation. 527 oldObj, hadOld, err := myObjects.CompareAndSwap(wtxn, obj, revision) 528 if errors.Is(err, statedb.ErrRevisionNotEqual) { 529 // Oh no, someone updated the object while we were calculating. 530 // I guess I need to calculate again... 531 wtxn.Abort() 532 return err 533 } 534 wtxn.Commit() 535 ``` 536 537 ### Utilities 538 539 StateDB includes few utility functions for operating on the iterators returned 540 by the query methods. 541 542 ```go 543 // objs is an iterator for (object, revision) pairs. It can be consumed 544 // multiple times. 545 var objs iter.Seq2[*MyObject, statedb.Revision] 546 objs = myObjects.All(db.ReadTxn()) 547 548 // The values can be collected into a slice. 549 var objsSlice []*MyObject 550 objsSlice = statedb.Collect(objs) 551 552 // The sequence can be mapped to another value. 553 var ids iter.Seq2[ID, Revision] 554 ids = statedb.Map(objs, func(o *MyObject) ID { return o.ID }) 555 556 // The sequence can be filtered. 557 ids = statedb.Filter(ids, func(id ID) bool { return id > 0 }) 558 559 // The revisions can be dropped by converting iter.Seq2 into iter.Seq 560 var onlyIds iter.Seq[ID] 561 onlyIds = statedb.ToSeq(ids) 562 ``` 563 564 ### Performance considerations 565 566 Needless to say, one should keep the duration of the write transactions 567 as short as possible so that other writers are not starved (readers 568 are not affected as they're reading from a snapshot). Writing in 569 batches or doing first a `ReadTxn` to compute the changes and committing 570 with `CompareAndSwap` is a good way to accomplish this as shown above 571 (optimistic concurrency control). 572 573 One should also avoid keeping the `ReadTxn` around when for example waiting 574 on a watch channel to close. The `ReadTxn` holds a pointer to the database 575 root and thus holding it will prevent old objects from being garbage collected 576 by the Go runtime. Considering grabbing the `ReadTxn` in a function and returning 577 the watch channel to the function doing the for-select loop. 578 579 It is fine to hold onto the `iter.Seq2[Obj, Revision]` returned by the queries 580 and since it may be iterated over multiple times, it may be preferable to pass 581 and hold the `iter.Seq2` instead of collecting the objects into a slice. 582 583 ## Persistent Map and Set 584 585 The `part` package contains persistent `Map[K, V]` and `Set[T]` data structures. 586 These, like StateDB, are implemented with the Persistent Adaptive Radix Trees. 587 They are meant to be used as replacements for the built-in mutable Go hashmap 588 in StateDB objects as they're persistent (operations return a copy) and thus 589 more efficient to copy and suitable to use in immutable objects. 590 591 Here's how to use `Map[K, V]`: 592 593 ```go 594 import ( 595 "github.com/cilium/statedb/part" 596 ) 597 598 // Create a new map with strings as keys 599 m := part.NewStringMap[int]() 600 601 // Set the key "one" to value 1. Returns a new map. 602 mNew := m.Set("one", 1) 603 v, ok := m.Get("one") 604 // ok == false since we didn't modify the original map. 605 606 v, ok = mNew.Get("one") 607 // v == 1, ok == true 608 609 // Let's reuse 'm' as our variable. 610 m = mNew 611 m = m.Set("two") 612 613 // All key-value pairs can be iterated over. 614 kvs := m.All() 615 // Maps can be prefix and lowerbound searched, just like StateDB tables 616 kvs = m.Prefix("a") // Iterator for anything starting with 'a' 617 kvs = m.LowerBound("b") // Iterator for anything equal to 'b' or larger, e.g. 'bb' or 'c'... 618 619 for k, v := range kvs { 620 // ... 621 } 622 623 m.Len() == 2 624 m = m.Delete("two") 625 m.Len() == 1 626 627 // We can use arbitrary types as keys and values... provided 628 // we teach it how to create a byte slice key out of it. 629 type Obj struct { 630 ID string 631 } 632 m2 := part.NewMap[*Obj, *Obj]( 633 func(o *Obj) []byte { return []byte(o.ID) }, 634 func(b []byte) string { return string(b) }, 635 ) 636 o := &Obj{ID: "foo"} 637 m2.Set(o, o) 638 ``` 639 640 And here's `Set[T]`: 641 642 ```go 643 // 's' is now the empty string set 644 s := part.StringSet 645 s = s.Set("hello") 646 s.Has("hello") == true 647 s2 := s.Delete("hello") 648 s.Has("hello") == true 649 s2.Has("hello") == false 650 651 // we can initialize a set with NewStringSet 652 s3 := part.NewStringSet("world", "foo") 653 654 // Sets can be combined. 655 s3 = s3.Union(s) 656 // s3 now contains "hello", "foo", world" 657 s3.Len() == 3 658 659 // Print "hello", "foo", "world" 660 for word := range s3.All() { 661 fmt.Println(word) 662 } 663 664 // We can remove a set from another set 665 s4 := s3.Difference(part.NewStringSet("foo")) 666 s4.Has("foo") == false 667 668 // As with Map[K, V] we can define Set[T] for our own objects 669 type Obj struct { 670 ID string 671 } 672 s5 := part.NewSet[*Obj]( 673 func(o *Obj) []byte { return []byte(o.ID) }, 674 ) 675 s5.Set(&Obj{"quux"}) 676 s5.Has(&Obj{"quux"}) == true 677 ``` 678 679 ## Reconciler 680 681 This repository comes with a generic reconciliation utility that watches a table 682 for changes and performs a configurable Update or Delete operation on the change. 683 The status of the operation is written back into the object, which allows inspecting 684 or waiting for an object to be reconciled. On failures the reconciler will retry 685 the operation at a later time. Reconciler supports health reporting and metrics. 686 687 See the example application in `reconciler/example` for more information. 688