github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/store.md (about) 1 # Data store design 2 3 SwarmKit has an embedded data store for configuration and state. This store is 4 usually backed by the raft protocol, but is abstracted from the underlying 5 consensus protocol, and in principle could use other means to synchronize data 6 across the cluster. This document focuses on the design of the store itself, 7 such as the programmer-facing APIs and consistency guarantees, and does not 8 cover distributed consensus. 9 10 ## Structure of stored data 11 12 The SwarmKit data store is built on top of go-memdb, which stores data in radix 13 trees. 14 15 There are separate tables for each data type, for example nodes, tasks, and so 16 on. Each table has its own set of indices, which always includes an ID index, 17 but may include other indices as well. For example, tasks can be indexed by 18 their service ID and node ID, among several other things. 19 20 Under the hood, go-memdb implements an index by adding keys for each index to 21 the radix tree, prefixed with the index's name. A single object in the data 22 store may have several keys corresponding to it, because it will have a 23 different key (and possibly multiple keys) within each index. 24 25 There are several advantages to using radix trees in this way. The first is that 26 it makes prefix matching easy. A second powerful feature of this design is 27 copy-on-write snapshotting. Since the radix tree consists of a hierarchy of 28 pointers, the root pointer always a fully consistent state at that moment in 29 time. Making a change to the tree involves replacing a leaf node with a new 30 value, and "bubbling up" that change to the root through the intermediate 31 pointers. To make the change visible to other readers, all it takes is a single 32 atomic pointer swap that replaces the root of the tree with a new root that 33 incorporates the changed nodes. The text below will discuss how this is used to 34 support transactions. 35 36 ## Transactions 37 38 Code that uses the store can only use it inside a *transaction*. There are two 39 kinds of transactions: view transactions (read-only) and update transactions 40 (read/write). 41 42 A view transaction runs in a callback passed to the `View` method: 43 44 ``` 45 s.View(func(tx store.ReadTx) { 46 nodes, err = store.FindNodes(tx, store.All) 47 }) 48 ``` 49 50 This callback can call functions defined in the `store` package that retrieve 51 and list the various types of objects. `View` operates on an atomic snapshot of 52 the data store, so changes made while the callback is running won't be visible 53 to code inside the callback using the supplied `ReadTx`. 54 55 An update transaction works similarly, but provides the ability to create, 56 update, and delete objects: 57 58 ``` 59 s.Update(func(tx store.Tx) error { 60 t2 := &api.Task{ 61 ID: "testTaskID2", 62 Status: api.TaskStatus{ 63 State: api.TaskStateNew, 64 }, 65 ServiceID: "testServiceID2", 66 DesiredState: api.TaskStateRunning, 67 } 68 return store.CreateTask(tx, t2) 69 }) 70 ``` 71 72 If the callback returns `nil`, the changes made inside the callback function are 73 committed atomically. If it returns any other error value, the transaction gets 74 rolled back. The changes are never visible to any other readers before the 75 commit happens, but they are visible to code inside the callback using the 76 `Tx` argument. 77 78 There is an exclusive lock for updates, so only one can happen at once. Take 79 care not to do expensive or blocking operations inside an `Update` callback. 80 81 ## Batching 82 83 Sometimes it's necessary to create or update many objects in the store, but we 84 want to do this without holding the update lock for an arbitrarily long period 85 of time, or generating a huge set of changes from the transaction that would 86 need to be serialized in a Raft write. For this situation, the store provides 87 primitives to batch iterated operations that don't require atomicity into 88 transactions of an appropriate size. 89 90 Here is an example of a batch operation: 91 92 ``` 93 err = d.store.Batch(func(batch *store.Batch) error { 94 for _, n := range nodes { 95 err := batch.Update(func(tx store.Tx) error { 96 // check if node is still here 97 node := store.GetNode(tx, n.ID) 98 if node == nil { 99 return nil 100 } 101 102 // [...] 103 104 node.Status.State = api.NodeStatus_UNKNOWN 105 node.Status.Message = `Node moved to "unknown" state due to leadership change in cluster` 106 107 if err := d.nodes.AddUnknown(node, expireFunc); err != nil { 108 return errors.Wrap(err, `adding node in "unknown" state to node store failed`) 109 } 110 if err := store.UpdateNode(tx, node); err != nil { 111 return errors.Wrap(err, "update failed") 112 } 113 return nil 114 }) 115 if err != nil { 116 log.WithField("node", n.ID).WithError(err).Error(`failed to move node to "unknown" state`) 117 } 118 } 119 return nil 120 }) 121 ``` 122 123 This is a slightly abbreviated version of code in the dispatcher that moves a 124 set of nodes to the "unknown" state. If there were many nodes in the system, 125 doing this inside a single Update transaction might block updates to the store 126 for a long time, or exceed the size limit of a serialized transaction. By using 127 `Batch`, the changes are automatically broken up into a set of transactions. 128 129 `Batch` takes a callback which generally contains a loop that iterates over a 130 set of objects. Every iteration can call `batch.Update` with another nested 131 callback that performs the actual changes. Changes performed inside a single 132 `batch.Update` call are guaranteed to land in the same transaction, and 133 therefore be applied atomically. However, changes different calls to 134 `batch.Update` may end up in different transactions. 135 136 ## Watches 137 138 The data store provides a real-time feed of insertions, deletions, and 139 modifications. Any number of listeners can subscribe to this feed, optionally 140 applying filters to the set of events. This is very useful for building control 141 loops. For example, the orchestrators watch changes to services to trigger 142 reconciliation. 143 144 To start a watch, use the `state.Watch` function. The first argument is the 145 watch queue, which can be obtained with the store instance's `WatchQueue` 146 method. Extra arguments are events to be matched against the incoming event when 147 filtering. For example, this call returns only tasks creations, updates, and 148 deletions that affect a specific task ID: 149 150 151 ``` 152 nodeTasks, err := store.Watch(s.WatchQueue(), 153 api.EventCreateTask{Task: &api.Task{NodeID: nodeID}, 154 Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}}, 155 api.EventUpdateTask{Task: &api.Task{NodeID: nodeID}, 156 Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}}, 157 api.EventDeleteTask{Task: &api.Task{NodeID: nodeID}, 158 Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}}, 159 ) 160 ``` 161 162 There is also a `ViewAndWatch` method on the store that provides access to a 163 snapshot of the store at just before the moment the watch starts receiving 164 events. It guarantees that events following this snapshot won't be missed, and 165 events that are already incorporated in the snapshot won't be received. 166 `ViewAndWatch` involves holding the store update lock while its callback runs, 167 so it's preferable to use `View` and `Watch` separately instead if the use case 168 isn't sensitive to redundant events. `Watch` should be called before `View` so 169 that events aren't missed in between viewing a snapshot and starting the event 170 stream. 171 172 ## Distributed operation 173 174 Data written to the store is automatically replicated to the other managers in 175 the cluster through the underlying consensus protocol. All active managers have 176 local in-memory copies of all the data in the store, accessible through 177 go-memdb. 178 179 The current consensus implementation, based on Raft, only allows writes to 180 happen on the leader. This avoids potentially conflicting writes ending up in 181 the log, which would have to be reconciled later on. The leader's copy of the 182 data in the store is the most up-to-date. Other nodes may lag behind this copy, 183 if there are replication delays, but will never diverge from it. 184 185 ## Sequencer 186 187 It's important not to overwrite current data with stale data. In some 188 situations, we might want to take data from the store, hand it to the user, and 189 then write it back to the store with the user's modifications. The store has 190 a safeguard to make sure this fails if the data has been updated since the copy 191 was retrieved. 192 193 Every top-level object has a `Meta` field which contains a `Version` object. The 194 `Version` is managed automatically by the store. When an object is updated, its 195 `Version` field is increased to distinguish the old version from the new 196 version. Trying to update an object will fail if the object passed into an 197 update function has a `Version` which doesn't match the current `Version` of 198 that object in the store. 199 200 `Meta` also contains timestamps that are automatically updated by the store. 201 202 To keep version numbers consistent across the cluster, version numbers are 203 provided by the underlying consensus protocol through the `Proposer` interface. 204 In the case of the Raft consensus implementation, the version number is simply 205 the current Raft index at the time that the object was last updated. Note that 206 the index is queried before the change is actually written to Raft, so an object 207 created with `Version.Index = 5` would most likely be appended to the Raft log 208 at index 6. 209 210 The `Proposer` interface also provides the mechanism for the store code to 211 synchronize changes to the rest of the cluster. `ProposeValue` sends a set of 212 changes to the other managers in the cluster through the consensus protocol. 213 214 ## RPC API 215 216 In addition to the Go API discussed above, the store exposes watches over gRPC. 217 There is a watch server that provides a very similar interface to the `Watch` 218 call. See `api/watch.proto` for the relevant protobuf definitions. 219 220 A full gRPC API for the store has been proposed, but not yet merged at the time 221 this document was written. See https://github.com/docker/swarmkit/pull/1998 for 222 draft code. In this proposal, the gRPC store API did not support full 223 transactions, but did allow creations and updates to happen in atomic sets. 224 Implementing full transactions over gRPC presents some challenges, because of 225 the store update lock. If a streaming RPC could hold the update lock, a 226 misbehaving client or severed network connection might cause this lock to be 227 held too long. Transactional APIs might need very short timeouts or other 228 safeguards. 229 230 The purpose of exposing an external gRPC API for the store would be to support 231 externally-implemented control loops. This would make swarmkit more extensible 232 because code that works with objects directly wouldn't need to be implemented 233 inside the swarmkit repository anymore. 234 235 ## Generated code 236 237 For type safety, the store exposes type-safe helper functions such as 238 `DeleteNode` and `FindSecrets`. These functions wrap internal methods that are 239 not type-specific. However, providing these wrappers ended up involving a lot of 240 boilerplate code. There was also code that had to be duplicated for things like 241 saving and restoring snapshots of the store, defining events, and indexing 242 objects in the store. 243 244 To make this more manageable, a lot of store code is now automatically 245 generated by `protobuf/plugin/storeobject/storeobject.go`. It's now a lot easier 246 to add a new object type to the store. There is scope for further improvements 247 through code generation. 248 249 The plugin uses the presence of the `docker.protobuf.plugin.store_object` option 250 to detect top-level objects that can be stored inside the store. There is a 251 `watch_selectors` field inside this option that specifies which functions should 252 be generated for matching against specific fields of an object in a `Watch` 253 call.