github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/store.md

github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/store.md (about)

1 # Data store design
2
3 SwarmKit has an embedded data store for configuration and state. This store is
4 usually backed by the raft protocol, but is abstracted from the underlying
5 consensus protocol, and in principle could use other means to synchronize data
6 across the cluster. This document focuses on the design of the store itself,
7 such as the programmer-facing APIs and consistency guarantees, and does not
8 cover distributed consensus.
9
10 ## Structure of stored data
11
12 The SwarmKit data store is built on top of go-memdb, which stores data in radix
13 trees.
14
15 There are separate tables for each data type, for example nodes, tasks, and so
16 on. Each table has its own set of indices, which always includes an ID index,
17 but may include other indices as well. For example, tasks can be indexed by
18 their service ID and node ID, among several other things.
19
20 Under the hood, go-memdb implements an index by adding keys for each index to
21 the radix tree, prefixed with the index's name. A single object in the data
22 store may have several keys corresponding to it, because it will have a
23 different key (and possibly multiple keys) within each index.
24
25 There are several advantages to using radix trees in this way. The first is that
26 it makes prefix matching easy. A second powerful feature of this design is
27 copy-on-write snapshotting. Since the radix tree consists of a hierarchy of
28 pointers, the root pointer always a fully consistent state at that moment in
29 time. Making a change to the tree involves replacing a leaf node with a new
30 value, and "bubbling up" that change to the root through the intermediate
31 pointers. To make the change visible to other readers, all it takes is a single
32 atomic pointer swap that replaces the root of the tree with a new root that
33 incorporates the changed nodes. The text below will discuss how this is used to
34 support transactions.
35
36 ## Transactions
37
38 Code that uses the store can only use it inside a *transaction*. There are two
39 kinds of transactions: view transactions (read-only) and update transactions
40 (read/write).
41
42 A view transaction runs in a callback passed to the `View` method:
43
44 ```
45 s.View(func(tx store.ReadTx) {
46 nodes, err = store.FindNodes(tx, store.All)
47 })
48 ```
49
50 This callback can call functions defined in the `store` package that retrieve
51 and list the various types of objects. `View` operates on an atomic snapshot of
52 the data store, so changes made while the callback is running won't be visible
53 to code inside the callback using the supplied `ReadTx`.
54
55 An update transaction works similarly, but provides the ability to create,
56 update, and delete objects:
57
58 ```
59 s.Update(func(tx store.Tx) error {
60 t2 := &api.Task{
61 ID: "testTaskID2",
62 Status: api.TaskStatus{
63 State: api.TaskStateNew,
64 },
65 ServiceID: "testServiceID2",
66 DesiredState: api.TaskStateRunning,
67 }
68 return store.CreateTask(tx, t2)
69 })
70 ```
71
72 If the callback returns `nil`, the changes made inside the callback function are
73 committed atomically. If it returns any other error value, the transaction gets
74 rolled back. The changes are never visible to any other readers before the
75 commit happens, but they are visible to code inside the callback using the
76 `Tx` argument.
77
78 There is an exclusive lock for updates, so only one can happen at once. Take
79 care not to do expensive or blocking operations inside an `Update` callback.
80
81 ## Batching
82
83 Sometimes it's necessary to create or update many objects in the store, but we
84 want to do this without holding the update lock for an arbitrarily long period
85 of time, or generating a huge set of changes from the transaction that would
86 need to be serialized in a Raft write. For this situation, the store provides
87 primitives to batch iterated operations that don't require atomicity into
88 transactions of an appropriate size.
89
90 Here is an example of a batch operation:
91
92 ```
93 err = d.store.Batch(func(batch *store.Batch) error {
94 for _, n := range nodes {
95 err := batch.Update(func(tx store.Tx) error {
96 // check if node is still here
97 node := store.GetNode(tx, n.ID)
98 if node == nil {
99 return nil
100 }
101
102 // [...]
103
104 node.Status.State = api.NodeStatus_UNKNOWN
105 node.Status.Message = `Node moved to "unknown" state due to leadership change in cluster`
106
107 if err := d.nodes.AddUnknown(node, expireFunc); err != nil {
108 return errors.Wrap(err, `adding node in "unknown" state to node store failed`)
109 }
110 if err := store.UpdateNode(tx, node); err != nil {
111 return errors.Wrap(err, "update failed")
112 }
113 return nil
114 })
115 if err != nil {
116 log.WithField("node", n.ID).WithError(err).Error(`failed to move node to "unknown" state`)
117 }
118 }
119 return nil
120 })
121 ```
122
123 This is a slightly abbreviated version of code in the dispatcher that moves a
124 set of nodes to the "unknown" state. If there were many nodes in the system,
125 doing this inside a single Update transaction might block updates to the store
126 for a long time, or exceed the size limit of a serialized transaction. By using
127 `Batch`, the changes are automatically broken up into a set of transactions.
128
129 `Batch` takes a callback which generally contains a loop that iterates over a
130 set of objects. Every iteration can call `batch.Update` with another nested
131 callback that performs the actual changes. Changes performed inside a single
132 `batch.Update` call are guaranteed to land in the same transaction, and
133 therefore be applied atomically. However, changes different calls to
134 `batch.Update` may end up in different transactions.
135
136 ## Watches
137
138 The data store provides a real-time feed of insertions, deletions, and
139 modifications. Any number of listeners can subscribe to this feed, optionally
140 applying filters to the set of events. This is very useful for building control
141 loops. For example, the orchestrators watch changes to services to trigger
142 reconciliation.
143
144 To start a watch, use the `state.Watch` function. The first argument is the
145 watch queue, which can be obtained with the store instance's `WatchQueue`
146 method. Extra arguments are events to be matched against the incoming event when
147 filtering. For example, this call returns only tasks creations, updates, and
148 deletions that affect a specific task ID:
149
150
151 ```
152 nodeTasks, err := store.Watch(s.WatchQueue(),
153 api.EventCreateTask{Task: &api.Task{NodeID: nodeID},
154 Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}},
155 api.EventUpdateTask{Task: &api.Task{NodeID: nodeID},
156 Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}},
157 api.EventDeleteTask{Task: &api.Task{NodeID: nodeID},
158 Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}},
159 )
160 ```
161
162 There is also a `ViewAndWatch` method on the store that provides access to a
163 snapshot of the store at just before the moment the watch starts receiving
164 events. It guarantees that events following this snapshot won't be missed, and
165 events that are already incorporated in the snapshot won't be received.
166 `ViewAndWatch` involves holding the store update lock while its callback runs,
167 so it's preferable to use `View` and `Watch` separately instead if the use case
168 isn't sensitive to redundant events. `Watch` should be called before `View` so
169 that events aren't missed in between viewing a snapshot and starting the event
170 stream.
171
172 ## Distributed operation
173
174 Data written to the store is automatically replicated to the other managers in
175 the cluster through the underlying consensus protocol. All active managers have
176 local in-memory copies of all the data in the store, accessible through
177 go-memdb.
178
179 The current consensus implementation, based on Raft, only allows writes to
180 happen on the leader. This avoids potentially conflicting writes ending up in
181 the log, which would have to be reconciled later on. The leader's copy of the
182 data in the store is the most up-to-date. Other nodes may lag behind this copy,
183 if there are replication delays, but will never diverge from it.
184
185 ## Sequencer
186
187 It's important not to overwrite current data with stale data. In some
188 situations, we might want to take data from the store, hand it to the user, and
189 then write it back to the store with the user's modifications. The store has
190 a safeguard to make sure this fails if the data has been updated since the copy
191 was retrieved.
192
193 Every top-level object has a `Meta` field which contains a `Version` object. The
194 `Version` is managed automatically by the store. When an object is updated, its
195 `Version` field is increased to distinguish the old version from the new
196 version. Trying to update an object will fail if the object passed into an
197 update function has a `Version` which doesn't match the current `Version` of
198 that object in the store.
199
200 `Meta` also contains timestamps that are automatically updated by the store.
201
202 To keep version numbers consistent across the cluster, version numbers are
203 provided by the underlying consensus protocol through the `Proposer` interface.
204 In the case of the Raft consensus implementation, the version number is simply
205 the current Raft index at the time that the object was last updated. Note that
206 the index is queried before the change is actually written to Raft, so an object
207 created with `Version.Index = 5` would most likely be appended to the Raft log
208 at index 6.
209
210 The `Proposer` interface also provides the mechanism for the store code to
211 synchronize changes to the rest of the cluster. `ProposeValue` sends a set of
212 changes to the other managers in the cluster through the consensus protocol.
213
214 ## RPC API
215
216 In addition to the Go API discussed above, the store exposes watches over gRPC.
217 There is a watch server that provides a very similar interface to the `Watch`
218 call. See `api/watch.proto` for the relevant protobuf definitions.
219
220 A full gRPC API for the store has been proposed, but not yet merged at the time
221 this document was written. See https://github.com/docker/swarmkit/pull/1998 for
222 draft code. In this proposal, the gRPC store API did not support full
223 transactions, but did allow creations and updates to happen in atomic sets.
224 Implementing full transactions over gRPC presents some challenges, because of
225 the store update lock. If a streaming RPC could hold the update lock, a
226 misbehaving client or severed network connection might cause this lock to be
227 held too long. Transactional APIs might need very short timeouts or other
228 safeguards.
229
230 The purpose of exposing an external gRPC API for the store would be to support
231 externally-implemented control loops. This would make swarmkit more extensible
232 because code that works with objects directly wouldn't need to be implemented
233 inside the swarmkit repository anymore.
234
235 ## Generated code
236
237 For type safety, the store exposes type-safe helper functions such as
238 `DeleteNode` and `FindSecrets`. These functions wrap internal methods that are
239 not type-specific. However, providing these wrappers ended up involving a lot of
240 boilerplate code. There was also code that had to be duplicated for things like
241 saving and restoring snapshots of the store, defining events, and indexing
242 objects in the store.
243
244 To make this more manageable, a lot of store code is now automatically
245 generated by `protobuf/plugin/storeobject/storeobject.go`. It's now a lot easier
246 to add a new object type to the store. There is scope for further improvements
247 through code generation.
248
249 The plugin uses the presence of the `docker.protobuf.plugin.store_object` option
250 to detect top-level objects that can be stored inside the store. There is a
251 `watch_selectors` field inside this option that specifies which functions should
252 be generated for matching against specific fields of an object in a `Watch`
253 call.