github.com/kaisenlinux/docker.io@v0.0.0-20230510090727-ea55db55fac7/swarmkit/design/store.md (about)

     1  # Data store design
     2  
     3  SwarmKit has an embedded data store for configuration and state. This store is
     4  usually backed by the raft protocol, but is abstracted from the underlying
     5  consensus protocol, and in principle could use other means to synchronize data
     6  across the cluster. This document focuses on the design of the store itself,
     7  such as the programmer-facing APIs and consistency guarantees, and does not
     8  cover distributed consensus.
     9  
    10  ## Structure of stored data
    11  
    12  The SwarmKit data store is built on top of go-memdb, which stores data in radix
    13  trees.
    14  
    15  There are separate tables for each data type, for example nodes, tasks, and so
    16  on. Each table has its own set of indices, which always includes an ID index,
    17  but may include other indices as well. For example, tasks can be indexed by
    18  their service ID and node ID, among several other things.
    19  
    20  Under the hood, go-memdb implements an index by adding keys for each index to
    21  the radix tree, prefixed with the index's name. A single object in the data
    22  store may have several keys corresponding to it, because it will have a
    23  different key (and possibly multiple keys) within each index.
    24  
    25  There are several advantages to using radix trees in this way. The first is that
    26  it makes prefix matching easy. A second powerful feature of this design is
    27  copy-on-write snapshotting. Since the radix tree consists of a hierarchy of
    28  pointers, the root pointer always a fully consistent state at that moment in
    29  time. Making a change to the tree involves replacing a leaf node with a new
    30  value, and "bubbling up" that change to the root through the intermediate
    31  pointers. To make the change visible to other readers, all it takes is a single
    32  atomic pointer swap that replaces the root of the tree with a new root that
    33  incorporates the changed nodes. The text below will discuss how this is used to
    34  support transactions.
    35  
    36  ## Transactions
    37  
    38  Code that uses the store can only use it inside a *transaction*. There are two
    39  kinds of transactions: view transactions (read-only) and update transactions
    40  (read/write).
    41  
    42  A view transaction runs in a callback passed to the `View` method:
    43  
    44  ```
    45  	s.View(func(tx store.ReadTx) {
    46  		nodes, err = store.FindNodes(tx, store.All)
    47  	})
    48  ```
    49  
    50  This callback can call functions defined in the `store` package that retrieve
    51  and list the various types of objects. `View` operates on an atomic snapshot of
    52  the data store, so changes made while the callback is running won't be visible
    53  to code inside the callback using the supplied `ReadTx`.
    54  
    55  An update transaction works similarly, but provides the ability to create,
    56  update, and delete objects:
    57  
    58  ```
    59  	s.Update(func(tx store.Tx) error {
    60  		t2 := &api.Task{
    61  			ID: "testTaskID2",
    62  			Status: api.TaskStatus{
    63  				State: api.TaskStateNew,
    64  			},
    65  			ServiceID:    "testServiceID2",
    66  			DesiredState: api.TaskStateRunning,
    67  		}
    68  		return store.CreateTask(tx, t2)
    69  	})
    70  ```
    71  
    72  If the callback returns `nil`, the changes made inside the callback function are
    73  committed atomically. If it returns any other error value, the transaction gets
    74  rolled back. The changes are never visible to any other readers before the
    75  commit happens, but they are visible to code inside the callback using the
    76  `Tx` argument.
    77  
    78  There is an exclusive lock for updates, so only one can happen at once. Take
    79  care not to do expensive or blocking operations inside an `Update` callback.
    80  
    81  ## Batching
    82  
    83  Sometimes it's necessary to create or update many objects in the store, but we
    84  want to do this without holding the update lock for an arbitrarily long period
    85  of time, or generating a huge set of changes from the transaction that would
    86  need to be serialized in a Raft write. For this situation, the store provides
    87  primitives to batch iterated operations that don't require atomicity into
    88  transactions of an appropriate size.
    89  
    90  Here is an example of a batch operation:
    91  
    92  ```
    93  	err = d.store.Batch(func(batch *store.Batch) error {
    94  		for _, n := range nodes {
    95  			err := batch.Update(func(tx store.Tx) error {
    96  				// check if node is still here
    97  				node := store.GetNode(tx, n.ID)
    98  				if node == nil {
    99  					return nil
   100  				}
   101  
   102  				// [...]
   103  
   104  				node.Status.State = api.NodeStatus_UNKNOWN
   105  				node.Status.Message = `Node moved to "unknown" state due to leadership change in cluster`
   106  
   107  				if err := d.nodes.AddUnknown(node, expireFunc); err != nil {
   108  					return errors.Wrap(err, `adding node in "unknown" state to node store failed`)
   109  				}
   110  				if err := store.UpdateNode(tx, node); err != nil {
   111  					return errors.Wrap(err, "update failed")
   112  				}
   113  				return nil
   114  			})
   115  			if err != nil {
   116  				log.WithField("node", n.ID).WithError(err).Error(`failed to move node to "unknown" state`)
   117  			}
   118  		}
   119  		return nil
   120  	})
   121  ```
   122  
   123  This is a slightly abbreviated version of code in the dispatcher that moves a
   124  set of nodes to the "unknown" state. If there were many nodes in the system,
   125  doing this inside a single Update transaction might block updates to the store
   126  for a long time, or exceed the size limit of a serialized transaction. By using
   127  `Batch`, the changes are automatically broken up into a set of transactions.
   128  
   129  `Batch` takes a callback which generally contains a loop that iterates over a
   130  set of objects. Every iteration can call `batch.Update` with another nested
   131  callback that performs the actual changes. Changes performed inside a single
   132  `batch.Update` call are guaranteed to land in the same transaction, and
   133  therefore be applied atomically. However, changes different calls to
   134  `batch.Update` may end up in different transactions.
   135  
   136  ## Watches
   137  
   138  The data store provides a real-time feed of insertions, deletions, and
   139  modifications. Any number of listeners can subscribe to this feed, optionally
   140  applying filters to the set of events. This is very useful for building control
   141  loops. For example, the orchestrators watch changes to services to trigger
   142  reconciliation.
   143  
   144  To start a watch, use the `state.Watch` function. The first argument is the
   145  watch queue, which can be obtained with the store instance's `WatchQueue`
   146  method. Extra arguments are events to be matched against the incoming event when
   147  filtering. For example, this call returns only tasks creations, updates, and
   148  deletions that affect a specific task ID:
   149  
   150  
   151  ```
   152  	nodeTasks, err := store.Watch(s.WatchQueue(),
   153  		api.EventCreateTask{Task: &api.Task{NodeID: nodeID},
   154  			Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}},
   155  		api.EventUpdateTask{Task: &api.Task{NodeID: nodeID},
   156  			Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}},
   157  		api.EventDeleteTask{Task: &api.Task{NodeID: nodeID},
   158  			Checks: []api.TaskCheckFunc{api.TaskCheckNodeID}},
   159  	)
   160  ```
   161  
   162  There is also a `ViewAndWatch` method on the store that provides access to a
   163  snapshot of the store at just before the moment the watch starts receiving
   164  events. It guarantees that events following this snapshot won't be missed, and
   165  events that are already incorporated in the snapshot won't be received.
   166  `ViewAndWatch` involves holding the store update lock while its callback runs,
   167  so it's preferable to use `View` and `Watch` separately instead if the use case
   168  isn't sensitive to redundant events. `Watch` should be called before `View` so
   169  that events aren't missed in between viewing a snapshot and starting the event
   170  stream.
   171  
   172  ## Distributed operation
   173  
   174  Data written to the store is automatically replicated to the other managers in
   175  the cluster through the underlying consensus protocol. All active managers have
   176  local in-memory copies of all the data in the store, accessible through
   177  go-memdb.
   178  
   179  The current consensus implementation, based on Raft, only allows writes to
   180  happen on the leader. This avoids potentially conflicting writes ending up in
   181  the log, which would have to be reconciled later on. The leader's copy of the
   182  data in the store is the most up-to-date. Other nodes may lag behind this copy,
   183  if there are replication delays, but will never diverge from it.
   184  
   185  ## Sequencer
   186  
   187  It's important not to overwrite current data with stale data. In some
   188  situations, we might want to take data from the store, hand it to the user, and
   189  then write it back to the store with the user's modifications. The store has
   190  a safeguard to make sure this fails if the data has been updated since the copy
   191  was retrieved.
   192  
   193  Every top-level object has a `Meta` field which contains a `Version` object. The
   194  `Version` is managed automatically by the store. When an object is updated, its
   195  `Version` field is increased to distinguish the old version from the new
   196  version. Trying to update an object will fail if the object passed into an
   197  update function has a `Version` which doesn't match the current `Version` of
   198  that object in the store.
   199  
   200  `Meta` also contains timestamps that are automatically updated by the store.
   201  
   202  To keep version numbers consistent across the cluster, version numbers are
   203  provided by the underlying consensus protocol through the `Proposer` interface.
   204  In the case of the Raft consensus implementation, the version number is simply
   205  the current Raft index at the time that the object was last updated. Note that
   206  the index is queried before the change is actually written to Raft, so an object
   207  created with `Version.Index = 5` would most likely be appended to the Raft log
   208  at index 6.
   209  
   210  The `Proposer` interface also provides the mechanism for the store code to
   211  synchronize changes to the rest of the cluster. `ProposeValue` sends a set of
   212  changes to the other managers in the cluster through the consensus protocol.
   213  
   214  ## RPC API
   215  
   216  In addition to the Go API discussed above, the store exposes watches over gRPC.
   217  There is a watch server that provides a very similar interface to the `Watch`
   218  call. See `api/watch.proto` for the relevant protobuf definitions.
   219  
   220  A full gRPC API for the store has been proposed, but not yet merged at the time
   221  this document was written. See https://github.com/docker/swarmkit/pull/1998 for
   222  draft code. In this proposal, the gRPC store API did not support full
   223  transactions, but did allow creations and updates to happen in atomic sets.
   224  Implementing full transactions over gRPC presents some challenges, because of
   225  the store update lock. If a streaming RPC could hold the update lock, a
   226  misbehaving client or severed network connection might cause this lock to be
   227  held too long. Transactional APIs might need very short timeouts or other
   228  safeguards.
   229  
   230  The purpose of exposing an external gRPC API for the store would be to support
   231  externally-implemented control loops. This would make swarmkit more extensible
   232  because code that works with objects directly wouldn't need to be implemented
   233  inside the swarmkit repository anymore.
   234  
   235  ## Generated code
   236  
   237  For type safety, the store exposes type-safe helper functions such as
   238  `DeleteNode` and `FindSecrets`. These functions wrap internal methods that are
   239  not type-specific. However, providing these wrappers ended up involving a lot of
   240  boilerplate code. There was also code that had to be duplicated for things like
   241  saving and restoring snapshots of the store, defining events, and indexing
   242  objects in the store.
   243  
   244  To make this more manageable, a lot of store code is now automatically
   245  generated by `protobuf/plugin/storeobject/storeobject.go`. It's now a lot easier
   246  to add a new object type to the store. There is scope for further improvements
   247  through code generation.
   248  
   249  The plugin uses the presence of the `docker.protobuf.plugin.store_object` option
   250  to detect top-level objects that can be stored inside the store. There is a
   251  `watch_selectors` field inside this option that specifies which functions should
   252  be generated for matching against specific fields of an object in a `Watch`
   253  call.