github.com/simpleiot/simpleiot@v0.18.3/docs/adr/7-jetstream-store.md (about)

     1  # JetStream SIOT Store
     2  
     3  - Author: Cliff Brake, last updated: 2024-01-24
     4  - Status: discussion
     5  
     6  ## Problem
     7  
     8  SQLite has worked well as a SIOT store. There are a few things we would like to
     9  improve:
    10  
    11  - synchronization of history
    12    - currently, if a device or server is offline, only the latest state is
    13      transferred when connected. We would like all history that has accumulated
    14      when offline to be transferred once reconnected.
    15  - we want history at the edge as well as cloud
    16    - this allows us to use history at the edge to run more advanced algorithms
    17      like AI
    18  - we currently have to re-compute hashes all the way to the root node anytime
    19    something changes
    20    - this may not scale to larger systems
    21    - is difficult to get right if things are changing while we re-compute hashes
    22      -- it requires some type of coordination between the distributed systems,
    23      which we currently don't have.
    24  
    25  ## Context/Discussion
    26  
    27  The purpose of this document is to explore storing SIOT state in a NATS
    28  JetStream store. SIOT data is stored in a tree of nodes and each node contains
    29  an array of points. Note, the term **"node"** in this document represents a data
    30  structure in a tree, not a physical computer or SIOT instance. The term
    31  **"instance"** will be used to represent devices or SIOT instances.
    32  
    33  ![nodes](./assets/nodes.png)
    34  
    35  Nodes are arranged in a
    36  [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
    37  
    38  <img src="./assets/image-20240124105741250.png" alt="image-20240124105741250" style="zoom: 33%;" />
    39  
    40  A subset of this tree is synchronized between various instances as shown in the
    41  below example:
    42  
    43  ![SIOT example tree](./assets/cloud-device-node-tree.png)
    44  
    45  The tree topology can be as deep as required to describe the system. To date,
    46  only the current state of a node is synchronized and history (if needed) is
    47  stored externally in a time-series database like InfluxDB and is not
    48  synchronized. The node tree is an excellent data model for IoT systems.
    49  
    50  Each node contains an array of points that represent the state of the node. The
    51  points contain a type and a key. The key can be used to describe maps and
    52  arrays. We keep points separate so they can all be updated independently and
    53  easily merged.
    54  
    55  With JetStream, we could store points in a stream where the head of the stream
    56  represents the current state of a Node or collection of nodes. Each point is
    57  stored in a separate NATS subject.
    58  
    59  ![image-20240119093623132](./assets/image-20240119093623132.png)
    60  
    61  NATS Jetstream is a stream-based store where every message in a stream is given
    62  a sequence number. Synchronization is simple in that if a sequence number does
    63  not exist on a remote system, the missing messages are sent.
    64  
    65  NATS also supports leaf nodes (instances) and streams can be synchronized
    66  between hub and leaf instances. If they are disconnected, then streams are
    67  "caught up" after the connection is made again.
    68  
    69  Several experiments have been run to understand the basic JetStream
    70  functionality in [this repo](https://github.com/simpleiot/nats-exp).
    71  
    72  1. storing and extracting points in a stream
    73  1. using streams to store time-series data and measure performance
    74  1. syncing streams between the hub and leaf instances
    75  
    76  ### Advantages of JetStream
    77  
    78  - JetStream is built into NATS, which we already embed and use.
    79  - History can be stored in a NATS stream instead of externally. Currently, we
    80    use an external store like InfluxDB to store history.
    81  - JetStream streams can be synchronized between instances.
    82  - JetStream has various retention models so old data can automatically be
    83    dropped.
    84  - Leverage the NATS AuthN/AuthZ features.
    85  - JetStream is a natural extension of core NATS, so many of the core SIOT
    86    concepts are still valid and do not need to change.
    87  
    88  ### Challenges with moving to JetStream
    89  
    90  - streams are typically synchronized in one direction. This is a challenge for
    91    SIOT as the basic premise is data can be modified in any location where a
    92    user/device has proper permissions. A user may change a configuration in a
    93    cloud portal or on a local touch-screen.
    94  - sequence numbers must be set by one instance, so you can't have both a leaf
    95    and hub nodes inserting data into a single stream. This has benefits in that
    96    it is a very simple and reliable model.
    97  - we are constrained by a simple message subject to label and easily query data.
    98    This is less flexible than a SQL database, but this constraint can also be an
    99    advantage in that it forces us into a simple and consistent data model.
   100  - SQLite has a built-in cache. We would likely need to create our own with
   101    JetStream.
   102  
   103  ### JetStream consistency model
   104  
   105  From this [discussion](https://github.com/nats-io/nats-server/discussions/4577):
   106  
   107  > When the doc mentions immediate consistency, it is in contrast to
   108  > [eventual consistency](https://en.wikipedia.org/wiki/Eventual_consistency). It
   109  > is about how 'writes' (i.e. publishing a message to a stream).
   110  >
   111  > JetStream is an immediately consistent distributed storage system in that
   112  > every new message stored in the stream is done so in a unique order (when
   113  > those messages reach the stream leader) and that the acknowledgment that the
   114  > storing of the message has been successful only happens as the result of a
   115  > RAFT vote between the NATS JetStream servers (e.g. 3 of them if replicas=3)
   116  > handling the stream.
   117  >
   118  > This means that when a publishing application receives the positive
   119  > acknowledgement to it's publication to the stream you are guaranteed that
   120  > everyone will see that new message in their updates _in the same order_ (and
   121  > with the same sequence number and time stamp).
   122  >
   123  > This 'non-eventual' consistency is what enables 'compare and set' (i.e.
   124  > compare and publish to a stream) operations on streams: because there can only
   125  > be one new message added to a stream at a time.
   126  >
   127  > To map back to those formal consistency models it means that for writes, NATS
   128  > JetStream is
   129  > [Linearizable](https://jepsen.io/consistency/models/linearizable).
   130  
   131  Currently SIOT uses a more "eventually" consistent model where we used data
   132  structures with some light-weight CRDT proprieties. However this has the
   133  disadvantage that we have to do things like hash the entire node tree to know if
   134  anything has changed. In a more static system where not much is changing, this
   135  works pretty well, but in a dynamic IoT system where data is changing all the
   136  time, it is hard to scale this model.
   137  
   138  ### Message/Subject encoding
   139  
   140  In the past, we've used the
   141  [Point datastructure](https://docs.simpleiot.org/docs/adr/1-consider-changing-point-data-type.html#proposal-2).
   142  This has worked extremely well at representing reasonably complex data
   143  structures (including maps and arrays) for a node. Yet it has limitations and
   144  constraints that have proven useful it making data easy to store, transmit, and
   145  merge.
   146  
   147  ```go
   148  // Point is a flexible data structure that can be used to represent
   149  // a sensor value or a configuration parameter.
   150  // ID, Type, and Index uniquely identify a point in a device
   151  type Point struct {
   152  	//-------------------------------------------------------
   153  	//1st three fields uniquely identify a point when receiving updates
   154  
   155  	// Type of point (voltage, current, key, etc)
   156  	Type string `json:"type,omitempty"`
   157  
   158  	// Key is used to allow a group of points to represent a map or array
   159  	Key string `json:"key,omitempty"`
   160  
   161  	//-------------------------------------------------------
   162  	// The following fields are the values for a point
   163  
   164  	// Time the point was taken
   165  	Time time.Time `json:"time,omitempty" yaml:"-"`
   166  
   167  	// Instantaneous analog or digital value of the point.
   168  	// 0 and 1 are used to represent digital values
   169  	Value float64 `json:"value,omitempty"`
   170  
   171  	// Optional text value of the point for data that is best represented
   172  	// as a string rather than a number.
   173  	Text string `json:"text,omitempty"`
   174  
   175  	// catchall field for data that does not fit into float or string --
   176  	// should be used sparingly
   177  	Data []byte `json:"data,omitempty"`
   178  
   179  	//-------------------------------------------------------
   180  	// Metadata
   181  
   182  	// Used to indicate a point has been deleted. This value is only
   183  	// ever incremented. Odd values mean point is deleted.
   184  	Tombstone int `json:"tombstone,omitempty"`
   185  
   186  	// Where did this point come from. If from the owning node, it may be blank.
   187  	Origin string `json:"origin,omitempty"`
   188  }
   189  ```
   190  
   191  With JetStream, the `Type`and `Key` can be encoded in the message subject:
   192  
   193  `p.<node id>.<type>.<key>`
   194  
   195  Message subjects are indexed in a stream, so NATS can quickly find messages for
   196  any subject in a stream without scanning the entire stream (see
   197  [discussion 1](https://github.com/nats-io/nats-server/discussions/3772) and
   198  [discussion 2](https://github.com/nats-io/nats-server/discussions/4170)).
   199  
   200  Over time, the Point structure has been simplified. For instance, it used to
   201  also have an `Index` field, but we have learned we can use a single `Key` field
   202  instead. At this point it may make sense to simplify the payload. One idea is to
   203  do away with the `Value` and `Text` fields and simply have a `Data` field. The
   204  components that use the points have to know the data-type anyway to know if they
   205  should use the `Value` or `Text`field. In the past, protobuf encoding was used
   206  as we started with quite a few fields and provided some flexibility and
   207  convenience. But as we have reduced the number of fields and two of them are now
   208  encoded in the message subject, it may be simpler to have a simple encoding for
   209  `Time`, `Data`, `Tombstone`, and `Origin` in the message payload. The code using
   210  the message would be responsible for convert `Data` into whatever data type is
   211  needed. This would open up the opportunity to encode any type of payload in the
   212  future in the `Data` field and be more flexible for the future.
   213  
   214  #### Message payload:
   215  
   216  - `Time` (uint64)
   217  - `Tombstone` (byte)
   218  - `OriginLen` (byte)
   219  - `Origin` (string)
   220  - `Data Type` (byte)
   221  - `Data` (length determined by the message length subtracted by the length of
   222    the above fields)
   223  
   224  Examples of types:
   225  
   226  - 0 - unknown or custom
   227  - 1 - float (32, or 64 bit)
   228  - 2 - int (8, 16, 32, or 64 bit)
   229  - 3 - unit (8, 16, 32, or 65 bit)
   230  - 4 - string
   231  - 5 - JSON
   232  - 6 - Protobuf
   233  
   234  Putting `Origin` in the message subject will make it inefficient to query as you
   235  will need to scan and decode all messages. Are there any cases where we will
   236  need to do this? (this is an example where a SQL database is more flexible). One
   237  solution would be to create another stream where the origin is in the subject.
   238  
   239  There are times when the current point model does not fit very well -- for
   240  instance when sending a notification -- this is difficult to encode in an array
   241  of points. I think in these cases encoding the notification data as JSON
   242  probably makes more sense and this encoding should work much better.
   243  
   244  #### Can't send multiple points in a message
   245  
   246  In the past, it was common to send multiple points in a message for a node --
   247  for instance when creating a node, or updating an array. However, with the
   248  `type` and `key` encoded in the subject this will no longer work. What is the
   249  implication for having separate messages?
   250  
   251  - will be more complex to create nodes
   252  - when updating an array/map in a node, it will not be updated all at once, but
   253    over the time it takes all the points to come into the client.
   254  - there is still value in arrays being encoded as points -- for instance a relay
   255    devices that contains two relays. However, for configuration are we better
   256    served by encoding the struct in a the data field as JSON and updating it as
   257    an atomic unit?
   258  
   259  ### UI Implications
   260  
   261  Because NATS and JetStream subjects overlap, the UI could
   262  [subscribe to the current state changes](https://github.com/simpleiot/simpleiot/tree/master/frontend/lib)
   263  much as is done today. A few things would need to change:
   264  
   265  - Getting the initial state could still use the
   266    [NATS `nodes` API](https://docs.simpleiot.org/docs/ref/api.html). However, the
   267    `Value` and `Text` fields might be merged into `Data`.
   268  - In the `p.<node id>` subscription, the `Type` and `Key` now would come from
   269    the message subject.
   270  
   271  ### Bi-Directional Synchronization
   272  
   273  Bi-directional synchronization between two instances may be accomplished by
   274  having two streams for every node. The head of both incoming and outgoing
   275  streams is looked at to determine the current state. If points of the same type
   276  exist in both streams, the point with the latest timestamp wins. In reality, 99%
   277  of the time, one set of data will be set by the Leaf instance (ex: sensor
   278  readings) and another set of data will be set by the upstream Hub instance (ex:
   279  configuration settings) and there will be very little overlap.
   280  
   281  ![image-20240119094329917](./assets/image-20240119094329917.png)
   282  
   283  The question arises -- do we really need bi-directional synchronization and the
   284  complexity of having two streams for every node? Every node includes some amount
   285  of configuration which can flow down from upstream instances. Additionally, many
   286  nodes are collecting data which needs to flow back upstream. So it seems a very
   287  common need for every node to have data flowing in both directions. Since this
   288  is a basic requirement, it does not seem like much of stretch to allow any data
   289  to flow in either stream, and then merge the streams at the endpoints where the
   290  data is used .
   291  
   292  ### Does it make sense to use NATS to create merged streams?
   293  
   294  NATS can source streams into an additional 3rd stream. This might be useful in
   295  that you don't have to read two streams and merge the points to get the current
   296  state. However, there are several disadvantages:
   297  
   298  - data would be stored twice
   299  - data is not guaranteed to be in chronological order -- the data would be
   300    inserted into the 3rd stream when it is received. So you would still have to
   301    walk back in history to know for sure if you had the latest point. It seems
   302    simpler to just read the head of two streams and compare them.
   303  
   304  ### Timestamps
   305  
   306  NATS JetStream messages store a timestamp, but the timestamp is when the message
   307  is inserted into the stream, not necessarily when the sample was taken. There
   308  can be some delay between the NATS client sending the message and the server
   309  processing it. Therefore, an additional high-resolution
   310  [64-bit timestamp](https://docs.simpleiot.org/docs/adr/4-time.html) is added to
   311  the beginning of each message.
   312  
   313  ### Edges
   314  
   315  Edges are used to describe the connections between nodes. Nodes can exist in
   316  multiple places in the tree. In the below example, `N2` is a child of both `N1`
   317  and `N3`.
   318  
   319  <img src="./assets/image-20240124112003398.png" alt="image-20240124112003398" style="zoom:67%;" />
   320  
   321  Edges currently contain the up and downstream node IDs, an array of points, and
   322  a node type. Putting the type in the edge made it efficient to traverse the tree
   323  by loading edges from a SQLite table and indexing the IDs and type. With
   324  JetStream it is less obvious how to store the edge information. SIOT regularly
   325  traverses up and down the tree.
   326  
   327  - down: to discover nodes
   328  - up: to propagate points to up subjects
   329  
   330  Because edges contain points that can change over time, edge points need to be
   331  stored in a stream, much like we do the node points. If each node has its own
   332  stream, then the child edges for the node could be stored in the same stream as
   333  the node as shown above. This would allow us to traverse the node tree on
   334  startup and perhaps cache all the edges. The following subject can be used for
   335  edge points:
   336  
   337  `p.<up node ID>.<down node ID>.<type>.<key>`
   338  
   339  Again, this is very similar to the existing
   340  [NATS API](https://docs.simpleiot.org/docs/ref/api.html#nats).
   341  
   342  Two special points are present in every edge:
   343  
   344  - `nodeType`: defines the type of the downstream node
   345  - `tombstone`: set to true if the downstream node is deleted
   346  
   347  One challenge with this model is much of the code in the SIOT uses a
   348  `NodeEdge` datastructure which includes a node and its parent edge. This
   349  collection of data describes this instance of a node and is more useful from a
   350  client perspective. However, `NodeEdge`'s are duplicated for every mirrored node
   351  in the tree, so don't really make sense from a storage and synchronization
   352  perspective. This will likely become more clear after some implementation work.
   353  
   354  ### NATS `up.*` subjects
   355  
   356  In SIOT, we partition the system using the tree structure and nodes that listen
   357  for messages (databases, messaging services, rules, etc.) subscribe to the
   358  `up.*`stream of their parent node. In the below example, each group has it's own
   359  database configuration and the Db node only receives points generated in the
   360  group it belongs to. This provides an opportunity for any node at any level in
   361  the tree to listen to messages of another node, as long as:
   362  
   363  1. it is equal or higher in the structure
   364  2. shares an ancestor.
   365  
   366  <img src="./assets/image-20240124104619281.png" alt="image-20240124104619281" style="zoom:67%;" />
   367  
   368  The use of "up" subjects would not have to change other than the logic that
   369  re-broadcasts points to "up" subjects would need to use the edge cache instead
   370  of querying the SQLite database for edges.
   371  
   372  ### AuthN/AuthZ
   373  
   374  Authorization typically needs to happen at device or group boundaries. Devices
   375  or users will need to be authorized. Users
   376  [have access](https://docs.simpleiot.org/docs/user/users-groups.html) to all
   377  nodes in their parent group or device. If each node has its own stream, that
   378  will simplify AuthZ. Each device or user are explicitly granted permission to
   379  all of the Nodes they have access to. If a new node is created that is a child
   380  of a node a user has permission to view, this new node (and the subsequent
   381  streams) are added to the list.
   382  
   383  ### Are we optimizing the right thing?
   384  
   385  Any time you move away from a SQL database, you should
   386  [think long and hard](http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/)
   387  about this. Additionally, there are very nice time-series database solutions out
   388  there. So we should have good reasons for inventing yet-another-database.
   389  However, mainstream SQL and Time-series databases all have one big drawback:
   390  they don't support synchronizing subsets of data between distributed systems.
   391  
   392  With system design, one approach is to order the problems you are solving by
   393  difficulty with the top of the list being most important/difficult, and then
   394  optimize the system to solve the hard problems first.
   395  
   396  1. Synchronizing subsets of data between distributed systems (including history)
   397  2. Be small and efficient enough to deploy at the edge
   398  3. Real-time response
   399  4. Efficient searching through history
   400  5. Flexible data storage/schema
   401  6. Querying nodes and state
   402  7. Arbitrary relationships between data
   403  8. Data encode/decode performance
   404  
   405  The number of devices and nodes in systems SIOT is targeting is relatively
   406  small, thus the current node topology can be cached in memory. The history is a
   407  much bigger dataset so using a stream to synchronize, store, and retrieve
   408  time-series data makes a lot of sense.
   409  
   410  On #7, will we ever need arbitrary relationships between data? With the node
   411  graph, we can do this fairly well. Edges contain points that can be used to
   412  further characterize the relationship between nodes. With IoT systems your
   413  relationships between nodes is mostly determined by physical proximity. A Modbus
   414  sensor is connected to a Modbus, which is connected to a Gateway, which is
   415  located at a site, which belongs to a customer.
   416  
   417  On #8, the network is relatively slow compared to anything else, so if it takes
   418  a little more time to encode/decode data this is typically not a big deal as the
   419  network is the bottleneck.
   420  
   421  With an IoT system, the data is primarily 1) sequential in time, and 2)
   422  hierarchical in structure. Thus the streaming/tree approach still appears to be
   423  the best approach.
   424  
   425  ### Questions
   426  
   427  - How chatty is the NATS Leaf-node protocol? Is it efficient enough to use over
   428    low-bandwidth Cat-M cellular connections (~20-100Kbps)?
   429  - Is it practical to have 2 streams for every node? A typical edge device may
   430    have 30 nodes, so this is 60 streams to synchronize. Is the overhead to source
   431    this many nodes over a leaf connection prohibitive?
   432  - Would it make sense to create streams at the device/instance boundaries rather
   433    than node boundaries?
   434    - this may limit our AuthZ capabilities where we want to give some users
   435      access to only part of a cloud instance.
   436  - How robust is the JetStream store compared to SQLite in events like
   437    [power loss](https://www.sqlite.org/transactional.html)?
   438  - Are there any other features of NATS/JetStream that we should be considering?
   439  
   440  ## Experiments
   441  
   442  Several POC experiments have been run to prove the feasibility of this:
   443  
   444  https://github.com/simpleiot/nats-exp
   445  
   446  ## Decision
   447  
   448  Implementation could be broken down into 3 steps:
   449  
   450  1. message/subject encoding changes
   451  1. switch store from SQLite to Jetstream
   452  1. Use Jetsream to sync between systems
   453  
   454  objections/concerns
   455  
   456  ## Consequences
   457  
   458  what is the impact, both negative and positive.
   459  
   460  ## Additional Notes/Reference