github.com/simpleiot/simpleiot@v0.18.3/docs/adr/7-jetstream-store.md

github.com/simpleiot/simpleiot@v0.18.3/docs/adr/7-jetstream-store.md (about)

1 # JetStream SIOT Store
2
3 - Author: Cliff Brake, last updated: 2024-01-24
4 - Status: discussion
5
6 ## Problem
7
8 SQLite has worked well as a SIOT store. There are a few things we would like to
9 improve:
10
11 - synchronization of history
12 - currently, if a device or server is offline, only the latest state is
13 transferred when connected. We would like all history that has accumulated
14 when offline to be transferred once reconnected.
15 - we want history at the edge as well as cloud
16 - this allows us to use history at the edge to run more advanced algorithms
17 like AI
18 - we currently have to re-compute hashes all the way to the root node anytime
19 something changes
20 - this may not scale to larger systems
21 - is difficult to get right if things are changing while we re-compute hashes
22 -- it requires some type of coordination between the distributed systems,
23 which we currently don't have.
24
25 ## Context/Discussion
26
27 The purpose of this document is to explore storing SIOT state in a NATS
28 JetStream store. SIOT data is stored in a tree of nodes and each node contains
29 an array of points. Note, the term **"node"** in this document represents a data
30 structure in a tree, not a physical computer or SIOT instance. The term
31 **"instance"** will be used to represent devices or SIOT instances.
32
33 ![nodes](./assets/nodes.png)
34
35 Nodes are arranged in a
36 [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph).
37
38 <img src="./assets/image-20240124105741250.png" alt="image-20240124105741250" style="zoom: 33%;" />
39
40 A subset of this tree is synchronized between various instances as shown in the
41 below example:
42
43 ![SIOT example tree](./assets/cloud-device-node-tree.png)
44
45 The tree topology can be as deep as required to describe the system. To date,
46 only the current state of a node is synchronized and history (if needed) is
47 stored externally in a time-series database like InfluxDB and is not
48 synchronized. The node tree is an excellent data model for IoT systems.
49
50 Each node contains an array of points that represent the state of the node. The
51 points contain a type and a key. The key can be used to describe maps and
52 arrays. We keep points separate so they can all be updated independently and
53 easily merged.
54
55 With JetStream, we could store points in a stream where the head of the stream
56 represents the current state of a Node or collection of nodes. Each point is
57 stored in a separate NATS subject.
58
59 ![image-20240119093623132](./assets/image-20240119093623132.png)
60
61 NATS Jetstream is a stream-based store where every message in a stream is given
62 a sequence number. Synchronization is simple in that if a sequence number does
63 not exist on a remote system, the missing messages are sent.
64
65 NATS also supports leaf nodes (instances) and streams can be synchronized
66 between hub and leaf instances. If they are disconnected, then streams are
67 "caught up" after the connection is made again.
68
69 Several experiments have been run to understand the basic JetStream
70 functionality in [this repo](https://github.com/simpleiot/nats-exp).
71
72 1. storing and extracting points in a stream
73 1. using streams to store time-series data and measure performance
74 1. syncing streams between the hub and leaf instances
75
76 ### Advantages of JetStream
77
78 - JetStream is built into NATS, which we already embed and use.
79 - History can be stored in a NATS stream instead of externally. Currently, we
80 use an external store like InfluxDB to store history.
81 - JetStream streams can be synchronized between instances.
82 - JetStream has various retention models so old data can automatically be
83 dropped.
84 - Leverage the NATS AuthN/AuthZ features.
85 - JetStream is a natural extension of core NATS, so many of the core SIOT
86 concepts are still valid and do not need to change.
87
88 ### Challenges with moving to JetStream
89
90 - streams are typically synchronized in one direction. This is a challenge for
91 SIOT as the basic premise is data can be modified in any location where a
92 user/device has proper permissions. A user may change a configuration in a
93 cloud portal or on a local touch-screen.
94 - sequence numbers must be set by one instance, so you can't have both a leaf
95 and hub nodes inserting data into a single stream. This has benefits in that
96 it is a very simple and reliable model.
97 - we are constrained by a simple message subject to label and easily query data.
98 This is less flexible than a SQL database, but this constraint can also be an
99 advantage in that it forces us into a simple and consistent data model.
100 - SQLite has a built-in cache. We would likely need to create our own with
101 JetStream.
102
103 ### JetStream consistency model
104
105 From this [discussion](https://github.com/nats-io/nats-server/discussions/4577):
106
107 > When the doc mentions immediate consistency, it is in contrast to
108 > [eventual consistency](https://en.wikipedia.org/wiki/Eventual_consistency). It
109 > is about how 'writes' (i.e. publishing a message to a stream).
110 >
111 > JetStream is an immediately consistent distributed storage system in that
112 > every new message stored in the stream is done so in a unique order (when
113 > those messages reach the stream leader) and that the acknowledgment that the
114 > storing of the message has been successful only happens as the result of a
115 > RAFT vote between the NATS JetStream servers (e.g. 3 of them if replicas=3)
116 > handling the stream.
117 >
118 > This means that when a publishing application receives the positive
119 > acknowledgement to it's publication to the stream you are guaranteed that
120 > everyone will see that new message in their updates _in the same order_ (and
121 > with the same sequence number and time stamp).
122 >
123 > This 'non-eventual' consistency is what enables 'compare and set' (i.e.
124 > compare and publish to a stream) operations on streams: because there can only
125 > be one new message added to a stream at a time.
126 >
127 > To map back to those formal consistency models it means that for writes, NATS
128 > JetStream is
129 > [Linearizable](https://jepsen.io/consistency/models/linearizable).
130
131 Currently SIOT uses a more "eventually" consistent model where we used data
132 structures with some light-weight CRDT proprieties. However this has the
133 disadvantage that we have to do things like hash the entire node tree to know if
134 anything has changed. In a more static system where not much is changing, this
135 works pretty well, but in a dynamic IoT system where data is changing all the
136 time, it is hard to scale this model.
137
138 ### Message/Subject encoding
139
140 In the past, we've used the
141 [Point datastructure](https://docs.simpleiot.org/docs/adr/1-consider-changing-point-data-type.html#proposal-2).
142 This has worked extremely well at representing reasonably complex data
143 structures (including maps and arrays) for a node. Yet it has limitations and
144 constraints that have proven useful it making data easy to store, transmit, and
145 merge.
146
147 ```go
148 // Point is a flexible data structure that can be used to represent
149 // a sensor value or a configuration parameter.
150 // ID, Type, and Index uniquely identify a point in a device
151 type Point struct {
152 //-------------------------------------------------------
153 //1st three fields uniquely identify a point when receiving updates
154
155 // Type of point (voltage, current, key, etc)
156 Type string `json:"type,omitempty"`
157
158 // Key is used to allow a group of points to represent a map or array
159 Key string `json:"key,omitempty"`
160
161 //-------------------------------------------------------
162 // The following fields are the values for a point
163
164 // Time the point was taken
165 Time time.Time `json:"time,omitempty" yaml:"-"`
166
167 // Instantaneous analog or digital value of the point.
168 // 0 and 1 are used to represent digital values
169 Value float64 `json:"value,omitempty"`
170
171 // Optional text value of the point for data that is best represented
172 // as a string rather than a number.
173 Text string `json:"text,omitempty"`
174
175 // catchall field for data that does not fit into float or string --
176 // should be used sparingly
177 Data []byte `json:"data,omitempty"`
178
179 //-------------------------------------------------------
180 // Metadata
181
182 // Used to indicate a point has been deleted. This value is only
183 // ever incremented. Odd values mean point is deleted.
184 Tombstone int `json:"tombstone,omitempty"`
185
186 // Where did this point come from. If from the owning node, it may be blank.
187 Origin string `json:"origin,omitempty"`
188 }
189 ```
190
191 With JetStream, the `Type`and `Key` can be encoded in the message subject:
192
193 `p.<node id>.<type>.<key>`
194
195 Message subjects are indexed in a stream, so NATS can quickly find messages for
196 any subject in a stream without scanning the entire stream (see
197 [discussion 1](https://github.com/nats-io/nats-server/discussions/3772) and
198 [discussion 2](https://github.com/nats-io/nats-server/discussions/4170)).
199
200 Over time, the Point structure has been simplified. For instance, it used to
201 also have an `Index` field, but we have learned we can use a single `Key` field
202 instead. At this point it may make sense to simplify the payload. One idea is to
203 do away with the `Value` and `Text` fields and simply have a `Data` field. The
204 components that use the points have to know the data-type anyway to know if they
205 should use the `Value` or `Text`field. In the past, protobuf encoding was used
206 as we started with quite a few fields and provided some flexibility and
207 convenience. But as we have reduced the number of fields and two of them are now
208 encoded in the message subject, it may be simpler to have a simple encoding for
209 `Time`, `Data`, `Tombstone`, and `Origin` in the message payload. The code using
210 the message would be responsible for convert `Data` into whatever data type is
211 needed. This would open up the opportunity to encode any type of payload in the
212 future in the `Data` field and be more flexible for the future.
213
214 #### Message payload:
215
216 - `Time` (uint64)
217 - `Tombstone` (byte)
218 - `OriginLen` (byte)
219 - `Origin` (string)
220 - `Data Type` (byte)
221 - `Data` (length determined by the message length subtracted by the length of
222 the above fields)
223
224 Examples of types:
225
226 - 0 - unknown or custom
227 - 1 - float (32, or 64 bit)
228 - 2 - int (8, 16, 32, or 64 bit)
229 - 3 - unit (8, 16, 32, or 65 bit)
230 - 4 - string
231 - 5 - JSON
232 - 6 - Protobuf
233
234 Putting `Origin` in the message subject will make it inefficient to query as you
235 will need to scan and decode all messages. Are there any cases where we will
236 need to do this? (this is an example where a SQL database is more flexible). One
237 solution would be to create another stream where the origin is in the subject.
238
239 There are times when the current point model does not fit very well -- for
240 instance when sending a notification -- this is difficult to encode in an array
241 of points. I think in these cases encoding the notification data as JSON
242 probably makes more sense and this encoding should work much better.
243
244 #### Can't send multiple points in a message
245
246 In the past, it was common to send multiple points in a message for a node --
247 for instance when creating a node, or updating an array. However, with the
248 `type` and `key` encoded in the subject this will no longer work. What is the
249 implication for having separate messages?
250
251 - will be more complex to create nodes
252 - when updating an array/map in a node, it will not be updated all at once, but
253 over the time it takes all the points to come into the client.
254 - there is still value in arrays being encoded as points -- for instance a relay
255 devices that contains two relays. However, for configuration are we better
256 served by encoding the struct in a the data field as JSON and updating it as
257 an atomic unit?
258
259 ### UI Implications
260
261 Because NATS and JetStream subjects overlap, the UI could
262 [subscribe to the current state changes](https://github.com/simpleiot/simpleiot/tree/master/frontend/lib)
263 much as is done today. A few things would need to change:
264
265 - Getting the initial state could still use the
266 [NATS `nodes` API](https://docs.simpleiot.org/docs/ref/api.html). However, the
267 `Value` and `Text` fields might be merged into `Data`.
268 - In the `p.<node id>` subscription, the `Type` and `Key` now would come from
269 the message subject.
270
271 ### Bi-Directional Synchronization
272
273 Bi-directional synchronization between two instances may be accomplished by
274 having two streams for every node. The head of both incoming and outgoing
275 streams is looked at to determine the current state. If points of the same type
276 exist in both streams, the point with the latest timestamp wins. In reality, 99%
277 of the time, one set of data will be set by the Leaf instance (ex: sensor
278 readings) and another set of data will be set by the upstream Hub instance (ex:
279 configuration settings) and there will be very little overlap.
280
281 ![image-20240119094329917](./assets/image-20240119094329917.png)
282
283 The question arises -- do we really need bi-directional synchronization and the
284 complexity of having two streams for every node? Every node includes some amount
285 of configuration which can flow down from upstream instances. Additionally, many
286 nodes are collecting data which needs to flow back upstream. So it seems a very
287 common need for every node to have data flowing in both directions. Since this
288 is a basic requirement, it does not seem like much of stretch to allow any data
289 to flow in either stream, and then merge the streams at the endpoints where the
290 data is used .
291
292 ### Does it make sense to use NATS to create merged streams?
293
294 NATS can source streams into an additional 3rd stream. This might be useful in
295 that you don't have to read two streams and merge the points to get the current
296 state. However, there are several disadvantages:
297
298 - data would be stored twice
299 - data is not guaranteed to be in chronological order -- the data would be
300 inserted into the 3rd stream when it is received. So you would still have to
301 walk back in history to know for sure if you had the latest point. It seems
302 simpler to just read the head of two streams and compare them.
303
304 ### Timestamps
305
306 NATS JetStream messages store a timestamp, but the timestamp is when the message
307 is inserted into the stream, not necessarily when the sample was taken. There
308 can be some delay between the NATS client sending the message and the server
309 processing it. Therefore, an additional high-resolution
310 [64-bit timestamp](https://docs.simpleiot.org/docs/adr/4-time.html) is added to
311 the beginning of each message.
312
313 ### Edges
314
315 Edges are used to describe the connections between nodes. Nodes can exist in
316 multiple places in the tree. In the below example, `N2` is a child of both `N1`
317 and `N3`.
318
319 <img src="./assets/image-20240124112003398.png" alt="image-20240124112003398" style="zoom:67%;" />
320
321 Edges currently contain the up and downstream node IDs, an array of points, and
322 a node type. Putting the type in the edge made it efficient to traverse the tree
323 by loading edges from a SQLite table and indexing the IDs and type. With
324 JetStream it is less obvious how to store the edge information. SIOT regularly
325 traverses up and down the tree.
326
327 - down: to discover nodes
328 - up: to propagate points to up subjects
329
330 Because edges contain points that can change over time, edge points need to be
331 stored in a stream, much like we do the node points. If each node has its own
332 stream, then the child edges for the node could be stored in the same stream as
333 the node as shown above. This would allow us to traverse the node tree on
334 startup and perhaps cache all the edges. The following subject can be used for
335 edge points:
336
337 `p.<up node ID>.<down node ID>.<type>.<key>`
338
339 Again, this is very similar to the existing
340 [NATS API](https://docs.simpleiot.org/docs/ref/api.html#nats).
341
342 Two special points are present in every edge:
343
344 - `nodeType`: defines the type of the downstream node
345 - `tombstone`: set to true if the downstream node is deleted
346
347 One challenge with this model is much of the code in the SIOT uses a
348 `NodeEdge` datastructure which includes a node and its parent edge. This
349 collection of data describes this instance of a node and is more useful from a
350 client perspective. However, `NodeEdge`'s are duplicated for every mirrored node
351 in the tree, so don't really make sense from a storage and synchronization
352 perspective. This will likely become more clear after some implementation work.
353
354 ### NATS `up.*` subjects
355
356 In SIOT, we partition the system using the tree structure and nodes that listen
357 for messages (databases, messaging services, rules, etc.) subscribe to the
358 `up.*`stream of their parent node. In the below example, each group has it's own
359 database configuration and the Db node only receives points generated in the
360 group it belongs to. This provides an opportunity for any node at any level in
361 the tree to listen to messages of another node, as long as:
362
363 1. it is equal or higher in the structure
364 2. shares an ancestor.
365
366 <img src="./assets/image-20240124104619281.png" alt="image-20240124104619281" style="zoom:67%;" />
367
368 The use of "up" subjects would not have to change other than the logic that
369 re-broadcasts points to "up" subjects would need to use the edge cache instead
370 of querying the SQLite database for edges.
371
372 ### AuthN/AuthZ
373
374 Authorization typically needs to happen at device or group boundaries. Devices
375 or users will need to be authorized. Users
376 [have access](https://docs.simpleiot.org/docs/user/users-groups.html) to all
377 nodes in their parent group or device. If each node has its own stream, that
378 will simplify AuthZ. Each device or user are explicitly granted permission to
379 all of the Nodes they have access to. If a new node is created that is a child
380 of a node a user has permission to view, this new node (and the subsequent
381 streams) are added to the list.
382
383 ### Are we optimizing the right thing?
384
385 Any time you move away from a SQL database, you should
386 [think long and hard](http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/)
387 about this. Additionally, there are very nice time-series database solutions out
388 there. So we should have good reasons for inventing yet-another-database.
389 However, mainstream SQL and Time-series databases all have one big drawback:
390 they don't support synchronizing subsets of data between distributed systems.
391
392 With system design, one approach is to order the problems you are solving by
393 difficulty with the top of the list being most important/difficult, and then
394 optimize the system to solve the hard problems first.
395
396 1. Synchronizing subsets of data between distributed systems (including history)
397 2. Be small and efficient enough to deploy at the edge
398 3. Real-time response
399 4. Efficient searching through history
400 5. Flexible data storage/schema
401 6. Querying nodes and state
402 7. Arbitrary relationships between data
403 8. Data encode/decode performance
404
405 The number of devices and nodes in systems SIOT is targeting is relatively
406 small, thus the current node topology can be cached in memory. The history is a
407 much bigger dataset so using a stream to synchronize, store, and retrieve
408 time-series data makes a lot of sense.
409
410 On #7, will we ever need arbitrary relationships between data? With the node
411 graph, we can do this fairly well. Edges contain points that can be used to
412 further characterize the relationship between nodes. With IoT systems your
413 relationships between nodes is mostly determined by physical proximity. A Modbus
414 sensor is connected to a Modbus, which is connected to a Gateway, which is
415 located at a site, which belongs to a customer.
416
417 On #8, the network is relatively slow compared to anything else, so if it takes
418 a little more time to encode/decode data this is typically not a big deal as the
419 network is the bottleneck.
420
421 With an IoT system, the data is primarily 1) sequential in time, and 2)
422 hierarchical in structure. Thus the streaming/tree approach still appears to be
423 the best approach.
424
425 ### Questions
426
427 - How chatty is the NATS Leaf-node protocol? Is it efficient enough to use over
428 low-bandwidth Cat-M cellular connections (~20-100Kbps)?
429 - Is it practical to have 2 streams for every node? A typical edge device may
430 have 30 nodes, so this is 60 streams to synchronize. Is the overhead to source
431 this many nodes over a leaf connection prohibitive?
432 - Would it make sense to create streams at the device/instance boundaries rather
433 than node boundaries?
434 - this may limit our AuthZ capabilities where we want to give some users
435 access to only part of a cloud instance.
436 - How robust is the JetStream store compared to SQLite in events like
437 [power loss](https://www.sqlite.org/transactional.html)?
438 - Are there any other features of NATS/JetStream that we should be considering?
439
440 ## Experiments
441
442 Several POC experiments have been run to prove the feasibility of this:
443
444 https://github.com/simpleiot/nats-exp
445
446 ## Decision
447
448 Implementation could be broken down into 3 steps:
449
450 1. message/subject encoding changes
451 1. switch store from SQLite to Jetstream
452 1. Use Jetsream to sync between systems
453
454 objections/concerns
455
456 ## Consequences
457
458 what is the impact, both negative and positive.
459
460 ## Additional Notes/Reference