github.com/simpleiot/simpleiot@v0.18.3/docs/adr/7-jetstream-store.md (about) 1 # JetStream SIOT Store 2 3 - Author: Cliff Brake, last updated: 2024-01-24 4 - Status: discussion 5 6 ## Problem 7 8 SQLite has worked well as a SIOT store. There are a few things we would like to 9 improve: 10 11 - synchronization of history 12 - currently, if a device or server is offline, only the latest state is 13 transferred when connected. We would like all history that has accumulated 14 when offline to be transferred once reconnected. 15 - we want history at the edge as well as cloud 16 - this allows us to use history at the edge to run more advanced algorithms 17 like AI 18 - we currently have to re-compute hashes all the way to the root node anytime 19 something changes 20 - this may not scale to larger systems 21 - is difficult to get right if things are changing while we re-compute hashes 22 -- it requires some type of coordination between the distributed systems, 23 which we currently don't have. 24 25 ## Context/Discussion 26 27 The purpose of this document is to explore storing SIOT state in a NATS 28 JetStream store. SIOT data is stored in a tree of nodes and each node contains 29 an array of points. Note, the term **"node"** in this document represents a data 30 structure in a tree, not a physical computer or SIOT instance. The term 31 **"instance"** will be used to represent devices or SIOT instances. 32 33  34 35 Nodes are arranged in a 36 [directed acyclic graph](https://en.wikipedia.org/wiki/Directed_acyclic_graph). 37 38 <img src="./assets/image-20240124105741250.png" alt="image-20240124105741250" style="zoom: 33%;" /> 39 40 A subset of this tree is synchronized between various instances as shown in the 41 below example: 42 43  44 45 The tree topology can be as deep as required to describe the system. To date, 46 only the current state of a node is synchronized and history (if needed) is 47 stored externally in a time-series database like InfluxDB and is not 48 synchronized. The node tree is an excellent data model for IoT systems. 49 50 Each node contains an array of points that represent the state of the node. The 51 points contain a type and a key. The key can be used to describe maps and 52 arrays. We keep points separate so they can all be updated independently and 53 easily merged. 54 55 With JetStream, we could store points in a stream where the head of the stream 56 represents the current state of a Node or collection of nodes. Each point is 57 stored in a separate NATS subject. 58 59  60 61 NATS Jetstream is a stream-based store where every message in a stream is given 62 a sequence number. Synchronization is simple in that if a sequence number does 63 not exist on a remote system, the missing messages are sent. 64 65 NATS also supports leaf nodes (instances) and streams can be synchronized 66 between hub and leaf instances. If they are disconnected, then streams are 67 "caught up" after the connection is made again. 68 69 Several experiments have been run to understand the basic JetStream 70 functionality in [this repo](https://github.com/simpleiot/nats-exp). 71 72 1. storing and extracting points in a stream 73 1. using streams to store time-series data and measure performance 74 1. syncing streams between the hub and leaf instances 75 76 ### Advantages of JetStream 77 78 - JetStream is built into NATS, which we already embed and use. 79 - History can be stored in a NATS stream instead of externally. Currently, we 80 use an external store like InfluxDB to store history. 81 - JetStream streams can be synchronized between instances. 82 - JetStream has various retention models so old data can automatically be 83 dropped. 84 - Leverage the NATS AuthN/AuthZ features. 85 - JetStream is a natural extension of core NATS, so many of the core SIOT 86 concepts are still valid and do not need to change. 87 88 ### Challenges with moving to JetStream 89 90 - streams are typically synchronized in one direction. This is a challenge for 91 SIOT as the basic premise is data can be modified in any location where a 92 user/device has proper permissions. A user may change a configuration in a 93 cloud portal or on a local touch-screen. 94 - sequence numbers must be set by one instance, so you can't have both a leaf 95 and hub nodes inserting data into a single stream. This has benefits in that 96 it is a very simple and reliable model. 97 - we are constrained by a simple message subject to label and easily query data. 98 This is less flexible than a SQL database, but this constraint can also be an 99 advantage in that it forces us into a simple and consistent data model. 100 - SQLite has a built-in cache. We would likely need to create our own with 101 JetStream. 102 103 ### JetStream consistency model 104 105 From this [discussion](https://github.com/nats-io/nats-server/discussions/4577): 106 107 > When the doc mentions immediate consistency, it is in contrast to 108 > [eventual consistency](https://en.wikipedia.org/wiki/Eventual_consistency). It 109 > is about how 'writes' (i.e. publishing a message to a stream). 110 > 111 > JetStream is an immediately consistent distributed storage system in that 112 > every new message stored in the stream is done so in a unique order (when 113 > those messages reach the stream leader) and that the acknowledgment that the 114 > storing of the message has been successful only happens as the result of a 115 > RAFT vote between the NATS JetStream servers (e.g. 3 of them if replicas=3) 116 > handling the stream. 117 > 118 > This means that when a publishing application receives the positive 119 > acknowledgement to it's publication to the stream you are guaranteed that 120 > everyone will see that new message in their updates _in the same order_ (and 121 > with the same sequence number and time stamp). 122 > 123 > This 'non-eventual' consistency is what enables 'compare and set' (i.e. 124 > compare and publish to a stream) operations on streams: because there can only 125 > be one new message added to a stream at a time. 126 > 127 > To map back to those formal consistency models it means that for writes, NATS 128 > JetStream is 129 > [Linearizable](https://jepsen.io/consistency/models/linearizable). 130 131 Currently SIOT uses a more "eventually" consistent model where we used data 132 structures with some light-weight CRDT proprieties. However this has the 133 disadvantage that we have to do things like hash the entire node tree to know if 134 anything has changed. In a more static system where not much is changing, this 135 works pretty well, but in a dynamic IoT system where data is changing all the 136 time, it is hard to scale this model. 137 138 ### Message/Subject encoding 139 140 In the past, we've used the 141 [Point datastructure](https://docs.simpleiot.org/docs/adr/1-consider-changing-point-data-type.html#proposal-2). 142 This has worked extremely well at representing reasonably complex data 143 structures (including maps and arrays) for a node. Yet it has limitations and 144 constraints that have proven useful it making data easy to store, transmit, and 145 merge. 146 147 ```go 148 // Point is a flexible data structure that can be used to represent 149 // a sensor value or a configuration parameter. 150 // ID, Type, and Index uniquely identify a point in a device 151 type Point struct { 152 //------------------------------------------------------- 153 //1st three fields uniquely identify a point when receiving updates 154 155 // Type of point (voltage, current, key, etc) 156 Type string `json:"type,omitempty"` 157 158 // Key is used to allow a group of points to represent a map or array 159 Key string `json:"key,omitempty"` 160 161 //------------------------------------------------------- 162 // The following fields are the values for a point 163 164 // Time the point was taken 165 Time time.Time `json:"time,omitempty" yaml:"-"` 166 167 // Instantaneous analog or digital value of the point. 168 // 0 and 1 are used to represent digital values 169 Value float64 `json:"value,omitempty"` 170 171 // Optional text value of the point for data that is best represented 172 // as a string rather than a number. 173 Text string `json:"text,omitempty"` 174 175 // catchall field for data that does not fit into float or string -- 176 // should be used sparingly 177 Data []byte `json:"data,omitempty"` 178 179 //------------------------------------------------------- 180 // Metadata 181 182 // Used to indicate a point has been deleted. This value is only 183 // ever incremented. Odd values mean point is deleted. 184 Tombstone int `json:"tombstone,omitempty"` 185 186 // Where did this point come from. If from the owning node, it may be blank. 187 Origin string `json:"origin,omitempty"` 188 } 189 ``` 190 191 With JetStream, the `Type`and `Key` can be encoded in the message subject: 192 193 `p.<node id>.<type>.<key>` 194 195 Message subjects are indexed in a stream, so NATS can quickly find messages for 196 any subject in a stream without scanning the entire stream (see 197 [discussion 1](https://github.com/nats-io/nats-server/discussions/3772) and 198 [discussion 2](https://github.com/nats-io/nats-server/discussions/4170)). 199 200 Over time, the Point structure has been simplified. For instance, it used to 201 also have an `Index` field, but we have learned we can use a single `Key` field 202 instead. At this point it may make sense to simplify the payload. One idea is to 203 do away with the `Value` and `Text` fields and simply have a `Data` field. The 204 components that use the points have to know the data-type anyway to know if they 205 should use the `Value` or `Text`field. In the past, protobuf encoding was used 206 as we started with quite a few fields and provided some flexibility and 207 convenience. But as we have reduced the number of fields and two of them are now 208 encoded in the message subject, it may be simpler to have a simple encoding for 209 `Time`, `Data`, `Tombstone`, and `Origin` in the message payload. The code using 210 the message would be responsible for convert `Data` into whatever data type is 211 needed. This would open up the opportunity to encode any type of payload in the 212 future in the `Data` field and be more flexible for the future. 213 214 #### Message payload: 215 216 - `Time` (uint64) 217 - `Tombstone` (byte) 218 - `OriginLen` (byte) 219 - `Origin` (string) 220 - `Data Type` (byte) 221 - `Data` (length determined by the message length subtracted by the length of 222 the above fields) 223 224 Examples of types: 225 226 - 0 - unknown or custom 227 - 1 - float (32, or 64 bit) 228 - 2 - int (8, 16, 32, or 64 bit) 229 - 3 - unit (8, 16, 32, or 65 bit) 230 - 4 - string 231 - 5 - JSON 232 - 6 - Protobuf 233 234 Putting `Origin` in the message subject will make it inefficient to query as you 235 will need to scan and decode all messages. Are there any cases where we will 236 need to do this? (this is an example where a SQL database is more flexible). One 237 solution would be to create another stream where the origin is in the subject. 238 239 There are times when the current point model does not fit very well -- for 240 instance when sending a notification -- this is difficult to encode in an array 241 of points. I think in these cases encoding the notification data as JSON 242 probably makes more sense and this encoding should work much better. 243 244 #### Can't send multiple points in a message 245 246 In the past, it was common to send multiple points in a message for a node -- 247 for instance when creating a node, or updating an array. However, with the 248 `type` and `key` encoded in the subject this will no longer work. What is the 249 implication for having separate messages? 250 251 - will be more complex to create nodes 252 - when updating an array/map in a node, it will not be updated all at once, but 253 over the time it takes all the points to come into the client. 254 - there is still value in arrays being encoded as points -- for instance a relay 255 devices that contains two relays. However, for configuration are we better 256 served by encoding the struct in a the data field as JSON and updating it as 257 an atomic unit? 258 259 ### UI Implications 260 261 Because NATS and JetStream subjects overlap, the UI could 262 [subscribe to the current state changes](https://github.com/simpleiot/simpleiot/tree/master/frontend/lib) 263 much as is done today. A few things would need to change: 264 265 - Getting the initial state could still use the 266 [NATS `nodes` API](https://docs.simpleiot.org/docs/ref/api.html). However, the 267 `Value` and `Text` fields might be merged into `Data`. 268 - In the `p.<node id>` subscription, the `Type` and `Key` now would come from 269 the message subject. 270 271 ### Bi-Directional Synchronization 272 273 Bi-directional synchronization between two instances may be accomplished by 274 having two streams for every node. The head of both incoming and outgoing 275 streams is looked at to determine the current state. If points of the same type 276 exist in both streams, the point with the latest timestamp wins. In reality, 99% 277 of the time, one set of data will be set by the Leaf instance (ex: sensor 278 readings) and another set of data will be set by the upstream Hub instance (ex: 279 configuration settings) and there will be very little overlap. 280 281  282 283 The question arises -- do we really need bi-directional synchronization and the 284 complexity of having two streams for every node? Every node includes some amount 285 of configuration which can flow down from upstream instances. Additionally, many 286 nodes are collecting data which needs to flow back upstream. So it seems a very 287 common need for every node to have data flowing in both directions. Since this 288 is a basic requirement, it does not seem like much of stretch to allow any data 289 to flow in either stream, and then merge the streams at the endpoints where the 290 data is used . 291 292 ### Does it make sense to use NATS to create merged streams? 293 294 NATS can source streams into an additional 3rd stream. This might be useful in 295 that you don't have to read two streams and merge the points to get the current 296 state. However, there are several disadvantages: 297 298 - data would be stored twice 299 - data is not guaranteed to be in chronological order -- the data would be 300 inserted into the 3rd stream when it is received. So you would still have to 301 walk back in history to know for sure if you had the latest point. It seems 302 simpler to just read the head of two streams and compare them. 303 304 ### Timestamps 305 306 NATS JetStream messages store a timestamp, but the timestamp is when the message 307 is inserted into the stream, not necessarily when the sample was taken. There 308 can be some delay between the NATS client sending the message and the server 309 processing it. Therefore, an additional high-resolution 310 [64-bit timestamp](https://docs.simpleiot.org/docs/adr/4-time.html) is added to 311 the beginning of each message. 312 313 ### Edges 314 315 Edges are used to describe the connections between nodes. Nodes can exist in 316 multiple places in the tree. In the below example, `N2` is a child of both `N1` 317 and `N3`. 318 319 <img src="./assets/image-20240124112003398.png" alt="image-20240124112003398" style="zoom:67%;" /> 320 321 Edges currently contain the up and downstream node IDs, an array of points, and 322 a node type. Putting the type in the edge made it efficient to traverse the tree 323 by loading edges from a SQLite table and indexing the IDs and type. With 324 JetStream it is less obvious how to store the edge information. SIOT regularly 325 traverses up and down the tree. 326 327 - down: to discover nodes 328 - up: to propagate points to up subjects 329 330 Because edges contain points that can change over time, edge points need to be 331 stored in a stream, much like we do the node points. If each node has its own 332 stream, then the child edges for the node could be stored in the same stream as 333 the node as shown above. This would allow us to traverse the node tree on 334 startup and perhaps cache all the edges. The following subject can be used for 335 edge points: 336 337 `p.<up node ID>.<down node ID>.<type>.<key>` 338 339 Again, this is very similar to the existing 340 [NATS API](https://docs.simpleiot.org/docs/ref/api.html#nats). 341 342 Two special points are present in every edge: 343 344 - `nodeType`: defines the type of the downstream node 345 - `tombstone`: set to true if the downstream node is deleted 346 347 One challenge with this model is much of the code in the SIOT uses a 348 `NodeEdge` datastructure which includes a node and its parent edge. This 349 collection of data describes this instance of a node and is more useful from a 350 client perspective. However, `NodeEdge`'s are duplicated for every mirrored node 351 in the tree, so don't really make sense from a storage and synchronization 352 perspective. This will likely become more clear after some implementation work. 353 354 ### NATS `up.*` subjects 355 356 In SIOT, we partition the system using the tree structure and nodes that listen 357 for messages (databases, messaging services, rules, etc.) subscribe to the 358 `up.*`stream of their parent node. In the below example, each group has it's own 359 database configuration and the Db node only receives points generated in the 360 group it belongs to. This provides an opportunity for any node at any level in 361 the tree to listen to messages of another node, as long as: 362 363 1. it is equal or higher in the structure 364 2. shares an ancestor. 365 366 <img src="./assets/image-20240124104619281.png" alt="image-20240124104619281" style="zoom:67%;" /> 367 368 The use of "up" subjects would not have to change other than the logic that 369 re-broadcasts points to "up" subjects would need to use the edge cache instead 370 of querying the SQLite database for edges. 371 372 ### AuthN/AuthZ 373 374 Authorization typically needs to happen at device or group boundaries. Devices 375 or users will need to be authorized. Users 376 [have access](https://docs.simpleiot.org/docs/user/users-groups.html) to all 377 nodes in their parent group or device. If each node has its own stream, that 378 will simplify AuthZ. Each device or user are explicitly granted permission to 379 all of the Nodes they have access to. If a new node is created that is a child 380 of a node a user has permission to view, this new node (and the subsequent 381 streams) are added to the list. 382 383 ### Are we optimizing the right thing? 384 385 Any time you move away from a SQL database, you should 386 [think long and hard](http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/) 387 about this. Additionally, there are very nice time-series database solutions out 388 there. So we should have good reasons for inventing yet-another-database. 389 However, mainstream SQL and Time-series databases all have one big drawback: 390 they don't support synchronizing subsets of data between distributed systems. 391 392 With system design, one approach is to order the problems you are solving by 393 difficulty with the top of the list being most important/difficult, and then 394 optimize the system to solve the hard problems first. 395 396 1. Synchronizing subsets of data between distributed systems (including history) 397 2. Be small and efficient enough to deploy at the edge 398 3. Real-time response 399 4. Efficient searching through history 400 5. Flexible data storage/schema 401 6. Querying nodes and state 402 7. Arbitrary relationships between data 403 8. Data encode/decode performance 404 405 The number of devices and nodes in systems SIOT is targeting is relatively 406 small, thus the current node topology can be cached in memory. The history is a 407 much bigger dataset so using a stream to synchronize, store, and retrieve 408 time-series data makes a lot of sense. 409 410 On #7, will we ever need arbitrary relationships between data? With the node 411 graph, we can do this fairly well. Edges contain points that can be used to 412 further characterize the relationship between nodes. With IoT systems your 413 relationships between nodes is mostly determined by physical proximity. A Modbus 414 sensor is connected to a Modbus, which is connected to a Gateway, which is 415 located at a site, which belongs to a customer. 416 417 On #8, the network is relatively slow compared to anything else, so if it takes 418 a little more time to encode/decode data this is typically not a big deal as the 419 network is the bottleneck. 420 421 With an IoT system, the data is primarily 1) sequential in time, and 2) 422 hierarchical in structure. Thus the streaming/tree approach still appears to be 423 the best approach. 424 425 ### Questions 426 427 - How chatty is the NATS Leaf-node protocol? Is it efficient enough to use over 428 low-bandwidth Cat-M cellular connections (~20-100Kbps)? 429 - Is it practical to have 2 streams for every node? A typical edge device may 430 have 30 nodes, so this is 60 streams to synchronize. Is the overhead to source 431 this many nodes over a leaf connection prohibitive? 432 - Would it make sense to create streams at the device/instance boundaries rather 433 than node boundaries? 434 - this may limit our AuthZ capabilities where we want to give some users 435 access to only part of a cloud instance. 436 - How robust is the JetStream store compared to SQLite in events like 437 [power loss](https://www.sqlite.org/transactional.html)? 438 - Are there any other features of NATS/JetStream that we should be considering? 439 440 ## Experiments 441 442 Several POC experiments have been run to prove the feasibility of this: 443 444 https://github.com/simpleiot/nats-exp 445 446 ## Decision 447 448 Implementation could be broken down into 3 steps: 449 450 1. message/subject encoding changes 451 1. switch store from SQLite to Jetstream 452 1. Use Jetsream to sync between systems 453 454 objections/concerns 455 456 ## Consequences 457 458 what is the impact, both negative and positive. 459 460 ## Additional Notes/Reference