github.com/simpleiot/simpleiot@v0.18.3/docs/ref/sync.md (about) 1 # Data Synchronization 2 3 See [research](research.md) for information on techniques that may be applicable 4 to this problem. 5 6 Typically, configuration is modified through a user interface either in the 7 cloud, or with a local UI (ex touchscreen LCD) at an edge device. Rules may also 8 eventually change values that need to be synchronized. As mentioned above, the 9 configuration of a `Node` will be stored as `Points`. Typically the UI for a 10 node will present fields for the needed configuration based on the `Node` 11 `Type`, whether it be a user, rule, group, edge device, etc. 12 13 In the system, the Node configuration will be relatively static, but the points 14 in a node may be changing often as sensor values changes, thus we need to 15 optimize for efficient synchronization of points. We can't afford the bandwidth 16 to send the entire node data structure any time something changes. 17 18 As IoT systems are fundamentally distributed systems, the question of 19 synchronization needs to be considered. Both client (edge), server (cloud), and 20 UI (frontend) can be considered independent systems and can make changes to the 21 same node. 22 23 - An edge device with a LCD/Keypad may make configuration changes. 24 - Configuration changes may be made in the Web UI. 25 - Sensor values will be sent by an edge device. 26 - Rules running in the cloud may update nodes with calculated values. 27 28 Although multiple systems may be updating a node at the same time, it is very 29 rare that multiple systems will update the same node point at the same time. The 30 reason for this is that a point typically only has one source. A sensor point 31 will only be updated by an edge device that has the sensor. A configuration 32 parameter will only be updated by a user, and there are relatively few admin 33 users, and so on. Because of this, we can assume there will rarely be collisions 34 in individual point changes, and thus this issue can be ignored. The point with 35 the latest timestamp is the version to use. 36 37 ## Real-time Point synchronization 38 39 Point changes are handled by sending points to a NATS topic for a node any time 40 it changes. There are three primary instance types: 41 42 1. Cloud: will subscribe to point changes on all nodes (wildcard) 43 1. Edge: will subscribe to point changes only for the nodes that exist on the 44 instance -- typically a handful of nodes. 45 1. WebUI: will subscribe to point changes for nodes currently being viewed -- 46 again, typically a small number. 47 48 With Point Synchronization, each instance is responsible for updating the node 49 data in its local store. 50 51 ## Catch-up/non real-time synchronization 52 53 Sending points over NATS will handle 99% of data synchronization needs, but 54 there are a few cases this does not cover: 55 56 1. One system is offline for some period of time 57 1. Data is lost during transmission 58 1. Other errors or unforeseen situations 59 60 There are two types of data: 61 62 1. periodic sensor readings (we'll call sample data) that is being continuously 63 updated 64 1. configuration data that is infrequently updated 65 66 Any node that produces sample data should send values every 10m, even if the 67 value is not changing. There are several reasons for this: 68 69 - indicates the data source is still alive 70 - makes graphing easier if there is always data to plot 71 - covers the synchronization problem for sample data. A new value will be coming 72 soon, so don't really need catch-up synchronization for sample data. 73 74 Config data is not sent periodically. To manage synchronization of config data, 75 each `edge` will have a `Hash` field that can be compared between instances. 76 77 ## Node hash 78 79 The edge `Hash` field is a hash of: 80 81 - edge point CRCs 82 - node points CRCs (except for repetitive or high rate sample points) 83 - child edge `Hash` fields 84 85 We store the hash in the `edge` structures because nodes (such as users) can 86 exist in multiple places in the tree. 87 88 This is essentially a Merkle DAG -- see [research](research.md). 89 90 Comparing the node `Hash` field allows us to detect node differences. If a 91 difference is detected, we can then compare the node points and child nodes to 92 determine the actual differences. 93 94 Any time a node point (except for repetitive or high rate data) is modified, the 95 node's `Hash` field is updated, and the `Hash` field in parents, grand-parents, 96 etc are also computed and updated. This may seem like a lot of overhead, but if 97 the database is local, and the graph is reasonably constructed, then each update 98 might require reading a dozen or so nodes and perhaps writing 3-5 nodes. 99 Additionally, non sample-data changes are relatively infrequent. 100 101 Initially synchronization between edge and cloud nodes is supported. The edge 102 device will contain an "upstream" node that defines a connection to another 103 instance's NATS server -- typically in the cloud. The edge node is responsible 104 for synchronizing of all state using the following algorithm: 105 106 1. occasionally the edge device fetches the edge device root node hash from the 107 cloud. 108 1. if the hash does not match, the edge device fetches the entire node and 109 compares/updates points. If local points need updated, this process can 110 happen all on the edge device. If upstream points need updated, these are 111 simply transmitted over NATS. 112 1. if node hash still does not match, a recursive operation is started to fetch 113 child node hashes and the same process is repeated. 114 115 ### Hash Algorithm 116 117 We don't need cryptographic level hashes as we are not trying to protect against 118 malicious actors, but rather provide a secondary check to ensure all data has 119 been synchronized. Normally, all data will be sent via points as it is changes 120 and if all points are received, the Hash is not needed. Therefore, we want to 121 prioritize performance and efficiency over hash strength. The XOR function has 122 some interesting properties: 123 124 - **Commutative: A ⊕ B = B ⊕ A** (the ability to process elements in any order 125 and get the same answer) 126 - **Associative: A ⊕ (B ⊕ C) = (A ⊕ B) ⊕ C** (we can group operations in any 127 order) 128 - **Identity: A ⊕ 0 = A** 129 - **Self-Inverse: A ⊕ A = 0** (we can back out an input value by simply applying 130 it again) 131 132 See 133 [hash_test.go](https://github.com/simpleiot/simpleiot/blob/master/store/hash_test.go) 134 for tests of the XOR concept. 135 136 ### Point CRC 137 138 Point CRCs are calculated using the crc-32 of the following point fields: 139 140 - `Time` 141 - `Type` 142 - `Key` 143 - `Text` 144 - `Value` 145 146 ### Updating the Node Hash 147 148 - edge or node points received 149 - for points updated 150 - back out previous point CRC 151 - add in new point CRC 152 - update upstream hash values (stops at device node) 153 - create cache of all upstream edges to root 154 - for each upstream edge, back out old hash, and xor in new hash 155 - write all updated edge hash fields 156 157 It should again be emphasized that repetitive or high rate points should not be 158 included in the hash because they will be sent again soon -- we do not need the 159 hash to ensure they get synchronized. The hash should only include points that 160 change at slow rates (user changes, state, etc). Anything machine generated 161 should be repeated -- even if only every 10m. 162 163 The hash is only useful in synchronizing state between a device node tree, and a 164 subset of the upstream node tree. For instances which do not have an upstream of 165 peer instances, there is little value in calculating hash values back to the 166 root node and could be computationally intensive for a cloud instance that had 167 1000's of child nodes.