github.com/simpleiot/simpleiot@v0.18.3/docs/ref/sync.md (about)

     1  # Data Synchronization
     2  
     3  See [research](research.md) for information on techniques that may be applicable
     4  to this problem.
     5  
     6  Typically, configuration is modified through a user interface either in the
     7  cloud, or with a local UI (ex touchscreen LCD) at an edge device. Rules may also
     8  eventually change values that need to be synchronized. As mentioned above, the
     9  configuration of a `Node` will be stored as `Points`. Typically the UI for a
    10  node will present fields for the needed configuration based on the `Node`
    11  `Type`, whether it be a user, rule, group, edge device, etc.
    12  
    13  In the system, the Node configuration will be relatively static, but the points
    14  in a node may be changing often as sensor values changes, thus we need to
    15  optimize for efficient synchronization of points. We can't afford the bandwidth
    16  to send the entire node data structure any time something changes.
    17  
    18  As IoT systems are fundamentally distributed systems, the question of
    19  synchronization needs to be considered. Both client (edge), server (cloud), and
    20  UI (frontend) can be considered independent systems and can make changes to the
    21  same node.
    22  
    23  - An edge device with a LCD/Keypad may make configuration changes.
    24  - Configuration changes may be made in the Web UI.
    25  - Sensor values will be sent by an edge device.
    26  - Rules running in the cloud may update nodes with calculated values.
    27  
    28  Although multiple systems may be updating a node at the same time, it is very
    29  rare that multiple systems will update the same node point at the same time. The
    30  reason for this is that a point typically only has one source. A sensor point
    31  will only be updated by an edge device that has the sensor. A configuration
    32  parameter will only be updated by a user, and there are relatively few admin
    33  users, and so on. Because of this, we can assume there will rarely be collisions
    34  in individual point changes, and thus this issue can be ignored. The point with
    35  the latest timestamp is the version to use.
    36  
    37  ## Real-time Point synchronization
    38  
    39  Point changes are handled by sending points to a NATS topic for a node any time
    40  it changes. There are three primary instance types:
    41  
    42  1. Cloud: will subscribe to point changes on all nodes (wildcard)
    43  1. Edge: will subscribe to point changes only for the nodes that exist on the
    44     instance -- typically a handful of nodes.
    45  1. WebUI: will subscribe to point changes for nodes currently being viewed --
    46     again, typically a small number.
    47  
    48  With Point Synchronization, each instance is responsible for updating the node
    49  data in its local store.
    50  
    51  ## Catch-up/non real-time synchronization
    52  
    53  Sending points over NATS will handle 99% of data synchronization needs, but
    54  there are a few cases this does not cover:
    55  
    56  1. One system is offline for some period of time
    57  1. Data is lost during transmission
    58  1. Other errors or unforeseen situations
    59  
    60  There are two types of data:
    61  
    62  1. periodic sensor readings (we'll call sample data) that is being continuously
    63     updated
    64  1. configuration data that is infrequently updated
    65  
    66  Any node that produces sample data should send values every 10m, even if the
    67  value is not changing. There are several reasons for this:
    68  
    69  - indicates the data source is still alive
    70  - makes graphing easier if there is always data to plot
    71  - covers the synchronization problem for sample data. A new value will be coming
    72    soon, so don't really need catch-up synchronization for sample data.
    73  
    74  Config data is not sent periodically. To manage synchronization of config data,
    75  each `edge` will have a `Hash` field that can be compared between instances.
    76  
    77  ## Node hash
    78  
    79  The edge `Hash` field is a hash of:
    80  
    81  - edge point CRCs
    82  - node points CRCs (except for repetitive or high rate sample points)
    83  - child edge `Hash` fields
    84  
    85  We store the hash in the `edge` structures because nodes (such as users) can
    86  exist in multiple places in the tree.
    87  
    88  This is essentially a Merkle DAG -- see [research](research.md).
    89  
    90  Comparing the node `Hash` field allows us to detect node differences. If a
    91  difference is detected, we can then compare the node points and child nodes to
    92  determine the actual differences.
    93  
    94  Any time a node point (except for repetitive or high rate data) is modified, the
    95  node's `Hash` field is updated, and the `Hash` field in parents, grand-parents,
    96  etc are also computed and updated. This may seem like a lot of overhead, but if
    97  the database is local, and the graph is reasonably constructed, then each update
    98  might require reading a dozen or so nodes and perhaps writing 3-5 nodes.
    99  Additionally, non sample-data changes are relatively infrequent.
   100  
   101  Initially synchronization between edge and cloud nodes is supported. The edge
   102  device will contain an "upstream" node that defines a connection to another
   103  instance's NATS server -- typically in the cloud. The edge node is responsible
   104  for synchronizing of all state using the following algorithm:
   105  
   106  1. occasionally the edge device fetches the edge device root node hash from the
   107     cloud.
   108  1. if the hash does not match, the edge device fetches the entire node and
   109     compares/updates points. If local points need updated, this process can
   110     happen all on the edge device. If upstream points need updated, these are
   111     simply transmitted over NATS.
   112  1. if node hash still does not match, a recursive operation is started to fetch
   113     child node hashes and the same process is repeated.
   114  
   115  ### Hash Algorithm
   116  
   117  We don't need cryptographic level hashes as we are not trying to protect against
   118  malicious actors, but rather provide a secondary check to ensure all data has
   119  been synchronized. Normally, all data will be sent via points as it is changes
   120  and if all points are received, the Hash is not needed. Therefore, we want to
   121  prioritize performance and efficiency over hash strength. The XOR function has
   122  some interesting properties:
   123  
   124  - **Commutative: A ⊕ B = B ⊕ A** (the ability to process elements in any order
   125    and get the same answer)
   126  - **Associative: A ⊕ (B ⊕ C) = (A ⊕ B) ⊕ C** (we can group operations in any
   127    order)
   128  - **Identity: A ⊕ 0 = A**
   129  - **Self-Inverse: A ⊕ A = 0** (we can back out an input value by simply applying
   130    it again)
   131  
   132  See
   133  [hash_test.go](https://github.com/simpleiot/simpleiot/blob/master/store/hash_test.go)
   134  for tests of the XOR concept.
   135  
   136  ### Point CRC
   137  
   138  Point CRCs are calculated using the crc-32 of the following point fields:
   139  
   140  - `Time`
   141  - `Type`
   142  - `Key`
   143  - `Text`
   144  - `Value`
   145  
   146  ### Updating the Node Hash
   147  
   148  - edge or node points received
   149    - for points updated
   150      - back out previous point CRC
   151      - add in new point CRC
   152    - update upstream hash values (stops at device node)
   153      - create cache of all upstream edges to root
   154      - for each upstream edge, back out old hash, and xor in new hash
   155      - write all updated edge hash fields
   156  
   157  It should again be emphasized that repetitive or high rate points should not be
   158  included in the hash because they will be sent again soon -- we do not need the
   159  hash to ensure they get synchronized. The hash should only include points that
   160  change at slow rates (user changes, state, etc). Anything machine generated
   161  should be repeated -- even if only every 10m.
   162  
   163  The hash is only useful in synchronizing state between a device node tree, and a
   164  subset of the upstream node tree. For instances which do not have an upstream of
   165  peer instances, there is little value in calculating hash values back to the
   166  root node and could be computationally intensive for a cloud instance that had
   167  1000's of child nodes.