github.com/simpleiot/simpleiot@v0.18.3/docs/ref/reliability.md

github.com/simpleiot/simpleiot@v0.18.3/docs/ref/reliability.md (about)

1 # Reliability
2
3 Reliability is an important consideration in any IoT system as these systems are
4 often used to monitor and control critical systems and processes. Performance is
5 a key aspect of reliability because if the system is not performing well, then
6 it can't keep up and do its job.
7
8 ## Point Metrics
9
10 The fundamental operation of SimpleIoT is that it process `points`, which are
11 changes to `nodes`. If the system can't process points at the rate they are
12 coming in, then we have a problem as data will start to back up and the system
13 will not be responsive.
14
15 Points and other data flow through the NATS messaging system, therefore it is
16 perhaps the first place to look. We track several metrics that are written to
17 the root device node to help track how the system is performing.
18
19 The NATS client buffers messages that are received for each subscription and
20 then messages are
21 [dispatched serially one message at a time](https://docs.nats.io/developing-with-nats/receiving/async).
22 If the application can't keep up with processing messages, then the number of
23 buffered messages increases. This number is occasionally read and then
24 min/max/avg writen to the `metricNatsPending*` points in the root device node.
25
26 The time required to process points is tracked in the `metricNatsCycle*` points
27 in the root device node. The cycle time is in milliseconds.
28
29 We also track point throughput (messages/sec) for various NATS subjects in the
30 `metricNatsThroughput*` points.
31
32 These metrics should be graphed and notifications sent when they are out of the
33 normal range. Rules that trigger on the point type can be installed high in the
34 tree above a group of devices so you don't have to write rules for every device.
35
36 ## Database interactions
37
38 Database operations greatly affect system performance. When Points come into the
39 system, we need to store this data in the primary (ex Genji) and time series
40 stores (ex InfluxDB). The time it takes to read and write data greatly impacts
41 how much data we can handle.
42
43 ## IO failures
44
45 All errors reading/writing IO devices should be tracked at both the device and
46 bus level. These can be observed over time and abnormal rates can trigger
47 notifications. Error counts should be reported at a low rate to avoid using
48 bandwidth and resources -- especially if multiple counts are incremented on an
49 error (IO and bus).
50
51 ## Logging
52
53 Many errors are currently reported as log messages. Eventually some effort
54 should be made to turn these into error counts and possibly store them in the
55 time series store for later analysis.