github.com/simpleiot/simpleiot@v0.18.3/docs/ref/reliability.md (about)

     1  # Reliability
     2  
     3  Reliability is an important consideration in any IoT system as these systems are
     4  often used to monitor and control critical systems and processes. Performance is
     5  a key aspect of reliability because if the system is not performing well, then
     6  it can't keep up and do its job.
     7  
     8  ## Point Metrics
     9  
    10  The fundamental operation of SimpleIoT is that it process `points`, which are
    11  changes to `nodes`. If the system can't process points at the rate they are
    12  coming in, then we have a problem as data will start to back up and the system
    13  will not be responsive.
    14  
    15  Points and other data flow through the NATS messaging system, therefore it is
    16  perhaps the first place to look. We track several metrics that are written to
    17  the root device node to help track how the system is performing.
    18  
    19  The NATS client buffers messages that are received for each subscription and
    20  then messages are
    21  [dispatched serially one message at a time](https://docs.nats.io/developing-with-nats/receiving/async).
    22  If the application can't keep up with processing messages, then the number of
    23  buffered messages increases. This number is occasionally read and then
    24  min/max/avg writen to the `metricNatsPending*` points in the root device node.
    25  
    26  The time required to process points is tracked in the `metricNatsCycle*` points
    27  in the root device node. The cycle time is in milliseconds.
    28  
    29  We also track point throughput (messages/sec) for various NATS subjects in the
    30  `metricNatsThroughput*` points.
    31  
    32  These metrics should be graphed and notifications sent when they are out of the
    33  normal range. Rules that trigger on the point type can be installed high in the
    34  tree above a group of devices so you don't have to write rules for every device.
    35  
    36  ## Database interactions
    37  
    38  Database operations greatly affect system performance. When Points come into the
    39  system, we need to store this data in the primary (ex Genji) and time series
    40  stores (ex InfluxDB). The time it takes to read and write data greatly impacts
    41  how much data we can handle.
    42  
    43  ## IO failures
    44  
    45  All errors reading/writing IO devices should be tracked at both the device and
    46  bus level. These can be observed over time and abnormal rates can trigger
    47  notifications. Error counts should be reported at a low rate to avoid using
    48  bandwidth and resources -- especially if multiple counts are incremented on an
    49  error (IO and bus).
    50  
    51  ## Logging
    52  
    53  Many errors are currently reported as log messages. Eventually some effort
    54  should be made to turn these into error counts and possibly store them in the
    55  time series store for later analysis.