github.com/voedger/voedger@v0.0.0-20240520144910-273e84102129/design/monitor/metrics.md (about)

     1  # Contents
     2  - [Abstract](#abstract)
     3  - [Functional Design](#functional-design)
     4    - [General](#general)
     5    - [API Functional Design](#api-functional-design)
     6  - [Technical Design](#technical-design)
     7    - [API](#api)
     8      - [System Resources](#system-resources)
     9      - [System Performance](#system-performance)
    10      - [Dashboard](#dashboard)
    11      - [App Performance](#app-performance)
    12    - [Metrics](#metrics)
    13      - [Writing metrics](#writing-metrics)
    14      - [List of Metrics](#list-of-metrics)
    15      - [Metrics View](#metrics-view)
    16  
    17  # Abstract
    18  As a system architect I want to know which metrics are needed for the monitor app, and the API to query them, so that [user requirements for CE monitoring](https://github.com/heeus/inv-monitoring/tree/master/20220503-user-reqs) can be implemented
    19  
    20  # Functional Design
    21  ## General
    22  - Monitor App Frontend requests metrics from Backend using API
    23  - Monitor App performs required calculations over metrics if needed (rate, diff, etc and shows charts/summaries etc
    24  
    25  ## API Functional Design
    26  - General
    27    - select list of nodes (vvms and dbs)
    28  - Time-series charts
    29    - select list of metrics by the time range (from..till)
    30  - Dashboard current values / gauges
    31    - select last metric value (select top 1 metric from the time range order by time desc)
    32  - Dashboard current values / rates (CPU load, IOPS)
    33    - select first and last metric value from the interval
    34     Dashboard: Applications Overview
    35    - select the list of apps with their versions, partitions, uptime, avg RPS
    36  - Sys Performance IOPS: Worst apps
    37    - Top 5 by request time (Get, GetBatch, Read, Put, PutBatch)
    38    - Top 5 by RPS (Get, GetBatch, Read, Put, PutBatch)
    39    - Bottom 5 bycache hits (Get, GetBatch)
    40    - Top 5 by batch size (PutBatch)
    41  - App: top 10 slow projectors
    42    - select top 10 projector partitions (Name + Partition + Lag)
    43  - App: Partitions balance
    44    - select number of queries and commands for every app partition for the specified period
    45  
    46  # Technical Design
    47  
    48  ## API
    49  The following query functions are available in the API:
    50  - `q.monitor.GetNodes` ([{nodename: 'worker1', vvm: true},...])
    51  - `q.monitor.GetApps` ([{app: 'sys/monitor, version: '1.2.3', partitions: 1, uptime: 123123123}])
    52  - `q.monitor.GetMetrics` (return all values for requested metrics over time interval for given app)
    53    - in:
    54      - from
    55      - till
    56      - app
    57      - list of metrics
    58      - list of vvms
    59    - out: array of objects:
    60      - metric_name
    61      - app
    62      - month
    63      - timestamp
    64      - value
    65  
    66  - `q.monitor.GetPartitionsBalance` (return the data to show the Partition Balance over time interval for given app, see below)
    67  - `q.monitor.GetWorstApps` (return the "IOPS/Worst Apps" data over time interval, see below)
    68  
    69  
    70  ### System Resources
    71  
    72  CPU Usage
    73  - Gets the list of ['node_cpu_idle_seconds_total'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval
    74  - Calculates `rate` over values
    75  
    76  Mem Usage
    77  - Gets the list of ['node_memory_memavailable_bytes', 'node_memory_memtotal_bytes'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval
    78  - Calculates percentage over values (avail/total)
    79  
    80  Disk Usage
    81  - Gets the list of ['node_filesystem_free_bytes', 'node_filesystem_size_bytes'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval
    82  - Calculates percentage over values (avail/total)
    83  
    84  Disk I/O
    85  - Gets the list of ['node_disk_read_bytes_total', 'node_disk_write_bytes_total'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval
    86  - Calculates the rate of read+write ops to show datasize per second
    87  
    88  Disk IOPS
    89  - Gets the list of ['node_disk_reads_completed_total', 'node_disk_writes_completed_total'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval
    90  - Calculates the rate of read+write ops to show number of ops per second
    91  
    92  
    93  ### System Performance
    94  
    95  RPS
    96  - Gets the list of ['heeus_cp_commands_total', 'heeus_qp_commands_total'] metrics from `q.monitor.GetMetrics` for app 'sys' and given interval
    97  - Calculates `rate` over values to show number per second
    98  
    99  Status Codes
   100  - Gets the list of ['heeus_http_status_2xx_total', 'heeus_http_status_4xx_total', 'heeus_http_status_5xx_total', 'heeus_http_status_503_total'] metrics from `q.monitor.GetMetrics` for app 'sys' and given interval
   101  - Calculates `diff` over values to show number per interval
   102  
   103  IOPS
   104  - Gets the list of ['heeus_istoragecache_get_total', 'heeus_istoragecache_getbatch_total', 'heeus_istoragecache_put_total',  'heeus_istoragecache_putbatch_total', 'heeus_istoragecache_read_total'] metrics from `q.monitor.GetMetrics` for app 'sys' and given interval
   105  - Calculates `rate` over values to show number per second
   106  
   107  IOPS: cache hits
   108  - Gets the list of 'heeus_istoragecache_get_total', 'heeus_istoragecache_get_cached_total', 'heeus_istoragecache_getbatch_total', 'heeus_istoragecache_getbatch_cached_total'] metrics from `q.monitor.GetMetrics` for app 'sys' and given interval
   109  - Calculates percentage over `diff` of values (get_cached/get; getbatch_cached/getbatch)
   110  
   111  Worst Apps
   112  - Gets the report from `q.monitor.GetWorstApps` fuction (interval)
   113  - Internally the function works with the list of apps and metrics from the previous two paragraphs
   114  
   115  ### Dashboard
   116  
   117  System Resources Overview
   118  - CPU
   119    - Same as [CPU Usage](#cpu-usage), but only rate between first and last value
   120  - Memory
   121    - Same as [Memory Usage](#mem-usage), but only read last value
   122  - Disk
   123    - Same as [Disk Usage](#disk-usage), but only read last value
   124  - IOPS
   125    - Same as [IOPS](#iops), but only rate between first and last value
   126  
   127  System Performance Overview
   128  - RPS
   129    - Same as [RPS](#rps) but only rate between first and last value
   130  - Status Codes
   131    - Same as [Status Codes](#status-codes) but only diff between first and last value
   132  - IOPS
   133    - Same as [IOPS](#iops) but only rate between first and last value
   134  
   135  Applications Overview
   136  - Gets the list of apps with `q.monitor.GetApps` function
   137  - RPS for every app is got in the same way with [App Rps](#app-rps), but rate between first and last value
   138  
   139  
   140  ### App Performance
   141  
   142  App RPS
   143  - Gets the list of ['heeus_cp_commands_total', 'heeus_qp_commands_total'] metrics from `q.monitor.GetMetrics` for given app and interval
   144  - Calculates `rate` over values to show number per second
   145  
   146  App Status Codes
   147  - Gets the list of ['heeus_http_status_2xx_total', 'heeus_http_status_4xx_total', 'heeus_http_status_5xx_total', 'heeus_http_status_503_total'] metrics from `q.monitor.GetMetrics` for given app and interval
   148  - Calculates `diff` over values to show number per interval
   149  
   150  App Status Codes / Command Processor
   151  - The same, but different metrics ['heeus_cp_http_status_2xx_total', 'heeus_cp_http_status_4xx_total', 'heeus_cp_http_status_5xx_total', 'heeus_cp_http_status_503_total']
   152  
   153  App Status Codes / Query Processor
   154  - The same, but different metrics ['heeus_qp_http_status_2xx_total', 'heeus_qp_http_status_4xx_total', 'heeus_qp_http_status_5xx_total', 'heeus_qp_http_status_503_total']
   155  
   156  App Execution Time / Command Processor
   157  - Gets the list of ['heeus_cp_commands_total', 'heeus_cp_commands_seconds', 'heeus_cp_exec_seconds', 'heeus_cp_validate_seconds', 'heeus_cp_putplog_seconds'] metrics from `q.monitor.GetMetrics` for given app and interval
   158  - Calculates `diff` over values to show the execution time: diff(seconds)/diff(total)
   159  
   160  App Execution Time / Query Processor
   161  - Gets the list of ['heeus_qp_queries_total',
   162                  'heeus_qp_queries_seconds', 'heeus_qp_build_seconds',
   163                  'heeus_qp_exec_seconds', 'heeus_qp_exec_fields_seconds',
   164                  'heeus_qp_exec_enrich_seconds', 'heeus_qp_exec_filter_seconds',
   165                  'heeus_qp_exec_order_seconds','heeus_qp_exec_count_seconds',
   166                  'heeus_qp_exec_send_seconds'] metrics from `q.monitor.GetMetrics` for given app and interval
   167  - Calculates `diff` over values to show the execution time: diff(seconds)/diff(total)
   168  
   169  App Partitions Balance
   170  - separate query function `q.monitor.GetPartitionsBalance(interval)` which interally selects difference between partition metrics over the range
   171  - in:
   172    - range
   173    - appName
   174  - out:
   175    [
   176      {partition: 'P1', queries: 100, commands: 20},
   177      ...
   178    ]
   179  - metrics
   180    - ['heeus_partition_cp_commands_total', 'heeus_partition_qp_commands_total']
   181    - Note that for this case we should add `partition` to the metric, e.g. metrics *may* have partition
   182  
   183  App Top 10 Slow Projectors
   184  ???
   185  
   186  App Storage / IOPS
   187  - Gets the list of ['heeus_istoragecache_get_total',
   188                  'heeus_istoragecache_getbatch_total', 'heeus_istoragecache_put_total',
   189                  'heeus_istoragecache_putbatch_total', 'heeus_istoragecache_read_total'] metrics from `q.monitor.GetMetrics` for given app and interval
   190  - Calculates `rate` over values to show the ops per seconds
   191  
   192  App Storage / Execution Time
   193  - Gets the list of ['heeus_istoragecache_get_seconds', 'heeus_istoragecache_get_total',
   194                          'heeus_istoragecache_getbatch_seconds', 'heeus_istoragecache_getbatch_total',
   195                          'heeus_istoragecache_put_seconds', 'heeus_istoragecache_put_total',
   196                          'heeus_istoragecache_putbatch_seconds', 'heeus_istoragecache_putbatch_total',
   197                          'heeus_istoragecache_read_seconds', 'heeus_istoragecache_read_total'] metrics from `q.monitor.GetMetrics` for given app and interval
   198  - Calculates `diff` over values to show the execution time: diff(seconds)/diff(total)
   199  
   200  
   201  App Storage / Cache hits
   202  - Gets the list of ['heeus_istoragecache_get_total', 'heeus_istoragecache_get_cached_total',
   203                  'heeus_istoragecache_getbatch_total', 'heeus_istoragecache_getbatch_cached_total'] metrics from `q.monitor.GetMetrics` for given app and interval
   204  - Calculates `diff` over values to show the execution time: diff(cached)/diff(total)
   205  
   206  ## Metrics
   207  ### Writing Metrics
   208  Metrics are periodically scraped by Monitor app and saved in `monitor.MetricsView` with the timestamps
   209  
   210  ### List of Metrics
   211  |                      Metric                       |  VVM  | Partitioned |
   212  | ------------------------------------------------- | ----- | ----------- |
   213  | heeus_http_status_2xx_total                       | yes   | no
   214  | heeus_http_status_4xx_total                       | yes   | no
   215  | heeus_http_status_5xx_total                       | yes   | no
   216  | heeus_http_status_503_total                       | yes   | no
   217  | heeus_cp_http_status_503_total                    | yes   | no
   218  | heeus_cp_http_status_4xx_total                    | yes   | no
   219  | heeus_cp_http_status_5xx_total                    | yes   | no
   220  | heeus_cp_http_status_503_total                    | yes   | no
   221  | heeus_qp_http_status_503_total                    | yes   | no
   222  | heeus_qp_http_status_4xx_total                    | yes   | no
   223  | heeus_qp_http_status_5xx_total                    | yes   | no
   224  | heeus_qp_http_status_503_total                    | yes   | no
   225  | heeus_istoragecache_get_total                     | yes   | no
   226  | heeus_istoragecache_get_cached_total              | yes   | no
   227  | heeus_istoragecache_getbatch_total                | yes   | no
   228  | heeus_istoragecache_getbatch_cached_total         | yes   | no
   229  | heeus_istoragecache_put_total                     | yes   | no
   230  | heeus_istoragecache_putbatch_total                | yes   | no
   231  | heeus_istoragecache_read_total                    | yes   | no
   232  | heeus_istoragecache_get_seconds                   | yes   | no
   233  | heeus_istoragecache_getbatch_seconds              | yes   | no
   234  | heeus_istoragecache_put_seconds                   | yes   | no
   235  | heeus_istoragecache_putbatch_seconds              | yes   | no
   236  | heeus_istoragecache_read_seconds                  | yes   | no
   237  | heeus_cp_commands_total                           | yes   | no
   238  | heeus_cp_commands_seconds                         | yes   | no
   239  | heeus_cp_exec_seconds                             | yes   | no
   240  | heeus_cp_validate_seconds                         | yes   | no
   241  | heeus_cp_putplog_seconds                          | yes   | no
   242  | heeus_qp_queries_total                            | yes   | no
   243  | heeus_qp_queries_seconds                          | yes   | no
   244  | heeus_qp_build_seconds                            | yes   | no
   245  | heeus_qp_exec_seconds                             | yes   | no
   246  | heeus_qp_exec_fields_seconds                      | yes   | no
   247  | heeus_qp_exec_enrich_seconds                      | yes   | no
   248  | heeus_qp_exec_filter_seconds                      | yes   | no
   249  | heeus_qp_exec_order_seconds                       | yes   | no
   250  | heeus_qp_exec_count_seconds                       | yes   | no
   251  | heeus_qp_exec_send_seconds                        | yes   | no
   252  | heeus_partition_cp_commands_total                 | yes   | yes
   253  | heeus_partition_qp_commands_total                 | yes   | yes
   254  | node_cpu_idle_seconds_total                       | no    | no
   255  | node_memory_memavailable_bytes                    | no    | no
   256  | node_memory_memtotal_bytes                        | no    | no
   257  | node_filesystem_free_bytes                        | no    | no
   258  | node_filesystem_size_bytes                        | no    | no
   259  | node_disk_read_bytes_total                        | no    | no
   260  | node_disk_write_bytes_total                       | no    | no
   261  | node_disk_reads_completed_total                   | no    | no
   262  | node_disk_writes_completed_total                  | no    | no
   263  
   264  ### Metrics View
   265  - PK: app, day_in_month
   266  - CC: metric_name, timestamp, node
   267  - partition
   268  - value: float64
   269  
   270  Partition size calculation:
   271  - Scrape every 15 seconds = 5760 scrapes per day
   272  - 9 non-vvm metrics
   273  - 40 vvm metrics (38 non-partitioned and 2 partitioned)
   274  - 1 node, 3 apps x 10 partitions:
   275    - values per day: (1 [node] * 9 + 3 [apps] * 38 + 2 * 10 [partitions]) * 5760 = 823680
   276    - partition size: ?
   277  - 2 worker + 3 dbs, 5 apps x 10 partitions
   278    - values per day: (5 [nodes] * 9 + 5 [apps] * (38 + 2 * 10 [partitions])) * 5760 = 1929600
   279    - partition size: ?
   280  - 50 worker + 3 dbs, 5 apps x 10 partitions, 5 apps x 100 partitions
   281    - values per day: (50 [nodes] * 9 + 5 [apps] * (38 + 2 * 10 [partitions]) + 5
   282    [apps] * (38 + 2 * 100 [partitions])) * 5760 = 11116800
   283    - partition size: ?
   284  
   285  https://cql-calculator.herokuapp.com/
   286  ```
   287  CREATE TABLE metrics (app text, day_in_month int, metric_name text, timestamp bigint, node text, partition int, value double, PRIMARY KEY ((app, day_in_month), metric_name, timestamp, node))
   288  ```
   289  
   290  # See Also
   291  - [core-imetrics](https://github.com/heeus/core-imetrics/)
   292  - [A&D CE Monitoring Requirements](https://dev.heeus.io/launchpad/#!19448)
   293  - [Full list of node_exporter metrics](https://github.com/prometheus/node_exporter/blob/master/collector/fixtures/e2e-output.txt)