github.com/voedger/voedger@v0.0.0-20240520144910-273e84102129/design/monitor/metrics.md (about) 1 # Contents 2 - [Abstract](#abstract) 3 - [Functional Design](#functional-design) 4 - [General](#general) 5 - [API Functional Design](#api-functional-design) 6 - [Technical Design](#technical-design) 7 - [API](#api) 8 - [System Resources](#system-resources) 9 - [System Performance](#system-performance) 10 - [Dashboard](#dashboard) 11 - [App Performance](#app-performance) 12 - [Metrics](#metrics) 13 - [Writing metrics](#writing-metrics) 14 - [List of Metrics](#list-of-metrics) 15 - [Metrics View](#metrics-view) 16 17 # Abstract 18 As a system architect I want to know which metrics are needed for the monitor app, and the API to query them, so that [user requirements for CE monitoring](https://github.com/heeus/inv-monitoring/tree/master/20220503-user-reqs) can be implemented 19 20 # Functional Design 21 ## General 22 - Monitor App Frontend requests metrics from Backend using API 23 - Monitor App performs required calculations over metrics if needed (rate, diff, etc and shows charts/summaries etc 24 25 ## API Functional Design 26 - General 27 - select list of nodes (vvms and dbs) 28 - Time-series charts 29 - select list of metrics by the time range (from..till) 30 - Dashboard current values / gauges 31 - select last metric value (select top 1 metric from the time range order by time desc) 32 - Dashboard current values / rates (CPU load, IOPS) 33 - select first and last metric value from the interval 34 Dashboard: Applications Overview 35 - select the list of apps with their versions, partitions, uptime, avg RPS 36 - Sys Performance IOPS: Worst apps 37 - Top 5 by request time (Get, GetBatch, Read, Put, PutBatch) 38 - Top 5 by RPS (Get, GetBatch, Read, Put, PutBatch) 39 - Bottom 5 bycache hits (Get, GetBatch) 40 - Top 5 by batch size (PutBatch) 41 - App: top 10 slow projectors 42 - select top 10 projector partitions (Name + Partition + Lag) 43 - App: Partitions balance 44 - select number of queries and commands for every app partition for the specified period 45 46 # Technical Design 47 48 ## API 49 The following query functions are available in the API: 50 - `q.monitor.GetNodes` ([{nodename: 'worker1', vvm: true},...]) 51 - `q.monitor.GetApps` ([{app: 'sys/monitor, version: '1.2.3', partitions: 1, uptime: 123123123}]) 52 - `q.monitor.GetMetrics` (return all values for requested metrics over time interval for given app) 53 - in: 54 - from 55 - till 56 - app 57 - list of metrics 58 - list of vvms 59 - out: array of objects: 60 - metric_name 61 - app 62 - month 63 - timestamp 64 - value 65 66 - `q.monitor.GetPartitionsBalance` (return the data to show the Partition Balance over time interval for given app, see below) 67 - `q.monitor.GetWorstApps` (return the "IOPS/Worst Apps" data over time interval, see below) 68 69 70 ### System Resources 71 72 CPU Usage 73 - Gets the list of ['node_cpu_idle_seconds_total'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval 74 - Calculates `rate` over values 75 76 Mem Usage 77 - Gets the list of ['node_memory_memavailable_bytes', 'node_memory_memtotal_bytes'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval 78 - Calculates percentage over values (avail/total) 79 80 Disk Usage 81 - Gets the list of ['node_filesystem_free_bytes', 'node_filesystem_size_bytes'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval 82 - Calculates percentage over values (avail/total) 83 84 Disk I/O 85 - Gets the list of ['node_disk_read_bytes_total', 'node_disk_write_bytes_total'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval 86 - Calculates the rate of read+write ops to show datasize per second 87 88 Disk IOPS 89 - Gets the list of ['node_disk_reads_completed_total', 'node_disk_writes_completed_total'] metrics from `q.monitor.GetMetrics` for app 'sys', given nodes and interval 90 - Calculates the rate of read+write ops to show number of ops per second 91 92 93 ### System Performance 94 95 RPS 96 - Gets the list of ['heeus_cp_commands_total', 'heeus_qp_commands_total'] metrics from `q.monitor.GetMetrics` for app 'sys' and given interval 97 - Calculates `rate` over values to show number per second 98 99 Status Codes 100 - Gets the list of ['heeus_http_status_2xx_total', 'heeus_http_status_4xx_total', 'heeus_http_status_5xx_total', 'heeus_http_status_503_total'] metrics from `q.monitor.GetMetrics` for app 'sys' and given interval 101 - Calculates `diff` over values to show number per interval 102 103 IOPS 104 - Gets the list of ['heeus_istoragecache_get_total', 'heeus_istoragecache_getbatch_total', 'heeus_istoragecache_put_total', 'heeus_istoragecache_putbatch_total', 'heeus_istoragecache_read_total'] metrics from `q.monitor.GetMetrics` for app 'sys' and given interval 105 - Calculates `rate` over values to show number per second 106 107 IOPS: cache hits 108 - Gets the list of 'heeus_istoragecache_get_total', 'heeus_istoragecache_get_cached_total', 'heeus_istoragecache_getbatch_total', 'heeus_istoragecache_getbatch_cached_total'] metrics from `q.monitor.GetMetrics` for app 'sys' and given interval 109 - Calculates percentage over `diff` of values (get_cached/get; getbatch_cached/getbatch) 110 111 Worst Apps 112 - Gets the report from `q.monitor.GetWorstApps` fuction (interval) 113 - Internally the function works with the list of apps and metrics from the previous two paragraphs 114 115 ### Dashboard 116 117 System Resources Overview 118 - CPU 119 - Same as [CPU Usage](#cpu-usage), but only rate between first and last value 120 - Memory 121 - Same as [Memory Usage](#mem-usage), but only read last value 122 - Disk 123 - Same as [Disk Usage](#disk-usage), but only read last value 124 - IOPS 125 - Same as [IOPS](#iops), but only rate between first and last value 126 127 System Performance Overview 128 - RPS 129 - Same as [RPS](#rps) but only rate between first and last value 130 - Status Codes 131 - Same as [Status Codes](#status-codes) but only diff between first and last value 132 - IOPS 133 - Same as [IOPS](#iops) but only rate between first and last value 134 135 Applications Overview 136 - Gets the list of apps with `q.monitor.GetApps` function 137 - RPS for every app is got in the same way with [App Rps](#app-rps), but rate between first and last value 138 139 140 ### App Performance 141 142 App RPS 143 - Gets the list of ['heeus_cp_commands_total', 'heeus_qp_commands_total'] metrics from `q.monitor.GetMetrics` for given app and interval 144 - Calculates `rate` over values to show number per second 145 146 App Status Codes 147 - Gets the list of ['heeus_http_status_2xx_total', 'heeus_http_status_4xx_total', 'heeus_http_status_5xx_total', 'heeus_http_status_503_total'] metrics from `q.monitor.GetMetrics` for given app and interval 148 - Calculates `diff` over values to show number per interval 149 150 App Status Codes / Command Processor 151 - The same, but different metrics ['heeus_cp_http_status_2xx_total', 'heeus_cp_http_status_4xx_total', 'heeus_cp_http_status_5xx_total', 'heeus_cp_http_status_503_total'] 152 153 App Status Codes / Query Processor 154 - The same, but different metrics ['heeus_qp_http_status_2xx_total', 'heeus_qp_http_status_4xx_total', 'heeus_qp_http_status_5xx_total', 'heeus_qp_http_status_503_total'] 155 156 App Execution Time / Command Processor 157 - Gets the list of ['heeus_cp_commands_total', 'heeus_cp_commands_seconds', 'heeus_cp_exec_seconds', 'heeus_cp_validate_seconds', 'heeus_cp_putplog_seconds'] metrics from `q.monitor.GetMetrics` for given app and interval 158 - Calculates `diff` over values to show the execution time: diff(seconds)/diff(total) 159 160 App Execution Time / Query Processor 161 - Gets the list of ['heeus_qp_queries_total', 162 'heeus_qp_queries_seconds', 'heeus_qp_build_seconds', 163 'heeus_qp_exec_seconds', 'heeus_qp_exec_fields_seconds', 164 'heeus_qp_exec_enrich_seconds', 'heeus_qp_exec_filter_seconds', 165 'heeus_qp_exec_order_seconds','heeus_qp_exec_count_seconds', 166 'heeus_qp_exec_send_seconds'] metrics from `q.monitor.GetMetrics` for given app and interval 167 - Calculates `diff` over values to show the execution time: diff(seconds)/diff(total) 168 169 App Partitions Balance 170 - separate query function `q.monitor.GetPartitionsBalance(interval)` which interally selects difference between partition metrics over the range 171 - in: 172 - range 173 - appName 174 - out: 175 [ 176 {partition: 'P1', queries: 100, commands: 20}, 177 ... 178 ] 179 - metrics 180 - ['heeus_partition_cp_commands_total', 'heeus_partition_qp_commands_total'] 181 - Note that for this case we should add `partition` to the metric, e.g. metrics *may* have partition 182 183 App Top 10 Slow Projectors 184 ??? 185 186 App Storage / IOPS 187 - Gets the list of ['heeus_istoragecache_get_total', 188 'heeus_istoragecache_getbatch_total', 'heeus_istoragecache_put_total', 189 'heeus_istoragecache_putbatch_total', 'heeus_istoragecache_read_total'] metrics from `q.monitor.GetMetrics` for given app and interval 190 - Calculates `rate` over values to show the ops per seconds 191 192 App Storage / Execution Time 193 - Gets the list of ['heeus_istoragecache_get_seconds', 'heeus_istoragecache_get_total', 194 'heeus_istoragecache_getbatch_seconds', 'heeus_istoragecache_getbatch_total', 195 'heeus_istoragecache_put_seconds', 'heeus_istoragecache_put_total', 196 'heeus_istoragecache_putbatch_seconds', 'heeus_istoragecache_putbatch_total', 197 'heeus_istoragecache_read_seconds', 'heeus_istoragecache_read_total'] metrics from `q.monitor.GetMetrics` for given app and interval 198 - Calculates `diff` over values to show the execution time: diff(seconds)/diff(total) 199 200 201 App Storage / Cache hits 202 - Gets the list of ['heeus_istoragecache_get_total', 'heeus_istoragecache_get_cached_total', 203 'heeus_istoragecache_getbatch_total', 'heeus_istoragecache_getbatch_cached_total'] metrics from `q.monitor.GetMetrics` for given app and interval 204 - Calculates `diff` over values to show the execution time: diff(cached)/diff(total) 205 206 ## Metrics 207 ### Writing Metrics 208 Metrics are periodically scraped by Monitor app and saved in `monitor.MetricsView` with the timestamps 209 210 ### List of Metrics 211 | Metric | VVM | Partitioned | 212 | ------------------------------------------------- | ----- | ----------- | 213 | heeus_http_status_2xx_total | yes | no 214 | heeus_http_status_4xx_total | yes | no 215 | heeus_http_status_5xx_total | yes | no 216 | heeus_http_status_503_total | yes | no 217 | heeus_cp_http_status_503_total | yes | no 218 | heeus_cp_http_status_4xx_total | yes | no 219 | heeus_cp_http_status_5xx_total | yes | no 220 | heeus_cp_http_status_503_total | yes | no 221 | heeus_qp_http_status_503_total | yes | no 222 | heeus_qp_http_status_4xx_total | yes | no 223 | heeus_qp_http_status_5xx_total | yes | no 224 | heeus_qp_http_status_503_total | yes | no 225 | heeus_istoragecache_get_total | yes | no 226 | heeus_istoragecache_get_cached_total | yes | no 227 | heeus_istoragecache_getbatch_total | yes | no 228 | heeus_istoragecache_getbatch_cached_total | yes | no 229 | heeus_istoragecache_put_total | yes | no 230 | heeus_istoragecache_putbatch_total | yes | no 231 | heeus_istoragecache_read_total | yes | no 232 | heeus_istoragecache_get_seconds | yes | no 233 | heeus_istoragecache_getbatch_seconds | yes | no 234 | heeus_istoragecache_put_seconds | yes | no 235 | heeus_istoragecache_putbatch_seconds | yes | no 236 | heeus_istoragecache_read_seconds | yes | no 237 | heeus_cp_commands_total | yes | no 238 | heeus_cp_commands_seconds | yes | no 239 | heeus_cp_exec_seconds | yes | no 240 | heeus_cp_validate_seconds | yes | no 241 | heeus_cp_putplog_seconds | yes | no 242 | heeus_qp_queries_total | yes | no 243 | heeus_qp_queries_seconds | yes | no 244 | heeus_qp_build_seconds | yes | no 245 | heeus_qp_exec_seconds | yes | no 246 | heeus_qp_exec_fields_seconds | yes | no 247 | heeus_qp_exec_enrich_seconds | yes | no 248 | heeus_qp_exec_filter_seconds | yes | no 249 | heeus_qp_exec_order_seconds | yes | no 250 | heeus_qp_exec_count_seconds | yes | no 251 | heeus_qp_exec_send_seconds | yes | no 252 | heeus_partition_cp_commands_total | yes | yes 253 | heeus_partition_qp_commands_total | yes | yes 254 | node_cpu_idle_seconds_total | no | no 255 | node_memory_memavailable_bytes | no | no 256 | node_memory_memtotal_bytes | no | no 257 | node_filesystem_free_bytes | no | no 258 | node_filesystem_size_bytes | no | no 259 | node_disk_read_bytes_total | no | no 260 | node_disk_write_bytes_total | no | no 261 | node_disk_reads_completed_total | no | no 262 | node_disk_writes_completed_total | no | no 263 264 ### Metrics View 265 - PK: app, day_in_month 266 - CC: metric_name, timestamp, node 267 - partition 268 - value: float64 269 270 Partition size calculation: 271 - Scrape every 15 seconds = 5760 scrapes per day 272 - 9 non-vvm metrics 273 - 40 vvm metrics (38 non-partitioned and 2 partitioned) 274 - 1 node, 3 apps x 10 partitions: 275 - values per day: (1 [node] * 9 + 3 [apps] * 38 + 2 * 10 [partitions]) * 5760 = 823680 276 - partition size: ? 277 - 2 worker + 3 dbs, 5 apps x 10 partitions 278 - values per day: (5 [nodes] * 9 + 5 [apps] * (38 + 2 * 10 [partitions])) * 5760 = 1929600 279 - partition size: ? 280 - 50 worker + 3 dbs, 5 apps x 10 partitions, 5 apps x 100 partitions 281 - values per day: (50 [nodes] * 9 + 5 [apps] * (38 + 2 * 10 [partitions]) + 5 282 [apps] * (38 + 2 * 100 [partitions])) * 5760 = 11116800 283 - partition size: ? 284 285 https://cql-calculator.herokuapp.com/ 286 ``` 287 CREATE TABLE metrics (app text, day_in_month int, metric_name text, timestamp bigint, node text, partition int, value double, PRIMARY KEY ((app, day_in_month), metric_name, timestamp, node)) 288 ``` 289 290 # See Also 291 - [core-imetrics](https://github.com/heeus/core-imetrics/) 292 - [A&D CE Monitoring Requirements](https://dev.heeus.io/launchpad/#!19448) 293 - [Full list of node_exporter metrics](https://github.com/prometheus/node_exporter/blob/master/collector/fixtures/e2e-output.txt)