github.com/cilium/cilium@v1.16.2/Documentation/observability/metrics.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _metrics: 8 9 ******************** 10 Monitoring & Metrics 11 ******************** 12 13 Cilium and Hubble can both be configured to serve `Prometheus 14 <https://prometheus.io>`_ metrics. Prometheus is a pluggable metrics collection 15 and storage system and can act as a data source for `Grafana 16 <https://grafana.com/>`_, a metrics visualization frontend. Unlike some metrics 17 collectors like statsd, Prometheus requires the collectors to pull metrics from 18 each source. 19 20 Cilium and Hubble metrics can be enabled independently of each other. 21 22 Cilium Metrics 23 ============== 24 25 Cilium metrics provide insights into the state of Cilium itself, namely 26 of the ``cilium-agent``, ``cilium-envoy``, and ``cilium-operator`` processes. 27 To run Cilium with Prometheus metrics enabled, deploy it with the 28 ``prometheus.enabled=true`` Helm value set. 29 30 Cilium metrics are exported under the ``cilium_`` Prometheus namespace. Envoy 31 metrics are exported under the ``envoy_`` Prometheus namespace, of which the 32 Cilium-defined metrics are exported under the ``envoy_cilium_`` namespace. 33 When running and collecting in Kubernetes they will be tagged with a pod name 34 and namespace. 35 36 Installation 37 ------------ 38 39 You can enable metrics for ``cilium-agent`` (including Envoy) with the Helm value 40 ``prometheus.enabled=true``. ``cilium-operator`` metrics are enabled by default, 41 if you want to disable them, set Helm value ``operator.prometheus.enabled=false``. 42 43 .. parsed-literal:: 44 45 helm install cilium |CHART_RELEASE| \\ 46 --namespace kube-system \\ 47 --set prometheus.enabled=true \\ 48 --set operator.prometheus.enabled=true 49 50 The ports can be configured via ``prometheus.port``, 51 ``envoy.prometheus.port``, or ``operator.prometheus.port`` respectively. 52 53 When metrics are enabled, all Cilium components will have the following 54 annotations. They can be used to signal Prometheus whether to scrape metrics: 55 56 .. code-block:: yaml 57 58 prometheus.io/scrape: true 59 prometheus.io/port: 9962 60 61 To collect Envoy metrics the Cilium chart will create a Kubernetes headless 62 service named ``cilium-agent`` with the ``prometheus.io/scrape:'true'`` annotation set: 63 64 .. code-block:: yaml 65 66 prometheus.io/scrape: true 67 prometheus.io/port: 9964 68 69 This additional headless service in addition to the other Cilium components is needed 70 as each component can only have one Prometheus scrape and port annotation. 71 72 Prometheus will pick up the Cilium and Envoy metrics automatically if the following 73 option is set in the ``scrape_configs`` section: 74 75 .. code-block:: yaml 76 77 scrape_configs: 78 - job_name: 'kubernetes-pods' 79 kubernetes_sd_configs: 80 - role: pod 81 relabel_configs: 82 - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] 83 action: keep 84 regex: true 85 - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] 86 action: replace 87 regex: ([^:]+)(?::\d+)?;(\d+) 88 replacement: ${1}:${2} 89 target_label: __address__ 90 91 .. _hubble_metrics: 92 93 Hubble Metrics 94 ============== 95 96 While Cilium metrics allow you to monitor the state Cilium itself, 97 Hubble metrics on the other hand allow you to monitor the network behavior 98 of your Cilium-managed Kubernetes pods with respect to connectivity and security. 99 100 Installation 101 ------------ 102 103 To deploy Cilium with Hubble metrics enabled, you need to enable Hubble with 104 ``hubble.enabled=true`` and provide a set of Hubble metrics you want to 105 enable via ``hubble.metrics.enabled``. 106 107 Some of the metrics can also be configured with additional options. 108 See the :ref:`Hubble exported metrics<hubble_exported_metrics>` 109 section for the full list of available metrics and their options. 110 111 .. parsed-literal:: 112 113 helm install cilium |CHART_RELEASE| \\ 114 --namespace kube-system \\ 115 --set prometheus.enabled=true \\ 116 --set operator.prometheus.enabled=true \\ 117 --set hubble.enabled=true \\ 118 --set hubble.metrics.enableOpenMetrics=true \\ 119 --set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\\,source_namespace\\,source_workload\\,destination_ip\\,destination_namespace\\,destination_workload\\,traffic_direction}" 120 121 The port of the Hubble metrics can be configured with the 122 ``hubble.metrics.port`` Helm value. 123 124 For details on enabling Hubble metrics with TLS see the 125 :ref:`hubble_configure_metrics_tls` section of the documentation. 126 127 .. Note:: 128 129 L7 metrics such as HTTP, are only emitted for pods that enable 130 :ref:`Layer 7 Protocol Visibility <proxy_visibility>`. 131 132 When deployed with a non-empty ``hubble.metrics.enabled`` Helm value, the 133 Cilium chart will create a Kubernetes headless service named ``hubble-metrics`` 134 with the ``prometheus.io/scrape:'true'`` annotation set: 135 136 .. code-block:: yaml 137 138 prometheus.io/scrape: true 139 prometheus.io/port: 9965 140 141 Set the following options in the ``scrape_configs`` section of Prometheus to 142 have it scrape all Hubble metrics from the endpoints automatically: 143 144 .. code-block:: yaml 145 146 scrape_configs: 147 - job_name: 'kubernetes-endpoints' 148 scrape_interval: 30s 149 kubernetes_sd_configs: 150 - role: endpoints 151 relabel_configs: 152 - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] 153 action: keep 154 regex: true 155 - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port] 156 action: replace 157 target_label: __address__ 158 regex: (.+)(?::\d+);(\d+) 159 replacement: $1:$2 160 161 .. _hubble_open_metrics: 162 163 OpenMetrics 164 ----------- 165 166 Additionally, you can opt-in to `OpenMetrics <https://openmetrics.io>`_ by 167 setting ``hubble.metrics.enableOpenMetrics=true``. 168 Enabling OpenMetrics configures the Hubble metrics endpoint to support exporting 169 metrics in OpenMetrics format when explicitly requested by clients. 170 171 Using OpenMetrics supports additional functionality such as Exemplars, which 172 enables associating metrics with traces by embedding trace IDs into the 173 exported metrics. 174 175 Prometheus needs to be configured to take advantage of OpenMetrics and will 176 only scrape exemplars when the `exemplars storage feature is enabled 177 <https://prometheus.io/docs/prometheus/latest/feature_flags/#exemplars-storage>`_. 178 179 OpenMetrics imposes a few additional requirements on metrics names and labels, 180 so this functionality is currently opt-in, though we believe all of the Hubble 181 metrics conform to the OpenMetrics requirements. 182 183 184 .. _clustermesh_apiserver_metrics: 185 186 Cluster Mesh API Server Metrics 187 =============================== 188 189 Cluster Mesh API Server metrics provide insights into the state of the 190 ``clustermesh-apiserver`` process, the ``kvstoremesh`` process (if enabled), 191 and the sidecar etcd instance. 192 Cluster Mesh API Server metrics are exported under the ``cilium_clustermesh_apiserver_`` 193 Prometheus namespace. KVStoreMesh metrics are exported under the ``cilium_kvstoremesh_`` 194 Prometheus namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace. 195 196 197 Installation 198 ------------ 199 200 You can enable the metrics for different Cluster Mesh API Server components by 201 setting the following values: 202 203 * clustermesh-apiserver: ``clustermesh.apiserver.metrics.enabled=true`` 204 * kvstoremesh: ``clustermesh.apiserver.metrics.kvstoremesh.enabled=true`` 205 * sidecar etcd instance: ``clustermesh.apiserver.metrics.etcd.enabled=true`` 206 207 .. parsed-literal:: 208 209 helm install cilium |CHART_RELEASE| \\ 210 --namespace kube-system \\ 211 --set clustermesh.useAPIServer=true \\ 212 --set clustermesh.apiserver.metrics.enabled=true \\ 213 --set clustermesh.apiserver.metrics.kvstoremesh.enabled=true \\ 214 --set clustermesh.apiserver.metrics.etcd.enabled=true 215 216 You can figure the ports by way of ``clustermesh.apiserver.metrics.port``, 217 ``clustermesh.apiserver.metrics.kvstoremesh.port`` and 218 ``clustermesh.apiserver.metrics.etcd.port`` respectively. 219 220 You can automatically create a 221 `Prometheus Operator <https://github.com/prometheus-operator/prometheus-operator>`_ 222 ``ServiceMonitor`` by setting ``clustermesh.apiserver.metrics.serviceMonitor.enabled=true``. 223 224 Example Prometheus & Grafana Deployment 225 ======================================= 226 227 If you don't have an existing Prometheus and Grafana stack running, you can 228 deploy a stack with: 229 230 .. parsed-literal:: 231 232 kubectl apply -f \ |SCM_WEB|\/examples/kubernetes/addons/prometheus/monitoring-example.yaml 233 234 It will run Prometheus and Grafana in the ``cilium-monitoring`` namespace. If 235 you have either enabled Cilium or Hubble metrics, they will automatically 236 be scraped by Prometheus. You can then expose Grafana to access it via your browser. 237 238 .. code-block:: shell-session 239 240 kubectl -n cilium-monitoring port-forward service/grafana --address 0.0.0.0 --address :: 3000:3000 241 242 Open your browser and access http://localhost:3000/ 243 244 Metrics Reference 245 ================= 246 247 cilium-agent 248 ------------ 249 250 Configuration 251 ^^^^^^^^^^^^^ 252 253 To expose any metrics, invoke ``cilium-agent`` with the 254 ``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but 255 passing an empty IP (e.g. ``:9962``) will bind the server to all available 256 interfaces (there is usually only one in a container). 257 258 To customize ``cilium-agent`` metrics, configure the ``--metrics`` option with 259 ``"+metric_a -metric_b -metric_c"``, where ``+/-`` means to enable/disable 260 the metric. For example, for really large clusters, users may consider to 261 disable the following two metrics as they generate too much data: 262 263 - ``cilium_node_connectivity_status`` 264 - ``cilium_node_connectivity_latency_seconds`` 265 266 You can then configure the agent with ``--metrics="-cilium_node_connectivity_status -cilium_node_connectivity_latency_seconds"``. 267 268 Exported Metrics 269 ^^^^^^^^^^^^^^^^ 270 271 Endpoint 272 ~~~~~~~~ 273 274 ============================================ ================================================== ========== ======================================================== 275 Name Labels Default Description 276 ============================================ ================================================== ========== ======================================================== 277 ``endpoint`` Enabled Number of endpoints managed by this agent 278 ``endpoint_max_ifindex`` Disabled Maximum interface index observed for existing endpoints 279 ``endpoint_regenerations_total`` ``outcome`` Enabled Count of all endpoint regenerations that have completed 280 ``endpoint_regeneration_time_stats_seconds`` ``scope`` Enabled Endpoint regeneration time stats 281 ``endpoint_state`` ``state`` Enabled Count of all endpoints 282 ============================================ ================================================== ========== ======================================================== 283 284 The default enabled status of ``endpoint_max_ifindex`` is dynamic. On earlier 285 kernels (typically with version lower than 5.10), Cilium must store the 286 interface index for each endpoint in the conntrack map, which reserves 16 bits 287 for this field. If Cilium is running on such a kernel, this metric will be 288 enabled by default. It can be used to implement an alert if the ifindex is 289 approaching the limit of 65535. This may be the case in instances of 290 significant Endpoint churn. 291 292 Services 293 ~~~~~~~~ 294 295 ========================================== ================================================== ========== ======================================================== 296 Name Labels Default Description 297 ========================================== ================================================== ========== ======================================================== 298 ``services_events_total`` Enabled Number of services events labeled by action type 299 ``service_implementation_delay`` ``action`` Enabled Duration in seconds to propagate the data plane programming of a service, its network and endpoints from the time the service or the service pod was changed excluding the event queue latency 300 ========================================== ================================================== ========== ======================================================== 301 302 Cluster health 303 ~~~~~~~~~~~~~~ 304 305 ========================================== ================================================== ========== ======================================================== 306 Name Labels Default Description 307 ========================================== ================================================== ========== ======================================================== 308 ``unreachable_nodes`` Enabled Number of nodes that cannot be reached 309 ``unreachable_health_endpoints`` Enabled Number of health endpoints that cannot be reached 310 ========================================== ================================================== ========== ======================================================== 311 312 Node Connectivity 313 ~~~~~~~~~~~~~~~~~ 314 315 ========================================== ====================================================================================================================================================================== ========== =================================================================================================================== 316 Name Labels Default Description 317 ========================================== ====================================================================================================================================================================== ========== =================================================================================================================== 318 ``node_connectivity_status`` ``source_cluster``, ``source_node_name``, ``target_cluster``, ``target_node_name``, ``target_node_type``, ``type`` Enabled The last observed status of both ICMP and HTTP connectivity between the current Cilium agent and other Cilium nodes 319 ``node_connectivity_latency_seconds`` ``address_type``, ``protocol``, ``source_cluster``, ``source_node_name``, ``target_cluster``, ``target_node_ip``, ``target_node_name``, ``target_node_type``, ``type`` Enabled The last observed latency between the current Cilium agent and other Cilium nodes in seconds 320 ========================================== ====================================================================================================================================================================== ========== =================================================================================================================== 321 322 Clustermesh 323 ~~~~~~~~~~~ 324 325 =============================================== ============================================================ ========== ================================================================= 326 Name Labels Default Description 327 =============================================== ============================================================ ========== ================================================================= 328 ``clustermesh_global_services`` ``source_cluster``, ``source_node_name`` Enabled The total number of global services in the cluster mesh 329 ``clustermesh_remote_clusters`` ``source_cluster``, ``source_node_name`` Enabled The total number of remote clusters meshed with the local cluster 330 ``clustermesh_remote_cluster_failures`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The total number of failures related to the remote cluster 331 ``clustermesh_remote_cluster_nodes`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The total number of nodes in the remote cluster 332 ``clustermesh_remote_cluster_last_failure_ts`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The timestamp of the last failure of the remote cluster 333 ``clustermesh_remote_cluster_readiness_status`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled The readiness status of the remote cluster 334 =============================================== ============================================================ ========== ================================================================= 335 336 Datapath 337 ~~~~~~~~ 338 339 ============================================= ================================================== ========== ======================================================== 340 Name Labels Default Description 341 ============================================= ================================================== ========== ======================================================== 342 ``datapath_conntrack_dump_resets_total`` ``area``, ``name``, ``family`` Enabled Number of conntrack dump resets. Happens when a BPF entry gets removed while dumping the map is in progress. 343 ``datapath_conntrack_gc_runs_total`` ``status`` Enabled Number of times that the conntrack garbage collector process was run 344 ``datapath_conntrack_gc_key_fallbacks_total`` Enabled The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family 345 ``datapath_conntrack_gc_entries`` ``family`` Enabled The number of alive and deleted conntrack entries at the end of a garbage collector run 346 ``datapath_conntrack_gc_duration_seconds`` ``status`` Enabled Duration in seconds of the garbage collector process 347 ============================================= ================================================== ========== ======================================================== 348 349 IPsec 350 ~~~~~ 351 352 ============================================= ================================================== ========== =========================================================== 353 Name Labels Default Description 354 ============================================= ================================================== ========== =========================================================== 355 ``ipsec_xfrm_error`` ``error``, ``type`` Enabled Total number of xfrm errors 356 ``ipsec_keys`` Enabled Number of keys in use 357 ``ipsec_xfrm_states`` ``direction`` Enabled Number of XFRM states 358 ``ipsec_xfrm_policies`` ``direction`` Enabled Number of XFRM policies 359 ============================================= ================================================== ========== =========================================================== 360 361 eBPF 362 ~~~~ 363 364 ========================================== ===================================================================== ========== ======================================================== 365 Name Labels Default Description 366 ========================================== ===================================================================== ========== ======================================================== 367 ``bpf_syscall_duration_seconds`` ``operation``, ``outcome`` Disabled Duration of eBPF system call performed 368 ``bpf_map_ops_total`` ``mapName`` (deprecated), ``map_name``, ``operation``, ``outcome`` Enabled Number of eBPF map operations performed. ``mapName`` is deprecated and will be removed in 1.10. Use ``map_name`` instead. 369 ``bpf_map_pressure`` ``map_name`` Enabled Map pressure is defined as a ratio of the required map size compared to its configured size. Values < 1.0 indicate the map's utilization, while values >= 1.0 indicate that the map is full. Policy map metrics are only reported when the ratio is over 0.1, ie 10% full. 370 ``bpf_map_capacity`` ``map_group`` Enabled Maximum size of eBPF maps by group of maps (type of map that have the same max capacity size). Map types with size of 65536 are not emitted, missing map types can be assumed to be 65536. 371 ``bpf_maps_virtual_memory_max_bytes`` Enabled Max memory used by eBPF maps installed in the system 372 ``bpf_progs_virtual_memory_max_bytes`` Enabled Max memory used by eBPF programs installed in the system 373 ========================================== ===================================================================== ========== ======================================================== 374 375 Both ``bpf_maps_virtual_memory_max_bytes`` and ``bpf_progs_virtual_memory_max_bytes`` 376 are currently reporting the system-wide memory usage of eBPF that is directly 377 and not directly managed by Cilium. This might change in the future and only 378 report the eBPF memory usage directly managed by Cilium. 379 380 Drops/Forwards (L3/L4) 381 ~~~~~~~~~~~~~~~~~~~~~~ 382 383 ========================================== ================================================== ========== ======================================================== 384 Name Labels Default Description 385 ========================================== ================================================== ========== ======================================================== 386 ``drop_count_total`` ``reason``, ``direction`` Enabled Total dropped packets 387 ``drop_bytes_total`` ``reason``, ``direction`` Enabled Total dropped bytes 388 ``forward_count_total`` ``direction`` Enabled Total forwarded packets 389 ``forward_bytes_total`` ``direction`` Enabled Total forwarded bytes 390 ========================================== ================================================== ========== ======================================================== 391 392 Policy 393 ~~~~~~ 394 395 ========================================== ================================================== ========== ======================================================== 396 Name Labels Default Description 397 ========================================== ================================================== ========== ======================================================== 398 ``policy`` Enabled Number of policies currently loaded 399 ``policy_regeneration_total`` Enabled Deprecated, will be removed in Cilium 1.17 - use ``endpoint_regenerations_total`` instead. Total number of policies regenerated successfully 400 ``policy_regeneration_time_stats_seconds`` ``scope`` Enabled Deprecated, will be removed in Cilium 1.17 - use ``endpoint_regeneration_time_stats_seconds`` instead. Policy regeneration time stats labeled by the scope 401 ``policy_max_revision`` Enabled Highest policy revision number in the agent 402 ``policy_change_total`` Enabled Number of policy changes by outcome 403 ``policy_endpoint_enforcement_status`` Enabled Number of endpoints labeled by policy enforcement status 404 ``policy_implementation_delay`` ``source`` Enabled Time in seconds between a policy change and it being fully deployed into the datapath, labeled by the policy's source 405 ========================================== ================================================== ========== ======================================================== 406 407 Policy L7 (HTTP/Kafka/FQDN) 408 ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 409 410 ======================================== ================================================== ========== ======================================================== 411 Name Labels Default Description 412 ======================================== ================================================== ========== ======================================================== 413 ``proxy_redirects`` ``protocol`` Enabled Number of redirects installed for endpoints 414 ``proxy_upstream_reply_seconds`` ``error``, ``protocol_l7``, ``scope`` Enabled Seconds waited for upstream server to reply to a request 415 ``proxy_datapath_update_timeout_total`` Disabled Number of total datapath update timeouts due to FQDN IP updates 416 ``policy_l7_total`` ``rule``, ``proxy_type`` Enabled Number of total L7 requests/responses 417 ======================================== ================================================== ========== ======================================================== 418 419 Identity 420 ~~~~~~~~ 421 422 ======================================== ================================================== ========== ======================================================== 423 Name Labels Default Description 424 ======================================== ================================================== ========== ======================================================== 425 ``identity`` ``type`` Enabled Number of identities currently allocated 426 ``identity_label_sources`` ``source`` Enabled Number of identities which contain at least one label from the given label source 427 ``ipcache_errors_total`` ``type``, ``error`` Enabled Number of errors interacting with the ipcache 428 ``ipcache_events_total`` ``type`` Enabled Number of events interacting with the ipcache 429 ======================================== ================================================== ========== ======================================================== 430 431 Events external to Cilium 432 ~~~~~~~~~~~~~~~~~~~~~~~~~ 433 434 ======================================== ================================================== ========== ======================================================== 435 Name Labels Default Description 436 ======================================== ================================================== ========== ======================================================== 437 ``event_ts`` ``source`` Enabled Last timestamp when Cilium received an event from a control plane source, per resource and per action 438 ``k8s_event_lag_seconds`` ``source`` Disabled Lag for Kubernetes events - computed value between receiving a CNI ADD event from kubelet and a Pod event received from kube-api-server 439 ======================================== ================================================== ========== ======================================================== 440 441 Controllers 442 ~~~~~~~~~~~ 443 444 ======================================== ================================================== ========== ======================================================== 445 Name Labels Default Description 446 ======================================== ================================================== ========== ======================================================== 447 ``controllers_runs_total`` ``status`` Enabled Number of times that a controller process was run 448 ``controllers_runs_duration_seconds`` ``status`` Enabled Duration in seconds of the controller process 449 ``controllers_group_runs_total`` ``status``, ``group_name`` Enabled Number of times that a controller process was run, labeled by controller group name 450 ``controllers_failing`` Enabled Number of failing controllers 451 ======================================== ================================================== ========== ======================================================== 452 453 The ``controllers_group_runs_total`` metric reports the success and failure 454 count of each controller within the system, labeled by controller group name 455 and completion status. Due to the large number of controllers, enabling this 456 metric is on a per-controller basis. This is configured using an allow-list 457 which is passed as the ``controller-group-metrics`` configuration flag, 458 or the ``prometheus.controllerGroupMetrics`` helm value. The current 459 recommended default set of group names can be found in the values file of 460 the Cilium Helm chart. The special names "all" and "none" are supported. 461 462 SubProcess 463 ~~~~~~~~~~ 464 465 ======================================== ================================================== ========== ======================================================== 466 Name Labels Default Description 467 ======================================== ================================================== ========== ======================================================== 468 ``subprocess_start_total`` ``subsystem`` Enabled Number of times that Cilium has started a subprocess 469 ======================================== ================================================== ========== ======================================================== 470 471 Kubernetes 472 ~~~~~~~~~~ 473 474 =========================================== ================================================== ========== ======================================================== 475 Name Labels Default Description 476 =========================================== ================================================== ========== ======================================================== 477 ``kubernetes_events_received_total`` ``scope``, ``action``, ``validity``, ``equal`` Enabled Number of Kubernetes events received 478 ``kubernetes_events_total`` ``scope``, ``action``, ``outcome`` Enabled Number of Kubernetes events processed 479 ``k8s_cnp_status_completion_seconds`` ``attempts``, ``outcome`` Enabled Duration in seconds in how long it took to complete a CNP status update 480 ``k8s_terminating_endpoints_events_total`` Enabled Number of terminating endpoint events received from Kubernetes 481 =========================================== ================================================== ========== ======================================================== 482 483 Kubernetes Rest Client 484 ~~~~~~~~~~~~~~~~~~~~~~ 485 486 ============================================= ============================================= ========== =========================================================== 487 Name Labels Default Description 488 ============================================= ============================================= ========== =========================================================== 489 ``k8s_client_api_latency_time_seconds`` ``path``, ``method`` Enabled Duration of processed API calls labeled by path and method 490 ``k8s_client_rate_limiter_duration_seconds`` ``path``, ``method`` Enabled Kubernetes client rate limiter latency in seconds. Broken down by path and method 491 ``k8s_client_api_calls_total`` ``host``, ``method``, ``return_code`` Enabled Number of API calls made to kube-apiserver labeled by host, method and return code 492 ============================================= ============================================= ========== =========================================================== 493 494 Kubernetes workqueue 495 ~~~~~~~~~~~~~~~~~~~~ 496 497 ==================================================== ============================================= ========== =========================================================== 498 Name Labels Default Description 499 ==================================================== ============================================= ========== =========================================================== 500 ``k8s_workqueue_depth`` ``name`` Enabled Current depth of workqueue 501 ``k8s_workqueue_adds_total`` ``name`` Enabled Total number of adds handled by workqueue 502 ``k8s_workqueue_queue_duration_seconds`` ``name`` Enabled Duration in seconds an item stays in workqueue prior to request 503 ``k8s_workqueue_work_duration_seconds`` ``name`` Enabled Duration in seconds to process an item from workqueue 504 ``k8s_workqueue_unfinished_work_seconds`` ``name`` Enabled Duration in seconds of work in progress that hasn't been observed by work_duration. Large values indicate stuck threads. You can deduce the number of stuck threads by observing the rate at which this value increases. 505 ``k8s_workqueue_longest_running_processor_seconds`` ``name`` Enabled Duration in seconds of the longest running processor for workqueue 506 ``k8s_workqueue_retries_total`` ``name`` Enabled Total number of retries handled by workqueue 507 ==================================================== ============================================= ========== =========================================================== 508 509 IPAM 510 ~~~~ 511 512 ======================================== ============================================ ========== ======================================================== 513 Name Labels Default Description 514 ======================================== ============================================ ========== ======================================================== 515 ``ipam_capacity`` ``family`` Enabled Total number of IPs in the IPAM pool labeled by family 516 ``ipam_events_total`` Enabled Number of IPAM events received labeled by action and datapath family type 517 ``ip_addresses`` ``family`` Enabled Number of allocated IP addresses 518 ======================================== ============================================ ========== ======================================================== 519 520 KVstore 521 ~~~~~~~ 522 523 ======================================== ============================================ ========== ======================================================== 524 Name Labels Default Description 525 ======================================== ============================================ ========== ======================================================== 526 ``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Enabled Duration of kvstore operation 527 ``kvstore_events_queue_seconds`` ``action``, ``scope`` Enabled Seconds waited before a received event was queued 528 ``kvstore_quorum_errors_total`` ``error`` Enabled Number of quorum errors 529 ``kvstore_sync_errors_total`` ``scope``, ``source_cluster`` Enabled Number of times synchronization to the kvstore failed 530 ``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Enabled Number of elements queued for synchronization in the kvstore 531 ``kvstore_initial_sync_completed`` ``scope``, ``source_cluster``, ``action`` Enabled Whether the initial synchronization from/to the kvstore has completed 532 ======================================== ============================================ ========== ======================================================== 533 534 Agent 535 ~~~~~ 536 537 ================================ ================================ ========== ======================================================== 538 Name Labels Default Description 539 ================================ ================================ ========== ======================================================== 540 ``agent_bootstrap_seconds`` ``scope``, ``outcome`` Enabled Duration of various bootstrap phases 541 ``api_process_time_seconds`` Enabled Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code. 542 ================================ ================================ ========== ======================================================== 543 544 FQDN 545 ~~~~ 546 547 ================================== ================================ ============ ======================================================== 548 Name Labels Default Description 549 ================================== ================================ ============ ======================================================== 550 ``fqdn_gc_deletions_total`` Enabled Number of FQDNs that have been cleaned on FQDN garbage collector job 551 ``fqdn_active_names`` ``endpoint`` Disabled Number of domains inside the DNS cache that have not expired (by TTL), per endpoint 552 ``fqdn_active_ips`` ``endpoint`` Disabled Number of IPs inside the DNS cache associated with a domain that has not expired (by TTL), per endpoint 553 ``fqdn_alive_zombie_connections`` ``endpoint`` Disabled Number of IPs associated with domains that have expired (by TTL) yet still associated with an active connection (aka zombie), per endpoint 554 ``fqdn_selectors`` Enabled Number of registered ToFQDN selectors 555 ================================== ================================ ============ ======================================================== 556 557 Jobs 558 ~~~~ 559 560 ================================== ================================ ============ ======================================================== 561 Name Labels Default Description 562 ================================== ================================ ============ ======================================================== 563 ``jobs_errors_total`` ``job`` Enabled Number of jobs runs that returned an error 564 ``jobs_one_shot_run_seconds`` ``job`` Enabled Histogram of one shot job run duration 565 ``jobs_timer_run_seconds`` ``job`` Enabled Histogram of timer job run duration 566 ``jobs_observer_run_seconds`` ``job`` Enabled Histogram of observer job run duration 567 ================================== ================================ ============ ======================================================== 568 569 CIDRGroups 570 ~~~~~~~~~~ 571 572 =================================================== ===================== ============================= 573 Name Labels Default Description 574 =================================================== ===================== ============================= 575 ``cidrgroups_referenced`` Enabled Number of CNPs and CCNPs referencing at least one CiliumCIDRGroup. CNPs with empty or non-existing CIDRGroupRefs are not considered 576 ``cidrgroup_translation_time_stats_seconds`` Disabled CIDRGroup translation time stats 577 =================================================== ===================== ============================= 578 579 .. _metrics_api_rate_limiting: 580 581 API Rate Limiting 582 ~~~~~~~~~~~~~~~~~ 583 584 ============================================== ========================================== ========== ======================================================== 585 Name Labels Default Description 586 ============================================== ========================================== ========== ======================================================== 587 ``api_limiter_adjustment_factor`` ``api_call`` Enabled Most recent adjustment factor for automatic adjustment 588 ``api_limiter_processed_requests_total`` ``api_call``, ``outcome``, ``return_code`` Enabled Total number of API requests processed 589 ``api_limiter_processing_duration_seconds`` ``api_call``, ``value`` Enabled Mean and estimated processing duration in seconds 590 ``api_limiter_rate_limit`` ``api_call``, ``value`` Enabled Current rate limiting configuration (limit and burst) 591 ``api_limiter_requests_in_flight`` ``api_call`` ``value`` Enabled Current and maximum allowed number of requests in flight 592 ``api_limiter_wait_duration_seconds`` ``api_call``, ``value`` Enabled Mean, min, and max wait duration 593 ``api_limiter_wait_history_duration_seconds`` ``api_call`` Disabled Histogram of wait duration per API call processed 594 ============================================== ========================================== ========== ======================================================== 595 596 .. _metrics_bgp_control_plane: 597 598 BGP Control Plane 599 ~~~~~~~~~~~~~~~~~ 600 601 ====================== ============================================= ======== =================================================================== 602 Name Labels Default Description 603 ====================== ============================================= ======== =================================================================== 604 ``session_state`` ``vrouter``, ``neighbor`` Enabled Current state of the BGP session with the peer, Up = 1 or Down = 0 605 ``advertised_routes`` ``vrouter``, ``neighbor``, ``afi``, ``safi`` Enabled Number of routes advertised to the peer 606 ``received_routes`` ``vrouter``, ``neighbor``, ``afi``, ``safi`` Enabled Number of routes received from the peer 607 ====================== ============================================= ======== =================================================================== 608 609 All metrics are enabled only when the BGP Control Plane is enabled. 610 611 cilium-operator 612 --------------- 613 614 Configuration 615 ^^^^^^^^^^^^^ 616 617 ``cilium-operator`` can be configured to serve metrics by running with the 618 option ``--enable-metrics``. By default, the operator will expose metrics on 619 port 9963, the port can be changed with the option 620 ``--operator-prometheus-serve-addr``. 621 622 Exported Metrics 623 ^^^^^^^^^^^^^^^^ 624 625 All metrics are exported under the ``cilium_operator_`` Prometheus namespace. 626 627 .. _ipam_metrics: 628 629 IPAM 630 ~~~~ 631 632 .. Note:: 633 634 IPAM metrics are all ``Enabled`` only if using the AWS, Alibabacloud or Azure IPAM plugins. 635 636 ======================================== ================================================================= ========== ======================================================== 637 Name Labels Default Description 638 ======================================== ================================================================= ========== ======================================================== 639 ``ipam_ips`` ``type`` Enabled Number of IPs allocated 640 ``ipam_ip_allocation_ops`` ``subnet_id`` Enabled Number of IP allocation operations. 641 ``ipam_ip_release_ops`` ``subnet_id`` Enabled Number of IP release operations. 642 ``ipam_interface_creation_ops`` ``subnet_id`` Enabled Number of interfaces creation operations. 643 ``ipam_release_duration_seconds`` ``type``, ``status``, ``subnet_id`` Enabled Release ip or interface latency in seconds 644 ``ipam_allocation_duration_seconds`` ``type``, ``status``, ``subnet_id`` Enabled Allocation ip or interface latency in seconds 645 ``ipam_available_interfaces`` Enabled Number of interfaces with addresses available 646 ``ipam_nodes`` ``category`` Enabled Number of nodes by category { total | in-deficit | at-capacity } 647 ``ipam_resync_total`` Enabled Number of synchronization operations with external IPAM API 648 ``ipam_api_duration_seconds`` ``operation``, ``response_code`` Enabled Duration of interactions with external IPAM API. 649 ``ipam_api_rate_limit_duration_seconds`` ``operation`` Enabled Duration of rate limiting while accessing external IPAM API 650 ``ipam_available_ips`` ``target_node`` Enabled Number of available IPs on a node (taking into account plugin specific NIC/Address limits). 651 ``ipam_used_ips`` ``target_node`` Enabled Number of currently used IPs on a node. 652 ``ipam_needed_ips`` ``target_node`` Enabled Number of IPs needed to satisfy allocation on a node. 653 ======================================== ================================================================= ========== ======================================================== 654 655 LB-IPAM 656 ~~~~~~~ 657 658 ======================================== ================================================================= ========== ======================================================== 659 Name Labels Default Description 660 ======================================== ================================================================= ========== ======================================================== 661 ``lbipam_conflicting_pools_total`` Enabled Number of conflicting pools 662 ``lbipam_ips_available_total`` ``pool`` Enabled Number of available IPs per pool 663 ``lbipam_ips_used_total`` ``pool`` Enabled Number of used IPs per pool 664 ``lbipam_services_matching_total`` Enabled Number of matching services 665 ``lbipam_services_unsatisfied_total`` Enabled Number of services which did not get requested IPs 666 ======================================== ================================================================= ========== ======================================================== 667 668 Controllers 669 ~~~~~~~~~~~ 670 671 ======================================== ================================================== ========== ======================================================== 672 Name Labels Default Description 673 ======================================== ================================================== ========== ======================================================== 674 ``controllers_group_runs_total`` ``status``, ``group_name`` Enabled Number of times that a controller process was run, labeled by controller group name 675 ======================================== ================================================== ========== ======================================================== 676 677 The ``controllers_group_runs_total`` metric reports the success and failure 678 count of each controller within the system, labeled by controller group name 679 and completion status. Due to the large number of controllers, enabling this 680 metric is on a per-controller basis. This is configured using an allow-list 681 which is passed as the ``controller-group-metrics`` configuration flag, 682 or the ``prometheus.controllerGroupMetrics`` helm value. The current 683 recommended default set of group names can be found in the values file of 684 the Cilium Helm chart. The special names "all" and "none" are supported. 685 686 .. _ces_metrics: 687 688 CiliumEndpointSlices (CES) 689 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 690 691 ============================================== ================================ ======================================================== 692 Name Labels Description 693 ============================================== ================================ ======================================================== 694 ``number_of_ceps_per_ces`` The number of CEPs batched in a CES 695 ``number_of_cep_changes_per_ces`` ``opcode`` The number of changed CEPs in each CES update 696 ``ces_sync_total`` ``outcome`` The number of completed CES syncs by outcome 697 ``ces_queueing_delay_seconds`` CiliumEndpointSlice queueing delay in seconds 698 ============================================== ================================ ======================================================== 699 700 701 Hubble 702 ------ 703 704 Configuration 705 ^^^^^^^^^^^^^ 706 707 Hubble metrics are served by a Hubble instance running inside ``cilium-agent``. 708 The command-line options to configure them are ``--enable-hubble``, 709 ``--hubble-metrics-server``, and ``--hubble-metrics``. 710 ``--hubble-metrics-server`` takes an ``IP:Port`` pair, but 711 passing an empty IP (e.g. ``:9965``) will bind the server to all available 712 interfaces. ``--hubble-metrics`` takes a comma-separated list of metrics. 713 It's also possible to configure Hubble metrics to listen with TLS and 714 optionally use mTLS for authentication. For details see :ref:`hubble_configure_metrics_tls`. 715 716 Some metrics can take additional semicolon-separated options per metric, e.g. 717 ``--hubble-metrics="dns:query;ignoreAAAA,http:destinationContext=workload-name"`` 718 will enable the ``dns`` metric with the ``query`` and ``ignoreAAAA`` options, 719 and the ``http`` metric with the ``destinationContext=workload-name`` option. 720 721 .. _hubble_context_options: 722 723 Context Options 724 ^^^^^^^^^^^^^^^ 725 726 Hubble metrics support configuration via context options. 727 Supported context options for all metrics: 728 729 - ``sourceContext`` - Configures the ``source`` label on metrics for both egress and ingress traffic. 730 - ``sourceEgressContext`` - Configures the ``source`` label on metrics for egress traffic (takes precedence over ``sourceContext``). 731 - ``sourceIngressContext`` - Configures the ``source`` label on metrics for ingress traffic (takes precedence over ``sourceContext``). 732 - ``destinationContext`` - Configures the ``destination`` label on metrics for both egress and ingress traffic. 733 - ``destinationEgressContext`` - Configures the ``destination`` label on metrics for egress traffic (takes precedence over ``destinationContext``). 734 - ``destinationIngressContext`` - Configures the ``destination`` label on metrics for ingress traffic (takes precedence over ``destinationContext``). 735 - ``labelsContext`` - Configures a list of labels to be enabled on metrics. 736 737 There are also some context options that are specific to certain metrics. 738 See the documentation for the individual metrics to see what options are available for each. 739 740 See below for details on each of the different context options. 741 742 Most Hubble metrics can be configured to add the source and/or destination 743 context as a label using the ``sourceContext`` and ``destinationContext`` 744 options. The possible values are: 745 746 ===================== =================================================================================== 747 Option Value Description 748 ===================== =================================================================================== 749 ``identity`` All Cilium security identity labels 750 ``namespace`` Kubernetes namespace name 751 ``pod`` Kubernetes pod name and namespace name in the form of ``namespace/pod``. 752 ``pod-name`` Kubernetes pod name. 753 ``dns`` All known DNS names of the source or destination (comma-separated) 754 ``ip`` The IPv4 or IPv6 address 755 ``reserved-identity`` Reserved identity label. 756 ``workload`` Kubernetes pod's workload name and namespace in the form of ``namespace/workload-name``. 757 ``workload-name`` Kubernetes pod's workload name (workloads are: Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift), etc). 758 ``app`` Kubernetes pod's app name, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``). 759 ===================== =================================================================================== 760 761 When specifying the source and/or destination context, multiple contexts can be 762 specified by separating them via the ``|`` symbol. 763 When multiple are specified, then the first non-empty value is added to the 764 metric as a label. For example, a metric configuration of 765 ``flow:destinationContext=dns|ip`` will first try to use the DNS name of the 766 target for the label. If no DNS name is known for the target, it will fall back 767 and use the IP address of the target instead. 768 769 .. note:: 770 771 There are 3 cases in which the identity label list contains multiple reserved labels: 772 773 1. ``reserved:kube-apiserver`` and ``reserved:host`` 774 2. ``reserved:kube-apiserver`` and ``reserved:remote-node`` 775 3. ``reserved:kube-apiserver`` and ``reserved:world`` 776 777 In all of these 3 cases, ``reserved-identity`` context returns ``reserved:kube-apiserver``. 778 779 Hubble metrics can also be configured with a ``labelsContext`` which allows providing a list of labels 780 that should be added to the metric. Unlike ``sourceContext`` and ``destinationContext``, instead 781 of different values being put into the same metric label, the ``labelsContext`` puts them into different label values. 782 783 ============================== =============================================================================== 784 Option Value Description 785 ============================== =============================================================================== 786 ``source_ip`` The source IP of the flow. 787 ``source_namespace`` The namespace of the pod if the flow source is from a Kubernetes pod. 788 ``source_pod`` The pod name if the flow source is from a Kubernetes pod. 789 ``source_workload`` The name of the source pod's workload (Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift)). 790 ``source_workload_kind`` The kind of the source pod's workload, for example, Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift). 791 ``source_app`` The app name of the source pod, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``). 792 ``destination_ip`` The destination IP of the flow. 793 ``destination_namespace`` The namespace of the pod if the flow destination is from a Kubernetes pod. 794 ``destination_pod`` The pod name if the flow destination is from a Kubernetes pod. 795 ``destination_workload`` The name of the destination pod's workload (Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift)). 796 ``destination_workload_kind`` The kind of the destination pod's workload, for example, Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift). 797 ``destination_app`` The app name of the source pod, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``). 798 ``traffic_direction`` Identifies the traffic direction of the flow. Possible values are ``ingress``, ``egress`` and ``unknown``. 799 ============================== =============================================================================== 800 801 When specifying the flow context, multiple values can be specified by separating them via the ``,`` symbol. 802 All labels listed are included in the metric, even if empty. For example, a metric configuration of 803 ``http:labelsContext=source_namespace,source_pod`` will add the ``source_namespace`` and ``source_pod`` 804 labels to all Hubble HTTP metrics. 805 806 .. note:: 807 808 To limit metrics cardinality hubble will remove data series bound to specific pod after one minute from pod deletion. 809 Metric is considered to be bound to a specific pod when at least one of the following conditions is met: 810 811 * ``sourceContext`` is set to ``pod`` and metric series has ``source`` label matching ``<pod_namespace>/<pod_name>`` 812 * ``destinationContext`` is set to ``pod`` and metric series has ``destination`` label matching ``<pod_namespace>/<pod_name>`` 813 * ``labelsContext`` contains both ``source_namespace`` and ``source_pod`` and metric series labels match namespace and name of deleted pod 814 * ``labelsContext`` contains both ``destination_namespace`` and ``destination_pod`` and metric series labels match namespace and name of deleted pod 815 816 .. _hubble_exported_metrics: 817 818 Exported Metrics 819 ^^^^^^^^^^^^^^^^ 820 821 Hubble metrics are exported under the ``hubble_`` Prometheus namespace. 822 823 lost events 824 ~~~~~~~~~~~ 825 826 This metric, unlike other ones, is not directly tied to network flows. It's enabled if any of the other metrics is enabled. 827 828 ================================ ======================================== ========== ================================================== 829 Name Labels Default Description 830 ================================ ======================================== ========== ================================================== 831 ``lost_events_total`` ``source`` Enabled Number of lost events 832 ================================ ======================================== ========== ================================================== 833 834 Labels 835 """""" 836 837 - ``source`` identifies the source of lost events, one of: 838 - ``perf_event_ring_buffer`` 839 - ``observer_events_queue`` 840 - ``hubble_ring_buffer`` 841 842 843 ``dns`` 844 ~~~~~~~ 845 846 ================================ ======================================== ========== =================================== 847 Name Labels Default Description 848 ================================ ======================================== ========== =================================== 849 ``dns_queries_total`` ``rcode``, ``qtypes``, ``ips_returned`` Disabled Number of DNS queries observed 850 ``dns_responses_total`` ``rcode``, ``qtypes``, ``ips_returned`` Disabled Number of DNS responses observed 851 ``dns_response_types_total`` ``type``, ``qtypes`` Disabled Number of DNS response types 852 ================================ ======================================== ========== =================================== 853 854 Options 855 """"""" 856 857 ============== ============= ==================================================================================== 858 Option Key Option Value Description 859 ============== ============= ==================================================================================== 860 ``query`` N/A Include the query as label "query" 861 ``ignoreAAAA`` N/A Ignore any AAAA requests/responses 862 ============== ============= ==================================================================================== 863 864 This metric supports :ref:`Context Options<hubble_context_options>`. 865 866 867 ``drop`` 868 ~~~~~~~~ 869 870 ================================ ======================================== ========== =================================== 871 Name Labels Default Description 872 ================================ ======================================== ========== =================================== 873 ``drop_total`` ``reason``, ``protocol`` Disabled Number of drops 874 ================================ ======================================== ========== =================================== 875 876 Options 877 """"""" 878 879 This metric supports :ref:`Context Options<hubble_context_options>`. 880 881 ``flow`` 882 ~~~~~~~~ 883 884 ================================ ======================================== ========== =================================== 885 Name Labels Default Description 886 ================================ ======================================== ========== =================================== 887 ``flows_processed_total`` ``type``, ``subtype``, ``verdict`` Disabled Total number of flows processed 888 ================================ ======================================== ========== =================================== 889 890 Options 891 """"""" 892 893 This metric supports :ref:`Context Options<hubble_context_options>`. 894 895 ``flows-to-world`` 896 ~~~~~~~~~~~~~~~~~~ 897 898 This metric counts all non-reply flows containing the ``reserved:world`` label in their 899 destination identity. By default, dropped flows are counted if and only if the drop reason 900 is ``Policy denied``. Set ``any-drop`` option to count all dropped flows. 901 902 ================================ ======================================== ========== ============================================ 903 Name Labels Default Description 904 ================================ ======================================== ========== ============================================ 905 ``flows_to_world_total`` ``protocol``, ``verdict`` Disabled Total number of flows to ``reserved:world``. 906 ================================ ======================================== ========== ============================================ 907 908 Options 909 """"""" 910 911 ============== ============= ====================================================== 912 Option Key Option Value Description 913 ============== ============= ====================================================== 914 ``any-drop`` N/A Count any dropped flows regardless of the drop reason. 915 ``port`` N/A Include the destination port as label ``port``. 916 ``syn-only`` N/A Only count non-reply SYNs for TCP flows. 917 ============== ============= ====================================================== 918 919 920 This metric supports :ref:`Context Options<hubble_context_options>`. 921 922 ``http`` 923 ~~~~~~~~ 924 925 Deprecated, use ``httpV2`` instead. 926 These metrics can not be enabled at the same time as ``httpV2``. 927 928 ================================= ======================================= ========== ============================================== 929 Name Labels Default Description 930 ================================= ======================================= ========== ============================================== 931 ``http_requests_total`` ``method``, ``protocol``, ``reporter`` Disabled Count of HTTP requests 932 ``http_responses_total`` ``method``, ``status``, ``reporter`` Disabled Count of HTTP responses 933 ``http_request_duration_seconds`` ``method``, ``reporter`` Disabled Histogram of HTTP request duration in seconds 934 ================================= ======================================= ========== ============================================== 935 936 Labels 937 """""" 938 939 - ``method`` is the HTTP method of the request/response. 940 - ``protocol`` is the HTTP protocol of the request, (For example: ``HTTP/1.1``, ``HTTP/2``). 941 - ``status`` is the HTTP status code of the response. 942 - ``reporter`` identifies the origin of the request/response. It is set to ``client`` if it originated from the client, ``server`` if it originated from the server, or ``unknown`` if its origin is unknown. 943 944 Options 945 """"""" 946 947 This metric supports :ref:`Context Options<hubble_context_options>`. 948 949 ``httpV2`` 950 ~~~~~~~~~~ 951 952 ``httpV2`` is an updated version of the existing ``http`` metrics. 953 These metrics can not be enabled at the same time as ``http``. 954 955 The main difference is that ``http_requests_total`` and 956 ``http_responses_total`` have been consolidated, and use the response flow 957 data. 958 959 Additionally, the ``http_request_duration_seconds`` metric source/destination 960 related labels now are from the perspective of the request. In the ``http`` 961 metrics, the source/destination were swapped, because the metric uses the 962 response flow data, where the source/destination are swapped, but in ``httpV2`` 963 we correctly account for this. 964 965 ================================= =================================================== ========== ============================================== 966 Name Labels Default Description 967 ================================= =================================================== ========== ============================================== 968 ``http_requests_total`` ``method``, ``protocol``, ``status``, ``reporter`` Disabled Count of HTTP requests 969 ``http_request_duration_seconds`` ``method``, ``reporter`` Disabled Histogram of HTTP request duration in seconds 970 ================================= =================================================== ========== ============================================== 971 972 Labels 973 """""" 974 975 - ``method`` is the HTTP method of the request/response. 976 - ``protocol`` is the HTTP protocol of the request, (For example: ``HTTP/1.1``, ``HTTP/2``). 977 - ``status`` is the HTTP status code of the response. 978 - ``reporter`` identifies the origin of the request/response. It is set to ``client`` if it originated from the client, ``server`` if it originated from the server, or ``unknown`` if its origin is unknown. 979 980 Options 981 """"""" 982 983 ============== ============== ============================================================================================================= 984 Option Key Option Value Description 985 ============== ============== ============================================================================================================= 986 ``exemplars`` ``true`` Include extracted trace IDs in HTTP metrics. Requires :ref:`OpenMetrics to be enabled<hubble_open_metrics>`. 987 ============== ============== ============================================================================================================= 988 989 This metric supports :ref:`Context Options<hubble_context_options>`. 990 991 ``icmp`` 992 ~~~~~~~~ 993 994 ================================ ======================================== ========== =================================== 995 Name Labels Default Description 996 ================================ ======================================== ========== =================================== 997 ``icmp_total`` ``family``, ``type`` Disabled Number of ICMP messages 998 ================================ ======================================== ========== =================================== 999 1000 Options 1001 """"""" 1002 1003 This metric supports :ref:`Context Options<hubble_context_options>`. 1004 1005 ``kafka`` 1006 ~~~~~~~~~ 1007 1008 =================================== ===================================================== ========== ============================================== 1009 Name Labels Default Description 1010 =================================== ===================================================== ========== ============================================== 1011 ``kafka_requests_total`` ``topic``, ``api_key``, ``error_code``, ``reporter`` Disabled Count of Kafka requests by topic 1012 ``kafka_request_duration_seconds`` ``topic``, ``api_key``, ``reporter`` Disabled Histogram of Kafka request duration by topic 1013 =================================== ===================================================== ========== ============================================== 1014 1015 Options 1016 """"""" 1017 1018 This metric supports :ref:`Context Options<hubble_context_options>`. 1019 1020 ``port-distribution`` 1021 ~~~~~~~~~~~~~~~~~~~~~ 1022 1023 ================================ ======================================== ========== ================================================== 1024 Name Labels Default Description 1025 ================================ ======================================== ========== ================================================== 1026 ``port_distribution_total`` ``protocol``, ``port`` Disabled Numbers of packets distributed by destination port 1027 ================================ ======================================== ========== ================================================== 1028 1029 Options 1030 """"""" 1031 1032 This metric supports :ref:`Context Options<hubble_context_options>`. 1033 1034 ``tcp`` 1035 ~~~~~~~ 1036 1037 ================================ ======================================== ========== ================================================== 1038 Name Labels Default Description 1039 ================================ ======================================== ========== ================================================== 1040 ``tcp_flags_total`` ``flag``, ``family`` Disabled TCP flag occurrences 1041 ================================ ======================================== ========== ================================================== 1042 1043 Options 1044 """"""" 1045 1046 This metric supports :ref:`Context Options<hubble_context_options>`. 1047 1048 dynamic_exporter_exporters_total 1049 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1050 1051 This is dynamic hubble exporter metric. 1052 1053 ==================================== ======================================== ========== ================================================== 1054 Name Labels Default Description 1055 ==================================== ======================================== ========== ================================================== 1056 ``dynamic_exporter_exporters_total`` ``source`` Enabled Number of configured hubble exporters 1057 ==================================== ======================================== ========== ================================================== 1058 1059 Labels 1060 """""" 1061 1062 - ``status`` identifies status of exporters, can be one of: 1063 - ``active`` 1064 - ``inactive`` 1065 1066 dynamic_exporter_up 1067 ~~~~~~~~~~~~~~~~~~~ 1068 1069 This is dynamic hubble exporter metric. 1070 1071 ==================================== ======================================== ========== ================================================== 1072 Name Labels Default Description 1073 ==================================== ======================================== ========== ================================================== 1074 ``dynamic_exporter_up`` ``source`` Enabled Status of exporter (1 - active, 0 - inactive) 1075 ==================================== ======================================== ========== ================================================== 1076 1077 Labels 1078 """""" 1079 1080 - ``name`` identifies exporter name 1081 1082 dynamic_exporter_reconfigurations_total 1083 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1084 1085 This is dynamic hubble exporter metric. 1086 1087 =========================================== ======================================== ========== ================================================== 1088 Name Labels Default Description 1089 =========================================== ======================================== ========== ================================================== 1090 ``dynamic_exporter_reconfigurations_total`` ``op`` Enabled Number of dynamic exporters reconfigurations 1091 =========================================== ======================================== ========== ================================================== 1092 1093 Labels 1094 """""" 1095 1096 - ``op`` identifies reconfiguration operation type, can be one of: 1097 - ``add`` 1098 - ``update`` 1099 - ``remove`` 1100 1101 dynamic_exporter_config_hash 1102 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1103 1104 This is dynamic hubble exporter metric. 1105 1106 ==================================== ======================================== ========== ================================================== 1107 Name Labels Default Description 1108 ==================================== ======================================== ========== ================================================== 1109 ``dynamic_exporter_config_hash`` Enabled Hash of last applied config 1110 ==================================== ======================================== ========== ================================================== 1111 1112 dynamic_exporter_config_last_applied 1113 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1114 1115 This is dynamic hubble exporter metric. 1116 1117 ======================================== ======================================== ========== ================================================== 1118 Name Labels Default Description 1119 ======================================== ======================================== ========== ================================================== 1120 ``dynamic_exporter_config_last_applied`` Enabled Timestamp of last applied config 1121 ======================================== ======================================== ========== ================================================== 1122 1123 1124 1125 1126 .. _clustermesh_apiserver_metrics_reference: 1127 1128 clustermesh-apiserver 1129 --------------------- 1130 1131 Configuration 1132 ^^^^^^^^^^^^^ 1133 1134 To expose any metrics, invoke ``clustermesh-apiserver`` with the 1135 ``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but 1136 passing an empty IP (e.g. ``:9962``) will bind the server to all available 1137 interfaces (there is usually only one in a container). 1138 1139 Exported Metrics 1140 ^^^^^^^^^^^^^^^^ 1141 1142 All metrics are exported under the ``cilium_clustermesh_apiserver_`` 1143 Prometheus namespace. 1144 1145 Bootstrap 1146 ~~~~~~~~~ 1147 1148 ======================================== ============================================ ======================================================== 1149 Name Labels Description 1150 ======================================== ============================================ ======================================================== 1151 ``bootstrap_seconds`` ``source_cluster`` Duration in seconds to complete bootstrap 1152 ======================================== ============================================ ======================================================== 1153 1154 KVstore 1155 ~~~~~~~ 1156 1157 ======================================== ============================================ ======================================================== 1158 Name Labels Description 1159 ======================================== ============================================ ======================================================== 1160 ``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation 1161 ``kvstore_events_queue_seconds`` ``action``, ``scope`` Seconds waited before a received event was queued 1162 ``kvstore_quorum_errors_total`` ``error`` Number of quorum errors 1163 ``kvstore_sync_errors_total`` ``scope``, ``source_cluster`` Number of times synchronization to the kvstore failed 1164 ``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Number of elements queued for synchronization in the kvstore 1165 ``kvstore_initial_sync_completed`` ``scope``, ``source_cluster``, ``action`` Whether the initial synchronization from/to the kvstore has completed 1166 ======================================== ============================================ ======================================================== 1167 1168 API Rate Limiting 1169 ~~~~~~~~~~~~~~~~~ 1170 1171 ============================================== ========================================== ======================================================== 1172 Name Labels Description 1173 ============================================== ========================================== ======================================================== 1174 ``api_limiter_processed_requests_total`` ``api_call``, ``outcome``, ``return_code`` Total number of API requests processed 1175 ``api_limiter_processing_duration_seconds`` ``api_call``, ``value`` Mean and estimated processing duration in seconds 1176 ``api_limiter_rate_limit`` ``api_call``, ``value`` Current rate limiting configuration (limit and burst) 1177 ``api_limiter_requests_in_flight`` ``api_call`` ``value`` Current and maximum allowed number of requests in flight 1178 ``api_limiter_wait_duration_seconds`` ``api_call``, ``value`` Mean, min, and max wait duration 1179 ============================================== ========================================== ======================================================== 1180 1181 Controllers 1182 ~~~~~~~~~~~ 1183 1184 ======================================== ================================================== ========== ======================================================== 1185 Name Labels Default Description 1186 ======================================== ================================================== ========== ======================================================== 1187 ``controllers_group_runs_total`` ``status``, ``group_name`` Enabled Number of times that a controller process was run, labeled by controller group name 1188 ======================================== ================================================== ========== ======================================================== 1189 1190 The ``controllers_group_runs_total`` metric reports the success 1191 and failure count of each controller within the system, labeled by 1192 controller group name and completion status. Enabling this metric is 1193 on a per-controller basis. This is configured using an allow-list which 1194 is passed as the ``controller-group-metrics`` configuration flag. 1195 The current default set for ``clustermesh-apiserver`` found in the 1196 Cilium Helm chart is the special name "all", which enables the metric 1197 for all controller groups. The special name "none" is also supported. 1198 1199 .. _kvstoremesh_metrics_reference: 1200 1201 kvstoremesh 1202 ----------- 1203 1204 Configuration 1205 ^^^^^^^^^^^^^ 1206 1207 To expose any metrics, invoke ``kvstoremesh`` with the 1208 ``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but 1209 passing an empty IP (e.g. ``:9964``) binds the server to all available 1210 interfaces (there is usually only one interface in a container). 1211 1212 Exported Metrics 1213 ^^^^^^^^^^^^^^^^ 1214 1215 All metrics are exported under the ``cilium_kvstoremesh_`` Prometheus namespace. 1216 1217 Bootstrap 1218 ~~~~~~~~~ 1219 1220 ======================================== ============================================ ======================================================== 1221 Name Labels Description 1222 ======================================== ============================================ ======================================================== 1223 ``bootstrap_seconds`` ``source_cluster`` Duration in seconds to complete bootstrap 1224 ======================================== ============================================ ======================================================== 1225 1226 Remote clusters 1227 ~~~~~~~~~~~~~~~ 1228 1229 ==================================== ======================================= ================================================================= 1230 Name Labels Description 1231 ==================================== ======================================= ================================================================= 1232 ``remote_clusters`` ``source_cluster`` The total number of remote clusters meshed with the local cluster 1233 ``remote_cluster_failures`` ``source_cluster``, ``target_cluster`` The total number of failures related to the remote cluster 1234 ``remote_cluster_last_failure_ts`` ``source_cluster``, ``target_cluster`` The timestamp of the last failure of the remote cluster 1235 ``remote_cluster_readiness_status`` ``source_cluster``, ``target_cluster`` The readiness status of the remote cluster 1236 ==================================== ======================================= ================================================================= 1237 1238 KVstore 1239 ~~~~~~~ 1240 1241 ======================================== ============================================ ======================================================== 1242 Name Labels Description 1243 ======================================== ============================================ ======================================================== 1244 ``kvstore_operations_duration_seconds`` ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation 1245 ``kvstore_events_queue_seconds`` ``action``, ``scope`` Seconds waited before a received event was queued 1246 ``kvstore_quorum_errors_total`` ``error`` Number of quorum errors 1247 ``kvstore_sync_errors_total`` ``scope``, ``source_cluster`` Number of times synchronization to the kvstore failed 1248 ``kvstore_sync_queue_size`` ``scope``, ``source_cluster`` Number of elements queued for synchronization in the kvstore 1249 ``kvstore_initial_sync_completed`` ``scope``, ``source_cluster``, ``action`` Whether the initial synchronization from/to the kvstore has completed 1250 ======================================== ============================================ ======================================================== 1251 1252 API Rate Limiting 1253 ~~~~~~~~~~~~~~~~~ 1254 1255 ============================================== ========================================== ======================================================== 1256 Name Labels Description 1257 ============================================== ========================================== ======================================================== 1258 ``api_limiter_processed_requests_total`` ``api_call``, ``outcome``, ``return_code`` Total number of API requests processed 1259 ``api_limiter_processing_duration_seconds`` ``api_call``, ``value`` Mean and estimated processing duration in seconds 1260 ``api_limiter_rate_limit`` ``api_call``, ``value`` Current rate limiting configuration (limit and burst) 1261 ``api_limiter_requests_in_flight`` ``api_call`` ``value`` Current and maximum allowed number of requests in flight 1262 ``api_limiter_wait_duration_seconds`` ``api_call``, ``value`` Mean, min, and max wait duration 1263 ============================================== ========================================== ======================================================== 1264 1265 Controllers 1266 ~~~~~~~~~~~ 1267 1268 ======================================== ================================================== ========== ======================================================== 1269 Name Labels Default Description 1270 ======================================== ================================================== ========== ======================================================== 1271 ``controllers_group_runs_total`` ``status``, ``group_name`` Enabled Number of times that a controller process was run, labeled by controller group name 1272 ======================================== ================================================== ========== ======================================================== 1273 1274 The ``controllers_group_runs_total`` metric reports the success 1275 and failure count of each controller within the system, labeled by 1276 controller group name and completion status. Enabling this metric is 1277 on a per-controller basis. This is configured using an allow-list 1278 which is passed as the ``controller-group-metrics`` configuration 1279 flag. The current default set for ``kvstoremesh`` found in the 1280 Cilium Helm chart is the special name "all", which enables the metric 1281 for all controller groups. The special name "none" is also supported. 1282 1283 NAT 1284 ~~~ 1285 1286 .. _nat_metrics: 1287 1288 ======================================== ================================================== ========== ======================================================== 1289 Name Labels Default Description 1290 ======================================== ================================================== ========== ======================================================== 1291 ``nat_endpoint_max_connection`` ``family`` Enabled Saturation of the most saturated distinct NAT mapped connection, in terms of egress-IP and remote endpoint address. 1292 ======================================== ================================================== ========== ======================================================== 1293 1294 These metrics are for monitoring Cilium's NAT mapping functionality. NAT is used by features such as Egress Gateway and BPF masquerading. 1295 1296 The NAT map holds mappings for masqueraded connections. Connection held in the NAT table that are masqueraded with the 1297 same egress-IP and are going to the same remote endpoints IP and port all require a unique source port for the mapping. 1298 This means that any Node masquerading connections to a distinct external endpoint is limited by the possible ephemeral source ports. 1299 1300 Given a Node forwarding one or more such egress-IP and remote endpoint tuples, the ``nat_endpoint_max_connection`` metric is the most saturated such connection in terms of a percent of possible source ports available. 1301 This metric is especially useful when using the egress gateway feature where it's possible to overload a Node if many connections are all going to the same endpoint. 1302 In general, this metric should normally be fairly low. 1303 A high number here may indicate that a Node is reaching its limit for connections to one or more external endpoints. 1304