github.com/cilium/cilium@v1.16.2/Documentation/observability/metrics.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _metrics:
     8  
     9  ********************
    10  Monitoring & Metrics
    11  ********************
    12  
    13  Cilium and Hubble can both be configured to serve `Prometheus
    14  <https://prometheus.io>`_ metrics. Prometheus is a pluggable metrics collection
    15  and storage system and can act as a data source for `Grafana
    16  <https://grafana.com/>`_, a metrics visualization frontend. Unlike some metrics
    17  collectors like statsd, Prometheus requires the collectors to pull metrics from
    18  each source.
    19  
    20  Cilium and Hubble metrics can be enabled independently of each other.
    21  
    22  Cilium Metrics
    23  ==============
    24  
    25  Cilium metrics provide insights into the state of Cilium itself, namely
    26  of the ``cilium-agent``, ``cilium-envoy``, and ``cilium-operator`` processes.
    27  To run Cilium with Prometheus metrics enabled, deploy it with the
    28  ``prometheus.enabled=true`` Helm value set.
    29  
    30  Cilium metrics are exported under the ``cilium_`` Prometheus namespace. Envoy
    31  metrics are exported under the ``envoy_`` Prometheus namespace, of which the
    32  Cilium-defined metrics are exported under the ``envoy_cilium_`` namespace.
    33  When running and collecting in Kubernetes they will be tagged with a pod name
    34  and namespace.
    35  
    36  Installation
    37  ------------
    38  
    39  You can enable metrics for ``cilium-agent`` (including Envoy) with the Helm value
    40  ``prometheus.enabled=true``. ``cilium-operator`` metrics are enabled by default,
    41  if you want to disable them, set Helm value ``operator.prometheus.enabled=false``.
    42  
    43  .. parsed-literal::
    44  
    45     helm install cilium |CHART_RELEASE| \\
    46       --namespace kube-system \\
    47       --set prometheus.enabled=true \\
    48       --set operator.prometheus.enabled=true
    49  
    50  The ports can be configured via ``prometheus.port``,
    51  ``envoy.prometheus.port``, or ``operator.prometheus.port`` respectively.
    52  
    53  When metrics are enabled, all Cilium components will have the following
    54  annotations. They can be used to signal Prometheus whether to scrape metrics:
    55  
    56  .. code-block:: yaml
    57  
    58          prometheus.io/scrape: true
    59          prometheus.io/port: 9962
    60  
    61  To collect Envoy metrics the Cilium chart will create a Kubernetes headless
    62  service named ``cilium-agent`` with the ``prometheus.io/scrape:'true'`` annotation set:
    63  
    64  .. code-block:: yaml
    65  
    66          prometheus.io/scrape: true
    67          prometheus.io/port: 9964
    68  
    69  This additional headless service in addition to the other Cilium components is needed
    70  as each component can only have one Prometheus scrape and port annotation.
    71  
    72  Prometheus will pick up the Cilium and Envoy metrics automatically if the following
    73  option is set in the ``scrape_configs`` section:
    74  
    75  .. code-block:: yaml
    76  
    77      scrape_configs:
    78      - job_name: 'kubernetes-pods'
    79        kubernetes_sd_configs:
    80        - role: pod
    81        relabel_configs:
    82          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
    83            action: keep
    84            regex: true
    85          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
    86            action: replace
    87            regex: ([^:]+)(?::\d+)?;(\d+)
    88            replacement: ${1}:${2}
    89            target_label: __address__
    90  
    91  .. _hubble_metrics:
    92  
    93  Hubble Metrics
    94  ==============
    95  
    96  While Cilium metrics allow you to monitor the state Cilium itself,
    97  Hubble metrics on the other hand allow you to monitor the network behavior
    98  of your Cilium-managed Kubernetes pods with respect to connectivity and security.
    99  
   100  Installation
   101  ------------
   102  
   103  To deploy Cilium with Hubble metrics enabled, you need to enable Hubble with
   104  ``hubble.enabled=true`` and provide a set of Hubble metrics you want to
   105  enable via ``hubble.metrics.enabled``.
   106  
   107  Some of the metrics can also be configured with additional options.
   108  See the :ref:`Hubble exported metrics<hubble_exported_metrics>`
   109  section for the full list of available metrics and their options.
   110  
   111  .. parsed-literal::
   112  
   113     helm install cilium |CHART_RELEASE| \\
   114       --namespace kube-system \\
   115       --set prometheus.enabled=true \\
   116       --set operator.prometheus.enabled=true \\
   117       --set hubble.enabled=true \\
   118       --set hubble.metrics.enableOpenMetrics=true \\
   119       --set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\\,source_namespace\\,source_workload\\,destination_ip\\,destination_namespace\\,destination_workload\\,traffic_direction}"
   120  
   121  The port of the Hubble metrics can be configured with the
   122  ``hubble.metrics.port`` Helm value.
   123  
   124  For details on enabling Hubble metrics with TLS see the
   125  :ref:`hubble_configure_metrics_tls` section of the documentation.
   126  
   127  .. Note::
   128  
   129      L7 metrics such as HTTP, are only emitted for pods that enable
   130      :ref:`Layer 7 Protocol Visibility <proxy_visibility>`.
   131  
   132  When deployed with a non-empty ``hubble.metrics.enabled`` Helm value, the
   133  Cilium chart will create a Kubernetes headless service named ``hubble-metrics``
   134  with the ``prometheus.io/scrape:'true'`` annotation set:
   135  
   136  .. code-block:: yaml
   137  
   138          prometheus.io/scrape: true
   139          prometheus.io/port: 9965
   140  
   141  Set the following options in the ``scrape_configs`` section of Prometheus to
   142  have it scrape all Hubble metrics from the endpoints automatically:
   143  
   144  .. code-block:: yaml
   145  
   146      scrape_configs:
   147        - job_name: 'kubernetes-endpoints'
   148          scrape_interval: 30s
   149          kubernetes_sd_configs:
   150            - role: endpoints
   151          relabel_configs:
   152            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
   153              action: keep
   154              regex: true
   155            - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
   156              action: replace
   157              target_label: __address__
   158              regex: (.+)(?::\d+);(\d+)
   159              replacement: $1:$2
   160  
   161  .. _hubble_open_metrics:
   162  
   163  OpenMetrics
   164  -----------
   165  
   166  Additionally, you can opt-in to `OpenMetrics <https://openmetrics.io>`_ by
   167  setting ``hubble.metrics.enableOpenMetrics=true``.
   168  Enabling OpenMetrics configures the Hubble metrics endpoint to support exporting
   169  metrics in OpenMetrics format when explicitly requested by clients.
   170  
   171  Using OpenMetrics supports additional functionality such as Exemplars, which
   172  enables associating metrics with traces by embedding trace IDs into the
   173  exported metrics.
   174  
   175  Prometheus needs to be configured to take advantage of OpenMetrics and will
   176  only scrape exemplars when the `exemplars storage feature is enabled
   177  <https://prometheus.io/docs/prometheus/latest/feature_flags/#exemplars-storage>`_.
   178  
   179  OpenMetrics imposes a few additional requirements on metrics names and labels,
   180  so this functionality is currently opt-in, though we believe all of the Hubble
   181  metrics conform to the OpenMetrics requirements.
   182  
   183  
   184  .. _clustermesh_apiserver_metrics:
   185  
   186  Cluster Mesh API Server Metrics
   187  ===============================
   188  
   189  Cluster Mesh API Server metrics provide insights into the state of the
   190  ``clustermesh-apiserver`` process, the ``kvstoremesh`` process (if enabled),
   191  and the sidecar etcd instance.
   192  Cluster Mesh API Server metrics are exported under the ``cilium_clustermesh_apiserver_``
   193  Prometheus namespace. KVStoreMesh metrics are exported under the ``cilium_kvstoremesh_``
   194  Prometheus namespace. Etcd metrics are exported under the ``etcd_`` Prometheus namespace.
   195  
   196  
   197  Installation
   198  ------------
   199  
   200  You can enable the metrics for different Cluster Mesh API Server components by
   201  setting the following values:
   202  
   203  * clustermesh-apiserver: ``clustermesh.apiserver.metrics.enabled=true``
   204  * kvstoremesh: ``clustermesh.apiserver.metrics.kvstoremesh.enabled=true``
   205  * sidecar etcd instance: ``clustermesh.apiserver.metrics.etcd.enabled=true``
   206  
   207  .. parsed-literal::
   208  
   209     helm install cilium |CHART_RELEASE| \\
   210       --namespace kube-system \\
   211       --set clustermesh.useAPIServer=true \\
   212       --set clustermesh.apiserver.metrics.enabled=true \\
   213       --set clustermesh.apiserver.metrics.kvstoremesh.enabled=true \\
   214       --set clustermesh.apiserver.metrics.etcd.enabled=true
   215  
   216  You can figure the ports by way of ``clustermesh.apiserver.metrics.port``,
   217  ``clustermesh.apiserver.metrics.kvstoremesh.port`` and
   218  ``clustermesh.apiserver.metrics.etcd.port`` respectively.
   219  
   220  You can automatically create a
   221  `Prometheus Operator <https://github.com/prometheus-operator/prometheus-operator>`_
   222  ``ServiceMonitor`` by setting ``clustermesh.apiserver.metrics.serviceMonitor.enabled=true``.
   223  
   224  Example Prometheus & Grafana Deployment
   225  =======================================
   226  
   227  If you don't have an existing Prometheus and Grafana stack running, you can
   228  deploy a stack with:
   229  
   230  .. parsed-literal::
   231  
   232      kubectl apply -f \ |SCM_WEB|\/examples/kubernetes/addons/prometheus/monitoring-example.yaml
   233  
   234  It will run Prometheus and Grafana in the ``cilium-monitoring`` namespace. If
   235  you have either enabled Cilium or Hubble metrics, they will automatically
   236  be scraped by Prometheus. You can then expose Grafana to access it via your browser.
   237  
   238  .. code-block:: shell-session
   239  
   240      kubectl -n cilium-monitoring port-forward service/grafana --address 0.0.0.0 --address :: 3000:3000
   241  
   242  Open your browser and access http://localhost:3000/
   243  
   244  Metrics Reference
   245  =================
   246  
   247  cilium-agent
   248  ------------
   249  
   250  Configuration
   251  ^^^^^^^^^^^^^
   252  
   253  To expose any metrics, invoke ``cilium-agent`` with the
   254  ``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but
   255  passing an empty IP (e.g. ``:9962``) will bind the server to all available
   256  interfaces (there is usually only one in a container).
   257  
   258  To customize ``cilium-agent`` metrics, configure the ``--metrics`` option with
   259  ``"+metric_a -metric_b -metric_c"``, where ``+/-`` means to enable/disable
   260  the metric. For example, for really large clusters, users may consider to
   261  disable the following two metrics as they generate too much data:
   262  
   263  - ``cilium_node_connectivity_status``
   264  - ``cilium_node_connectivity_latency_seconds``
   265  
   266  You can then configure the agent with ``--metrics="-cilium_node_connectivity_status -cilium_node_connectivity_latency_seconds"``.
   267  
   268  Exported Metrics
   269  ^^^^^^^^^^^^^^^^
   270  
   271  Endpoint
   272  ~~~~~~~~
   273  
   274  ============================================ ================================================== ========== ========================================================
   275  Name                                         Labels                                             Default    Description
   276  ============================================ ================================================== ========== ========================================================
   277  ``endpoint``                                                                                    Enabled    Number of endpoints managed by this agent
   278  ``endpoint_max_ifindex``                                                                        Disabled   Maximum interface index observed for existing endpoints
   279  ``endpoint_regenerations_total``             ``outcome``                                        Enabled    Count of all endpoint regenerations that have completed
   280  ``endpoint_regeneration_time_stats_seconds`` ``scope``                                          Enabled    Endpoint regeneration time stats
   281  ``endpoint_state``                           ``state``                                          Enabled    Count of all endpoints
   282  ============================================ ================================================== ========== ========================================================
   283  
   284  The default enabled status of ``endpoint_max_ifindex`` is dynamic. On earlier
   285  kernels (typically with version lower than 5.10), Cilium must store the
   286  interface index for each endpoint in the conntrack map, which reserves 16 bits
   287  for this field. If Cilium is running on such a kernel, this metric will be
   288  enabled by default. It can be used to implement an alert if the ifindex is
   289  approaching the limit of 65535. This may be the case in instances of
   290  significant Endpoint churn.
   291  
   292  Services
   293  ~~~~~~~~
   294  
   295  ========================================== ================================================== ========== ========================================================
   296  Name                                       Labels                                             Default    Description
   297  ========================================== ================================================== ========== ========================================================
   298  ``services_events_total``                                                                     Enabled    Number of services events labeled by action type
   299  ``service_implementation_delay``           ``action``                                         Enabled    Duration in seconds to propagate the data plane programming of a service, its network and endpoints from the time the service or the service pod was changed excluding the event queue latency
   300  ========================================== ================================================== ========== ========================================================
   301  
   302  Cluster health
   303  ~~~~~~~~~~~~~~
   304  
   305  ========================================== ================================================== ========== ========================================================
   306  Name                                       Labels                                             Default    Description
   307  ========================================== ================================================== ========== ========================================================
   308  ``unreachable_nodes``                                                                         Enabled    Number of nodes that cannot be reached
   309  ``unreachable_health_endpoints``                                                              Enabled    Number of health endpoints that cannot be reached
   310  ========================================== ================================================== ========== ========================================================
   311  
   312  Node Connectivity
   313  ~~~~~~~~~~~~~~~~~
   314  
   315  ========================================== ====================================================================================================================================================================== ========== ===================================================================================================================
   316  Name                                       Labels                                                                                                                                                                 Default    Description
   317  ========================================== ====================================================================================================================================================================== ========== ===================================================================================================================
   318  ``node_connectivity_status``               ``source_cluster``, ``source_node_name``, ``target_cluster``, ``target_node_name``, ``target_node_type``, ``type``                                                     Enabled    The last observed status of both ICMP and HTTP connectivity between the current Cilium agent and other Cilium nodes
   319  ``node_connectivity_latency_seconds``      ``address_type``, ``protocol``, ``source_cluster``, ``source_node_name``, ``target_cluster``, ``target_node_ip``, ``target_node_name``, ``target_node_type``, ``type`` Enabled    The last observed latency between the current Cilium agent and other Cilium nodes in seconds
   320  ========================================== ====================================================================================================================================================================== ========== ===================================================================================================================
   321  
   322  Clustermesh
   323  ~~~~~~~~~~~
   324  
   325  =============================================== ============================================================ ========== =================================================================
   326  Name                                            Labels                                                       Default    Description
   327  =============================================== ============================================================ ========== =================================================================
   328  ``clustermesh_global_services``                 ``source_cluster``, ``source_node_name``                     Enabled    The total number of global services in the cluster mesh
   329  ``clustermesh_remote_clusters``                 ``source_cluster``, ``source_node_name``                     Enabled    The total number of remote clusters meshed with the local cluster
   330  ``clustermesh_remote_cluster_failures``         ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled    The total number of failures related to the remote cluster
   331  ``clustermesh_remote_cluster_nodes``            ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled    The total number of nodes in the remote cluster
   332  ``clustermesh_remote_cluster_last_failure_ts``  ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled    The timestamp of the last failure of the remote cluster
   333  ``clustermesh_remote_cluster_readiness_status`` ``source_cluster``, ``source_node_name``, ``target_cluster`` Enabled    The readiness status of the remote cluster
   334  =============================================== ============================================================ ========== =================================================================
   335  
   336  Datapath
   337  ~~~~~~~~
   338  
   339  ============================================= ================================================== ========== ========================================================
   340  Name                                          Labels                                             Default    Description
   341  ============================================= ================================================== ========== ========================================================
   342  ``datapath_conntrack_dump_resets_total``      ``area``, ``name``, ``family``                     Enabled    Number of conntrack dump resets. Happens when a BPF entry gets removed while dumping the map is in progress.
   343  ``datapath_conntrack_gc_runs_total``          ``status``                                         Enabled    Number of times that the conntrack garbage collector process was run
   344  ``datapath_conntrack_gc_key_fallbacks_total``                                                    Enabled    The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family
   345  ``datapath_conntrack_gc_entries``             ``family``                                         Enabled    The number of alive and deleted conntrack entries at the end of a garbage collector run
   346  ``datapath_conntrack_gc_duration_seconds``    ``status``                                         Enabled    Duration in seconds of the garbage collector process
   347  ============================================= ================================================== ========== ========================================================
   348  
   349  IPsec
   350  ~~~~~
   351  
   352  ============================================= ================================================== ========== ===========================================================
   353  Name                                          Labels                                             Default    Description
   354  ============================================= ================================================== ========== ===========================================================
   355  ``ipsec_xfrm_error``                          ``error``, ``type``                                Enabled    Total number of xfrm errors
   356  ``ipsec_keys``                                                                                   Enabled    Number of keys in use
   357  ``ipsec_xfrm_states``                         ``direction``                                      Enabled    Number of XFRM states
   358  ``ipsec_xfrm_policies``                       ``direction``                                      Enabled    Number of XFRM policies
   359  ============================================= ================================================== ========== ===========================================================
   360  
   361  eBPF
   362  ~~~~
   363  
   364  ========================================== ===================================================================== ========== ========================================================
   365  Name                                       Labels                                                                Default    Description
   366  ========================================== ===================================================================== ========== ========================================================
   367  ``bpf_syscall_duration_seconds``           ``operation``, ``outcome``                                            Disabled   Duration of eBPF system call performed
   368  ``bpf_map_ops_total``                      ``mapName`` (deprecated), ``map_name``, ``operation``, ``outcome``    Enabled    Number of eBPF map operations performed. ``mapName`` is deprecated and will be removed in 1.10. Use ``map_name`` instead.
   369  ``bpf_map_pressure``                       ``map_name``                                                          Enabled    Map pressure is defined as a ratio of the required map size compared to its configured size. Values < 1.0 indicate the map's utilization, while values >= 1.0 indicate that the map is full. Policy map metrics are only reported when the ratio is over 0.1, ie 10% full.
   370  ``bpf_map_capacity``                       ``map_group``                                                         Enabled    Maximum size of eBPF maps by group of maps (type of map that have the same max capacity size). Map types with size of 65536 are not emitted, missing map types can be assumed to be 65536.
   371  ``bpf_maps_virtual_memory_max_bytes``                                                                            Enabled    Max memory used by eBPF maps installed in the system
   372  ``bpf_progs_virtual_memory_max_bytes``                                                                           Enabled    Max memory used by eBPF programs installed in the system
   373  ========================================== ===================================================================== ========== ========================================================
   374  
   375  Both ``bpf_maps_virtual_memory_max_bytes`` and ``bpf_progs_virtual_memory_max_bytes``
   376  are currently reporting the system-wide memory usage of eBPF that is directly
   377  and not directly managed by Cilium. This might change in the future and only
   378  report the eBPF memory usage directly managed by Cilium.
   379  
   380  Drops/Forwards (L3/L4)
   381  ~~~~~~~~~~~~~~~~~~~~~~
   382  
   383  ========================================== ================================================== ========== ========================================================
   384  Name                                       Labels                                             Default    Description
   385  ========================================== ================================================== ========== ========================================================
   386  ``drop_count_total``                       ``reason``, ``direction``                          Enabled    Total dropped packets
   387  ``drop_bytes_total``                       ``reason``, ``direction``                          Enabled    Total dropped bytes
   388  ``forward_count_total``                    ``direction``                                      Enabled    Total forwarded packets
   389  ``forward_bytes_total``                    ``direction``                                      Enabled    Total forwarded bytes
   390  ========================================== ================================================== ========== ========================================================
   391  
   392  Policy
   393  ~~~~~~
   394  
   395  ========================================== ================================================== ========== ========================================================
   396  Name                                       Labels                                             Default    Description
   397  ========================================== ================================================== ========== ========================================================
   398  ``policy``                                                                                    Enabled    Number of policies currently loaded
   399  ``policy_regeneration_total``                                                                 Enabled    Deprecated, will be removed in Cilium 1.17 - use ``endpoint_regenerations_total`` instead. Total number of policies regenerated successfully
   400  ``policy_regeneration_time_stats_seconds`` ``scope``                                          Enabled    Deprecated, will be removed in Cilium 1.17 - use ``endpoint_regeneration_time_stats_seconds`` instead. Policy regeneration time stats labeled by the scope
   401  ``policy_max_revision``                                                                       Enabled    Highest policy revision number in the agent
   402  ``policy_change_total``                                                                       Enabled    Number of policy changes by outcome
   403  ``policy_endpoint_enforcement_status``                                                        Enabled    Number of endpoints labeled by policy enforcement status
   404  ``policy_implementation_delay``            ``source``                                         Enabled    Time in seconds between a policy change and it being fully deployed into the datapath, labeled by the policy's source
   405  ========================================== ================================================== ========== ========================================================
   406  
   407  Policy L7 (HTTP/Kafka/FQDN)
   408  ~~~~~~~~~~~~~~~~~~~~~~~~~~~
   409  
   410  ======================================== ================================================== ========== ========================================================
   411  Name                                     Labels                                             Default    Description
   412  ======================================== ================================================== ========== ========================================================
   413  ``proxy_redirects``                      ``protocol``                                       Enabled    Number of redirects installed for endpoints
   414  ``proxy_upstream_reply_seconds``         ``error``, ``protocol_l7``, ``scope``              Enabled    Seconds waited for upstream server to reply to a request
   415  ``proxy_datapath_update_timeout_total``                                                     Disabled   Number of total datapath update timeouts due to FQDN IP updates
   416  ``policy_l7_total``                      ``rule``, ``proxy_type``                           Enabled    Number of total L7 requests/responses
   417  ======================================== ================================================== ========== ========================================================
   418  
   419  Identity
   420  ~~~~~~~~
   421  
   422  ======================================== ================================================== ========== ========================================================
   423  Name                                     Labels                                             Default    Description
   424  ======================================== ================================================== ========== ========================================================
   425  ``identity``                             ``type``                                           Enabled    Number of identities currently allocated
   426  ``identity_label_sources``               ``source``                                         Enabled    Number of identities which contain at least one label from the given label source
   427  ``ipcache_errors_total``                 ``type``, ``error``                                Enabled    Number of errors interacting with the ipcache
   428  ``ipcache_events_total``                 ``type``                                           Enabled    Number of events interacting with the ipcache
   429  ======================================== ================================================== ========== ========================================================
   430  
   431  Events external to Cilium
   432  ~~~~~~~~~~~~~~~~~~~~~~~~~
   433  
   434  ======================================== ================================================== ========== ========================================================
   435  Name                                     Labels                                             Default    Description
   436  ======================================== ================================================== ========== ========================================================
   437  ``event_ts``                             ``source``                                         Enabled    Last timestamp when Cilium received an event from a control plane source, per resource and per action
   438  ``k8s_event_lag_seconds``                ``source``                                         Disabled   Lag for Kubernetes events - computed value between receiving a CNI ADD event from kubelet and a Pod event received from kube-api-server
   439  ======================================== ================================================== ========== ========================================================
   440  
   441  Controllers
   442  ~~~~~~~~~~~
   443  
   444  ======================================== ================================================== ========== ========================================================
   445  Name                                     Labels                                             Default    Description
   446  ======================================== ================================================== ========== ========================================================
   447  ``controllers_runs_total``               ``status``                                         Enabled    Number of times that a controller process was run
   448  ``controllers_runs_duration_seconds``    ``status``                                         Enabled    Duration in seconds of the controller process
   449  ``controllers_group_runs_total``         ``status``, ``group_name``                         Enabled    Number of times that a controller process was run, labeled by controller group name
   450  ``controllers_failing``                                                                     Enabled    Number of failing controllers
   451  ======================================== ================================================== ========== ========================================================
   452  
   453  The ``controllers_group_runs_total`` metric reports the success and failure
   454  count of each controller within the system, labeled by controller group name
   455  and completion status. Due to the large number of controllers, enabling this
   456  metric is on a per-controller basis. This is configured using an allow-list
   457  which is passed as the ``controller-group-metrics`` configuration flag,
   458  or the ``prometheus.controllerGroupMetrics`` helm value. The current
   459  recommended default set of group names can be found in the values file of
   460  the Cilium Helm chart. The special names "all" and "none" are supported.
   461  
   462  SubProcess
   463  ~~~~~~~~~~
   464  
   465  ======================================== ================================================== ========== ========================================================
   466  Name                                     Labels                                             Default    Description
   467  ======================================== ================================================== ========== ========================================================
   468  ``subprocess_start_total``               ``subsystem``                                      Enabled    Number of times that Cilium has started a subprocess
   469  ======================================== ================================================== ========== ========================================================
   470  
   471  Kubernetes
   472  ~~~~~~~~~~
   473  
   474  =========================================== ================================================== ========== ========================================================
   475  Name                                        Labels                                             Default    Description
   476  =========================================== ================================================== ========== ========================================================
   477  ``kubernetes_events_received_total``        ``scope``, ``action``, ``validity``, ``equal``     Enabled    Number of Kubernetes events received
   478  ``kubernetes_events_total``                 ``scope``, ``action``, ``outcome``                 Enabled    Number of Kubernetes events processed
   479  ``k8s_cnp_status_completion_seconds``       ``attempts``, ``outcome``                          Enabled    Duration in seconds in how long it took to complete a CNP status update
   480  ``k8s_terminating_endpoints_events_total``                                                     Enabled    Number of terminating endpoint events received from Kubernetes
   481  =========================================== ================================================== ========== ========================================================
   482  
   483  Kubernetes Rest Client
   484  ~~~~~~~~~~~~~~~~~~~~~~
   485  
   486  ============================================= ============================================= ========== ===========================================================
   487  Name                                          Labels                                        Default    Description
   488  ============================================= ============================================= ========== ===========================================================
   489  ``k8s_client_api_latency_time_seconds``       ``path``, ``method``                          Enabled    Duration of processed API calls labeled by path and method
   490  ``k8s_client_rate_limiter_duration_seconds``  ``path``, ``method``                          Enabled    Kubernetes client rate limiter latency in seconds. Broken down by path and method
   491  ``k8s_client_api_calls_total``                ``host``, ``method``, ``return_code``         Enabled    Number of API calls made to kube-apiserver labeled by host, method and return code
   492  ============================================= ============================================= ========== ===========================================================
   493  
   494  Kubernetes workqueue
   495  ~~~~~~~~~~~~~~~~~~~~
   496  
   497  ==================================================== ============================================= ========== ===========================================================
   498  Name                                                 Labels                                        Default    Description
   499  ==================================================== ============================================= ========== ===========================================================
   500  ``k8s_workqueue_depth``                              ``name``                                      Enabled    Current depth of workqueue
   501  ``k8s_workqueue_adds_total``                         ``name``                                      Enabled    Total number of adds handled by workqueue
   502  ``k8s_workqueue_queue_duration_seconds``             ``name``                                      Enabled    Duration in seconds an item stays in workqueue prior to request
   503  ``k8s_workqueue_work_duration_seconds``              ``name``                                      Enabled    Duration in seconds to process an item from workqueue
   504  ``k8s_workqueue_unfinished_work_seconds``            ``name``                                      Enabled    Duration in seconds of work in progress that hasn't been observed by work_duration. Large values indicate stuck threads. You can deduce the number of stuck threads by observing the rate at which this value increases.
   505  ``k8s_workqueue_longest_running_processor_seconds``  ``name``                                      Enabled    Duration in seconds of the longest running processor for workqueue
   506  ``k8s_workqueue_retries_total``                      ``name``                                      Enabled    Total number of retries handled by workqueue
   507  ==================================================== ============================================= ========== ===========================================================
   508  
   509  IPAM
   510  ~~~~
   511  
   512  ======================================== ============================================ ========== ========================================================
   513  Name                                     Labels                                       Default    Description
   514  ======================================== ============================================ ========== ========================================================
   515  ``ipam_capacity``                        ``family``                                   Enabled    Total number of IPs in the IPAM pool labeled by family
   516  ``ipam_events_total``                                                                 Enabled    Number of IPAM events received labeled by action and datapath family type
   517  ``ip_addresses``                         ``family``                                   Enabled    Number of allocated IP addresses
   518  ======================================== ============================================ ========== ========================================================
   519  
   520  KVstore
   521  ~~~~~~~
   522  
   523  ======================================== ============================================ ========== ========================================================
   524  Name                                     Labels                                       Default    Description
   525  ======================================== ============================================ ========== ========================================================
   526  ``kvstore_operations_duration_seconds``  ``action``, ``kind``, ``outcome``, ``scope`` Enabled    Duration of kvstore operation
   527  ``kvstore_events_queue_seconds``         ``action``, ``scope``                        Enabled    Seconds waited before a received event was queued
   528  ``kvstore_quorum_errors_total``          ``error``                                    Enabled    Number of quorum errors
   529  ``kvstore_sync_errors_total``            ``scope``, ``source_cluster``                Enabled    Number of times synchronization to the kvstore failed
   530  ``kvstore_sync_queue_size``              ``scope``, ``source_cluster``                Enabled    Number of elements queued for synchronization in the kvstore
   531  ``kvstore_initial_sync_completed``       ``scope``, ``source_cluster``, ``action``    Enabled    Whether the initial synchronization from/to the kvstore has completed
   532  ======================================== ============================================ ========== ========================================================
   533  
   534  Agent
   535  ~~~~~
   536  
   537  ================================ ================================ ========== ========================================================
   538  Name                             Labels                           Default    Description
   539  ================================ ================================ ========== ========================================================
   540  ``agent_bootstrap_seconds``      ``scope``, ``outcome``           Enabled    Duration of various bootstrap phases
   541  ``api_process_time_seconds``                                      Enabled    Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code.
   542  ================================ ================================ ========== ========================================================
   543  
   544  FQDN
   545  ~~~~
   546  
   547  ================================== ================================ ============ ========================================================
   548  Name                               Labels                           Default      Description
   549  ================================== ================================ ============ ========================================================
   550  ``fqdn_gc_deletions_total``                                         Enabled      Number of FQDNs that have been cleaned on FQDN garbage collector job
   551  ``fqdn_active_names``              ``endpoint``                     Disabled     Number of domains inside the DNS cache that have not expired (by TTL), per endpoint
   552  ``fqdn_active_ips``                ``endpoint``                     Disabled     Number of IPs inside the DNS cache associated with a domain that has not expired (by TTL), per endpoint
   553  ``fqdn_alive_zombie_connections``  ``endpoint``                     Disabled     Number of IPs associated with domains that have expired (by TTL) yet still associated with an active connection (aka zombie), per endpoint
   554  ``fqdn_selectors``                                                  Enabled      Number of registered ToFQDN selectors
   555  ================================== ================================ ============ ========================================================
   556  
   557  Jobs
   558  ~~~~
   559  
   560  ================================== ================================ ============ ========================================================
   561  Name                               Labels                           Default      Description
   562  ================================== ================================ ============ ========================================================
   563  ``jobs_errors_total``              ``job``                          Enabled      Number of jobs runs that returned an error
   564  ``jobs_one_shot_run_seconds``      ``job``                          Enabled      Histogram of one shot job run duration
   565  ``jobs_timer_run_seconds``         ``job``                          Enabled      Histogram of timer job run duration
   566  ``jobs_observer_run_seconds``      ``job``                          Enabled      Histogram of observer job run duration
   567  ================================== ================================ ============ ========================================================
   568  
   569  CIDRGroups
   570  ~~~~~~~~~~
   571  
   572  =================================================== ===================== =============================
   573  Name                                                Labels                Default    Description
   574  =================================================== ===================== =============================
   575  ``cidrgroups_referenced``                                                 Enabled    Number of CNPs and CCNPs referencing at least one CiliumCIDRGroup. CNPs with empty or non-existing CIDRGroupRefs are not considered
   576  ``cidrgroup_translation_time_stats_seconds``                              Disabled   CIDRGroup translation time stats
   577  =================================================== ===================== =============================
   578  
   579  .. _metrics_api_rate_limiting:
   580  
   581  API Rate Limiting
   582  ~~~~~~~~~~~~~~~~~
   583  
   584  ============================================== ========================================== ========== ========================================================
   585  Name                                           Labels                                     Default    Description
   586  ============================================== ========================================== ========== ========================================================
   587  ``api_limiter_adjustment_factor``              ``api_call``                               Enabled    Most recent adjustment factor for automatic adjustment
   588  ``api_limiter_processed_requests_total``       ``api_call``, ``outcome``, ``return_code`` Enabled    Total number of API requests processed
   589  ``api_limiter_processing_duration_seconds``    ``api_call``, ``value``                    Enabled    Mean and estimated processing duration in seconds
   590  ``api_limiter_rate_limit``                     ``api_call``, ``value``                    Enabled    Current rate limiting configuration (limit and burst)
   591  ``api_limiter_requests_in_flight``             ``api_call``  ``value``                    Enabled    Current and maximum allowed number of requests in flight
   592  ``api_limiter_wait_duration_seconds``          ``api_call``, ``value``                    Enabled    Mean, min, and max wait duration
   593  ``api_limiter_wait_history_duration_seconds``  ``api_call``                               Disabled   Histogram of wait duration per API call processed
   594  ============================================== ========================================== ========== ========================================================
   595  
   596  .. _metrics_bgp_control_plane:
   597  
   598  BGP Control Plane
   599  ~~~~~~~~~~~~~~~~~
   600  
   601  ====================== ============================================= ======== ===================================================================
   602  Name                   Labels                                        Default  Description
   603  ====================== ============================================= ======== ===================================================================
   604  ``session_state``      ``vrouter``, ``neighbor``                     Enabled  Current state of the BGP session with the peer, Up = 1 or Down = 0
   605  ``advertised_routes``  ``vrouter``, ``neighbor``, ``afi``, ``safi``  Enabled  Number of routes advertised to the peer
   606  ``received_routes``    ``vrouter``, ``neighbor``, ``afi``, ``safi``  Enabled  Number of routes received from the peer
   607  ====================== ============================================= ======== ===================================================================
   608  
   609  All metrics are enabled only when the BGP Control Plane is enabled.
   610  
   611  cilium-operator
   612  ---------------
   613  
   614  Configuration
   615  ^^^^^^^^^^^^^
   616  
   617  ``cilium-operator`` can be configured to serve metrics by running with the
   618  option ``--enable-metrics``.  By default, the operator will expose metrics on
   619  port 9963, the port can be changed with the option
   620  ``--operator-prometheus-serve-addr``.
   621  
   622  Exported Metrics
   623  ^^^^^^^^^^^^^^^^
   624  
   625  All metrics are exported under the ``cilium_operator_`` Prometheus namespace.
   626  
   627  .. _ipam_metrics:
   628  
   629  IPAM
   630  ~~~~
   631  
   632  .. Note::
   633  
   634      IPAM metrics are all ``Enabled`` only if using the AWS, Alibabacloud or Azure IPAM plugins.
   635  
   636  ======================================== ================================================================= ========== ========================================================
   637  Name                                     Labels                                                            Default    Description
   638  ======================================== ================================================================= ========== ========================================================
   639  ``ipam_ips``                             ``type``                                                          Enabled    Number of IPs allocated
   640  ``ipam_ip_allocation_ops``               ``subnet_id``                                                     Enabled    Number of IP allocation operations.
   641  ``ipam_ip_release_ops``                  ``subnet_id``                                                     Enabled    Number of IP release operations.
   642  ``ipam_interface_creation_ops``          ``subnet_id``                                                     Enabled    Number of interfaces creation operations.
   643  ``ipam_release_duration_seconds``        ``type``, ``status``, ``subnet_id``                               Enabled    Release ip or interface latency in seconds
   644  ``ipam_allocation_duration_seconds``     ``type``, ``status``, ``subnet_id``                               Enabled    Allocation ip or interface latency in seconds
   645  ``ipam_available_interfaces``                                                                              Enabled    Number of interfaces with addresses available
   646  ``ipam_nodes``                           ``category``                                                      Enabled    Number of nodes by category { total | in-deficit | at-capacity }
   647  ``ipam_resync_total``                                                                                      Enabled    Number of synchronization operations with external IPAM API
   648  ``ipam_api_duration_seconds``            ``operation``, ``response_code``                                  Enabled    Duration of interactions with external IPAM API.
   649  ``ipam_api_rate_limit_duration_seconds`` ``operation``                                                     Enabled    Duration of rate limiting while accessing external IPAM API
   650  ``ipam_available_ips``                   ``target_node``                                                   Enabled    Number of available IPs on a node (taking into account plugin specific NIC/Address limits).
   651  ``ipam_used_ips``                        ``target_node``                                                   Enabled    Number of currently used IPs on a node.
   652  ``ipam_needed_ips``                      ``target_node``                                                   Enabled    Number of IPs needed to satisfy allocation on a node.
   653  ======================================== ================================================================= ========== ========================================================
   654  
   655  LB-IPAM
   656  ~~~~~~~
   657  
   658  ======================================== ================================================================= ========== ========================================================
   659  Name                                     Labels                                                            Default    Description
   660  ======================================== ================================================================= ========== ========================================================
   661  ``lbipam_conflicting_pools_total``                                                                         Enabled    Number of conflicting pools
   662  ``lbipam_ips_available_total``           ``pool``                                                          Enabled    Number of available IPs per pool
   663  ``lbipam_ips_used_total``                ``pool``                                                          Enabled    Number of used IPs per pool
   664  ``lbipam_services_matching_total``                                                                         Enabled    Number of matching services
   665  ``lbipam_services_unsatisfied_total``                                                                      Enabled    Number of services which did not get requested IPs
   666  ======================================== ================================================================= ========== ========================================================
   667  
   668  Controllers
   669  ~~~~~~~~~~~
   670  
   671  ======================================== ================================================== ========== ========================================================
   672  Name                                     Labels                                             Default    Description
   673  ======================================== ================================================== ========== ========================================================
   674  ``controllers_group_runs_total``         ``status``, ``group_name``                         Enabled    Number of times that a controller process was run, labeled by controller group name
   675  ======================================== ================================================== ========== ========================================================
   676  
   677  The ``controllers_group_runs_total`` metric reports the success and failure
   678  count of each controller within the system, labeled by controller group name
   679  and completion status. Due to the large number of controllers, enabling this
   680  metric is on a per-controller basis. This is configured using an allow-list
   681  which is passed as the ``controller-group-metrics`` configuration flag,
   682  or the ``prometheus.controllerGroupMetrics`` helm value. The current
   683  recommended default set of group names can be found in the values file of
   684  the Cilium Helm chart. The special names "all" and "none" are supported.
   685  
   686  .. _ces_metrics:
   687  
   688  CiliumEndpointSlices (CES)
   689  ~~~~~~~~~~~~~~~~~~~~~~~~~~
   690  
   691  ============================================== ================================ ========================================================
   692  Name                                           Labels                           Description
   693  ============================================== ================================ ========================================================
   694  ``number_of_ceps_per_ces``                                                      The number of CEPs batched in a CES
   695  ``number_of_cep_changes_per_ces``              ``opcode``                       The number of changed CEPs in each CES update
   696  ``ces_sync_total``                             ``outcome``                      The number of completed CES syncs by outcome
   697  ``ces_queueing_delay_seconds``                                                  CiliumEndpointSlice queueing delay in seconds
   698  ============================================== ================================ ========================================================
   699  
   700  
   701  Hubble
   702  ------
   703  
   704  Configuration
   705  ^^^^^^^^^^^^^
   706  
   707  Hubble metrics are served by a Hubble instance running inside ``cilium-agent``.
   708  The command-line options to configure them are ``--enable-hubble``,
   709  ``--hubble-metrics-server``, and ``--hubble-metrics``.
   710  ``--hubble-metrics-server`` takes an ``IP:Port`` pair, but
   711  passing an empty IP (e.g. ``:9965``) will bind the server to all available
   712  interfaces. ``--hubble-metrics`` takes a comma-separated list of metrics.
   713  It's also possible to configure Hubble metrics to listen with TLS and
   714  optionally use mTLS for authentication. For details see :ref:`hubble_configure_metrics_tls`.
   715  
   716  Some metrics can take additional semicolon-separated options per metric, e.g.
   717  ``--hubble-metrics="dns:query;ignoreAAAA,http:destinationContext=workload-name"``
   718  will enable the ``dns`` metric with the ``query`` and ``ignoreAAAA`` options,
   719  and the ``http`` metric with the ``destinationContext=workload-name`` option.
   720  
   721  .. _hubble_context_options:
   722  
   723  Context Options
   724  ^^^^^^^^^^^^^^^
   725  
   726  Hubble metrics support configuration via context options.
   727  Supported context options for all metrics:
   728  
   729  - ``sourceContext`` - Configures the ``source`` label on metrics for both egress and ingress traffic.
   730  - ``sourceEgressContext`` - Configures the ``source`` label on metrics for egress traffic (takes precedence over ``sourceContext``).
   731  - ``sourceIngressContext`` - Configures the ``source`` label on metrics for ingress traffic (takes precedence over ``sourceContext``).
   732  - ``destinationContext`` - Configures the ``destination`` label on metrics for both egress and ingress traffic.
   733  - ``destinationEgressContext`` - Configures the ``destination`` label on metrics for egress traffic (takes precedence over ``destinationContext``).
   734  - ``destinationIngressContext`` - Configures the ``destination`` label on metrics for ingress traffic (takes precedence over ``destinationContext``).
   735  - ``labelsContext`` - Configures a list of labels to be enabled on metrics.
   736  
   737  There are also some context options that are specific to certain metrics.
   738  See the documentation for the individual metrics to see what options are available for each.
   739  
   740  See below for details on each of the different context options.
   741  
   742  Most Hubble metrics can be configured to add the source and/or destination
   743  context as a label using the ``sourceContext`` and ``destinationContext``
   744  options. The possible values are:
   745  
   746  ===================== ===================================================================================
   747  Option Value          Description
   748  ===================== ===================================================================================
   749  ``identity``          All Cilium security identity labels
   750  ``namespace``         Kubernetes namespace name
   751  ``pod``               Kubernetes pod name and namespace name in the form of ``namespace/pod``.
   752  ``pod-name``          Kubernetes pod name.
   753  ``dns``               All known DNS names of the source or destination (comma-separated)
   754  ``ip``                The IPv4 or IPv6 address
   755  ``reserved-identity`` Reserved identity label.
   756  ``workload``          Kubernetes pod's workload name and namespace in the form of ``namespace/workload-name``.
   757  ``workload-name``     Kubernetes pod's workload name (workloads are: Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift), etc).
   758  ``app``               Kubernetes pod's app name, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``).
   759  ===================== ===================================================================================
   760  
   761  When specifying the source and/or destination context, multiple contexts can be
   762  specified by separating them via the ``|`` symbol.
   763  When multiple are specified, then the first non-empty value is added to the
   764  metric as a label. For example, a metric configuration of
   765  ``flow:destinationContext=dns|ip`` will first try to use the DNS name of the
   766  target for the label. If no DNS name is known for the target, it will fall back
   767  and use the IP address of the target instead.
   768  
   769  .. note::
   770  
   771     There are 3 cases in which the identity label list contains multiple reserved labels:
   772  
   773     1. ``reserved:kube-apiserver`` and ``reserved:host``
   774     2. ``reserved:kube-apiserver`` and ``reserved:remote-node``
   775     3. ``reserved:kube-apiserver`` and ``reserved:world``
   776  
   777     In all of these 3 cases, ``reserved-identity`` context returns ``reserved:kube-apiserver``.
   778  
   779  Hubble metrics can also be configured with a ``labelsContext`` which allows providing a list of labels
   780  that should be added to the metric. Unlike ``sourceContext`` and ``destinationContext``, instead
   781  of different values being put into the same metric label, the ``labelsContext`` puts them into different label values.
   782  
   783  ============================== ===============================================================================
   784  Option Value                   Description
   785  ============================== ===============================================================================
   786  ``source_ip``                  The source IP of the flow.
   787  ``source_namespace``           The namespace of the pod if the flow source is from a Kubernetes pod.
   788  ``source_pod``                 The pod name if the flow source is from a Kubernetes pod.
   789  ``source_workload``            The name of the source pod's workload (Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift)).
   790  ``source_workload_kind``       The kind of the source pod's workload, for example, Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift).
   791  ``source_app``                 The app name of the source pod, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``).
   792  ``destination_ip``             The destination IP of the flow.
   793  ``destination_namespace``      The namespace of the pod if the flow destination is from a Kubernetes pod.
   794  ``destination_pod``            The pod name if the flow destination is from a Kubernetes pod.
   795  ``destination_workload``       The name of the destination pod's workload (Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift)).
   796  ``destination_workload_kind``  The kind of the destination pod's workload, for example, Deployment, Statefulset, Daemonset, ReplicationController, CronJob, Job, DeploymentConfig (OpenShift).
   797  ``destination_app``            The app name of the source pod, derived from pod labels (``app.kubernetes.io/name``, ``k8s-app``, or ``app``).
   798  ``traffic_direction``          Identifies the traffic direction of the flow. Possible values are ``ingress``, ``egress`` and ``unknown``.
   799  ============================== ===============================================================================
   800  
   801  When specifying the flow context, multiple values can be specified by separating them via the ``,`` symbol.
   802  All labels listed are included in the metric, even if empty. For example, a metric configuration of
   803  ``http:labelsContext=source_namespace,source_pod`` will add the ``source_namespace`` and ``source_pod``
   804  labels to all Hubble HTTP metrics.
   805  
   806  .. note::
   807  
   808      To limit metrics cardinality hubble will remove data series bound to specific pod after one minute from pod deletion.
   809      Metric is considered to be bound to a specific pod when at least one of the following conditions is met:
   810  
   811      * ``sourceContext`` is set to ``pod`` and metric series has ``source`` label matching ``<pod_namespace>/<pod_name>``
   812      * ``destinationContext`` is set to ``pod`` and metric series has ``destination`` label matching ``<pod_namespace>/<pod_name>``
   813      * ``labelsContext`` contains both ``source_namespace`` and ``source_pod`` and metric series labels match namespace and name of deleted pod
   814      * ``labelsContext`` contains both ``destination_namespace`` and ``destination_pod`` and metric series labels match namespace and name of deleted pod
   815  
   816  .. _hubble_exported_metrics:
   817  
   818  Exported Metrics
   819  ^^^^^^^^^^^^^^^^
   820  
   821  Hubble metrics are exported under the ``hubble_`` Prometheus namespace.
   822  
   823  lost events
   824  ~~~~~~~~~~~
   825  
   826  This metric, unlike other ones, is not directly tied to network flows. It's enabled if any of the other metrics is enabled.
   827  
   828  ================================ ======================================== ========== ==================================================
   829  Name                             Labels                                   Default    Description
   830  ================================ ======================================== ========== ==================================================
   831  ``lost_events_total``            ``source``                               Enabled    Number of lost events
   832  ================================ ======================================== ========== ==================================================
   833  
   834  Labels
   835  """"""
   836  
   837  - ``source`` identifies the source of lost events, one of:
   838     - ``perf_event_ring_buffer``
   839     - ``observer_events_queue``
   840     - ``hubble_ring_buffer``
   841  
   842  
   843  ``dns``
   844  ~~~~~~~
   845  
   846  ================================ ======================================== ========== ===================================
   847  Name                             Labels                                   Default    Description
   848  ================================ ======================================== ========== ===================================
   849  ``dns_queries_total``            ``rcode``, ``qtypes``, ``ips_returned``  Disabled   Number of DNS queries observed
   850  ``dns_responses_total``          ``rcode``, ``qtypes``, ``ips_returned``  Disabled   Number of DNS responses observed
   851  ``dns_response_types_total``     ``type``, ``qtypes``                     Disabled   Number of DNS response types
   852  ================================ ======================================== ========== ===================================
   853  
   854  Options
   855  """""""
   856  
   857  ============== ============= ====================================================================================
   858  Option Key     Option Value  Description
   859  ============== ============= ====================================================================================
   860  ``query``      N/A           Include the query as label "query"
   861  ``ignoreAAAA`` N/A           Ignore any AAAA requests/responses
   862  ============== ============= ====================================================================================
   863  
   864  This metric supports :ref:`Context Options<hubble_context_options>`.
   865  
   866  
   867  ``drop``
   868  ~~~~~~~~
   869  
   870  ================================ ======================================== ========== ===================================
   871  Name                             Labels                                   Default    Description
   872  ================================ ======================================== ========== ===================================
   873  ``drop_total``                   ``reason``, ``protocol``                 Disabled   Number of drops
   874  ================================ ======================================== ========== ===================================
   875  
   876  Options
   877  """""""
   878  
   879  This metric supports :ref:`Context Options<hubble_context_options>`.
   880  
   881  ``flow``
   882  ~~~~~~~~
   883  
   884  ================================ ======================================== ========== ===================================
   885  Name                             Labels                                   Default    Description
   886  ================================ ======================================== ========== ===================================
   887  ``flows_processed_total``        ``type``, ``subtype``, ``verdict``       Disabled   Total number of flows processed
   888  ================================ ======================================== ========== ===================================
   889  
   890  Options
   891  """""""
   892  
   893  This metric supports :ref:`Context Options<hubble_context_options>`.
   894  
   895  ``flows-to-world``
   896  ~~~~~~~~~~~~~~~~~~
   897  
   898  This metric counts all non-reply flows containing the ``reserved:world`` label in their
   899  destination identity. By default, dropped flows are counted if and only if the drop reason
   900  is ``Policy denied``. Set ``any-drop`` option to count all dropped flows.
   901  
   902  ================================ ======================================== ========== ============================================
   903  Name                             Labels                                   Default    Description
   904  ================================ ======================================== ========== ============================================
   905  ``flows_to_world_total``         ``protocol``, ``verdict``                Disabled   Total number of flows to ``reserved:world``.
   906  ================================ ======================================== ========== ============================================
   907  
   908  Options
   909  """""""
   910  
   911  ============== ============= ======================================================
   912  Option Key     Option Value  Description
   913  ============== ============= ======================================================
   914  ``any-drop``   N/A           Count any dropped flows regardless of the drop reason.
   915  ``port``       N/A           Include the destination port as label ``port``.
   916  ``syn-only``   N/A           Only count non-reply SYNs for TCP flows.
   917  ============== ============= ======================================================
   918  
   919  
   920  This metric supports :ref:`Context Options<hubble_context_options>`.
   921  
   922  ``http``
   923  ~~~~~~~~
   924  
   925  Deprecated, use ``httpV2`` instead.
   926  These metrics can not be enabled at the same time as ``httpV2``.
   927  
   928  ================================= ======================================= ========== ==============================================
   929  Name                              Labels                                  Default    Description
   930  ================================= ======================================= ========== ==============================================
   931  ``http_requests_total``           ``method``, ``protocol``, ``reporter``  Disabled   Count of HTTP requests
   932  ``http_responses_total``          ``method``, ``status``, ``reporter``    Disabled   Count of HTTP responses
   933  ``http_request_duration_seconds`` ``method``, ``reporter``                Disabled   Histogram of HTTP request duration in seconds
   934  ================================= ======================================= ========== ==============================================
   935  
   936  Labels
   937  """"""
   938  
   939  - ``method`` is the HTTP method of the request/response.
   940  - ``protocol`` is the HTTP protocol of the request, (For example: ``HTTP/1.1``, ``HTTP/2``).
   941  - ``status`` is the HTTP status code of the response.
   942  - ``reporter`` identifies the origin of the request/response. It is set to ``client`` if it originated from the client, ``server`` if it originated from the server, or ``unknown`` if its origin is unknown.
   943  
   944  Options
   945  """""""
   946  
   947  This metric supports :ref:`Context Options<hubble_context_options>`.
   948  
   949  ``httpV2``
   950  ~~~~~~~~~~
   951  
   952  ``httpV2`` is an updated version of the existing ``http`` metrics.
   953  These metrics can not be enabled at the same time as ``http``.
   954  
   955  The main difference is that ``http_requests_total`` and
   956  ``http_responses_total`` have been consolidated, and use the response flow
   957  data.
   958  
   959  Additionally, the ``http_request_duration_seconds`` metric source/destination
   960  related labels now are from the perspective of the request. In the ``http``
   961  metrics, the source/destination were swapped, because the metric uses the
   962  response flow data, where the source/destination are swapped, but in ``httpV2``
   963  we correctly account for this.
   964  
   965  ================================= =================================================== ========== ==============================================
   966  Name                              Labels                                              Default    Description
   967  ================================= =================================================== ========== ==============================================
   968  ``http_requests_total``           ``method``, ``protocol``, ``status``, ``reporter``  Disabled   Count of HTTP requests
   969  ``http_request_duration_seconds`` ``method``, ``reporter``                            Disabled   Histogram of HTTP request duration in seconds
   970  ================================= =================================================== ========== ==============================================
   971  
   972  Labels
   973  """"""
   974  
   975  - ``method`` is the HTTP method of the request/response.
   976  - ``protocol`` is the HTTP protocol of the request, (For example: ``HTTP/1.1``, ``HTTP/2``).
   977  - ``status`` is the HTTP status code of the response.
   978  - ``reporter`` identifies the origin of the request/response. It is set to ``client`` if it originated from the client, ``server`` if it originated from the server, or ``unknown`` if its origin is unknown.
   979  
   980  Options
   981  """""""
   982  
   983  ============== ============== =============================================================================================================
   984  Option Key     Option Value   Description
   985  ============== ============== =============================================================================================================
   986  ``exemplars``  ``true``       Include extracted trace IDs in HTTP metrics. Requires :ref:`OpenMetrics to be enabled<hubble_open_metrics>`.
   987  ============== ============== =============================================================================================================
   988  
   989  This metric supports :ref:`Context Options<hubble_context_options>`.
   990  
   991  ``icmp``
   992  ~~~~~~~~
   993  
   994  ================================ ======================================== ========== ===================================
   995  Name                             Labels                                   Default    Description
   996  ================================ ======================================== ========== ===================================
   997  ``icmp_total``                   ``family``, ``type``                     Disabled   Number of ICMP messages
   998  ================================ ======================================== ========== ===================================
   999  
  1000  Options
  1001  """""""
  1002  
  1003  This metric supports :ref:`Context Options<hubble_context_options>`.
  1004  
  1005  ``kafka``
  1006  ~~~~~~~~~
  1007  
  1008  =================================== ===================================================== ========== ==============================================
  1009  Name                                Labels                                                Default    Description
  1010  =================================== ===================================================== ========== ==============================================
  1011  ``kafka_requests_total``            ``topic``, ``api_key``, ``error_code``, ``reporter``  Disabled   Count of Kafka requests by topic
  1012  ``kafka_request_duration_seconds``  ``topic``, ``api_key``, ``reporter``                  Disabled   Histogram of Kafka request duration by topic
  1013  =================================== ===================================================== ========== ==============================================
  1014  
  1015  Options
  1016  """""""
  1017  
  1018  This metric supports :ref:`Context Options<hubble_context_options>`.
  1019  
  1020  ``port-distribution``
  1021  ~~~~~~~~~~~~~~~~~~~~~
  1022  
  1023  ================================ ======================================== ========== ==================================================
  1024  Name                             Labels                                   Default    Description
  1025  ================================ ======================================== ========== ==================================================
  1026  ``port_distribution_total``      ``protocol``, ``port``                   Disabled   Numbers of packets distributed by destination port
  1027  ================================ ======================================== ========== ==================================================
  1028  
  1029  Options
  1030  """""""
  1031  
  1032  This metric supports :ref:`Context Options<hubble_context_options>`.
  1033  
  1034  ``tcp``
  1035  ~~~~~~~
  1036  
  1037  ================================ ======================================== ========== ==================================================
  1038  Name                             Labels                                   Default    Description
  1039  ================================ ======================================== ========== ==================================================
  1040  ``tcp_flags_total``              ``flag``, ``family``                     Disabled   TCP flag occurrences
  1041  ================================ ======================================== ========== ==================================================
  1042  
  1043  Options
  1044  """""""
  1045  
  1046  This metric supports :ref:`Context Options<hubble_context_options>`.
  1047  
  1048  dynamic_exporter_exporters_total
  1049  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1050  
  1051  This is dynamic hubble exporter metric.
  1052  
  1053  ==================================== ======================================== ========== ==================================================
  1054  Name                                 Labels                                   Default    Description
  1055  ==================================== ======================================== ========== ==================================================
  1056  ``dynamic_exporter_exporters_total`` ``source``                               Enabled    Number of configured hubble exporters
  1057  ==================================== ======================================== ========== ==================================================
  1058  
  1059  Labels
  1060  """"""
  1061  
  1062  - ``status`` identifies status of exporters, can be one of:
  1063     - ``active``
  1064     - ``inactive``
  1065  
  1066  dynamic_exporter_up
  1067  ~~~~~~~~~~~~~~~~~~~
  1068  
  1069  This is dynamic hubble exporter metric.
  1070  
  1071  ==================================== ======================================== ========== ==================================================
  1072  Name                                 Labels                                   Default    Description
  1073  ==================================== ======================================== ========== ==================================================
  1074  ``dynamic_exporter_up``              ``source``                               Enabled    Status of exporter (1 - active, 0 - inactive)
  1075  ==================================== ======================================== ========== ==================================================
  1076  
  1077  Labels
  1078  """"""
  1079  
  1080  - ``name`` identifies exporter name
  1081  
  1082  dynamic_exporter_reconfigurations_total
  1083  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1084  
  1085  This is dynamic hubble exporter metric.
  1086  
  1087  =========================================== ======================================== ========== ==================================================
  1088  Name                                        Labels                                   Default    Description
  1089  =========================================== ======================================== ========== ==================================================
  1090  ``dynamic_exporter_reconfigurations_total`` ``op``                                   Enabled    Number of dynamic exporters reconfigurations
  1091  =========================================== ======================================== ========== ==================================================
  1092  
  1093  Labels
  1094  """"""
  1095  
  1096  - ``op`` identifies reconfiguration operation type, can be one of:
  1097     - ``add``
  1098     - ``update``
  1099     - ``remove``
  1100  
  1101  dynamic_exporter_config_hash
  1102  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1103  
  1104  This is dynamic hubble exporter metric.
  1105  
  1106  ==================================== ======================================== ========== ==================================================
  1107  Name                                 Labels                                   Default    Description
  1108  ==================================== ======================================== ========== ==================================================
  1109  ``dynamic_exporter_config_hash``                                              Enabled    Hash of last applied config
  1110  ==================================== ======================================== ========== ==================================================
  1111  
  1112  dynamic_exporter_config_last_applied
  1113  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  1114  
  1115  This is dynamic hubble exporter metric.
  1116  
  1117  ======================================== ======================================== ========== ==================================================
  1118  Name                                     Labels                                   Default    Description
  1119  ======================================== ======================================== ========== ==================================================
  1120  ``dynamic_exporter_config_last_applied``                                          Enabled    Timestamp of last applied config
  1121  ======================================== ======================================== ========== ==================================================
  1122  
  1123  
  1124  
  1125  
  1126  .. _clustermesh_apiserver_metrics_reference:
  1127  
  1128  clustermesh-apiserver
  1129  ---------------------
  1130  
  1131  Configuration
  1132  ^^^^^^^^^^^^^
  1133  
  1134  To expose any metrics, invoke ``clustermesh-apiserver`` with the
  1135  ``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but
  1136  passing an empty IP (e.g. ``:9962``) will bind the server to all available
  1137  interfaces (there is usually only one in a container).
  1138  
  1139  Exported Metrics
  1140  ^^^^^^^^^^^^^^^^
  1141  
  1142  All metrics are exported under the ``cilium_clustermesh_apiserver_``
  1143  Prometheus namespace.
  1144  
  1145  Bootstrap
  1146  ~~~~~~~~~
  1147  
  1148  ======================================== ============================================ ========================================================
  1149  Name                                     Labels                                       Description
  1150  ======================================== ============================================ ========================================================
  1151  ``bootstrap_seconds``                    ``source_cluster``                           Duration in seconds to complete bootstrap
  1152  ======================================== ============================================ ========================================================
  1153  
  1154  KVstore
  1155  ~~~~~~~
  1156  
  1157  ======================================== ============================================ ========================================================
  1158  Name                                     Labels                                       Description
  1159  ======================================== ============================================ ========================================================
  1160  ``kvstore_operations_duration_seconds``  ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
  1161  ``kvstore_events_queue_seconds``         ``action``, ``scope``                        Seconds waited before a received event was queued
  1162  ``kvstore_quorum_errors_total``          ``error``                                    Number of quorum errors
  1163  ``kvstore_sync_errors_total``            ``scope``, ``source_cluster``                Number of times synchronization to the kvstore failed
  1164  ``kvstore_sync_queue_size``              ``scope``, ``source_cluster``                Number of elements queued for synchronization in the kvstore
  1165  ``kvstore_initial_sync_completed``       ``scope``, ``source_cluster``, ``action``    Whether the initial synchronization from/to the kvstore has completed
  1166  ======================================== ============================================ ========================================================
  1167  
  1168  API Rate Limiting
  1169  ~~~~~~~~~~~~~~~~~
  1170  
  1171  ============================================== ========================================== ========================================================
  1172  Name                                           Labels                                     Description
  1173  ============================================== ========================================== ========================================================
  1174  ``api_limiter_processed_requests_total``       ``api_call``, ``outcome``, ``return_code`` Total number of API requests processed
  1175  ``api_limiter_processing_duration_seconds``    ``api_call``, ``value``                    Mean and estimated processing duration in seconds
  1176  ``api_limiter_rate_limit``                     ``api_call``, ``value``                    Current rate limiting configuration (limit and burst)
  1177  ``api_limiter_requests_in_flight``             ``api_call``  ``value``                    Current and maximum allowed number of requests in flight
  1178  ``api_limiter_wait_duration_seconds``          ``api_call``, ``value``                     Mean, min, and max wait duration
  1179  ============================================== ========================================== ========================================================
  1180  
  1181  Controllers
  1182  ~~~~~~~~~~~
  1183  
  1184  ======================================== ================================================== ========== ========================================================
  1185  Name                                     Labels                                             Default    Description
  1186  ======================================== ================================================== ========== ========================================================
  1187  ``controllers_group_runs_total``         ``status``, ``group_name``                         Enabled    Number of times that a controller process was run, labeled by controller group name
  1188  ======================================== ================================================== ========== ========================================================
  1189  
  1190  The ``controllers_group_runs_total`` metric reports the success
  1191  and failure count of each controller within the system, labeled by
  1192  controller group name and completion status. Enabling this metric is
  1193  on a per-controller basis. This is configured using an allow-list which
  1194  is passed as the ``controller-group-metrics`` configuration flag.
  1195  The current default set for ``clustermesh-apiserver`` found in the
  1196  Cilium Helm chart is the special name "all", which enables the metric
  1197  for all controller groups. The special name "none" is also supported.
  1198  
  1199  .. _kvstoremesh_metrics_reference:
  1200  
  1201  kvstoremesh
  1202  -----------
  1203  
  1204  Configuration
  1205  ^^^^^^^^^^^^^
  1206  
  1207  To expose any metrics, invoke ``kvstoremesh`` with the
  1208  ``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but
  1209  passing an empty IP (e.g. ``:9964``) binds the server to all available
  1210  interfaces (there is usually only one interface in a container).
  1211  
  1212  Exported Metrics
  1213  ^^^^^^^^^^^^^^^^
  1214  
  1215  All metrics are exported under the ``cilium_kvstoremesh_`` Prometheus namespace.
  1216  
  1217  Bootstrap
  1218  ~~~~~~~~~
  1219  
  1220  ======================================== ============================================ ========================================================
  1221  Name                                     Labels                                       Description
  1222  ======================================== ============================================ ========================================================
  1223  ``bootstrap_seconds``                    ``source_cluster``                           Duration in seconds to complete bootstrap
  1224  ======================================== ============================================ ========================================================
  1225  
  1226  Remote clusters
  1227  ~~~~~~~~~~~~~~~
  1228  
  1229  ==================================== ======================================= =================================================================
  1230  Name                                 Labels                                                       Description
  1231  ==================================== ======================================= =================================================================
  1232  ``remote_clusters``                  ``source_cluster``                      The total number of remote clusters meshed with the local cluster
  1233  ``remote_cluster_failures``          ``source_cluster``, ``target_cluster``  The total number of failures related to the remote cluster
  1234  ``remote_cluster_last_failure_ts``   ``source_cluster``, ``target_cluster``  The timestamp of the last failure of the remote cluster
  1235  ``remote_cluster_readiness_status``  ``source_cluster``, ``target_cluster``  The readiness status of the remote cluster
  1236  ==================================== ======================================= =================================================================
  1237  
  1238  KVstore
  1239  ~~~~~~~
  1240  
  1241  ======================================== ============================================ ========================================================
  1242  Name                                     Labels                                       Description
  1243  ======================================== ============================================ ========================================================
  1244  ``kvstore_operations_duration_seconds``  ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
  1245  ``kvstore_events_queue_seconds``         ``action``, ``scope``                        Seconds waited before a received event was queued
  1246  ``kvstore_quorum_errors_total``          ``error``                                    Number of quorum errors
  1247  ``kvstore_sync_errors_total``            ``scope``, ``source_cluster``                Number of times synchronization to the kvstore failed
  1248  ``kvstore_sync_queue_size``              ``scope``, ``source_cluster``                Number of elements queued for synchronization in the kvstore
  1249  ``kvstore_initial_sync_completed``       ``scope``, ``source_cluster``, ``action``    Whether the initial synchronization from/to the kvstore has completed
  1250  ======================================== ============================================ ========================================================
  1251  
  1252  API Rate Limiting
  1253  ~~~~~~~~~~~~~~~~~
  1254  
  1255  ============================================== ========================================== ========================================================
  1256  Name                                           Labels                                     Description
  1257  ============================================== ========================================== ========================================================
  1258  ``api_limiter_processed_requests_total``       ``api_call``, ``outcome``, ``return_code`` Total number of API requests processed
  1259  ``api_limiter_processing_duration_seconds``    ``api_call``, ``value``                    Mean and estimated processing duration in seconds
  1260  ``api_limiter_rate_limit``                     ``api_call``, ``value``                    Current rate limiting configuration (limit and burst)
  1261  ``api_limiter_requests_in_flight``             ``api_call``  ``value``                    Current and maximum allowed number of requests in flight
  1262  ``api_limiter_wait_duration_seconds``          ``api_call``, ``value``                    Mean, min, and max wait duration
  1263  ============================================== ========================================== ========================================================
  1264  
  1265  Controllers
  1266  ~~~~~~~~~~~
  1267  
  1268  ======================================== ================================================== ========== ========================================================
  1269  Name                                     Labels                                             Default    Description
  1270  ======================================== ================================================== ========== ========================================================
  1271  ``controllers_group_runs_total``         ``status``, ``group_name``                         Enabled    Number of times that a controller process was run, labeled by controller group name
  1272  ======================================== ================================================== ========== ========================================================
  1273  
  1274  The ``controllers_group_runs_total`` metric reports the success
  1275  and failure count of each controller within the system, labeled by
  1276  controller group name and completion status. Enabling this metric is
  1277  on a per-controller basis. This is configured using an allow-list
  1278  which is passed as the ``controller-group-metrics`` configuration
  1279  flag. The current default set for ``kvstoremesh`` found in the
  1280  Cilium Helm chart is the special name "all", which enables the metric
  1281  for all controller groups. The special name "none" is also supported.
  1282  
  1283  NAT
  1284  ~~~
  1285  
  1286  .. _nat_metrics:
  1287  
  1288  ======================================== ================================================== ========== ========================================================
  1289  Name                                     Labels                                             Default    Description
  1290  ======================================== ================================================== ========== ========================================================
  1291  ``nat_endpoint_max_connection``          ``family``                                         Enabled    Saturation of the most saturated distinct NAT mapped connection, in terms of egress-IP and remote endpoint address.
  1292  ======================================== ================================================== ========== ========================================================
  1293  
  1294  These metrics are for monitoring Cilium's NAT mapping functionality. NAT is used by features such as Egress Gateway and BPF masquerading.
  1295  
  1296  The NAT map holds mappings for masqueraded connections. Connection held in the NAT table that are masqueraded with the
  1297  same egress-IP and are going to the same remote endpoints IP and port all require a unique source port for the mapping.
  1298  This means that any Node masquerading connections to a distinct external endpoint is limited by the possible ephemeral source ports.
  1299  
  1300  Given a Node forwarding one or more such egress-IP and remote endpoint tuples, the ``nat_endpoint_max_connection`` metric is the most saturated such connection in terms of a percent of possible source ports available.
  1301  This metric is especially useful when using the egress gateway feature where it's possible to overload a Node if many connections are all going to the same endpoint.
  1302  In general, this metric should normally be fairly low.
  1303  A high number here may indicate that a Node is reaching its limit for connections to one or more external endpoints.
  1304