github.com/imran-kn/cilium-fork@v1.6.9/Documentation/configuration/metrics.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      http://docs.cilium.io
     6  
     7  .. _metrics:
     8  
     9  ********************
    10  Monitoring & Metrics
    11  ********************
    12  
    13  ``cilium-agent`` and ``cilium-operator`` can be configured to serve `Prometheus
    14  <https://prometheus.io>`_ metrics. Prometheus is a pluggable metrics collection
    15  and storage system and can act as a data source for `Grafana
    16  <https://grafana.com/>`_, a metrics visualization frontend. Unlike some metrics
    17  collectors like statsd, Prometheus requires the collectors to pull metrics from
    18  each source.
    19  
    20  To run Cilium with Prometheus metrics enabled, deploy it with the
    21  ``global.prometheus.enabled=true`` Helm value set.
    22  
    23  All metrics are exported under the ``cilium`` Prometheus namespace. When
    24  running and collecting in Kubernetes they will be tagged with a pod name and
    25  namespace.
    26  
    27  Installation
    28  ============
    29  
    30  When deployed with the Helm value ``global.prometheus.enabled=true``, all Cilium
    31  components will have the annotations to signal Prometheus whether to scrape
    32  metrics:
    33  
    34  .. code-block:: yaml
    35  
    36          prometheus.io/scrape: "true"
    37          prometheus.io/port: "9090"
    38  
    39  Example Prometheus & Grafana Deployment
    40  ---------------------------------------
    41  
    42  If you don't have an existing Prometheus and Grafana stack running, you can
    43  deploy a stack with:
    44  
    45  .. parsed-literal::
    46  
    47      kubectl apply -f \ |SCM_WEB|\/examples/kubernetes/addons/prometheus/monitoring-example.yaml
    48  
    49  It will run Prometheus and Grafana in the ``cilium-monitoring`` namespace. You
    50  can then expose Grafana to access it via your browser.
    51  
    52  .. code:: bash
    53  
    54      kubectl -n cilium-monitoring port-forward service/grafana 3000:3000
    55  
    56  Open your browser and access ``https://localhost:3000/``
    57  
    58  cilium-agent
    59  ============
    60  
    61  To expose any metrics, invoke ``cilium-agent`` with the
    62  ``--prometheus-serve-addr`` option. This option takes a ``IP:Port`` pair but
    63  passing an empty IP (e.g. ``:9090``) will bind the server to all available
    64  interfaces (there is usually only one in a container).
    65  
    66  in :git-tree:`examples/kubernetes/addons/prometheus/monitoring-example.yaml`
    67  
    68  Exported Metrics
    69  ----------------
    70  
    71  Endpoint
    72  ~~~~~~~~
    73  
    74  ============================================ ================================================== ========================================================
    75  Name                                         Labels                                             Description
    76  ============================================ ================================================== ========================================================
    77  ``endpoint_count``                                                                              Number of endpoints managed by this agent
    78  ``endpoint_regenerations``                   ``outcome``                                        Count of all endpoint regenerations that have completed
    79  ``endpoint_regeneration_time_stats_seconds`` ``scope``                                          Endpoint regeneration time stats
    80  ``endpoint_state``                           ``state``                                          Count of all endpoints
    81  ============================================ ================================================== ========================================================
    82  
    83  Services
    84  ~~~~~~~~
    85  
    86  ========================================== ================================================== ========================================================
    87  Name                                       Labels                                             Description
    88  ========================================== ================================================== ========================================================
    89  ``services_events_total``                                                                     Number of services events labeled by action type
    90  ========================================== ================================================== ========================================================
    91  
    92  Datapath
    93  ~~~~~~~~
    94  
    95  ============================================= ================================================== ========================================================
    96  Name                                          Labels                                             Description
    97  ============================================= ================================================== ========================================================
    98  ``datapath_errors_total``                     ``area``, ``name``, ``family``                     Total number of errors occurred in datapath management
    99  ``datapath_conntrack_gc_runs_total``          ``status``                                         Number of times that the conntrack garbage collector process was run
   100  ``datapath_conntrack_gc_key_fallbacks_total``                                                    The number of alive and deleted conntrack entries at the end of a garbage collector run labeled by datapath family
   101  ``datapath_conntrack_gc_entries``             ``family``                                         The number of alive and deleted conntrack entries at the end of a garbage collector run
   102  ``datapath_conntrack_gc_duration_seconds``    ``status``                                         Duration in seconds of the garbage collector process
   103  ============================================= ================================================== ========================================================
   104  
   105  BPF
   106  ~~~
   107  
   108  ========================================== ================================================== ========================================================
   109  Name                                       Labels                                             Description
   110  ========================================== ================================================== ========================================================
   111  ``bpf_syscall_duration_seconds``           ``operation``, ``outcome``                         Duration of BPF system call performed
   112  ``bpf_map_ops_total``                      ``mapName``, ``operation``, ``outcome``            Number of BPF map operations performed
   113  ========================================== ================================================== ========================================================
   114  
   115  Drops/Forwards (L3/L4)
   116  ~~~~~~~~~~~~~~~~~~~~~~
   117  
   118  ========================================== ================================================== ========================================================
   119  Name                                       Labels                                             Description
   120  ========================================== ================================================== ========================================================
   121  ``drop_count_total``                       ``reason``, ``direction``                          Total dropped packets
   122  ``drop_bytes_total``                       ``reason``, ``direction``                          Total dropped bytes
   123  ``forward_count_total``                    ``direction``                                      Total forwarded packets
   124  ``forward_bytes_total``                    ``direction``                                      Total forwarded bytes
   125  ========================================== ================================================== ========================================================
   126  
   127  Policy
   128  ~~~~~~
   129  
   130  ========================================== ================================================== ========================================================
   131  Name                                       Labels                                             Description
   132  ========================================== ================================================== ========================================================
   133  ``policy_count``                                                                              Number of policies currently loaded
   134  ``policy_regeneration_total``                                                                 Total number of policies regenerated successfully
   135  ``policy_regeneration_time_stats_seconds`` ``scope``                                          Policy regeneration time stats labeled by the scope
   136  ``policy_max_revision``                                                                       Highest policy revision number in the agent
   137  ``policy_import_errors``                                                                      Number of times a policy import has failed
   138  ``policy_endpoint_enforcement_status``                                                        Number of endpoints labeled by policy enforcement status
   139  ========================================== ================================================== ========================================================
   140  
   141  Policy L7 (HTTP/Kafka)
   142  ~~~~~~~~~~~~~~~~~~~~~~
   143  
   144  ======================================== ================================================== ========================================================
   145  Name                                     Labels                                             Description
   146  ======================================== ================================================== ========================================================
   147  ``proxy_redirects``                      ``protocol``                                       Number of redirects installed for endpoints
   148  ``proxy_upstream_reply_seconds``                                                            Seconds waited for upstream server to reply to a request
   149  ``policy_l7_total``                      ``type``                                           Number of total L7 requests/responses
   150  ======================================== ================================================== ========================================================
   151  
   152  Identity
   153  ~~~~~~~~
   154  
   155  ======================================== ================================================== ========================================================
   156  Name                                     Labels                                             Description
   157  ======================================== ================================================== ========================================================
   158  ``identity_count``                                                                          Number of identities currently allocated
   159  ======================================== ================================================== ========================================================
   160  
   161  Events external to Cilium
   162  ~~~~~~~~~~~~~~~~~~~~~~~~~
   163  
   164  ======================================== ================================================== ========================================================
   165  Name                                     Labels                                             Description
   166  ======================================== ================================================== ========================================================
   167  ``event_ts``                             ``source``                                         Last timestamp when we received an event
   168  ======================================== ================================================== ========================================================
   169  
   170  Controllers
   171  ~~~~~~~~~~~
   172  
   173  ======================================== ================================================== ========================================================
   174  Name                                     Labels                                             Description
   175  ======================================== ================================================== ========================================================
   176  ``controllers_runs_total``               ``status``                                         Number of times that a controller process was run
   177  ``controllers_runs_duration_seconds``    ``status``                                         Duration in seconds of the controller process
   178  ======================================== ================================================== ========================================================
   179  
   180  SubProcess
   181  ~~~~~~~~~~
   182  
   183  ======================================== ================================================== ========================================================
   184  Name                                     Labels                                             Description
   185  ======================================== ================================================== ========================================================
   186  ``subprocess_start_total``               ``subsystem``                                      Number of times that Cilium has started a subprocess
   187  ======================================== ================================================== ========================================================
   188  
   189  Kubernetes
   190  ~~~~~~~~~~
   191  
   192  ======================================== ================================================== ========================================================
   193  Name                                     Labels                                             Description
   194  ======================================== ================================================== ========================================================
   195  ``kubernetes_events_received_total``     ``scope``, ``action``, ``validity``, ``equiality`` Number of Kubernetes events received
   196  ``kubernetes_events_total``              ``scope``, ``action``, ``outcome``                 Number of Kubernetes events processed
   197  ``k8s_cnp_status_completion_seconds``    ``attempts``, ``outcome``                          Duration in seconds in how long it took to complete a CNP status update
   198  ======================================== ================================================== ========================================================
   199  
   200  IPAM
   201  ~~~~
   202  
   203  ======================================== ============================================ ========================================================
   204  Name                                     Labels                                       Description
   205  ======================================== ============================================ ========================================================
   206  ``ipam_events_total``                                                                 Number of IPAM events received labeled by action and datapath family type
   207  ======================================== ============================================ ========================================================
   208  
   209  KVstore
   210  ~~~~~~~
   211  
   212  ======================================== ============================================ ========================================================
   213  Name                                     Labels                                       Description
   214  ======================================== ============================================ ========================================================
   215  ``kvstore_operations_duration_seconds``  ``action``, ``kind``, ``outcome``, ``scope`` Duration of kvstore operation
   216  ``kvstore_events_queue_seconds``         ``action``, ``scope``                        Duration of seconds of time received event was blocked before it could be queued
   217  ======================================== ============================================ ========================================================
   218  
   219  Agent
   220  ~~~~~
   221  
   222  ================================ ================================ ========================================================
   223  Name                             Labels                           Description
   224  ================================ ================================ ========================================================
   225  ``agent_bootstrap_seconds``      ``scope``, ``outcome``           Duration of various bootstrap phases
   226  ``api_process_time_seconds``                                      Processing time of all the API calls made to the cilium-agent, labeled by API method, API path and returned HTTP code.
   227  ================================ ================================ ========================================================
   228  
   229  FQDN
   230  ~~~~
   231  
   232  ================================ ================================ ========================================================
   233  Name                             Labels                           Description
   234  ================================ ================================ ========================================================
   235  ``qdn_gc_deletions_total``                                        Number of FQDNs that have been cleaned on FQDN garbage collector job
   236  ================================ ================================ ========================================================
   237  
   238  cilium-operator
   239  ===============
   240  
   241  ``cilium-operator`` can be configured to serve metrics by running with the
   242  option ``--enable-metrics``.  By default, the operator will expose metrics on
   243  port 6942, the port can be changed with the option ``--metrics-address``.
   244  
   245  Exported Metrics
   246  ----------------
   247  
   248  All metrics are exported under the ``cilium_operator_`` Prometheus namespace.
   249  
   250  ENI
   251  ~~~
   252  
   253  ================================ ================================ ========================================================
   254  Name                             Labels                           Description
   255  ================================ ================================ ========================================================
   256  ``eni_ips``                      ``type``                         Number of IPs allocated
   257  ``eni_allocation_ops``           ``subnetId``                     Number of IP allocation operations
   258  ``eni_interface_creation_ops``   ``subnetId``, ``status``         Number of ENIs allocated
   259  ``eni_available``                                                 Number of ENIs with addresses available
   260  ``eni_nodes_at_capacity``                                         Number of nodes unable to allocate more addresses
   261  ``eni_aws_api_duration_seconds`` ``operation``, ``responseCode``  Duration of interactions with AWS API
   262  ``eni_resync_total``                                              Number of synchronization operations to synchronize AWS EC2 metadata
   263  ``eni_ec2_rate_limit``           ``operation``                    Number of times the EC2 client rate limiter kicked in
   264  ================================ ================================ ========================================================