github.com/crossplane/upjet@v1.3.0/docs/monitoring.md (about)

     1  <!--
     2  SPDX-FileCopyrightText: 2023 The Crossplane Authors <https://crossplane.io>
     3  
     4  SPDX-License-Identifier: CC-BY-4.0
     5  -->
     6  # Monitoring the Upjet runtime
     7  
     8  The [Kubernetes controller-runtime] library provides a Prometheus metrics
     9  endpoint by default. The Upjet based providers including the
    10  [upbound/provider-aws], [upbound/provider-azure], [upbound/provider-azuread] and
    11  [upbound/provider-gcp] expose
    12  [various metrics](https://book.kubebuilder.io/reference/metrics-reference.html)
    13  from the controller-runtime to help monitor the health of the various runtime
    14  components, such as the [`controller-runtime` client], the [leader election
    15  client], the [controller workqueues], etc. In addition to these metrics, each
    16  controller also
    17  [exposes](https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/internal/controller/metrics/metrics.go#L25)
    18  various metrics related to the reconciliation of the custom resources and active
    19  reconciliation worker goroutines.
    20  
    21  In addition to these metrics exposed by the `controller-runtime`, the Upjet
    22  based providers also expose metrics specific to the Upjet runtime. The Upjet
    23  runtime registers some custom metrics using the
    24  [available extension mechanism](https://book.kubebuilder.io/reference/metrics.html#publishing-additional-metrics),
    25  and are available from the default `/metrics` endpoint of the provider pod. Here
    26  are these custom metrics exposed from the Upjet runtime:
    27  
    28  - `upjet_terraform_cli_duration`: This is a histogram metric and reports
    29    statistics, in seconds, on how long it takes a Terraform CLI invocation to
    30    complete.
    31  - `upjet_terraform_active_cli_invocations`: This is a gauge metric and it's the
    32    number of active (running) Terraform CLI invocations.
    33  - `upjet_terraform_running_processes`: This is a gauge metric and it's the
    34    number of running Terraform CLI and Terraform provider processes.
    35  - `upjet_resource_ttr`: This is a histogram metric and it measures, in seconds,
    36    the time-to-readiness for managed resources.
    37  
    38  Prometheus metrics can have [labels] associated with them to differentiate the
    39  characteristics of the measurements being made, such as differentiating between
    40  the CLI processes and the Terraform provider processes when counting the number
    41  of active Terraform processes running. Here is a list of labels associated with
    42  each of the above custom Upjet metrics:
    43  
    44  - Labels associated with the `upjet_terraform_cli_duration` metric:
    45    - `subcommand`: The `terraform` subcommand that's run, e.g., `init`, `apply`,
    46      `plan`, `destroy`, etc.
    47    - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that the
    48      CLI was invoked synchronously as part of a reconcile loop), `async` (so that
    49      the CLI was invoked asynchronously, the reconciler goroutine will poll and
    50      collect results in future).
    51  - Labels associated with the `upjet_terraform_active_cli_invocations` metric:
    52    - `subcommand`: The `terraform` subcommand that's run, e.g., `init`, `apply`,
    53      `plan`, `destroy`, etc.
    54    - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that the
    55      CLI was invoked synchronously as part of a reconcile loop), `async` (so that
    56      the CLI was invoked asynchronously, the reconciler goroutine will poll and
    57      collect results in future).
    58  - Labels associated with the `upjet_terraform_running_processes` metric:
    59    - `type`: Either `cli` for Terraform CLI (the `terraform` process) processes
    60      or `provider` for the Terraform provider processes. Please note that this is
    61      a best effort metric that may not be able to precisely catch & report all
    62      relevant processes. We may, in the future, improve this if needed by for
    63      example watching the `fork` system calls. But currently, it may prove to be
    64      useful to watch rouge Terraform provider processes.
    65  - Labels associated with the `upjet_resource_ttr` metric:
    66    - `group`, `version`, `kind` labels record the
    67      [API group, version and kind](https://kubernetes.io/docs/reference/using-api/api-concepts/)
    68      for the managed resource, whose
    69      [time-to-readiness](https://github.com/crossplane/terrajet/issues/55#issuecomment-929494212)
    70      measurement is captured.
    71  
    72  ## Examples
    73  
    74  You can [export](https://book.kubebuilder.io/reference/metrics.html) all these
    75  custom metrics and the `controller-runtime` metrics from the provider pod for
    76  Prometheus. Here are some examples showing the custom metrics in action from the
    77  Prometheus console:
    78  
    79  - `upjet_terraform_active_cli_invocations` gauge metric showing the sync & async
    80    `terraform init/apply/plan/destroy` invocations: <img width="3000" alt="image"
    81    src="https://user-images.githubusercontent.com/9376684/223296539-94e7d634-58b0-4d3f-942e-8b857eb92ef7.png">
    82  
    83  - `upjet_terraform_running_processes` gauge metric showing both `cli` and
    84    `provider` labels: <img width="2999" alt="image"
    85    src="https://user-images.githubusercontent.com/9376684/223297575-18c2232e-b5af-4cc1-916a-d61fe5dfb527.png">
    86  
    87  - `upjet_terraform_cli_duration` histogram metric, showing average Terraform CLI
    88    running times for the last 5m: <img width="2993" alt="image"
    89    src="https://user-images.githubusercontent.com/9376684/223299401-8f128b74-8d9c-4c82-86c5-26870385bee7.png">
    90  
    91  - The medians (0.5-quantiles) for these observations aggregated by the mode and
    92    Terraform subcommand being invoked: <img width="2999" alt="image"
    93    src="https://user-images.githubusercontent.com/9376684/223300766-c1adebb9-bd19-4a38-9941-116185d8d39f.png">
    94  
    95  - `upjet_resource_ttr` histogram metric, showing average resource TTR for the
    96    last 10m: <img width="2999" alt="image"
    97    src="https://user-images.githubusercontent.com/9376684/223309711-edef690e-2a59-419b-bb93-8f837496bec8.png">
    98  
    99  - The median (0.5-quantile) for these TTR observations: <img width="3002"
   100    alt="image"
   101    src="https://user-images.githubusercontent.com/9376684/223309727-d1a0f4e2-1ed2-414b-be67-478a0575ee49.png">
   102  
   103  These samples have been collected by provisioning 10 [upbound/provider-aws]
   104  `cognitoidp.UserPool` resources by running the provider with a poll interval of
   105  1m. In these examples, one can observe that the resources were polled
   106  (reconciled) twice after they acquired the `Ready=True` condition and after
   107  that, they were destroyed.
   108  
   109  ## Reference
   110  
   111  You can find a full reference of the exposed metrics from the Upjet-based
   112  providers [here](provider_metrics_help.txt).
   113  
   114  [Kubernetes controller-runtime]: https://github.com/kubernetes-sigs/controller-runtime
   115  [upbound/provider-aws]: https://github.com/upbound/provider-aws
   116  [upbound/provider-azure]: https://github.com/upbound/provider-azure
   117  [upbound/provider-azuread]: https://github.com/upbound/provider-azuread
   118  [upbound/provider-gcp]: https://github.com/upbound/provider-gcp
   119  [`controller-runtime` client]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/client_go_adapter.go#L40
   120  [leader election client]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/leaderelection.go#L12
   121  [controller workqueues]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/workqueue.go#L40
   122  [labels]: https://prometheus.io/docs/practices/naming/#labels