github.com/crossplane/upjet@v1.3.0/docs/monitoring.md (about) 1 <!-- 2 SPDX-FileCopyrightText: 2023 The Crossplane Authors <https://crossplane.io> 3 4 SPDX-License-Identifier: CC-BY-4.0 5 --> 6 # Monitoring the Upjet runtime 7 8 The [Kubernetes controller-runtime] library provides a Prometheus metrics 9 endpoint by default. The Upjet based providers including the 10 [upbound/provider-aws], [upbound/provider-azure], [upbound/provider-azuread] and 11 [upbound/provider-gcp] expose 12 [various metrics](https://book.kubebuilder.io/reference/metrics-reference.html) 13 from the controller-runtime to help monitor the health of the various runtime 14 components, such as the [`controller-runtime` client], the [leader election 15 client], the [controller workqueues], etc. In addition to these metrics, each 16 controller also 17 [exposes](https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/internal/controller/metrics/metrics.go#L25) 18 various metrics related to the reconciliation of the custom resources and active 19 reconciliation worker goroutines. 20 21 In addition to these metrics exposed by the `controller-runtime`, the Upjet 22 based providers also expose metrics specific to the Upjet runtime. The Upjet 23 runtime registers some custom metrics using the 24 [available extension mechanism](https://book.kubebuilder.io/reference/metrics.html#publishing-additional-metrics), 25 and are available from the default `/metrics` endpoint of the provider pod. Here 26 are these custom metrics exposed from the Upjet runtime: 27 28 - `upjet_terraform_cli_duration`: This is a histogram metric and reports 29 statistics, in seconds, on how long it takes a Terraform CLI invocation to 30 complete. 31 - `upjet_terraform_active_cli_invocations`: This is a gauge metric and it's the 32 number of active (running) Terraform CLI invocations. 33 - `upjet_terraform_running_processes`: This is a gauge metric and it's the 34 number of running Terraform CLI and Terraform provider processes. 35 - `upjet_resource_ttr`: This is a histogram metric and it measures, in seconds, 36 the time-to-readiness for managed resources. 37 38 Prometheus metrics can have [labels] associated with them to differentiate the 39 characteristics of the measurements being made, such as differentiating between 40 the CLI processes and the Terraform provider processes when counting the number 41 of active Terraform processes running. Here is a list of labels associated with 42 each of the above custom Upjet metrics: 43 44 - Labels associated with the `upjet_terraform_cli_duration` metric: 45 - `subcommand`: The `terraform` subcommand that's run, e.g., `init`, `apply`, 46 `plan`, `destroy`, etc. 47 - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that the 48 CLI was invoked synchronously as part of a reconcile loop), `async` (so that 49 the CLI was invoked asynchronously, the reconciler goroutine will poll and 50 collect results in future). 51 - Labels associated with the `upjet_terraform_active_cli_invocations` metric: 52 - `subcommand`: The `terraform` subcommand that's run, e.g., `init`, `apply`, 53 `plan`, `destroy`, etc. 54 - `mode`: The execution mode of the Terraform CLI, one of `sync` (so that the 55 CLI was invoked synchronously as part of a reconcile loop), `async` (so that 56 the CLI was invoked asynchronously, the reconciler goroutine will poll and 57 collect results in future). 58 - Labels associated with the `upjet_terraform_running_processes` metric: 59 - `type`: Either `cli` for Terraform CLI (the `terraform` process) processes 60 or `provider` for the Terraform provider processes. Please note that this is 61 a best effort metric that may not be able to precisely catch & report all 62 relevant processes. We may, in the future, improve this if needed by for 63 example watching the `fork` system calls. But currently, it may prove to be 64 useful to watch rouge Terraform provider processes. 65 - Labels associated with the `upjet_resource_ttr` metric: 66 - `group`, `version`, `kind` labels record the 67 [API group, version and kind](https://kubernetes.io/docs/reference/using-api/api-concepts/) 68 for the managed resource, whose 69 [time-to-readiness](https://github.com/crossplane/terrajet/issues/55#issuecomment-929494212) 70 measurement is captured. 71 72 ## Examples 73 74 You can [export](https://book.kubebuilder.io/reference/metrics.html) all these 75 custom metrics and the `controller-runtime` metrics from the provider pod for 76 Prometheus. Here are some examples showing the custom metrics in action from the 77 Prometheus console: 78 79 - `upjet_terraform_active_cli_invocations` gauge metric showing the sync & async 80 `terraform init/apply/plan/destroy` invocations: <img width="3000" alt="image" 81 src="https://user-images.githubusercontent.com/9376684/223296539-94e7d634-58b0-4d3f-942e-8b857eb92ef7.png"> 82 83 - `upjet_terraform_running_processes` gauge metric showing both `cli` and 84 `provider` labels: <img width="2999" alt="image" 85 src="https://user-images.githubusercontent.com/9376684/223297575-18c2232e-b5af-4cc1-916a-d61fe5dfb527.png"> 86 87 - `upjet_terraform_cli_duration` histogram metric, showing average Terraform CLI 88 running times for the last 5m: <img width="2993" alt="image" 89 src="https://user-images.githubusercontent.com/9376684/223299401-8f128b74-8d9c-4c82-86c5-26870385bee7.png"> 90 91 - The medians (0.5-quantiles) for these observations aggregated by the mode and 92 Terraform subcommand being invoked: <img width="2999" alt="image" 93 src="https://user-images.githubusercontent.com/9376684/223300766-c1adebb9-bd19-4a38-9941-116185d8d39f.png"> 94 95 - `upjet_resource_ttr` histogram metric, showing average resource TTR for the 96 last 10m: <img width="2999" alt="image" 97 src="https://user-images.githubusercontent.com/9376684/223309711-edef690e-2a59-419b-bb93-8f837496bec8.png"> 98 99 - The median (0.5-quantile) for these TTR observations: <img width="3002" 100 alt="image" 101 src="https://user-images.githubusercontent.com/9376684/223309727-d1a0f4e2-1ed2-414b-be67-478a0575ee49.png"> 102 103 These samples have been collected by provisioning 10 [upbound/provider-aws] 104 `cognitoidp.UserPool` resources by running the provider with a poll interval of 105 1m. In these examples, one can observe that the resources were polled 106 (reconciled) twice after they acquired the `Ready=True` condition and after 107 that, they were destroyed. 108 109 ## Reference 110 111 You can find a full reference of the exposed metrics from the Upjet-based 112 providers [here](provider_metrics_help.txt). 113 114 [Kubernetes controller-runtime]: https://github.com/kubernetes-sigs/controller-runtime 115 [upbound/provider-aws]: https://github.com/upbound/provider-aws 116 [upbound/provider-azure]: https://github.com/upbound/provider-azure 117 [upbound/provider-azuread]: https://github.com/upbound/provider-azuread 118 [upbound/provider-gcp]: https://github.com/upbound/provider-gcp 119 [`controller-runtime` client]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/client_go_adapter.go#L40 120 [leader election client]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/leaderelection.go#L12 121 [controller workqueues]: https://github.com/kubernetes-sigs/controller-runtime/blob/60af59f5b22335516850ca11c974c8f614d5d073/pkg/metrics/workqueue.go#L40 122 [labels]: https://prometheus.io/docs/practices/naming/#labels