sigs.k8s.io/cluster-api@v1.7.1/docs/proposals/20220411-cluster-api-state-metrics.md (about)

     1  ---
     2  title: Cluster API State Metrics
     3  authors:
     4    - "@tobiasgiese"
     5    - "@chrischdi"
     6  reviewers:
     7    - "@johannesfrey"
     8    - "@enxebre"
     9    - "@sbueringer"
    10    - "@apricote"
    11    - "@fabriziopandini"
    12  creation-date: 2022-03-03
    13  last-updated: 2022-09-07
    14  status: experimental
    15  ---
    16  
    17  # Cluster API State Metrics
    18  
    19  ## Table of Contents
    20  
    21  <!-- START doctoc generated TOC please keep comment here to allow auto update -->
    22  <!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
    23  
    24  - [Glossary](#glossary)
    25  - [Summary](#summary)
    26  - [Motivation](#motivation)
    27    - [Goals](#goals)
    28    - [Non-Goals](#non-goals)
    29    - [Future Work](#future-work)
    30  - [Proposal](#proposal)
    31    - [User Stories](#user-stories)
    32      - [Story 1](#story-1)
    33      - [Story 2](#story-2)
    34    - [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
    35      - [Scrapable Information](#scrapable-information)
    36      - [Relationship to kube-state-metrics](#relationship-to-kube-state-metrics)
    37      - [How does kube-state-metrics work](#how-does-kube-state-metrics-work)
    38    - [Use kube-state-metrics custom resource configuration](#use-kube-state-metrics-custom-resource-configuration)
    39    - [Security Model](#security-model)
    40    - [Risks and Mitigations](#risks-and-mitigations)
    41  - [Alternatives](#alternatives)
    42    - [Reuse kube-state-metrics packages](#reuse-kube-state-metrics-packages)
    43      - [Package structure for cluster-api-state-metrics](#package-structure-for-cluster-api-state-metrics)
    44    - [Expose metrics by the controllers](#expose-metrics-by-the-controllers)
    45  - [Upgrade Strategy](#upgrade-strategy)
    46  - [Additional Details](#additional-details)
    47    - [Metrics](#metrics)
    48      - [Cluster CR](#cluster-cr)
    49      - [KubeadmControlPlane CR](#kubeadmcontrolplane-cr)
    50      - [MachineDeployment CR](#machinedeployment-cr)
    51      - [MachineSet CR](#machineset-cr)
    52      - [Machine CR](#machine-cr)
    53      - [MachineHealthCheck CR](#machinehealthcheck-cr)
    54    - [Gaduation Criteria](#gaduation-criteria)
    55  - [Implementation History](#implementation-history)
    56  
    57  <!-- END doctoc generated TOC please keep comment here to allow auto update -->
    58  
    59  ## Glossary
    60  
    61  - `CustomResource` (CR)
    62  - `CustomResourceDefinition` (CRD)
    63  - Cluster API State Metrics (CASM)
    64  
    65  Refer to the [Cluster API Book Glossary].
    66  
    67  ## Summary
    68  
    69  This proposal outlines adding kube-state-metrics as a new component for exposing metrics specific to CAPI's CRs.
    70  
    71  This solution is derived from [mercedes-benz/cluster-api-state-metrics] and all the merits for its inception goes to the team that created it: [@chrischdi](https://github.com/chrischdi), [@seanschneeweiss](https://github.com/seanschneeweiss), [@tobiasgiese](https://github.com/tobiasgiese).
    72  
    73  A special thank goes to [Mercedes-Benz] who allowed the donation of this project to CNCF and to the Cluster API community.
    74  
    75  Also it builds up on the custom resource feature of [kube-state-metrics] which was introduced in v2.5.0 and improved in v2.6.0.
    76  
    77  ## Motivation
    78  
    79  As of now CAPI controllers only expose metrics that are provided by controller-runtime, which in turn are specific to the controllers internal behavior and thus do not provide any information about the state of the clusters provisioned by CAPI.
    80  
    81  This proposal introduces a kube-state-metrics and Custom Resource configuration to generate a new set of metrics that will help end users to monitor the state of the Cluster API resources in their [OpenMetrics] compatible monitoring system of choice.
    82  
    83  ### Goals
    84  
    85  - Define metrics to monitor the state of the Cluster API resources, thus solving parts of the [metrics umbrella issue #1477]
    86  - Adding Custom Resource configuration for kube-state-metrics, which then exposes the below metrics by following the [OpenMetrics] standard
    87  - Make metrics part of the CAPI developer workflow by making them accessible via Tilt
    88  - Implement metrics for core CAPI CRs
    89  
    90  ### Non-Goals
    91  
    92  - Implement metrics for provider specific CAPI CRs
    93  
    94  ### Future Work
    95  
    96  - Metrics for resources of Cluster API providers (e.g., CAPA, CAPO, CAPZ, ...)
    97  - Collection of example alerts to e.g. alert when
    98    - a KubeadmControlPlane is not healthy.
    99    - 70% of my worker Machines are not healthy.
   100    - 70% of my Machines are blocked on deletion.
   101  - Auto-generation of the metric definition from markers at the type definitions.
   102  - Introduction of generic `*_labels` metrics which work in the same way as `kube_*_labels` metrics.
   103    - This needs implementation on kube-state-metrics side to also include using the configuration from the existing flags.
   104    - For now, a custom user-specified `Info` metric could get configured to create a customized `*_labels` metric which exposes explicitly listed labels.
   105  
   106  ## Proposal
   107  
   108  This proposal introduces a CAPI specific configuration for kube-state-metrics to expose metrics for CAPI specific CRs.
   109  
   110  The configuration could be deployed using the kube-state-metrics helm chart and should leverage the custom-resource configuration file.
   111  
   112  In future the configuration for kube-state-metrics may be added to the release artifacts of CAPI.
   113  
   114  ### User Stories
   115  
   116  #### Story 1
   117  
   118  As a service provider/cluster operator, I want to have metrics for the Cluster API CRs to create alerts for cluster lifecycle symptoms that might impact workloads availability.
   119  
   120  #### Story 2
   121  
   122  As an application developer, I would like to deploy kube-state-metrics including the CAPI configuration together with the prometheus stack via Tilt. This allows further analysis by using the metrics like measuring the duration of several provisioning phases.
   123  
   124  ### Implementation Details/Notes/Constraints
   125  
   126  #### Scrapable Information
   127  
   128  Following Cluster API CRDs currently exist.
   129  The *In-scope* column marks CRDs for which metrics should be exposed.
   130  In future iterations other CRs or configuration for provider specific CRDs could be added.
   131  
   132  | Name                      | API Group/Version                       | In-scope |
   133  |---------------------------|-----------------------------------------|----------|
   134  | Cluster                   | `cluster.x-k8s.io/v1beta1`              | yes      |
   135  | ClusterClass              | `cluster.x-k8s.io/v1beta1`              | no       |
   136  | MachineDeployment         | `cluster.x-k8s.io/v1beta1`              | yes      |
   137  | MachineSet                | `cluster.x-k8s.io/v1beta1`              | yes      |
   138  | Machine                   | `cluster.x-k8s.io/v1beta1`              | yes      |
   139  | KubeadmConfig             | `bootstrap.cluster.x-k8s.io/v1beta1`    | no       |
   140  | KubeadmConfigTemplate     | `bootstrap.cluster.x-k8s.io/v1beta1`    | no       |
   141  | KubeadmControlPlane       | `controlplane.cluster.x-k8s.io/v1beta1` | yes      |
   142  | ClusterResourceSetBinding | `addons.cluster.x-k8s.io/v1beta1`       | no       |
   143  | ClusterResourceSet        | `addons.cluster.x-k8s.io/v1beta1`       | no       |
   144  | MachineHealthCheck        | `cluster.x-k8s.io/v1beta1`              | yes      |
   145  | MachinePool               | `cluster.x-k8s.io/v1beta1`              | yes      |
   146  
   147  The relationships between the resources can be found in the [corresponding section in the cluster-api docs].
   148  
   149  #### Relationship to kube-state-metrics
   150  
   151  There are several CustomResources introduced by Cluster API and Cluster API providers that are similar to core Kubernetes resources.
   152  Because of that it may make sense to implement metrics inspired by kube-state-metrics.
   153  
   154  The following table lists possible mappings from core resources to CRDs:
   155  
   156  | kube-state-metrics equivalent | implement for               |
   157  |-------------------------------|-----------------------------|
   158  | `Machine`                     | [Pod]                       |
   159  | `MachineSet`                  | [ReplicaSet]                |
   160  | `MachineDeployment`           | [Deployment]                |
   161  | `KubeadmControlPlane`         | [Statefulset], [Deployment] |
   162  
   163  CRDs missing in this table:
   164  
   165  - `Cluster`
   166  - `ClusterClass`
   167  - `ClusterResourceSet`
   168  - `ClusterResourceSetBinding`
   169  - `MachineHealthCheck`
   170  - `MachinePool`
   171  - `KubeadmConfig`
   172  - `KubeadmConfigTemplate`
   173  
   174  The `Cluster` CR will have important information in their status fields similar to [Pod] metrics like `status.Ready` or `status.Conditions`.
   175  
   176  Currently it is not important to implement metrics for `KubeadmConfig` and `KubeadmConfigTemplate` because both only contain configuration data (e.g., passed via cloud-init to the machine). However they may be compared to `ConfigMaps` or `Secrets`.
   177  
   178  #### How does kube-state-metrics work
   179  
   180  Kube-state-metrics exposes metrics by a http endpoint to be consumed by either Prometheus itself or a compatible scraper [[1]].
   181  
   182  Since kube-state-metrics v1.5 large performance improvements got introduced to kube-state-metrics which are documented at the [Performance Optimization Proposal](https://github.com/kubernetes/kube-state-metrics/blob/master/docs/design/metrics-store-performance-optimization.md#Proposal). This document also explains the current internals of kube-state-metrics.
   183  
   184  It caches the current state of the metrics using an internal cache and updates this internal state on add, update and delete events of watched resources.
   185  
   186  On requests to `/metrics` the cached data gets concatenated to a single string and returned as a response.
   187  
   188  ### Use kube-state-metrics custom resource configuration
   189  
   190  kube-state-metrics v2.5.0 introduced an experimental feature to create metrics for CRDs.
   191  The feature for v2.5.0 did not fulfill all requirements for the metrics of this proposal.
   192  Because of that improvements got contributed to kube-state-metrics to improve the feature and support more use-cases, including the metrics of this proposal.
   193  
   194  For the implementation of this proposal a proper configuration file for kube-state-metrics should be provided and kube-state-metrics should be used to expose the metrics.
   195  
   196  ### Security Model
   197  
   198  - RBAC definitions should be generated for markers at the type definitions in future.
   199  - RBAC definitions should only grant `get`, `list` and `watch` permissions to CRs relevant for the application.
   200  
   201  ### Risks and Mitigations
   202  
   203  This initial implementation provides a baseline on which incremental changes can be introduced in the future. Instead of encompassing all possible use cases under a single proposal, this proposal mitigates the risk of waiting too long to consider all required use cases regarding this topic.
   204  
   205  ## Alternatives
   206  
   207  ### Reuse kube-state-metrics packages
   208  
   209  Cluster-api-state-metrics could re-use packages provided by kube-state-metrics to implement the metrics. This allows re-use of flags, configuration and extended functionality like sharding or tls configuration without additional implementation.
   210  
   211  When first writing the proposal, this was the favoured option because at that state, kube-state-metrics did not support configuration for custom resource metrics. As of today this is not the case anymore.
   212  
   213  An [extension mechanism](https://github.com/kubernetes/kube-state-metrics/pull/1644) was introduced to the kube-state-metrics packages which allows using its basic mechanism but defining custom metrics for Custom Resources.
   214  The `k8s.io/kube-state-metrics/v2/pkg/customresource.RegistryFactory`[[2]] interface was [introduced](https://github.com/kubernetes/kube-state-metrics/pull/1644) to allow defining custom metrics for Custom Resources while leveraging kube-state-metrics logic.
   215  
   216  Cluster-api-state-metrics will have to implement the `customresource.RegistryFactory` interface for each custom resource.
   217  The interface defines the function `MetricFamilyGenerators(allowAnnotationsList, allowLabelsList []string) []generator.FamilyGenerator` to be implemented which then contains the specific metric implementations.
   218  A detailed implementation example is available at the [package documentation](https://pkg.go.dev/k8s.io/kube-state-metrics/v2@v2.4.2/pkg/customresource#RegistryFactory).
   219  
   220  These `customresource.RegistryFactory` implementations get used in a `main.go` which configures and starts the application by using `k8s.io/kube-state-metrics/v2/pkg/app.RunKubeStateMetrics(...)`[[3]].
   221  
   222  #### Package structure for cluster-api-state-metrics
   223  
   224  - `/exp/state-metrics/pkg/store` contains the metric implementation exposed by the `customresource.RegistryFactory`[[2]] interface.
   225    - `/exp/state-metrics/pkg/store/{cluster,kubeadmcontrolplane,machinedeployment,machine,...}.go` for the custom resource specific implementation of `customresource.RegistryFactory`
   226    - `/exp/state-metrics/pkg/store/factory.go` for implementing the `Factories()` function which groups and exposes the `customresource.RegistryFactory` implementations of this package
   227  - `/exp/state-metrics/main.go` which:
   228    - imports the capi custom resource specific metric *factories* implemented and exposed via `/exp/state-metrics/pkg/store.Factories()`
   229    - imports and uses `k8s.io/kube-state-metrics/v2/pkg/options.NewOptions()`[[3]] to define the same cli flags and options as kube-state-metrics, except the enabled metrics.
   230    - imports and uses `k8s.io/kube-state-metrics/v2/pkg/app.RunKubeStateMetrics(...)`[[3]] to start the metrics server using the given options and the custom registry factory from `store.Factories()`.
   231  
   232  ### Expose metrics by the controllers
   233  
   234  It may make sense to implement the metrics directly within the CAPI controllers.
   235  This might be a working solution but would impose following disadvantages:
   236  
   237  - Requires changes in at least 3 controllers:
   238    - `kubeadm-bootstrap-controller`
   239    - `kubeadm-control-plane-controller`
   240    - `capi-controller-manager`
   241    - And potentially in every provider-specific controller, e.g.: `capd-controller-manager`
   242  - Potential bugs of the metrics implementation may also have a negative impact on provisioning. E.g. if the controller runs out of memory or because of crashes of the application due to `invalid memory address` or `nil pointer dereference`.
   243  
   244  Nevertheless, including metrics directly in the controllers may be valid for future iterations, but certainly not for an experimental feature.
   245  
   246  ## Upgrade Strategy
   247  
   248  After being introduced and validated to be valuable, the configuration file for kube-state-metrics could be provided as release artifact.
   249  A user could also update its kube-state-metrics configuration after upgrading Cluster API to expose the latest compatible metrics.
   250  The conversion webhooks provided by the controllers for the Custom Resources allow to still expose metrics during the version shift, even when the default APIVersion gets changed by the Cluster API upgrade.
   251  
   252  ## Additional Details
   253  
   254  ### Metrics
   255  
   256  #### Cluster CR
   257  
   258  Common labels:
   259  
   260  - `cluster=<cluster-name>`
   261  
   262  | metric name                        | value                             | type  | additional labels/tags                                                                                                                                                                                               | xref                    |
   263  |------------------------------------|-----------------------------------|-------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
   264  | `capi_cluster_created`             | `.metadata.creationTimestamp`     | Gauge |                                                                                                                                                                                                                      | common                  |
   265  | `capi_cluster_annotation_paused` * | `1`                               | Gauge | `paused_value=<paused annotation value>`                                                                                                                                                                             |                         |
   266  | `capi_cluster_info`                | `1`                               | Gauge |                                                                                                                                                                                                                      |                         |
   267  | `capi_cluster_spec_paused`         | `.spec.paused`                    | Gauge | `topology_version=.spec.topology.version`<br>`topology_class=.spec.topology.class`<br>`control_plane_endpoint_host=.spec.controlPlaneEndpoint.host`<br>`control_plane_endpoint_port=.spec.controlPlaneEndpoint.port` |                         |
   268  | `capi_cluster_status_condition`    | `.status.conditions==<condition>` | Gauge | `condition=<condition>` `status=<true\|false>`                                                                                                                                                                       |                         |
   269  | `capi_cluster_status_phase`        | `.status.phase==<phase>`          | Gauge | `phase=<phase>`                                                                                                                                                                                                      | [Pod], [cluster phases] |
   270  
   271  *: A metric will only be exposed if the annotation existst. If so it will always have a value of `1` and expose a label which contains its value. Prometheus would drop labels having an empty value, which is why an empty value would be equal to a not set annotation otherwise.
   272  
   273  #### KubeadmControlPlane CR
   274  
   275  Common labels:
   276  
   277  - `kubeadmcontrolplane=<kubeadmcontrolplane-name>`
   278  
   279  | metric name                                                      | value                                          | type  | additional labels/tags                              | xref         |
   280  |------------------------------------------------------------------|------------------------------------------------|-------|-----------------------------------------------------|--------------|
   281  | `capi_kubeadmcontrolplane_created`                               | `.metadata.creationTimestamp`                  | Gauge |                                                     | common       |
   282  | `capi_kubeadmcontrolplane_annotation_paused` *                   | `1`                                            | Gauge | `paused_value=<paused annotation value>`            |              |
   283  | `capi_kubeadmcontrolplane_status_condition`                      | `.status.conditions==<condition>`              | Gauge | `condition=<condition>` `status=<true\|false>`      |              |
   284  | `capi_kubeadmcontrolplane_status_replicas`                       | `.status.replicas`                             | Gauge |                                                     | [Deployment] |
   285  | `capi_kubeadmcontrolplane_status_replicas_ready`                 | `.status.readyReplicas`                        | Gauge |                                                     | [Deployment] |
   286  | `capi_kubeadmcontrolplane_status_replicas_unavailable`           | `.status.unavailableReplicas`                  | Gauge |                                                     | [Deployment] |
   287  | `capi_kubeadmcontrolplane_status_replicas_updated`               | `.status.updatedReplicas`                      | Gauge |                                                     | [Deployment] |
   288  | `capi_kubeadmcontrolplane_spec_replicas`                         | `.spec.replicas`                               | Gauge |                                                     | [Deployment] |
   289  | `capi_kubeadmcontrolplane_spec_strategy_rollingupdate_max_surge` | `.spec.rolloutStrategy.rollingUpdate.maxSurge` | Gauge |                                                     | [Deployment] |
   290  | `capi_kubeadmcontrolplane_info`                                  | `1`                                            | Gauge | `version=.spec.version`                             | [Pod]        |
   291  | `capi_kubeadmcontrolplane_owner`                                 | `1`                                            | Gauge | `owner_kind=<owner kind>` `owner_name=<owner name>` | [ReplicaSet] |
   292  
   293  *: A metric will only be exposed if the annotation existst. If so it will always have a value of `1` and expose a label which contains its value. Prometheus would drop labels having an empty value, which is why an empty value would be equal to a not set annotation otherwise.
   294  
   295  #### MachineDeployment CR
   296  
   297  Common labels:
   298  
   299  - `machinedeployment=<machinedeployment-name>`
   300  
   301  | metric name                                                          | value                                         | type  | additional labels/tags                              | xref                              |
   302  |----------------------------------------------------------------------|-----------------------------------------------|-------|-----------------------------------------------------|-----------------------------------|
   303  | `capi_machinedeployment_created`                                     | `.metadata.creationTimestamp`                 | Gauge |                                                     | common                            |
   304  | `capi_machinedeployment_annotation_paused` *                         | `1`                                           | Gauge | `paused_value=<paused annotation value>`            |                                   |
   305  | `capi_machinedeployment_spec_paused`                                 | `.Spec.Paused`                                | Gauge |                                                     |                                   |
   306  | `capi_machinedeployment_status_condition`                            | `.status.conditions==<condition>`             | Gauge | `condition=<condition>` `status=<true\|false>`      |                                   |
   307  | `capi_machinedeployment_status_replicas`                             | `.status.replicas`                            | Gauge |                                                     | [Deployment]                      |
   308  | `capi_machinedeployment_status_replicas_available`                   | `.status.availableReplicas`                   | Gauge |                                                     | [Deployment]                      |
   309  | `capi_machinedeployment_status_replicas_ready`                       | `.status.readyReplicas`                       | Gauge |                                                     | [Deployment]                      |
   310  | `capi_machinedeployment_status_replicas_unavailable`                 | `.status.unavailableReplicas`                 | Gauge |                                                     | [Deployment]                      |
   311  | `capi_machinedeployment_status_replicas_updated`                     | `.status.updatedReplicas`                     | Gauge |                                                     | [Deployment]                      |
   312  | `capi_machinedeployment_spec_replicas`                               | `.spec.replicas`                              | Gauge |                                                     | [Deployment]                      |
   313  | `capi_machinedeployment_spec_strategy_rollingupdate_max_unavailable` | `.spec.strategy.rollingUpdate.maxUnavailable` | Gauge |                                                     | [Deployment]                      |
   314  | `capi_machinedeployment_spec_strategy_rollingupdate_max_surge`       | `.spec.strategy.rollingUpdate.maxSurge`       | Gauge |                                                     | [Deployment]                      |
   315  | `capi_machinedeployment_status_phase`                                | `.status.phase==<phase>`                      | Gauge | `phase=<phase>`                                     | [Pod], [machinedeployment phases] |
   316  | `capi_machinedeployment_owner`                                       | `1`                                           | Gauge | `owner_kind=<owner kind>` `owner_name=<owner name>` | [ReplicaSet]                      |
   317  
   318  *: A metric will only be exposed if the annotation existst. If so it will always have a value of `1` and expose a label which contains its value. Prometheus would drop labels having an empty value, which is why an empty value would be equal to a not set annotation otherwise.
   319  
   320  #### MachineSet CR
   321  
   322  Common labels:
   323  
   324  - `machineset=<machineset-name>`
   325  
   326  | metric name                                     | value                             | type  | additional labels/tags                              | xref         |
   327  |-------------------------------------------------|-----------------------------------|-------|-----------------------------------------------------|--------------|
   328  | `capi_machineset_created`                       | `.metadata.creationTimestamp`     | Gauge |                                                     | common       |
   329  | `capi_machineset_annotation_paused` *           | `1`                               | Gauge | `paused_value=<paused annotation value>`            |              |
   330  | `capi_machineset_status_available_replicas`     | `.status.availableReplicas`       | Gauge |                                                     | [ReplicaSet] |
   331  | `capi_machineset_status_condition`              | `.status.conditions==<condition>` | Gauge | `condition=<condition>` `status=<true\|false>`      |              |
   332  | `capi_machineset_status_replicas`               | `.status.replicas`                | Gauge |                                                     | [ReplicaSet] |
   333  | `capi_machineset_status_fully_labeled_replicas` | `.status.fullyLabeledReplicas`    | Gauge |                                                     | [ReplicaSet] |
   334  | `capi_machineset_status_ready_replicas`         | `.status.readyReplicas`           | Gauge |                                                     | [ReplicaSet] |
   335  | `capi_machineset_spec_replicas`                 | `.spec.replicas`                  | Gauge |                                                     | [ReplicaSet] |
   336  | `capi_machineset_owner`                         | `1`                               | Gauge | `owner_kind=<owner kind>` `owner_name=<owner name>` | [ReplicaSet] |
   337  
   338  *: A metric will only be exposed if the annotation existst. If so it will always have a value of `1` and expose a label which contains its value. Prometheus would drop labels having an empty value, which is why an empty value would be equal to a not set annotation otherwise.
   339  
   340  #### Machine CR
   341  
   342  Common labels:
   343  
   344  - `machine=<machine-name>`
   345  
   346  | metric name                        | value                             | type  | additional labels/tags                                                                                                                       | xref                    |
   347  |------------------------------------|-----------------------------------|-------|----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
   348  | `capi_machine_created`             | `.metadata.creationTimestamp`     | Gauge |                                                                                                                                              | common                  |
   349  | `capi_machine_annotation_paused` * | `1`                               | Gauge | `paused_value=<paused annotation value>`                                                                                                     |                         |
   350  | `capi_machine_status_condition`    | `.status.conditions==<condition>` | Gauge | `condition=<condition>`<br>`status=<true\|false>`                                                                                            |                         |
   351  | `capi_machine_status_phase`        | `.status.phase==<phase>`          | Gauge | `phase=<phase>`                                                                                                                              | [Pod], [machine phases] |
   352  | `capi_machine_owner`               | `1`                               | Gauge | `owner_kind=<owner kind>`<br>`owner_name=<owner name>`                                                                                       | [ReplicaSet]            |
   353  | `capi_machine_info`                | `1`                               | Gauge | `internal_ip=<.status.addresses>`<br>`version=<.spec.version>`<br>`provider_id=<.spec.providerID>`<br>`failure_domain=<.spec.failureDomain>` | [Pod]                   |
   354  | `capi_machine_status_noderef`      | `1`                               | Gauge | `node=<.status.nodeRef.name>`                                                                                                                |                         |
   355  
   356  *: A metric will only be exposed if the annotation existst. If so it will always have a value of `1` and expose a label which contains its value. Prometheus would drop labels having an empty value, which is why an empty value would be equal to a not set annotation otherwise.
   357  
   358  #### MachineHealthCheck CR
   359  
   360  Common labels:
   361  
   362  - `machinehealthcheck=<machinehealthcheck-name>`
   363  
   364  | metric name                                           | value                             | type  | additional labels/tags                                 | xref         |
   365  |-------------------------------------------------------|-----------------------------------|-------|--------------------------------------------------------|--------------|
   366  | `capi_machinehealthcheck_created`                     | `.metadata.creationTimestamp`     | Gauge |                                                        | common       |
   367  | `capi_machinehealthcheck_annotation_paused` *         | `1`                               | Gauge | `paused_value=<paused annotation value>`               |              |
   368  | `capi_machinehealthcheck_owner`                       | `1`                               | Gauge | `owner_kind=<owner kind>`<br>`owner_name=<owner name>` | [ReplicaSet] |
   369  | `capi_machinehealthcheck_status_condition`            | `.status.conditions==<condition>` | Gauge | `condition=<condition>`<br>`status=<true\|false>`      |              |
   370  | `capi_machinehealthcheck_status_expected_machines`    | `.status.expectedMachines`        | Gauge |                                                        |              |
   371  | `capi_machinehealthcheck_status_current_healthy`      | `.status.currentHealthy`          | Gauge |                                                        |              |
   372  | `capi_machinehealthcheck_status_remediations_allowed` | `.status.remediationsAllowed`     | Gauge |                                                        |              |
   373  
   374  *: A metric will only be exposed if the annotation existst. If so it will always have a value of `1` and expose a label which contains its value. Prometheus would drop labels having an empty value, which is why an empty value would be equal to a not set annotation otherwise.
   375  
   376  ### Gaduation Criteria
   377  
   378  The initial plan is to add kube-state-metrics as to the `./hack/observability` directory and allow it to be enabled via `tilt`.
   379  
   380  ## Implementation History
   381  
   382  - [x] 09/07/2022: Updated proposal to match current implementation state, removed `_labels` metrics due to lack of functionality in kube-state-metrics
   383  - [x] 03/02/2022: Proposed idea in an issue or [community meeting]
   384  - [x] 03/15/2022: Compile a Google Doc following the CAEP template
   385  - [x] 03/16/2022: Present proposal at a [community meeting]
   386  - [x] 03/16/2022: First round of feedback from community
   387  - [x] 04/11/2022: Open proposal PR
   388  
   389  <!-- Links -->
   390  
   391  [Cluster API Book Glossary]: https://cluster-api.sigs.k8s.io/reference/glossary.html
   392  [mercedes-benz/cluster-api-state-metrics]: https://github.com/mercedes-benz/cluster-api-state-metrics/
   393  [Mercedes-Benz]: https://opensource.mercedes-benz.com/
   394  [OpenMetrics]: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md
   395  [metrics umbrella issue #1477]: https://github.com/kubernetes-sigs/cluster-api/issues/1477
   396  [corresponding section in the cluster-api docs]: https://cluster-api.sigs.k8s.io/developer/crd-relationships.html#worker-machines-relationships
   397  [Deployment]: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/deployment-metrics.md
   398  [ReplicaSet]: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/replicaset-metrics.md
   399  [Pod]: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/pod-metrics.md
   400  [StatefulSet]: https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/workload/statefulset-metrics.md
   401  [kube-state-metrics]: https://github.com/kubernetes/kube-state-metrics
   402  [1]: https://github.com/kubernetes/kube-state-metrics
   403  [2]: https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/customresource/registry_factory.go#L29
   404  [3]: https://github.com/kubernetes/kube-state-metrics/blob/master/pkg/app/server.go
   405  [4]: https://github.com/kubernetes/kube-state-metrics/issues/457
   406  [machine phases]: https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/machine_phase_types.go
   407  [cluster phases]: https://github.com/kubernetes-sigs/cluster-api/blob/main/api/v1beta1/cluster_phase_types.go
   408  [machinedeployment phases]: https://github.com/kubernetes-sigs/cluster-api/blob/07c0a4809361927b15cde2747b34142b7c7ead15/api/v1beta1/machinedeployment_types.go#L222-L224
   409  [community meeting]: https://docs.google.com/document/d/1ushaVqAKYnZ2VN_aa3GyKlS4kEd6bSug13xaXOakAQI