sigs.k8s.io/cluster-api@v1.7.1/docs/book/src/developer/logging.md

sigs.k8s.io/cluster-api@v1.7.1/docs/book/src/developer/logging.md (about)

1 # Logging
2 The Cluster API project is committed to improving the SRE/developer experience when troubleshooting issues, and logging
3 plays an important part in this goal.
4
5 In Cluster API we strive to follow three principles while implementing logging:
6
7 - **Logs are for SRE & developers, not for end users!**
8 Whenever an end user is required to read logs to understand what is happening in the system, most probably there is an
9 opportunity for improvement of other observability in our API, like e.g. conditions and events.
10 - **Navigating logs should be easy**:
11 We should make sure that SREs/Developers can easily drill down logs while investigating issues, e.g. by allowing to
12 search all the log entries for a specific Machine object, eventually across different controllers/reconciler logs.
13 - **Cluster API developers MUST use logs!**
14 As Cluster API contributors you are not only the ones that implement logs, but also the first users of them. Use it!
15 Provide feedback!
16
17 ## Upstream Alignment
18
19 Kubernetes defines a set of [logging conventions](https://git.k8s.io/community/contributors/devel/sig-instrumentation/logging.md),
20 as well as tools and libraries for logging.
21
22 ## Continuous improvement
23
24 The foundational items of Cluster API logging are:
25
26 - Support for structured logging in all the Cluster API controllers (see [log format](#log-format)).
27 - Using contextual logging (see [contextual logging](#contextual-logging)).
28 - Adding a minimal set of key/value pairs in the logger at the beginning of each reconcile loop, so all the subsequent
29 log entries will inherit them (see [key value pairs](#keyvalue-pairs)).
30
31 Starting from the above foundations, then the long tail of small improvements will consist of following activities:
32
33 - Improve consistency of additional key/value pairs added by single log entries (see [key value pairs](#keyvalue-pairs)).
34 - Improve log messages (see [log messages](#log-messages)).
35 - Improve consistency of log levels (see [log levels](#log-levels)).
36
37 ## Log Format
38
39 Controllers MUST provide support for [structured logging](https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/1602-structured-logging)
40 and for the [JSON output format](https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/1602-structured-logging#json-output-format);
41 quoting the Kubernetes documentation, these are the key elements of this approach:
42
43 - Separate a log message from its arguments.
44 - Treat log arguments as key-value pairs.
45 - Be easily parsable and queryable.
46
47 Cluster API uses all the tooling provided by the Kubernetes community to implement structured logging: [Klog](https://github.com/kubernetes/klog), a
48 [logr](https://github.com/go-logr/logr) wrapper that works with controller runtime, and other utils for exposing flags
49 in the controller’s main.go.
50
51 Ideally, in a future release of Cluster API we will make JSON output format the default format for all the Cluster API
52 controllers (currently the default is still text format).
53
54 ## Contextual logging
55
56 [Contextual logging](https://github.com/kubernetes/enhancements/tree/master/keps/sig-instrumentation/3077-contextual-logging)
57 is the practice of using a log stored in the context across the entire chain of calls of a reconcile
58 action. One of the main advantages of this approach is that key value pairs which are added to the logger at the
59 beginning of the chain are then inherited by all the subsequent log entries created down the chain.
60
61 Contextual logging is also embedded in controller runtime; In Cluster API we use contextual logging via controller runtime's
62 `LoggerFrom(ctx)` and `LoggerInto(ctx, log)` primitives and this ensures that:
63
64 - The logger passed to each reconcile call has a unique `reconcileID`, so all the logs being written during a single
65 reconcile call can be easily identified (note: controller runtime also adds other useful key value pairs by default).
66 - The logger has a key value pair identifying the objects being reconciled,e.g. a Machine Deployment, so all the logs
67 impacting this object can be easily identified.
68
69 Cluster API developer MUST ensure that:
70
71 - The logger has a set of key value pairs identifying the hierarchy of objects the object being reconciled belongs to,
72 e.g. the Cluster a Machine Deployment belongs to, so it will be possible to drill down logs for related Cluster API
73 objects while investigating issues.
74
75 ## Key/Value Pairs
76
77 One of the key elements of structured logging is key-value pairs.
78
79 Having consistent key value pairs is a requirement for ensuring readability and for providing support for searching and
80 correlating lines across logs.
81
82 A set of good practices for defining key value pairs is defined in the [Kubernetes Guidelines](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/migration-to-structured-logging.md#name-arguments), and
83 one of the above practices is really important for Cluster API developers
84
85 - Developers MUST use `klog.KObj` or `klog.KRef` functions when logging key value pairs for Kubernetes objects, thus
86 ensuring a key value pair representing a Kubernetes object is formatted consistently in all the logs.
87
88 Please note that, in order to ensure logs can be easily searched it is important to ensure consistency for the following
89 key value pairs (in order of importance):
90
91 - Key value pairs identifying the object being reconciled, e.g. a Machine Deployment.
92 - Key value pairs identifying the hierarchy of objects being reconciled, e.g. the Cluster a Machine Deployment belongs
93 to.
94 - Key value pairs identifying side effects on other objects, e.g. while reconciling a MachineDeployment, the controller
95 creates a MachinesSet.
96 - Other Key value pairs.
97
98 ## Log Messages
99
100 - A Message MUST always start with a capital letter.
101 - Period at the end of a message MUST be omitted.
102 - Always prefer logging before the action, so in case of errors there will be an immediate, visual correlation between
103 the action log and the corresponding error log; While logging before the action, log verbs should use the -ing form.
104 - Ideally log messages should surface a different level of detail according to the target log level (see [log levels](#log-levels)
105 for more details).
106
107 ## Log Levels
108
109 Kubernetes provides a set of [recommendations](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-instrumentation/logging.md#what-method-to-use)
110 for log levels; as a small integration on the above guidelines we would like to add:
111
112 - Logs at the lower levels of verbosity (<=3) are meant to document “what happened” by describing how an object status
113 is being changed by controller/reconcilers across subsequent reconciliations; as a rule of thumb, it is reasonable
114 to assume that a person reading those logs has a deep knowledge of how the system works, but it should not be required
115 for those persons to have knowledge of the codebase.
116 - Logs at higher levels of verbosity (>=4) are meant to document “how it happened”, providing insight on thorny parts of
117 the code; a person reading those logs usually has deep knowledge of the codebase.
118 - Don’t use verbosity higher than 5.
119
120 We are using log level 2 as a default verbosity for all core Cluster API
121 controllers as recommended by the Kubernetes guidelines.
122
123 ## Trade-offs
124
125 When developing logs there are operational trade-offs to take into account, e.g. verbosity vs space allocation, user
126 readability vs machine readability, maintainability of the logs across the code base.
127
128 A reasonable approach for logging is to keep things simple and implement more log verbosity selectively and only on
129 thorny parts of code. Over time, based on feedback from SRE/developers, more logs can be added to shed light where necessary.
130
131 ## Developing and testing logs
132
133 Our [Tilt](tilt.md) setup offers a batteries-included log suite based on [Promtail](https://grafana.com/docs/loki/latest/clients/promtail/), [Loki](https://grafana.com/docs/loki/latest/fundamentals/overview/) and [Grafana](https://grafana.com/docs/grafana/latest/explore/logs-integration/).
134
135 We are working to continuously improving this experience, allowing Cluster API developers to use logs and improve them as part of their development process.
136
137 For the best experience exploring the logs using Tilt:
138 1. Set `--logging-format=json`.
139 2. Set a high log verbosity, e.g. `v=5`.
140 3. Enable Promtail, Loki, and Grafana under `deploy_observability`.
141
142 A minimal example of a tilt-settings.yaml file that deploys a ready-to-use logging suite looks like:
143 ```yaml
144 deploy_observability:
145 - promtail
146 - loki
147 - grafana
148 enable_providers:
149 - docker
150 - kubeadm-bootstrap
151 - kubeadm-control-plane
152 extra_args:
153 core:
154 - "--logging-format=json"
155 - "--v=5"
156 docker:
157 - "--v=5"
158 - "--logging-format=json"
159 kubeadm-bootstrap:
160 - "--v=5"
161 - "--logging-format=json"
162 kubeadm-control-plane:
163 - "--v=5"
164 - "--logging-format=json"
165 ```
166 The above options can be combined with other settings from our [Tilt](tilt.md) setup. Once Tilt is up and running with these settings users will be able to browse logs using the Grafana Explore UI.
167
168 This will normally be available on `localhost:3001`. To explore logs from Loki, open the Explore interface for the DataSource 'Loki'. [This link](http://localhost:3001/explore?datasource%22:%22Loki%22) should work as a shortcut with the default Tilt settings.
169
170 ### Example queries
171
172 In the Log browser the following queries can be used to browse logs by controller, and by specific Cluster API objects. For example:
173 ```
174 {app="capi-controller-manager"} | json
175 ```
176 Will return logs from the `capi-controller-manager` which are parsed in json. Passing the query through the json parser allows filtering by key-value pairs that are part of nested json objects. For example `.cluster.name` becomes `cluster_name`.
177
178 ```
179 {app="capi-controller-manager"} | json | Cluster_name="my-cluster"
180 ```
181 Will return logs from the `capi-controller-manager` that are associated with the Cluster `my-cluster`.
182
183 ```
184 {app="capi-controller-manager"} | json | Cluster_name="my-cluster" | v <= 2
185 ```
186 Will return logs from the `capi-controller-manager` that are associated with the Cluster `my-cluster` with log level <= 2.
187
188 ```
189 {app="capi-controller-manager"} | json | Cluster_name="my-cluster" reconcileID="6f6ad971-bdb6-4fa3-b803-xxxxxxxxxxxx"
190 ```
191
192 Will return logs from the `capi-controller-manager`, associated with the Cluster `my-cluster` and the Reconcile ID `6f6ad971-bdb6-4fa3-b803-xxxxxxxxxxxx`. Each reconcile loop will have a unique Reconcile ID.
193
194 ```
195 {app="capi-controller-manager"} | json | Cluster_name="my-cluster" reconcileID="6f6ad971-bdb6-4fa3-b803-ef81c5c8f9d0" controller="cluster" | line_format "{{ .msg }}"
196 ```
197 Will return logs from the `capi-controller-manager`, associated with the Cluster `my-cluster` and the Reconcile ID `6f6ad971-bdb6-4fa3-b803-xxxxxxxxxxxx` it further selects only those logs which come from the Cluster controller. It will then format the logs so only the message is displayed.
198
199 ```
200 {app=~"capd-controller-manager|capi-kubeadm-bootstrap-controller-manager|capi-kubeadm-control-plane-controller-manager"} | json | Cluster_name="my-cluster" Machine_name="my-cluster-linux-worker-1" | line_format "{{.controller}} {{.msg}}"
201 ```
202
203 Will return the logs from four CAPI providers - the Core provider, Kubeadm Control Plane provider, Kubeadm Bootstrap provider and the Docker infrastructure provider. It filters by the cluster name and the machine name and then formats the log lines to show just the source controller and the message. This allows us to correlate logs and see actions taken by each of these four providers related to the machine `my-cluster-linux-worker-1`.
204
205 For more information on formatting and filtering logs using Grafana and Loki see:
206 - [json parsing](https://grafana.com/docs/loki/latest/clients/promtail/stages/json/)
207 - [log queries](https://grafana.com/docs/loki/latest/logql/log_queries/)
208
209 ## What about providers
210 Cluster API providers are developed by independent teams, and each team is free to define their own processes and
211 conventions.
212
213 However, given that SRE/developers looking at logs are often required to look both at logs from core CAPI and providers,
214 we encourage providers to adopt and contribute to the guidelines defined in this document.
215
216 It is also worth noting that the foundational elements of the approach described in this document are easy to achieve
217 by leveraging default Kubernetes tooling for logging.
218