github.com/cilium/cilium@v1.16.2/Documentation/contributing/development/debugging.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _gs_debugging: 8 9 ######### 10 Debugging 11 ######### 12 13 Attaching a Debugger 14 -------------------- 15 16 Cilium comes with a set of Makefile targets for quickly deploying development 17 builds to a local :ref:`Kind <gs_kind>` cluster. One of these targets is 18 ``kind-debug-agent``, which generates a container image that wraps the Cilium 19 agent with a `Delve (dlv) <https://github.com/go-delve/delve>`_ invocation. This 20 causes the agent process to listen for connections from a debugger front-end on 21 port 2345. 22 23 To build and push a debug image to your local Kind cluster, run: 24 25 .. code-block:: shell-session 26 27 $ make kind-debug-agent 28 29 .. note:: 30 The image is automatically pushed to the Kind nodes, but running Cilium 31 Pods are not restarted. To do so, run: 32 33 .. code-block:: shell-session 34 35 $ kubectl delete pods -n kube-system -l app.kubernetes.io/name=cilium-agent 36 37 If your Kind cluster was set up using ``make kind``, it will automatically 38 be configured using with the following port mappings: 39 40 - ``23401``: ``kind-control-plane-1`` 41 - ``2340*``: Subsequent ``kind-control-plane-*`` nodes, if defined 42 - ``23411``: ``kind-worker-1`` 43 - ``2341*``: Subsequent ``kind-worker-*`` nodes, if defined 44 45 The Delve listener supports multiple debugging protocols, so any IDEs or 46 debugger front-ends that understand either the `Debug Adapter Protocol 47 <https://microsoft.github.io/debug-adapter-protocol>`_ or Delve API v2 are 48 supported. 49 50 ~~~~~~~~~~~~~~~~~~ 51 Visual Studio Code 52 ~~~~~~~~~~~~~~~~~~ 53 54 The Cilium repository contains a VS Code launch configuration 55 (``.vscode/launch.json``) that includes debug targets for the Kind control 56 plane, the first two ``kind-worker`` nodes and the :ref:`Cilium Operator 57 <cilium_operator_internals>`. 58 59 .. image:: _static/vscode-run-and-debug.png 60 :align: center 61 62 | 63 64 The preceding screenshot is taken from the 'Run And Debug' section in VS Code. 65 The default shortcut to access this section is ``Shift+Ctrl+D``. Select a target 66 to attach to, start the debug session and set a breakpoint to halt the agent or 67 operator on a specific code statement. This only works for Go code, BPF C code 68 cannot be debugged this way. 69 70 See `the VS Code debugging guide <https://code.visualstudio.com/docs/editor/debugging>`_ 71 for more details. 72 73 ~~~~~~ 74 Neovim 75 ~~~~~~ 76 77 The Cilium repository contains a `.nvim directory 78 <https://github.com/cilium/cilium/tree/main/.nvim>`_ containing a DAP 79 configuration as well as a README on how to configure ``nvim-dap``. 80 81 toFQDNs and DNS Debugging 82 ------------------------- 83 84 The interactions of L3 toFQDNs and L7 DNS rules can be difficult to debug 85 around. Unlike many other policy rules, these are resolved at runtime with 86 unknown data. Pods may create large numbers of IPs in the cache or the IPs 87 returned may not be compatible with our datapath implementation. Sometimes 88 we also just have bugs. 89 90 91 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 92 Isolating the source of toFQDNs issues 93 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 94 95 While there is no common culprit when debugging, the DNS Proxy shares the least 96 code with other system and so is more likely the least audited in this chain. 97 The cascading caching scheme is also complex in its behaviour. Determining 98 whether an issue is caused by the DNS components, in the policy layer or in the 99 datapath is often the first step when debugging toFQDNs related issues. 100 Generally, working top-down is easiest as the information needed to verify 101 low-level correctness can be collected in the initial debug invocations. 102 103 104 REFUSED vs NXDOMAIN responses 105 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 106 107 The proxy uses REFUSED DNS responses to indicate a denied request. Some libc 108 implementations, notably musl which is common in Alpine Linux images, terminate 109 the whole DNS search in these cases. This often manifests as a connect error in 110 applications, as the libc lookup returns no data. 111 To work around this, denied responses can be configured to be NXDOMAIN by 112 setting ``--tofqdns-dns-reject-response-code=nameError`` on the command line. 113 114 115 Monitor Events 116 ~~~~~~~~~~~~~~ 117 118 The DNS Proxy emits multiple L7 DNS monitor events. One for the request and one 119 for the response (if allowed). Often the L7 DNS rules are paired with L3 120 toFQDNs rules and events relating to those rules are also relevant. 121 122 .. Note:: 123 124 Be sure to run cilium-dbg monitor on the same node as the pod being debugged! 125 126 .. code-block:: shell-session 127 128 $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg monitor --related-to 3459 129 Listening for events on 4 CPUs with 64x4096 of shared memory 130 Press Ctrl-C to quit 131 level=info msg="Initializing dissection cache..." subsys=monitor 132 133 -> Request dns from 3459 ([k8s:org=alliance k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.cilium.k8s.policy.cluster=default k8s:class=xwing]) to 0 ([k8s:io.cilium.k8s.policy.serviceaccount=kube-dns k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns k8s:io.cilium.k8s.policy.cluster=default]), identity 323->15194, verdict Forwarded DNS Query: cilium.io. A 134 -> endpoint 3459 flow 0xe6866e21 identity 15194->323 state reply ifindex lxc84b58cbdabfe orig-ip 10.60.1.115: 10.63.240.10:53 -> 10.60.0.182:42132 udp 135 -> Response dns to 3459 ([k8s:org=alliance k8s:io.kubernetes.pod.namespace=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.cilium.k8s.policy.cluster=default k8s:class=xwing]) from 0 ([k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=kube-dns k8s:io.kubernetes.pod.namespace=kube-system k8s:k8s-app=kube-dns]), identity 323->15194, verdict Forwarded DNS Query: cilium.io. A TTL: 486 Answer: '104.198.14.52' 136 -> endpoint 3459 flow 0xe6866e21 identity 15194->323 state reply ifindex lxc84b58cbdabfe orig-ip 10.60.1.115: 10.63.240.10:53 -> 10.60.0.182:42132 udp 137 Policy verdict log: flow 0x614e9723 local EP ID 3459, remote ID 16777217, proto 6, egress, action allow, match L3-Only, 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN 138 139 -> stack flow 0x614e9723 identity 323->16777217 state new ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN 140 -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp SYN 141 -> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp SYN, ACK 142 -> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK 143 -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK 144 -> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK 145 -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK 146 -> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp ACK 147 -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK 148 -> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK, FIN 149 -> 0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK, FIN 150 -> endpoint 3459 flow 0x7388921 identity 16777217->323 state reply ifindex lxc84b58cbdabfe orig-ip 104.198.14.52: 104.198.14.52:80 -> 10.60.0.182:41510 tcp ACK, FIN 151 -> stack flow 0x614e9723 identity 323->16777217 state established ifindex 0 orig-ip 0.0.0.0: 10.60.0.182:41510 -> 104.198.14.52:80 tcp ACK 152 153 The above is for a simple ``curl cilium.io`` in a pod. The L7 DNS request is 154 the first set of message and the subsequent L3 connection is the HTTP 155 component. AAAA DNS lookups commonly happen but were removed to simplify the 156 example. 157 158 - If no L7 DNS requests appear, the proxy redirect is not in place. This may 159 mean that the policy does not select this endpoint or there is an issue with 160 the proxy redirection. Whether any redirects exist can be checked with 161 ``cilium-dbg status --all-redirects``. 162 In the past, a bug occurred with more permissive L3 rules overriding the 163 proxy redirect, causing the proxy to never see the requests. 164 - If the L7 DNS request is blocked, with an explicit denied message, then the 165 requests are not allowed by the proxy. This may be due to a typo in the 166 network policy, or the matchPattern rule not allowing this domain. It may 167 also be due to a bug in policy propagation to the DNS Proxy. 168 - If the DNS request is allowed, with an explicit message, and it should not 169 be, this may be because a more general policy is in place that allows the 170 request. ``matchPattern: "*"`` visibility policies are commonly in place and 171 would supersede all other, more restrictive, policies. 172 If no other policies are in place, incorrect allows may indicate a bug when 173 passing policy information to the proxy. There is no way to dump the rules in 174 the proxy, but a debug log is printed when a rule is added. Look for 175 ``DNS Proxy updating matchNames in allowed list during UpdateRules``. 176 The pkg/proxy/dns.go file contains the DNS proxy implementation. 177 178 If L7 DNS behaviour seems correct, see the sections below to further isolate 179 the issue. This can be verified with ``cilium-dbg fqdn cache list``. The IPs in the 180 response should appear in the cache for the appropriate endpoint. The lookup 181 time is included in the json output of the command. 182 183 .. code-block:: shell-session 184 185 $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg fqdn cache list 186 Endpoint Source FQDN TTL ExpirationTime IPs 187 3459 lookup cilium.io. 3600 2020-04-21T15:04:27.146Z 104.198.14.52 188 189 As of Cilium 1.16, the ``ExpirationTime`` represents the next time that 190 the entry will be evaluated for staleness. If the entry ``Source`` is 191 ``lookup``, then the entry will expire at that time. An equivalent entry with 192 source ``connection`` may be established when a ``lookup`` entry expires. If 193 the corresponding Endpoint continues to communicate to this domain via one of 194 the related IP addresses, then Cilium will continue to keep the ``connection`` 195 entry alive. When the expiration time for a ``connection`` entry is reached, 196 the entry will be re-evaluated to determine whether it is still used by active 197 connections, and at that time may expire or be renewed with a new target 198 expiration time. 199 200 DNS Proxy Errors 201 ~~~~~~~~~~~~~~~~ 202 203 REFUSED responses are returned when the proxy encounters an error during 204 processing. This can be confusing to debug as that is also the response when a 205 DNS request is denied. An error log is always printed in these cases. Some are 206 callbacks provided by other packages via daemon in cilium-agent. 207 208 - ``Rejecting DNS query from endpoint due to error``: This is the "normal" 209 policy-reject message. It is a debug log. 210 - ``cannot extract endpoint IP from DNS request``: The proxy cannot read the 211 socket information to read the source endpoint IP. This could mean an 212 issue with the datapath routing and information passing. 213 - ``cannot extract endpoint ID from DNS request``: The proxy cannot use the 214 source endpoint IP to get the cilium-internal ID for that endpoint. This is 215 different from the Security Identity. This could mean that cilium is not 216 managing this endpoint and that something has gone awry. It could also mean a 217 routing problem where a packet has arrived at the proxy incorrectly. 218 - ``cannot extract destination IP:port from DNS request``: The proxy cannot 219 read the socket information of the original request to obtain the intended 220 target IP:Port. This could mean an issue with the datapath routing and 221 information passing. 222 - ``cannot find server ip in ipcache``: The proxy cannot resolve a Security 223 Identity for the target IP of the DNS request. This should always succeed, as 224 world catches all IPs not set by more specific entries. This can mean a 225 broken ipcache BPF table. 226 - ``Rejecting DNS query from endpoint due to error``: While checking if the DNS 227 request was allowed (based on Endpoint ID, destination IP:Port and the DNS 228 query) an error occurred. These errors would come from the internal rule 229 lookup in the proxy, the ``allowed`` field. 230 - ``Timeout waiting for response to forwarded proxied DNS lookup``: The proxy 231 forwards requests 1:1 and does not cache. It applies a 10s timeout on 232 responses to those requests, as the client will retry within this period 233 (usually). Bursts of these errors can happen if the DNS target server 234 misbehaves and many pods see DNS timeouts. This isn't an actual problem with 235 cilium or the proxy although it can be caused by policy blocking the DNS 236 target server if it is in-cluster. 237 - ``Timed out waiting for datapath updates of FQDN IP information; returning 238 response``: When the proxy updates the DNS caches with response data, it 239 needs to allow some time for that information to get into the datapath. 240 Otherwise, pods would attempt to make the outbound connection (the thing that 241 caused the DNS lookup) before the datapath is ready. Many stacks retry the 242 SYN in such cases but some return an error and some apps further crash as a 243 response. This delay is configurable by setting the 244 ``--tofqdns-proxy-response-max-delay`` command line argument but defaults to 245 100ms. It can be exceeded if the system is under load. 246 247 .. _isolating-source-toFQDNs-issues-identities-policy: 248 249 Identities and Policy 250 ~~~~~~~~~~~~~~~~~~~~~ 251 252 Once a DNS response has been passed back through the proxy and is placed in the 253 DNS cache ``toFQDNs`` rules can begin using the IPs in the cache. There are 254 multiple layers of cache: 255 256 - A per-Endpoint ``DNSCache`` stores the lookups for this endpoint. It is 257 restored on cilium startup with the endpoint. Limits are applied here for 258 ``--tofqdns-endpoint-max-ip-per-hostname`` and TTLs are tracked. The 259 ``--tofqdns-min-ttl`` is not used here. 260 - A per-Endpoint ``DNSZombieMapping`` list of IPs that have expired from the 261 per-Endpoint cache but are waiting for the Connection Tracking GC to mark 262 them in-use or not. This can take up to 12 hours to occur. This list is 263 size-limited by ``--tofqdns-max-deferred-connection-deletes``. 264 - A global ``DNSCache`` where all endpoint and poller DNS data is collected. It 265 does apply the ``--tofqdns-min-ttl`` value but not the 266 ``--tofqdns-endpoint-max-ip-per-hostname`` value. 267 268 If an IP exists in the FQDN cache (check with ``cilium-dbg fqdn cache list``) then 269 ``toFQDNs`` rules that select a domain name, either explicitly via 270 ``matchName`` or via ``matchPattern``, should cause IPs for that domain to have 271 allocated Security Identities. These can be listed with: 272 273 .. code-block:: shell-session 274 275 $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg identity list 276 ID LABELS 277 1 reserved:host 278 2 reserved:world 279 3 reserved:unmanaged 280 4 reserved:health 281 5 reserved:init 282 6 reserved:remote-node 283 323 k8s:class=xwing 284 k8s:io.cilium.k8s.policy.cluster=default 285 k8s:io.cilium.k8s.policy.serviceaccount=default 286 k8s:io.kubernetes.pod.namespace=default 287 k8s:org=alliance 288 ... 289 16777217 fqdn:* 290 reserved:world 291 292 Note that FQDN identities are allocated locally on the node and have a high-bit set so they are often in the 16-million range. 293 Note that this is the identity in the monitor output for the HTTP connection. 294 295 In cases where there is no matching identity for an IP in the fqdn cache it may 296 simply be because no policy selects an associated domain. The policy system 297 represents each ``toFQDNs:`` rule with a ``FQDNSelector`` instance. These 298 receive updates from a global ``NameManager`` in the daemon. 299 They can be listed along with other selectors (roughly corresponding to any L3 rule): 300 301 .. code-block:: shell-session 302 303 $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg policy selectors 304 SELECTOR USERS IDENTITIES 305 MatchName: , MatchPattern: * 1 16777217 306 &LabelSelector{MatchLabels:map[string]string{},MatchExpressions:[]LabelSelectorRequirement{},} 2 1 307 2 308 3 309 4 310 5 311 6 312 323 313 6188 314 15194 315 18892 316 25379 317 29200 318 32255 319 33831 320 16777217 321 &LabelSelector{MatchLabels:map[string]string{reserved.none: ,},MatchExpressions:[]LabelSelectorRequirement{},} 1 322 323 In this example 16777217 is used by two selectors, one with ``matchPattern: "*"`` 324 and another empty one. This is because of the policy in use: 325 326 .. code-block:: yaml 327 328 apiVersion: cilium.io/v2 329 kind: CiliumNetworkPolicy 330 metadata: 331 name: "tofqdn-dns-visibility" 332 spec: 333 endpointSelector: 334 matchLabels: 335 any:org: alliance 336 egress: 337 - toPorts: 338 - ports: 339 - port: "53" 340 protocol: ANY 341 rules: 342 dns: 343 - matchPattern: "*" 344 - toFQDNs: 345 - matchPattern: "*" 346 347 The L7 DNS rule has an implicit L3 allow-all because it defines only L4 and L7 348 sections. This is the second selector in the list, and includes all possible L3 349 identities known in the system. In contrast, the first selector, which 350 corresponds to the ``toFQDNS: matchName: "*"`` rule would list all identities 351 for IPs that came from the DNS Proxy. 352 353 Unintended DNS Policy Drops 354 ~~~~~~~~~~~~~~~~~~~~~~~~~~~ 355 356 ``toFQDNSs`` policy enforcement relies on the source pod performing a DNS query 357 before using an IP address returned in the DNS response. Sometimes pods may hold 358 on to a DNS response and start new connections to the same IP address at a later 359 time. This may trigger policy drops if the DNS response has expired as requested 360 by the DNS server in the time-to-live (TTL) value in the response. When DNS is 361 used for service load balancing the advertised TTL value may be short (e.g., 60 362 seconds). 363 364 Cilium honors the TTL values returned by the DNS server by default, but you can 365 override them by setting a minimum TTL using ``--tofqdns-min-ttl`` flag. This 366 setting overrides short TTLs and allows the pod to use the IP address in the DNS 367 response for a longer duration. Existing connections also keep the IP address as 368 allowed in the policy. 369 370 Any new connections opened by the pod using the same IP address without 371 performing a new DNS query after the (possibly extended) DNS TTL has expired are 372 dropped by Cilium policy enforcement. To allow pods to use the DNS response 373 after TTL expiry for new connections, a command line option 374 ``--tofqdns-idle-connection-grace-period`` may be used to keep the IP address / 375 name mapping valid in the policy for an extended time after DNS TTL expiry. This 376 option takes effect only if the pod has opened at least one connection during 377 the DNS TTL period. 378 379 Datapath Plumbing 380 ~~~~~~~~~~~~~~~~~ 381 382 For a policy to be fully realized the datapath for an Endpoint must be updated. 383 In the case of a new DNS-source IP, the FQDN identity associated with it must 384 propagate from the selectors to the Endpoint specific policy. Unless a new 385 policy is being added, this often only involves updating the Policy Map of the 386 Endpoint with the new FQDN Identity of the IP. This can be verified: 387 388 .. code-block:: shell-session 389 390 $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg bpf policy get 3459 391 DIRECTION LABELS (source:key[=value]) PORT/PROTO PROXY PORT BYTES PACKETS 392 Ingress reserved:unknown ANY NONE 1367 7 393 Ingress reserved:host ANY NONE 0 0 394 Egress reserved:unknown 53/TCP 36447 0 0 395 Egress reserved:unknown 53/UDP 36447 138 2 396 Egress fqdn:* ANY NONE 477 6 397 reserved:world 398 399 Note that the labels for identities are resolved here. This can be skipped, or 400 there may be cases where this doesn't occur: 401 402 .. code-block:: shell-session 403 404 $ kubectl exec pod/cilium-sbp8v -n kube-system -- cilium-dbg bpf policy get -n 3459 405 DIRECTION IDENTITY PORT/PROTO PROXY PORT BYTES PACKETS 406 Ingress 0 ANY NONE 1367 7 407 Ingress 1 ANY NONE 0 0 408 Egress 0 53/TCP 36447 0 0 409 Egress 0 53/UDP 36447 138 2 410 Egress 16777217 ANY NONE 477 6 411 412 413 L3 ``toFQDNs`` rules are egress only, so we would expect to see an ``Egress`` 414 entry with Security Identity ``16777217``. The L7 rule, used to redirect to the 415 DNS Proxy is also present with a populated ``PROXY PORT``. It has a 0 416 ``IDENTITY`` as it is an L3 wildcard, i.e. the policy allows any peer on the 417 specified port. 418 419 An identity missing here can be an error in various places: 420 421 - Policy doesn't actually allow this Endpoint to connect. A sanity check is to 422 use ``cilium-dbg endpoint list`` to see if cilium thinks it should have policy 423 enforcement. 424 - Endpoint regeneration is slow and the Policy Map has not been updated yet. 425 This can occur in cases where we have leaked IPs from the DNS cache (i.e. 426 they were never deleted correctly) or when there are legitimately many IPs. 427 It can also simply mean an overloaded node or even a deadlock within cilium. 428 - A more permissive policy has removed the need to include this identity. This 429 is likely a bug, however, as the IP would still have an identity allocated 430 and it would be included in the Policy Map. In the past, a similar bug 431 occurred with the L7 redirect and that would stop this whole process at the 432 beginning. 433 434 Mutexes / Locks and Data Races 435 ------------------------------ 436 437 .. Note:: 438 439 This section only applies to Golang code. 440 441 There are a few options available to debug Cilium data races and deadlocks. 442 443 To debug data races, Golang allows ``-race`` to be passed to the compiler to 444 compile Cilium with race detection. Additionally, the flag can be provided to 445 ``go test`` to detect data races in a testing context. 446 447 .. _compile-cilium-with-race-detection: 448 449 ~~~~~~~~~~~~~~ 450 Race detection 451 ~~~~~~~~~~~~~~ 452 453 To compile a Cilium binary with race detection, you can do: 454 455 .. code-block:: shell-session 456 457 $ make RACE=1 458 459 .. Note:: 460 461 For building the Operator with race detection, you must also provide 462 ``BASE_IMAGE`` which can be the ``cilium/cilium-runtime`` image from the 463 root Dockerfile found in the Cilium repository. 464 465 To run integration tests with race detection, you can do: 466 467 .. code-block:: shell-session 468 469 $ make RACE=1 integration-tests 470 471 ~~~~~~~~~~~~~~~~~~ 472 Deadlock detection 473 ~~~~~~~~~~~~~~~~~~ 474 475 Cilium can be compiled with a build tag ``lockdebug`` which will provide a 476 seamless wrapper over the standard mutex types in Golang, via 477 `sasha-s/go-deadlock library <https://github.com/sasha-s/go-deadlock>`_. No 478 action is required, besides building the binary with this tag. 479 480 For example: 481 482 .. code-block:: shell-session 483 484 $ make LOCKDEBUG=1 485 $ # Deadlock detection during integration tests: 486 $ make LOCKDEBUG=1 integration-tests 487 488 CPU Profiling and Memory Leaks 489 ------------------------------ 490 491 Cilium bundles ``gops``, a standard tool for Golang applications, which 492 provides the ability to collect CPU and memory profiles using ``pprof``. 493 Inspecting profiles can help identify CPU bottlenecks and memory leaks. 494 495 To capture a profile, take a :ref:`sysdump <sysdump>` of the cluster with the 496 Cilium CLI or more directly, use the ``cilium-bugtool`` command that is 497 included in the Cilium image after enabling ``pprof`` in the Cilium ConfigMap: 498 499 .. code-block:: shell-session 500 501 $ kubectl exec -ti -n kube-system <cilium-pod-name> -- cilium-bugtool --get-pprof --pprof-trace-seconds N 502 $ kubectl cp -n kube-system <cilium-pod-name>:/tmp/cilium-bugtool-<time-generated-name>.tar ./cilium-pprof.tar 503 $ tar xf ./cilium-pprof.tar 504 505 Be mindful that the profile window is the number of seconds passed to 506 ``--pprof-trace-seconds``. Ensure that the number of seconds are enough to 507 capture Cilium while it is exhibiting the problematic behavior to debug. 508 509 There are 6 files that encompass the tar archive: 510 511 .. code-block:: shell-session 512 513 Permissions Size User Date Modified Name 514 .rw-r--r-- 940 chris 6 Jul 14:04 gops-memstats-$(pidof-cilium-agent).md 515 .rw-r--r-- 211k chris 6 Jul 14:04 gops-stack-$(pidof-cilium-agent).md 516 .rw-r--r-- 58 chris 6 Jul 14:04 gops-stats-$(pidof-cilium-agent).md 517 .rw-r--r-- 212 chris 6 Jul 14:04 pprof-cpu 518 .rw-r--r-- 2.3M chris 6 Jul 14:04 pprof-heap 519 .rw-r--r-- 25k chris 6 Jul 14:04 pprof-trace 520 521 The files prefixed with ``pprof-`` are profiles. For more information on each 522 one, see `Julia Evan's blog`_ on ``pprof``. 523 524 To view the CPU or memory profile, simply execute the following command: 525 526 .. code-block:: shell-session 527 528 $ go tool pprof -http localhost:9090 pprof-cpu # for CPU 529 $ go tool pprof -http localhost:9090 pprof-heap # for memory 530 531 This opens a browser window for profile inspection. 532 533 .. _Julia Evan's blog: https://jvns.ca/blog/2017/09/24/profiling-go-with-pprof/