github.com/cilium/cilium@v1.16.2/Documentation/operations/troubleshooting.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _admin_guide: 8 9 ############### 10 Troubleshooting 11 ############### 12 13 This document describes how to troubleshoot Cilium in different deployment 14 modes. It focuses on a full deployment of Cilium within a datacenter or public 15 cloud. If you are just looking for a simple way to experiment, we highly 16 recommend trying out the :ref:`getting_started` guide instead. 17 18 This guide assumes that you have read the :ref:`network_root` and `security_root` which explain all 19 the components and concepts. 20 21 We use GitHub issues to maintain a list of `Cilium Frequently Asked Questions 22 (FAQ)`_. You can also check there to see if your question(s) is already 23 addressed. 24 25 Component & Cluster Health 26 ========================== 27 28 Kubernetes 29 ---------- 30 31 An initial overview of Cilium can be retrieved by listing all pods to verify 32 whether all pods have the status ``Running``: 33 34 .. code-block:: shell-session 35 36 $ kubectl -n kube-system get pods -l k8s-app=cilium 37 NAME READY STATUS RESTARTS AGE 38 cilium-2hq5z 1/1 Running 0 4d 39 cilium-6kbtz 1/1 Running 0 4d 40 cilium-klj4b 1/1 Running 0 4d 41 cilium-zmjj9 1/1 Running 0 4d 42 43 If Cilium encounters a problem that it cannot recover from, it will 44 automatically report the failure state via ``cilium-dbg status`` which is regularly 45 queried by the Kubernetes liveness probe to automatically restart Cilium pods. 46 If a Cilium pod is in state ``CrashLoopBackoff`` then this indicates a 47 permanent failure scenario. 48 49 Detailed Status 50 ~~~~~~~~~~~~~~~ 51 52 If a particular Cilium pod is not in running state, the status and health of 53 the agent on that node can be retrieved by running ``cilium-dbg status`` in the 54 context of that pod: 55 56 .. code-block:: shell-session 57 58 $ kubectl -n kube-system exec cilium-2hq5z -- cilium-dbg status 59 KVStore: Ok etcd: 1/1 connected: http://demo-etcd-lab--a.etcd.tgraf.test1.lab.corp.isovalent.link:2379 - 3.2.5 (Leader) 60 ContainerRuntime: Ok docker daemon: OK 61 Kubernetes: Ok OK 62 Kubernetes APIs: ["cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint", "core/v1::Node", "CustomResourceDefinition"] 63 Cilium: Ok OK 64 NodeMonitor: Disabled 65 Cilium health daemon: Ok 66 Controller Status: 14/14 healthy 67 Proxy Status: OK, ip 10.2.0.172, port-range 10000-20000 68 Cluster health: 4/4 reachable (2018-06-16T09:49:58Z) 69 70 Alternatively, the ``k8s-cilium-exec.sh`` script can be used to run ``cilium-dbg 71 status`` on all nodes. This will provide detailed status and health information 72 of all nodes in the cluster: 73 74 .. code-block:: shell-session 75 76 curl -sLO https://raw.githubusercontent.com/cilium/cilium/main/contrib/k8s/k8s-cilium-exec.sh 77 chmod +x ./k8s-cilium-exec.sh 78 79 ... and run ``cilium-dbg status`` on all nodes: 80 81 .. code-block:: shell-session 82 83 $ ./k8s-cilium-exec.sh cilium-dbg status 84 KVStore: Ok Etcd: http://127.0.0.1:2379 - (Leader) 3.1.10 85 ContainerRuntime: Ok 86 Kubernetes: Ok OK 87 Kubernetes APIs: ["networking.k8s.io/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint"] 88 Cilium: Ok OK 89 NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory 90 Cilium health daemon: Ok 91 Controller Status: 7/7 healthy 92 Proxy Status: OK, ip 10.15.28.238, 0 redirects, port-range 10000-20000 93 Cluster health: 1/1 reachable (2018-02-27T00:24:34Z) 94 95 Detailed information about the status of Cilium can be inspected with the 96 ``cilium-dbg status --verbose`` command. Verbose output includes detailed IPAM state 97 (allocated addresses), Cilium controller status, and details of the Proxy 98 status. 99 100 .. _ts_agent_logs: 101 102 Logs 103 ~~~~ 104 105 To retrieve log files of a cilium pod, run (replace ``cilium-1234`` with a pod 106 name returned by ``kubectl -n kube-system get pods -l k8s-app=cilium``) 107 108 .. code-block:: shell-session 109 110 kubectl -n kube-system logs --timestamps cilium-1234 111 112 If the cilium pod was already restarted due to the liveness problem after 113 encountering an issue, it can be useful to retrieve the logs of the pod before 114 the last restart: 115 116 .. code-block:: shell-session 117 118 kubectl -n kube-system logs --timestamps -p cilium-1234 119 120 Generic 121 ------- 122 123 When logged in a host running Cilium, the cilium CLI can be invoked directly, 124 e.g.: 125 126 .. code-block:: shell-session 127 128 $ cilium-dbg status 129 KVStore: Ok etcd: 1/1 connected: https://192.168.60.11:2379 - 3.2.7 (Leader) 130 ContainerRuntime: Ok 131 Kubernetes: Ok OK 132 Kubernetes APIs: ["core/v1::Endpoint", "networking.k8s.io/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service"] 133 Cilium: Ok OK 134 NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory 135 Cilium health daemon: Ok 136 IPv4 address pool: 261/65535 allocated 137 IPv6 address pool: 4/4294967295 allocated 138 Controller Status: 20/20 healthy 139 Proxy Status: OK, ip 10.0.28.238, port-range 10000-20000 140 Hubble: Ok Current/Max Flows: 2542/4096 (62.06%), Flows/s: 164.21 Metrics: Disabled 141 Cluster health: 2/2 reachable (2018-04-11T15:41:01Z) 142 143 .. _hubble_troubleshooting: 144 145 Observing Flows with Hubble 146 =========================== 147 148 Hubble is a built-in observability tool which allows you to inspect recent flow 149 events on all endpoints managed by Cilium. 150 151 Ensure Hubble is running correctly 152 ---------------------------------- 153 154 To ensure the Hubble client can connect to the Hubble server running inside 155 Cilium, you may use the ``hubble status`` command from within a Cilium pod: 156 157 .. code-block:: shell-session 158 159 $ hubble status 160 Healthcheck (via unix:///var/run/cilium/hubble.sock): Ok 161 Current/Max Flows: 4095/4095 (100.00%) 162 Flows/s: 164.21 163 164 ``cilium-agent`` must be running with the ``--enable-hubble`` option (default) in order 165 for the Hubble server to be enabled. When deploying Cilium with Helm, make sure 166 to set the ``hubble.enabled=true`` value. 167 168 To check if Hubble is enabled in your deployment, you may look for the 169 following output in ``cilium-dbg status``: 170 171 .. code-block:: shell-session 172 173 $ cilium status 174 ... 175 Hubble: Ok Current/Max Flows: 4095/4095 (100.00%), Flows/s: 164.21 Metrics: Disabled 176 ... 177 178 .. note:: 179 Pods need to be managed by Cilium in order to be observable by Hubble. 180 See how to :ref:`ensure a pod is managed by Cilium<ensure_managed_pod>` 181 for more details. 182 183 Observing flows of a specific pod 184 --------------------------------- 185 186 In order to observe the traffic of a specific pod, you will first have to 187 :ref:`retrieve the name of the cilium instance managing it<retrieve_cilium_pod>`. 188 The Hubble CLI is part of the Cilium container image and can be accessed via 189 ``kubectl exec``. The following query for example will show all events related 190 to flows which either originated or terminated in the ``default/tiefighter`` pod 191 in the last three minutes: 192 193 .. code-block:: shell-session 194 195 $ kubectl exec -n kube-system cilium-77lk6 -- hubble observe --since 3m --pod default/tiefighter 196 May 4 12:47:08.811: default/tiefighter:53875 -> kube-system/coredns-74ff55c5b-66f4n:53 to-endpoint FORWARDED (UDP) 197 May 4 12:47:08.811: default/tiefighter:53875 -> kube-system/coredns-74ff55c5b-66f4n:53 to-endpoint FORWARDED (UDP) 198 May 4 12:47:08.811: default/tiefighter:53875 <- kube-system/coredns-74ff55c5b-66f4n:53 to-endpoint FORWARDED (UDP) 199 May 4 12:47:08.811: default/tiefighter:53875 <- kube-system/coredns-74ff55c5b-66f4n:53 to-endpoint FORWARDED (UDP) 200 May 4 12:47:08.811: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: SYN) 201 May 4 12:47:08.812: default/tiefighter:50214 <- default/deathstar-c74d84667-cx5kp:80 to-endpoint FORWARDED (TCP Flags: SYN, ACK) 202 May 4 12:47:08.812: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: ACK) 203 May 4 12:47:08.812: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: ACK, PSH) 204 May 4 12:47:08.812: default/tiefighter:50214 <- default/deathstar-c74d84667-cx5kp:80 to-endpoint FORWARDED (TCP Flags: ACK, PSH) 205 May 4 12:47:08.812: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: ACK, FIN) 206 May 4 12:47:08.812: default/tiefighter:50214 <- default/deathstar-c74d84667-cx5kp:80 to-endpoint FORWARDED (TCP Flags: ACK, FIN) 207 May 4 12:47:08.812: default/tiefighter:50214 <> default/deathstar-c74d84667-cx5kp:80 to-overlay FORWARDED (TCP Flags: ACK) 208 209 You may also use ``-o json`` to obtain more detailed information about each 210 flow event. 211 212 .. note:: 213 **Hubble Relay** allows you to query multiple Hubble instances 214 simultaneously without having to first manually target a specific node. See 215 `Observing flows with Hubble Relay`_ for more information. 216 217 Observing flows with Hubble Relay 218 ================================= 219 220 Hubble Relay is a service which allows to query multiple Hubble instances 221 simultaneously and aggregate the results. See :ref:`hubble_setup` to enable 222 Hubble Relay if it is not yet enabled and install the Hubble CLI on your local 223 machine. 224 225 You may access the Hubble Relay service by port-forwarding it locally: 226 227 .. code-block:: shell-session 228 229 kubectl -n kube-system port-forward service/hubble-relay --address 0.0.0.0 --address :: 4245:80 230 231 This will forward the Hubble Relay service port (``80``) to your local machine 232 on port ``4245`` on all of it's IP addresses. 233 234 You can verify that Hubble Relay can be reached by using the Hubble CLI and 235 running the following command from your local machine: 236 237 .. code-block:: shell-session 238 239 hubble status 240 241 This command should return an output similar to the following: 242 243 :: 244 245 Healthcheck (via localhost:4245): Ok 246 Current/Max Flows: 16380/16380 (100.00%) 247 Flows/s: 46.19 248 Connected Nodes: 4/4 249 250 You may see details about nodes that Hubble Relay is connected to by running 251 the following command: 252 253 .. code-block:: shell-session 254 255 hubble list nodes 256 257 As Hubble Relay shares the same API as individual Hubble instances, you may 258 follow the `Observing flows with Hubble`_ section keeping in mind that 259 limitations with regards to what can be seen from individual Hubble instances no 260 longer apply. 261 262 Connectivity Problems 263 ===================== 264 265 Cilium connectivity tests 266 ------------------------------------ 267 268 The Cilium connectivity test deploys a series of services, deployments, and 269 CiliumNetworkPolicy which will use various connectivity paths to connect to 270 each other. Connectivity paths include with and without service load-balancing 271 and various network policy combinations. 272 273 .. note:: 274 The connectivity tests this will only work in a namespace with no other pods 275 or network policies applied. If there is a Cilium Clusterwide Network Policy 276 enabled, that may also break this connectivity check. 277 278 To run the connectivity tests create an isolated test namespace called 279 ``cilium-test`` to deploy the tests with. 280 281 .. parsed-literal:: 282 283 kubectl create ns cilium-test 284 kubectl apply --namespace=cilium-test -f \ |SCM_WEB|\/examples/kubernetes/connectivity-check/connectivity-check.yaml 285 286 The tests cover various functionality of the system. Below we call out each test 287 type. If tests pass, it suggests functionality of the referenced subsystem. 288 289 +----------------------------+-----------------------------+-------------------------------+-----------------------------+----------------------------------------+ 290 | Pod-to-pod (intra-host) | Pod-to-pod (inter-host) | Pod-to-service (intra-host) | Pod-to-service (inter-host) | Pod-to-external resource | 291 +============================+=============================+===============================+=============================+========================================+ 292 | eBPF routing is functional | Data plane, routing, network| eBPF service map lookup | VXLAN overlay port if used | Egress, CiliumNetworkPolicy, masquerade| 293 +----------------------------+-----------------------------+-------------------------------+-----------------------------+----------------------------------------+ 294 295 The pod name indicates the connectivity 296 variant and the readiness and liveness gate indicates success or failure of the 297 test: 298 299 .. code-block:: shell-session 300 301 $ kubectl get pods -n cilium-test 302 NAME READY STATUS RESTARTS AGE 303 echo-a-6788c799fd-42qxx 1/1 Running 0 69s 304 echo-b-59757679d4-pjtdl 1/1 Running 0 69s 305 echo-b-host-f86bd784d-wnh4v 1/1 Running 0 68s 306 host-to-b-multi-node-clusterip-585db65b4d-x74nz 1/1 Running 0 68s 307 host-to-b-multi-node-headless-77c64bc7d8-kgf8p 1/1 Running 0 67s 308 pod-to-a-allowed-cnp-87b5895c8-bfw4x 1/1 Running 0 68s 309 pod-to-a-b76ddb6b4-2v4kb 1/1 Running 0 68s 310 pod-to-a-denied-cnp-677d9f567b-kkjp4 1/1 Running 0 68s 311 pod-to-b-intra-node-nodeport-8484fb6d89-bwj8q 1/1 Running 0 68s 312 pod-to-b-multi-node-clusterip-f7655dbc8-h5bwk 1/1 Running 0 68s 313 pod-to-b-multi-node-headless-5fd98b9648-5bjj8 1/1 Running 0 68s 314 pod-to-b-multi-node-nodeport-74bd8d7bd5-kmfmm 1/1 Running 0 68s 315 pod-to-external-1111-7489c7c46d-jhtkr 1/1 Running 0 68s 316 pod-to-external-fqdn-allow-google-cnp-b7b6bcdcb-97p75 1/1 Running 0 68s 317 318 Information about test failures can be determined by describing a failed test 319 pod 320 321 .. code-block:: shell-session 322 323 $ kubectl describe pod pod-to-b-intra-node-hostport 324 Warning Unhealthy 6s (x6 over 56s) kubelet, agent1 Readiness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 40000: Connection refused 325 Warning Unhealthy 2s (x3 over 52s) kubelet, agent1 Liveness probe failed: curl: (7) Failed to connect to echo-b-host-headless port 40000: Connection refused 326 327 .. _cluster_connectivity_health: 328 329 Checking cluster connectivity health 330 ------------------------------------ 331 332 Cilium can rule out network fabric related issues when troubleshooting 333 connectivity issues by providing reliable health and latency probes between all 334 cluster nodes and a simulated workload running on each node. 335 336 By default when Cilium is run, it launches instances of ``cilium-health`` in 337 the background to determine the overall connectivity status of the cluster. This 338 tool periodically runs bidirectional traffic across multiple paths through the 339 cluster and through each node using different protocols to determine the health 340 status of each path and protocol. At any point in time, cilium-health may be 341 queried for the connectivity status of the last probe. 342 343 .. code-block:: shell-session 344 345 $ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium-health status 346 Probe time: 2018-06-16T09:51:58Z 347 Nodes: 348 ip-172-0-52-116.us-west-2.compute.internal (localhost): 349 Host connectivity to 172.0.52.116: 350 ICMP to stack: OK, RTT=315.254µs 351 HTTP to agent: OK, RTT=368.579µs 352 Endpoint connectivity to 10.2.0.183: 353 ICMP to stack: OK, RTT=190.658µs 354 HTTP to agent: OK, RTT=536.665µs 355 ip-172-0-117-198.us-west-2.compute.internal: 356 Host connectivity to 172.0.117.198: 357 ICMP to stack: OK, RTT=1.009679ms 358 HTTP to agent: OK, RTT=1.808628ms 359 Endpoint connectivity to 10.2.1.234: 360 ICMP to stack: OK, RTT=1.016365ms 361 HTTP to agent: OK, RTT=2.29877ms 362 363 For each node, the connectivity will be displayed for each protocol and path, 364 both to the node itself and to an endpoint on that node. The latency specified 365 is a snapshot at the last time a probe was run, which is typically once per 366 minute. The ICMP connectivity row represents Layer 3 connectivity to the 367 networking stack, while the HTTP connectivity row represents connection to an 368 instance of the ``cilium-health`` agent running on the host or as an endpoint. 369 370 .. _monitor: 371 372 Monitoring Datapath State 373 ------------------------- 374 375 Sometimes you may experience broken connectivity, which may be due to a 376 number of different causes. A main cause can be unwanted packet drops on 377 the networking level. The tool 378 ``cilium-dbg monitor`` allows you to quickly inspect and see if and where packet 379 drops happen. Following is an example output (use ``kubectl exec`` as in 380 previous examples if running with Kubernetes): 381 382 .. code-block:: shell-session 383 384 $ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium-dbg monitor --type drop 385 Listening for events on 2 CPUs with 64x4096 of shared memory 386 Press Ctrl-C to quit 387 xx drop (Policy denied) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest 388 xx drop (Policy denied) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest 389 xx drop (Policy denied) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest 390 xx drop (Policy denied) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest 391 xx drop (Invalid destination mac) to endpoint 0, identity 0->0: fe80::5c25:ddff:fe8e:78d8 -> ff02::2 RouterSolicitation 392 393 The above indicates that a packet to endpoint ID ``25729`` has been dropped due 394 to violation of the Layer 3 policy. 395 396 Handling drop (CT: Map insertion failed) 397 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 398 399 If connectivity fails and ``cilium-dbg monitor --type drop`` shows ``xx drop (CT: 400 Map insertion failed)``, then it is likely that the connection tracking table 401 is filling up and the automatic adjustment of the garbage collector interval is 402 insufficient. 403 404 Setting ``--conntrack-gc-interval`` to an interval lower than the current value 405 may help. This controls the time interval between two garbage collection runs. 406 407 By default ``--conntrack-gc-interval`` is set to 0 which translates to 408 using a dynamic interval. In that case, the interval is updated after each 409 garbage collection run depending on how many entries were garbage collected. 410 If very few or no entries were garbage collected, the interval will increase; 411 if many entries were garbage collected, it will decrease. The current interval 412 value is reported in the Cilium agent logs. 413 414 Alternatively, the value for ``bpf-ct-global-any-max`` and 415 ``bpf-ct-global-tcp-max`` can be increased. Setting both of these options will 416 be a trade-off of CPU for ``conntrack-gc-interval``, and for 417 ``bpf-ct-global-any-max`` and ``bpf-ct-global-tcp-max`` the amount of memory 418 consumed. You can track conntrack garbage collection related metrics such as 419 ``datapath_conntrack_gc_runs_total`` and ``datapath_conntrack_gc_entries`` to 420 get visibility into garbage collection runs. Refer to :ref:`metrics` for more 421 details. 422 423 Enabling datapath debug messages 424 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 425 426 By default, datapath debug messages are disabled, and therefore not shown in 427 ``cilium-dbg monitor -v`` output. To enable them, add ``"datapath"`` to 428 the ``debug-verbose`` option. 429 430 Policy Troubleshooting 431 ====================== 432 433 .. _ensure_managed_pod: 434 435 Ensure pod is managed by Cilium 436 ------------------------------- 437 438 A potential cause for policy enforcement not functioning as expected is that 439 the networking of the pod selected by the policy is not being managed by 440 Cilium. The following situations result in unmanaged pods: 441 442 * The pod is running in host networking and will use the host's IP address 443 directly. Such pods have full network connectivity but Cilium will not 444 provide security policy enforcement for such pods by default. To enforce 445 policy against these pods, either set ``hostNetwork`` to false or use 446 :ref:`HostPolicies`. 447 448 * The pod was started before Cilium was deployed. Cilium only manages pods 449 that have been deployed after Cilium itself was started. Cilium will not 450 provide security policy enforcement for such pods. These pods should be 451 restarted in order to ensure that Cilium can provide security policy 452 enforcement. 453 454 If pod networking is not managed by Cilium. Ingress and egress policy rules 455 selecting the respective pods will not be applied. See the section 456 :ref:`network_policy` for more details. 457 458 For a quick assessment of whether any pods are not managed by Cilium, the 459 `Cilium CLI <https://github.com/cilium/cilium-cli>`_ will print the number 460 of managed pods. If this prints that all of the pods are managed by Cilium, 461 then there is no problem: 462 463 .. code-block:: shell-session 464 465 $ cilium status 466 /¯¯\ 467 /¯¯\__/¯¯\ Cilium: OK 468 \__/¯¯\__/ Operator: OK 469 /¯¯\__/¯¯\ Hubble: OK 470 \__/¯¯\__/ ClusterMesh: disabled 471 \__/ 472 473 Deployment cilium-operator Desired: 2, Ready: 2/2, Available: 2/2 474 Deployment hubble-relay Desired: 1, Ready: 1/1, Available: 1/1 475 Deployment hubble-ui Desired: 1, Ready: 1/1, Available: 1/1 476 DaemonSet cilium Desired: 2, Ready: 2/2, Available: 2/2 477 Containers: cilium-operator Running: 2 478 hubble-relay Running: 1 479 hubble-ui Running: 1 480 cilium Running: 2 481 Cluster Pods: 5/5 managed by Cilium 482 ... 483 484 You can run the following script to list the pods which are *not* managed by 485 Cilium: 486 487 .. code-block:: shell-session 488 489 $ curl -sLO https://raw.githubusercontent.com/cilium/cilium/main/contrib/k8s/k8s-unmanaged.sh 490 $ chmod +x k8s-unmanaged.sh 491 $ ./k8s-unmanaged.sh 492 kube-system/cilium-hqpk7 493 kube-system/kube-addon-manager-minikube 494 kube-system/kube-dns-54cccfbdf8-zmv2c 495 kube-system/kubernetes-dashboard-77d8b98585-g52k5 496 kube-system/storage-provisioner 497 498 Understand the rendering of your policy 499 --------------------------------------- 500 501 There are always multiple ways to approach a problem. Cilium can provide the 502 rendering of the aggregate policy provided to it, leaving you to simply compare 503 with what you expect the policy to actually be rather than search (and 504 potentially overlook) every policy. At the expense of reading a very large dump 505 of an endpoint, this is often a faster path to discovering errant policy 506 requests in the Kubernetes API. 507 508 Start by finding the endpoint you are debugging from the following list. There 509 are several cross references for you to use in this list, including the IP 510 address and pod labels: 511 512 .. code-block:: shell-session 513 514 kubectl -n kube-system exec -ti cilium-q8wvt -- cilium-dbg endpoint list 515 516 When you find the correct endpoint, the first column of every row is the 517 endpoint ID. Use that to dump the full endpoint information: 518 519 .. code-block:: shell-session 520 521 kubectl -n kube-system exec -ti cilium-q8wvt -- cilium-dbg endpoint get 59084 522 523 .. image:: images/troubleshooting_policy.png 524 :align: center 525 526 Importing this dump into a JSON-friendly editor can help browse and navigate the 527 information here. At the top level of the dump, there are two nodes of note: 528 529 * ``spec``: The desired state of the endpoint 530 * ``status``: The current state of the endpoint 531 532 This is the standard Kubernetes control loop pattern. Cilium is the controller 533 here, and it is iteratively working to bring the ``status`` in line with the 534 ``spec``. 535 536 Opening the ``status``, we can drill down through ``policy.realized.l4``. Do 537 your ``ingress`` and ``egress`` rules match what you expect? If not, the 538 reference to the errant rules can be found in the ``derived-from-rules`` node. 539 540 Policymap pressure and overflow 541 ------------------------------- 542 543 The most important step in debugging policymap pressure is finding out which 544 node(s) are impacted. 545 546 The ``cilium_bpf_map_pressure{map_name="cilium_policy_*"}`` metric monitors the 547 endpoint's BPF policymap pressure. This metric exposes the maximum BPF map 548 pressure on the node, meaning the policymap experiencing the most pressure on a 549 particular node. 550 551 Once the node is known, the troubleshooting steps are as follows: 552 553 1. Find the Cilium pod on the node experiencing the problematic policymap 554 pressure and obtain a shell via ``kubectl exec``. 555 2. Use ``cilium policy selectors`` to get an overview of which selectors are 556 selecting many identities. The output of this command as of Cilium v1.15 557 additionally displays the namespace and name of the policy resource of each 558 selector. 559 3. The type of selector tells you what sort of policy rule could be having an 560 impact. The three existing types of selectors are explained below, each with 561 specific steps depending on the selector. See the steps below corresponding 562 to the type of selector. 563 4. Consider bumping the policymap size as a last resort. However, keep in mind 564 the following implications: 565 566 * Increased memory consumption for each policymap. 567 * Generally, as identities increase in the cluster, the more work Cilium 568 performs. 569 * At a broader level, if the policy posture is such that all or nearly all 570 identities are selected, this suggests that the posture is too permissive. 571 572 +---------------+------------------------------------------------------------------------------------------------------------+ 573 | Selector type | Form in ``cilium policy selectors`` output | 574 +===============+============================================================================================================+ 575 | CIDR | ``&LabelSelector{MatchLabels:map[string]string{cidr.1.1.1.1/32: ,}`` | 576 +---------------+------------------------------------------------------------------------------------------------------------+ 577 | FQDN | ``MatchName: , MatchPattern: *`` | 578 +---------------+------------------------------------------------------------------------------------------------------------+ 579 | Label | ``&LabelSelector{MatchLabels:map[string]string{any.name: curl,k8s.io.kubernetes.pod.namespace: default,}`` | 580 +---------------+------------------------------------------------------------------------------------------------------------+ 581 582 An example output of ``cilium policy selectors``: 583 584 .. code-block:: shell-session 585 586 root@kind-worker:/home/cilium# cilium policy selectors 587 SELECTOR LABELS USERS IDENTITIES 588 &LabelSelector{MatchLabels:map[string]string{k8s.io.kubernetes.pod.namespace: kube-system,k8s.k8s-app: kube-dns,},MatchExpressions:[]LabelSelectorRequirement{},} default/tofqdn-dns-visibility 1 16500 589 &LabelSelector{MatchLabels:map[string]string{reserved.none: ,},MatchExpressions:[]LabelSelectorRequirement{},} default/tofqdn-dns-visibility 1 590 MatchName: , MatchPattern: * default/tofqdn-dns-visibility 1 16777231 591 16777232 592 16777233 593 16860295 594 16860322 595 16860323 596 16860324 597 16860325 598 16860326 599 16860327 600 16860328 601 &LabelSelector{MatchLabels:map[string]string{any.name: netperf,k8s.io.kubernetes.pod.namespace: default,},MatchExpressions:[]LabelSelectorRequirement{},} default/tofqdn-dns-visibility 1 602 &LabelSelector{MatchLabels:map[string]string{cidr.1.1.1.1/32: ,},MatchExpressions:[]LabelSelectorRequirement{},} default/tofqdn-dns-visibility 1 16860329 603 &LabelSelector{MatchLabels:map[string]string{cidr.1.1.1.2/32: ,},MatchExpressions:[]LabelSelectorRequirement{},} default/tofqdn-dns-visibility 1 16860330 604 &LabelSelector{MatchLabels:map[string]string{cidr.1.1.1.3/32: ,},MatchExpressions:[]LabelSelectorRequirement{},} default/tofqdn-dns-visibility 1 16860331 605 606 From the output above, we see that all three selectors are in use. The 607 significant action here is to determine which selector is selecting the most 608 identities, because the policy containing that selector is the likely cause for 609 the policymap pressure. 610 611 Label 612 ~~~~~ 613 614 See section on :ref:`identity-relevant labels <identity-relevant-labels>`. 615 616 Another aspect to consider is the permissiveness of the policies and whether it 617 could be reduced. 618 619 CIDR 620 ~~~~ 621 622 One way to reduce the number of identities selected by a CIDR selector is to 623 broaden the range of the CIDR, if possible. For example, in the above example 624 output, the policy contains a ``/32`` rule for each CIDR, rather than using a 625 wider range like ``/30`` instead. Updating the policy with this rule creates an 626 identity that represents all IPs within the ``/30`` and therefore, only 627 requires the selector to select 1 identity. 628 629 FQDN 630 ~~~~ 631 632 See section on :ref:`isolating the source of toFQDNs issues regarding 633 identities and policy <isolating-source-toFQDNs-issues-identities-policy>`. 634 635 etcd (kvstore) 636 ============== 637 638 Introduction 639 ------------ 640 641 Cilium can be operated in CRD-mode and kvstore/etcd mode. When cilium is 642 running in kvstore/etcd mode, the kvstore becomes a vital component of the 643 overall cluster health as it is required to be available for several 644 operations. 645 646 Operations for which the kvstore is strictly required when running in etcd 647 mode: 648 649 Scheduling of new workloads: 650 As part of scheduling workloads/endpoints, agents will perform security 651 identity allocation which requires interaction with the kvstore. If a 652 workload can be scheduled due to re-using a known security identity, then 653 state propagation of the endpoint details to other nodes will still depend on 654 the kvstore and thus packets drops due to policy enforcement may be observed 655 as other nodes in the cluster will not be aware of the new workload. 656 657 Multi cluster: 658 All state propagation between clusters depends on the kvstore. 659 660 Node discovery: 661 New nodes require to register themselves in the kvstore. 662 663 Agent bootstrap: 664 The Cilium agent will eventually fail if it can't connect to the kvstore at 665 bootstrap time, however, the agent will still perform all possible operations 666 while waiting for the kvstore to appear. 667 668 Operations which *do not* require kvstore availability: 669 670 All datapath operations: 671 All datapath forwarding, policy enforcement and visibility functions for 672 existing workloads/endpoints do not depend on the kvstore. Packets will 673 continue to be forwarded and network policy rules will continue to be 674 enforced. 675 676 However, if the agent requires to restart as part of the 677 :ref:`etcd_recovery_behavior`, there can be delays in: 678 679 * processing of flow events and metrics 680 * short unavailability of layer 7 proxies 681 682 NetworkPolicy updates: 683 Network policy updates will continue to be processed and applied. 684 685 Services updates: 686 All updates to services will be processed and applied. 687 688 Understanding etcd status 689 ------------------------- 690 691 The etcd status is reported when running ``cilium-dbg status``. The following line 692 represents the status of etcd:: 693 694 KVStore: Ok etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=true: https://192.168.60.11:2379 - 3.4.9 (Leader) 695 696 OK: 697 The overall status. Either ``OK`` or ``Failure``. 698 699 1/1 connected: 700 Number of total etcd endpoints and how many of them are reachable. 701 702 lease-ID: 703 UUID of the lease used for all keys owned by this agent. 704 705 lock lease-ID: 706 UUID of the lease used for locks acquired by this agent. 707 708 has-quorum: 709 Status of etcd quorum. Either ``true`` or set to an error. 710 711 consecutive-errors: 712 Number of consecutive quorum errors. Only printed if errors are present. 713 714 https://192.168.60.11:2379 - 3.4.9 (Leader): 715 List of all etcd endpoints stating the etcd version and whether the 716 particular endpoint is currently the elected leader. If an etcd endpoint 717 cannot be reached, the error is shown. 718 719 .. _etcd_recovery_behavior: 720 721 Recovery behavior 722 ----------------- 723 724 In the event of an etcd endpoint becoming unhealthy, etcd should automatically 725 resolve this by electing a new leader and by failing over to a healthy etcd 726 endpoint. As long as quorum is preserved, the etcd cluster will remain 727 functional. 728 729 In addition, Cilium performs a background check in an interval to determine 730 etcd health and potentially take action. The interval depends on the overall 731 cluster size. The larger the cluster, the longer the `interval 732 <https://pkg.go.dev/github.com/cilium/cilium/pkg/kvstore?tab=doc#ExtraOptions.StatusCheckInterval>`_: 733 734 * If no etcd endpoints can be reached, Cilium will report failure in ``cilium-dbg 735 status``. This will cause the liveness and readiness probe of Kubernetes to 736 fail and Cilium will be restarted. 737 738 * A lock is acquired and released to test a write operation which requires 739 quorum. If this operation fails, loss of quorum is reported. If quorum fails 740 for three or more intervals in a row, Cilium is declared unhealthy. 741 742 * The Cilium operator will constantly write to a heartbeat key 743 (``cilium/.heartbeat``). All Cilium agents will watch for updates to this 744 heartbeat key. This validates the ability for an agent to receive key 745 updates from etcd. If the heartbeat key is not updated in time, the quorum 746 check is declared to have failed and Cilium is declared unhealthy after 3 or 747 more consecutive failures. 748 749 Example of a status with a quorum failure which has not yet reached the 750 threshold:: 751 752 KVStore: Ok etcd: 1/1 connected, lease-ID=29c6732d5d580cb5, lock lease-ID=29c6732d5d580cb7, has-quorum=2m2.778966915s since last heartbeat update has been received, consecutive-errors=1: https://192.168.60.11:2379 - 3.4.9 (Leader) 753 754 Example of a status with the number of quorum failures exceeding the threshold:: 755 756 KVStore: Failure Err: quorum check failed 8 times in a row: 4m28.446600949s since last heartbeat update has been received 757 758 .. _troubleshooting_clustermesh: 759 760 .. include:: ./troubleshooting_clustermesh.rst 761 762 .. _troubleshooting_servicemesh: 763 764 .. include:: troubleshooting_servicemesh.rst 765 766 Symptom Library 767 =============== 768 769 Node to node traffic is being dropped 770 ------------------------------------- 771 772 Symptom 773 ~~~~~~~ 774 775 Endpoint to endpoint communication on a single node succeeds but communication 776 fails between endpoints across multiple nodes. 777 778 Troubleshooting steps: 779 ~~~~~~~~~~~~~~~~~~~~~~ 780 781 #. Run ``cilium-health status`` on the node of the source and destination 782 endpoint. It should describe the connectivity from that node to other 783 nodes in the cluster, and to a simulated endpoint on each other node. 784 Identify points in the cluster that cannot talk to each other. If the 785 command does not describe the status of the other node, there may be an 786 issue with the KV-Store. 787 788 #. Run ``cilium-dbg monitor`` on the node of the source and destination endpoint. 789 Look for packet drops. 790 791 When running in :ref:`arch_overlay` mode: 792 793 #. Run ``cilium-dbg bpf tunnel list`` and verify that each Cilium node is aware of 794 the other nodes in the cluster. If not, check the logfile for errors. 795 796 #. If nodes are being populated correctly, run ``tcpdump -n -i cilium_vxlan`` on 797 each node to verify whether cross node traffic is being forwarded correctly 798 between nodes. 799 800 If packets are being dropped, 801 802 * verify that the node IP listed in ``cilium-dbg bpf tunnel list`` can reach each 803 other. 804 * verify that the firewall on each node allows UDP port 8472. 805 806 When running in :ref:`arch_direct_routing` mode: 807 808 #. Run ``ip route`` or check your cloud provider router and verify that you have 809 routes installed to route the endpoint prefix between all nodes. 810 811 #. Verify that the firewall on each node permits to route the endpoint IPs. 812 813 814 Useful Scripts 815 ============== 816 817 .. _retrieve_cilium_pod: 818 819 Retrieve Cilium pod managing a particular pod 820 --------------------------------------------- 821 822 Identifies the Cilium pod that is managing a particular pod in a namespace: 823 824 .. code-block:: shell-session 825 826 k8s-get-cilium-pod.sh <pod> <namespace> 827 828 **Example:** 829 830 .. code-block:: shell-session 831 832 $ curl -sLO https://raw.githubusercontent.com/cilium/cilium/main/contrib/k8s/k8s-get-cilium-pod.sh 833 $ chmod +x k8s-get-cilium-pod.sh 834 $ ./k8s-get-cilium-pod.sh luke-pod default 835 cilium-zmjj9 836 cilium-node-init-v7r9p 837 cilium-operator-f576f7977-s5gpq 838 839 Execute a command in all Kubernetes Cilium pods 840 ----------------------------------------------- 841 842 Run a command within all Cilium pods of a cluster 843 844 .. code-block:: shell-session 845 846 k8s-cilium-exec.sh <command> 847 848 **Example:** 849 850 .. code-block:: shell-session 851 852 $ curl -sLO https://raw.githubusercontent.com/cilium/cilium/main/contrib/k8s/k8s-cilium-exec.sh 853 $ chmod +x k8s-cilium-exec.sh 854 $ ./k8s-cilium-exec.sh uptime 855 10:15:16 up 6 days, 7:37, 0 users, load average: 0.00, 0.02, 0.00 856 10:15:16 up 6 days, 7:32, 0 users, load average: 0.00, 0.03, 0.04 857 10:15:16 up 6 days, 7:30, 0 users, load average: 0.75, 0.27, 0.15 858 10:15:16 up 6 days, 7:28, 0 users, load average: 0.14, 0.04, 0.01 859 860 List unmanaged Kubernetes pods 861 ------------------------------ 862 863 Lists all Kubernetes pods in the cluster for which Cilium does *not* provide 864 networking. This includes pods running in host-networking mode and pods that 865 were started before Cilium was deployed. 866 867 .. code-block:: shell-session 868 869 k8s-unmanaged.sh 870 871 **Example:** 872 873 .. code-block:: shell-session 874 875 $ curl -sLO https://raw.githubusercontent.com/cilium/cilium/main/contrib/k8s/k8s-unmanaged.sh 876 $ chmod +x k8s-unmanaged.sh 877 $ ./k8s-unmanaged.sh 878 kube-system/cilium-hqpk7 879 kube-system/kube-addon-manager-minikube 880 kube-system/kube-dns-54cccfbdf8-zmv2c 881 kube-system/kubernetes-dashboard-77d8b98585-g52k5 882 kube-system/storage-provisioner 883 884 Reporting a problem 885 =================== 886 887 Before you report a problem, make sure to retrieve the necessary information 888 from your cluster before the failure state is lost. 889 890 .. _sysdump: 891 892 Automatic log & state collection 893 -------------------------------- 894 895 .. include:: ../installation/cli-download.rst 896 897 Then, execute ``cilium sysdump`` command to collect troubleshooting information 898 from your Kubernetes cluster: 899 900 .. code-block:: shell-session 901 902 cilium sysdump 903 904 Note that by default ``cilium sysdump`` will attempt to collect as much logs as 905 possible and for all the nodes in the cluster. If your cluster size is above 20 906 nodes, consider setting the following options to limit the size of the sysdump. 907 This is not required, but useful for those who have a constraint on bandwidth or 908 upload size. 909 910 * set the ``--node-list`` option to pick only a few nodes in case the cluster has 911 many of them. 912 * set the ``--logs-since-time`` option to go back in time to when the issues started. 913 * set the ``--logs-limit-bytes`` option to limit the size of the log files (note: 914 passed onto ``kubectl logs``; does not apply to entire collection archive). 915 916 Ideally, a sysdump that has a full history of select nodes, rather than a brief 917 history of all the nodes, would be preferred (by using ``--node-list``). The second 918 recommended way would be to use ``--logs-since-time`` if you are able to narrow down 919 when the issues started. Lastly, if the Cilium agent and Operator logs are too 920 large, consider ``--logs-limit-bytes``. 921 922 Use ``--help`` to see more options: 923 924 .. code-block:: shell-session 925 926 cilium sysdump --help 927 928 Single Node Bugtool 929 ~~~~~~~~~~~~~~~~~~~ 930 931 If you are not running Kubernetes, it is also possible to run the bug 932 collection tool manually with the scope of a single node: 933 934 The ``cilium-bugtool`` captures potentially useful information about your 935 environment for debugging. The tool is meant to be used for debugging a single 936 Cilium agent node. In the Kubernetes case, if you have multiple Cilium pods, 937 the tool can retrieve debugging information from all of them. The tool works by 938 archiving a collection of command output and files from several places. By 939 default, it writes to the ``tmp`` directory. 940 941 Note that the command needs to be run from inside the Cilium pod/container. 942 943 .. code-block:: shell-session 944 945 cilium-bugtool 946 947 When running it with no option as shown above, it will try to copy various 948 files and execute some commands. If ``kubectl`` is detected, it will search for 949 Cilium pods. The default label being ``k8s-app=cilium``, but this and the 950 namespace can be changed via ``k8s-namespace`` and ``k8s-label`` respectively. 951 952 If you want to capture the archive from a Kubernetes pod, then the process is a 953 bit different 954 955 .. code-block:: shell-session 956 957 $ # First we need to get the Cilium pod 958 $ kubectl get pods --namespace kube-system 959 NAME READY STATUS RESTARTS AGE 960 cilium-kg8lv 1/1 Running 0 13m 961 kube-addon-manager-minikube 1/1 Running 0 1h 962 kube-dns-6fc954457d-sf2nk 3/3 Running 0 1h 963 kubernetes-dashboard-6xvc7 1/1 Running 0 1h 964 965 $ # Run the bugtool from this pod 966 $ kubectl -n kube-system exec cilium-kg8lv -- cilium-bugtool 967 [...] 968 969 $ # Copy the archive from the pod 970 $ kubectl cp kube-system/cilium-kg8lv:/tmp/cilium-bugtool-20180411-155146.166+0000-UTC-266836983.tar /tmp/cilium-bugtool-20180411-155146.166+0000-UTC-266836983.tar 971 [...] 972 973 .. note:: 974 975 Please check the archive for sensitive information and strip it 976 away before sharing it with us. 977 978 Below is an approximate list of the kind of information in the archive. 979 980 * Cilium status 981 * Cilium version 982 * Kernel configuration 983 * Resolve configuration 984 * Cilium endpoint state 985 * Cilium logs 986 * Docker logs 987 * ``dmesg`` 988 * ``ethtool`` 989 * ``ip a`` 990 * ``ip link`` 991 * ``ip r`` 992 * ``iptables-save`` 993 * ``kubectl -n kube-system get pods`` 994 * ``kubectl get pods,svc for all namespaces`` 995 * ``uname`` 996 * ``uptime`` 997 * ``cilium-dbg bpf * list`` 998 * ``cilium-dbg endpoint get for each endpoint`` 999 * ``cilium-dbg endpoint list`` 1000 * ``hostname`` 1001 * ``cilium-dbg policy get`` 1002 * ``cilium-dbg service list`` 1003 1004 1005 Debugging information 1006 ~~~~~~~~~~~~~~~~~~~~~ 1007 1008 If you are not running Kubernetes, you can use the ``cilium-dbg debuginfo`` command 1009 to retrieve useful debugging information. If you are running Kubernetes, this 1010 command is automatically run as part of the system dump. 1011 1012 ``cilium-dbg debuginfo`` can print useful output from the Cilium API. The output 1013 format is in Markdown format so this can be used when reporting a bug on the 1014 `issue tracker`_. Running without arguments will print to standard output, but 1015 you can also redirect to a file like 1016 1017 .. code-block:: shell-session 1018 1019 cilium-dbg debuginfo -f debuginfo.md 1020 1021 .. note:: 1022 1023 Please check the debuginfo file for sensitive information and strip it 1024 away before sharing it with us. 1025 1026 1027 Slack assistance 1028 ---------------- 1029 1030 The `Cilium Slack`_ community is a helpful first point of assistance to get 1031 help troubleshooting a problem or to discuss options on how to address a 1032 problem. The community is open to anyone. 1033 1034 Report an issue via GitHub 1035 -------------------------- 1036 1037 If you believe to have found an issue in Cilium, please report a 1038 `GitHub issue`_ and make sure to attach a system dump as described above to 1039 ensure that developers have the best chance to reproduce the issue. 1040 1041 .. _NodeSelector: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodeselector 1042 .. _RBAC: https://kubernetes.io/docs/reference/access-authn-authz/rbac/ 1043 .. _CNI: https://github.com/containernetworking/cni 1044 .. _Volumes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-volume-storage/ 1045 1046 .. _Cilium Frequently Asked Questions (FAQ): https://github.com/cilium/cilium/issues?utf8=%E2%9C%93&q=label%3Akind%2Fquestion%20 1047 1048 .. _issue tracker: https://github.com/cilium/cilium/issues 1049 .. _GitHub issue: `issue tracker`_