github.com/imran-kn/cilium-fork@v1.6.9/Documentation/troubleshooting.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 http://docs.cilium.io 6 7 .. _admin_guide: 8 9 ############### 10 Troubleshooting 11 ############### 12 13 This document describes how to troubleshoot Cilium in different deployment 14 modes. It focuses on a full deployment of Cilium within a datacenter or public 15 cloud. If you are just looking for a simple way to experiment, we highly 16 recommend trying out the :ref:`gs_guide` instead. 17 18 This guide assumes that you have read the :ref:`concepts` which explains all 19 the components and concepts. 20 21 We use GitHub issues to maintain a list of `Cilium Frequently Asked Questions 22 (FAQ)`_. You can also check there to see if your question(s) is already 23 addressed. 24 25 Component & Cluster Health 26 ========================== 27 28 Kubernetes 29 ---------- 30 31 An initial overview of Cilium can be retrieved by listing all pods to verify 32 whether all pods have the status ``Running``: 33 34 .. code:: bash 35 36 $ kubectl -n kube-system get pods -l k8s-app=cilium 37 NAME READY STATUS RESTARTS AGE 38 cilium-2hq5z 1/1 Running 0 4d 39 cilium-6kbtz 1/1 Running 0 4d 40 cilium-klj4b 1/1 Running 0 4d 41 cilium-zmjj9 1/1 Running 0 4d 42 43 If Cilium encounters a problem that it cannot recover from, it will 44 automatically report the failure state via ``cilium status`` which is regularly 45 queried by the Kubernetes liveness probe to automatically restart Cilium pods. 46 If a Cilium pod is in state ``CrashLoopBackoff`` then this indicates a 47 permanent failure scenario. 48 49 Detailed Status 50 ~~~~~~~~~~~~~~~ 51 52 If a particular Cilium pod is not in running state, the status and health of 53 the agent on that node can be retrieved by running ``cilium status`` in the 54 context of that pod: 55 56 .. code:: bash 57 58 $ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium status 59 KVStore: Ok etcd: 1/1 connected: http://demo-etcd-lab--a.etcd.tgraf.test1.lab.corp.covalent.link:2379 - 3.2.5 (Leader) 60 ContainerRuntime: Ok docker daemon: OK 61 Kubernetes: Ok OK 62 Kubernetes APIs: ["cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint", "core/v1::Node", "CustomResourceDefinition"] 63 Cilium: Ok OK 64 NodeMonitor: Disabled 65 Cilium health daemon: Ok 66 Controller Status: 14/14 healthy 67 Proxy Status: OK, ip 10.2.0.172, port-range 10000-20000 68 Cluster health: 4/4 reachable (2018-06-16T09:49:58Z) 69 70 Alternatively, the ``k8s-cilium-exec.sh`` script can be used to run ``cilium 71 status`` on all nodes. This will provide detailed status and health information 72 of all nodes in the cluster: 73 74 .. code:: bash 75 76 $ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-cilium-exec.sh 77 $ chmod +x ./k8s-cilium-exec.sh 78 79 ... and run ``cilium status`` on all nodes: 80 81 .. code:: bash 82 83 $ ./k8s-cilium-exec.sh cilium status 84 KVStore: Ok Etcd: http://127.0.0.1:2379 - (Leader) 3.1.10 85 ContainerRuntime: Ok 86 Kubernetes: Ok OK 87 Kubernetes APIs: ["extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service", "core/v1::Endpoint"] 88 Cilium: Ok OK 89 NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory 90 Cilium health daemon: Ok 91 Controller Status: 7/7 healthy 92 Proxy Status: OK, ip 10.15.28.238, 0 redirects, port-range 10000-20000 93 Cluster health: 1/1 reachable (2018-02-27T00:24:34Z) 94 95 Logs 96 ~~~~ 97 98 To retrieve log files of a cilium pod, run (replace ``cilium-1234`` with a pod 99 name returned by ``kubectl -n kube-system get pods -l k8s-app=cilium``) 100 101 .. code:: bash 102 103 $ kubectl -n kube-system logs --timestamps cilium-1234 104 105 If the cilium pod was already restarted due to the liveness problem after 106 encountering an issue, it can be useful to retrieve the logs of the pod before 107 the last restart: 108 109 .. code:: bash 110 111 $ kubectl -n kube-system logs --timestamps -p cilium-1234 112 113 Generic 114 ------- 115 116 When logged in a host running Cilium, the cilium CLI can be invoked directly, 117 e.g.: 118 119 .. code:: bash 120 121 $ cilium status 122 KVStore: Ok etcd: 1/1 connected: https://192.168.33.11:2379 - 3.2.7 (Leader) 123 ContainerRuntime: Ok 124 Kubernetes: Ok OK 125 Kubernetes APIs: ["core/v1::Endpoint", "extensions/v1beta1::Ingress", "core/v1::Node", "CustomResourceDefinition", "cilium/v2::CiliumNetworkPolicy", "networking.k8s.io/v1::NetworkPolicy", "core/v1::Service"] 126 Cilium: Ok OK 127 NodeMonitor: Listening for events on 2 CPUs with 64x4096 of shared memory 128 Cilium health daemon: Ok 129 IPv4 address pool: 261/65535 allocated 130 IPv6 address pool: 4/4294967295 allocated 131 Controller Status: 20/20 healthy 132 Proxy Status: OK, ip 10.0.28.238, port-range 10000-20000 133 Cluster health: 2/2 reachable (2018-04-11T15:41:01Z) 134 135 Connectivity Problems 136 ===================== 137 138 Checking cluster connectivity health 139 ------------------------------------ 140 141 Cilium allows to rule out network fabric related issues when troubleshooting 142 connectivity issues by providing reliable health and latency probes between all 143 cluster nodes and between a simulated workload running on each node. 144 145 By default when Cilium is run, it launches instances of ``cilium-health`` in 146 the background to determine overall connectivity status of the cluster. This 147 tool periodically runs bidirectional traffic across multiple paths through the 148 cluster and through each node using different protocols to determine the health 149 status of each path and protocol. At any point in time, cilium-health may be 150 queried for the connectivity status of the last probe. 151 152 .. code:: bash 153 154 $ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium-health status 155 Probe time: 2018-06-16T09:51:58Z 156 Nodes: 157 ip-172-0-52-116.us-west-2.compute.internal (localhost): 158 Host connectivity to 172.0.52.116: 159 ICMP to stack: OK, RTT=315.254µs 160 HTTP to agent: OK, RTT=368.579µs 161 Endpoint connectivity to 10.2.0.183: 162 ICMP to stack: OK, RTT=190.658µs 163 HTTP to agent: OK, RTT=536.665µs 164 ip-172-0-117-198.us-west-2.compute.internal: 165 Host connectivity to 172.0.117.198: 166 ICMP to stack: OK, RTT=1.009679ms 167 HTTP to agent: OK, RTT=1.808628ms 168 Endpoint connectivity to 10.2.1.234: 169 ICMP to stack: OK, RTT=1.016365ms 170 HTTP to agent: OK, RTT=2.29877ms 171 172 For each node, the connectivity will be displayed for each protocol and path, 173 both to the node itself and to an endpoint on that node. The latency specified 174 is a snapshot at the last time a probe was run, which is typically once per 175 minute. The ICMP connectivity row represents Layer 3 connectivity to the 176 networking stack, while the HTTP connectivity row represents connection to an 177 instance of the ``cilium-health`` agent running on the host or as an endpoint. 178 179 Monitoring Packet Drops 180 ----------------------- 181 182 Sometimes you may experience broken connectivity, which may be due to a 183 number of different causes. A main cause can be unwanted packet drops on 184 the networking level. The tool 185 ``cilium monitor`` allows you to quickly inspect and see if and where packet 186 drops happen. Following is an example output (use ``kubectl exec`` as in previous 187 examples if running with Kubernetes): 188 189 .. code:: bash 190 191 $ kubectl -n kube-system exec -ti cilium-2hq5z -- cilium monitor --type drop 192 Listening for events on 2 CPUs with 64x4096 of shared memory 193 Press Ctrl-C to quit 194 xx drop (Policy denied (L3)) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest 195 xx drop (Policy denied (L3)) to endpoint 25729, identity 261->264: fd02::c0a8:210b:0:bf00 -> fd02::c0a8:210b:0:6481 EchoRequest 196 xx drop (Policy denied (L3)) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest 197 xx drop (Policy denied (L3)) to endpoint 25729, identity 261->264: 10.11.13.37 -> 10.11.101.61 EchoRequest 198 xx drop (Invalid destination mac) to endpoint 0, identity 0->0: fe80::5c25:ddff:fe8e:78d8 -> ff02::2 RouterSolicitation 199 200 The above indicates that a packet to endpoint ID ``25729`` has been dropped due 201 to violation of the Layer 3 policy. 202 203 Handling drop (CT: Map insertion failed) 204 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 205 206 If connectivity fails and ``cilium monitor --type drop`` shows ``xx drop (CT: 207 Map insertion failed)``, then it is likely that the connection tracking table 208 is filling up and the automatic adjustment of the garbage collector interval is 209 insufficient. Set ``--conntrack-gc-interval`` to an interval lower than the 210 default. Alternatively, the value for ``bpf-ct-global-any-max`` and 211 ``bpf-ct-global-tcp-max`` can be increased. Setting both of these options will 212 be a trade-off of CPU for ``conntrack-gc-interval``, and for 213 ``bpf-ct-global-any-max`` and ``bpf-ct-global-tcp-max`` the amount of memory 214 consumed. 215 216 Policy Troubleshooting 217 ====================== 218 219 Ensure pod is managed by Cilium 220 ------------------------------- 221 222 A potential cause for policy enforcement not functioning as expected is that 223 the networking of the pod selected by the policy is not being managed by 224 Cilium. The following situations result in unmanaged pods: 225 226 * The pod is running in host networking and will use the host's IP address 227 directly. Such pods have full network connectivity but Cilium will not 228 provide security policy enforcement for such pods. 229 230 * The pod was started before Cilium was deployed. Cilium only manages pods 231 that have been deployed after Cilium itself was started. Cilium will not 232 provide security policy enforcement for such pods. 233 234 If pod networking is not managed by Cilium. Ingress and egress policy rules 235 selecting the respective pods will not be applied. See the section 236 :ref:`network_policy` for more details. 237 238 You can run the following script to list the pods which are *not* managed by 239 Cilium: 240 241 .. code:: bash 242 243 $ ./contrib/k8s/k8s-unmanaged.sh 244 kube-system/cilium-hqpk7 245 kube-system/kube-addon-manager-minikube 246 kube-system/kube-dns-54cccfbdf8-zmv2c 247 kube-system/kubernetes-dashboard-77d8b98585-g52k5 248 kube-system/storage-provisioner 249 250 See section :ref:`policy_tracing` for details and examples on how to use the 251 policy tracing feature. 252 253 Understand the rendering of your policy 254 --------------------------------------- 255 256 There are always multiple ways to approach a problem. Cilium can provide the 257 rendering of the aggregate policy provided to it, leaving you to simply compare 258 with what you expect the policy to actually be rather than search (and potentially 259 overlook) every policy. At the expense of reading a very large dump of an endpoint, 260 this is often a faster path to discovering errant policy requests in the Kubernetes 261 API. 262 263 Start by finding the endpoint you are debugging from the following list. There are 264 several cross references for you to use in this list, including the IP address and 265 pod labels: 266 267 .. code:: bash 268 269 kubectl -n kube-system exec -ti cilium-q8wvt -- cilium endpoint list 270 271 When you find the correct endpoint, the first column of every row is the endpoint ID. 272 Use that to dump the full endpoint information: 273 274 .. code:: bash 275 276 kubectl -n kube-system exec -ti cilium-q8wvt -- cilium endpoint get 59084 277 278 .. image:: images/troubleshooting_policy.png 279 :align: center 280 281 Importing this dump into a JSON-friendly editor can help browse and navigate the 282 information here. At the top level of the dump, there are two nodes of note: 283 284 * ``spec``: The desired state of the endpoint 285 * ``status``: The current state of the endpoint 286 287 This is the standard Kubernetes control loop pattern. Cilium is the controller here, 288 and it is iteratively working to bring the ``status`` in line with the ``spec``. 289 290 Opening the ``status``, we can drill down through ``policy.realized.l4``. Do your 291 ``ingress`` and ``egress`` rules match what you expect? If not, the reference to the errant 292 rules can be found in the ``derived-from-rules`` node. 293 294 Symptom Library 295 =============== 296 297 Node to node traffic is being dropped 298 ------------------------------------- 299 300 Symptom 301 ~~~~~~~ 302 303 Endpoint to endpoint communication on a single node succeeds but communication 304 fails between endpoints across multiple nodes. 305 306 Troubleshooting steps: 307 ~~~~~~~~~~~~~~~~~~~~~~ 308 309 1. Run ``cilium-health status`` on the node of the source and destination 310 endpoint. It should describe the connectivity from that node to other 311 nodes in the cluster, and to a simulated endpoint on each other node. 312 Identify points in the cluster that cannot talk to each other. If the 313 command does not describe the status of the other node, there may be an 314 issue with the KV-Store. 315 316 2. Run ``cilium monitor`` on the node of the source and destination endpoint. 317 Look for packet drops. 318 319 When running in :ref:`arch_overlay` mode: 320 321 3. Run ``cilium bpf tunnel list`` and verify that each Cilium node is aware of 322 the other nodes in the cluster. If not, check the logfile for errors. 323 324 4. If nodes are being populated correctly, run ``tcpdump -n -i cilium_vxlan`` on 325 each node to verify whether cross node traffic is being forwarded correctly 326 between nodes. 327 328 If packets are being dropped, 329 330 * verify that the node IP listed in ``cilium bpf tunnel list`` can reach each 331 other. 332 * verify that the firewall on each node allows UDP port 8472. 333 334 When running in :ref:`arch_direct_routing` mode: 335 336 3. Run ``ip route`` or check your cloud provider router and verify that you have 337 routes installed to route the endpoint prefix between all nodes. 338 339 4. Verify that the firewall on each node permits to route the endpoint IPs. 340 341 342 Useful Scripts 343 ============== 344 345 Retrieve Cilium pod managing a particular pod 346 --------------------------------------------- 347 348 Identifies the Cilium pod that is managing a particular pod in a namespace: 349 350 .. code:: bash 351 352 k8s-get-cilium-pod.sh <pod> <namespace> 353 354 **Example:** 355 356 .. code:: bash 357 358 $ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-get-cilium-pod.sh 359 $ ./k8s-get-cilium-pod.sh luke-pod default 360 cilium-zmjj9 361 362 363 Execute a command in all Kubernetes Cilium pods 364 ----------------------------------------------- 365 366 Run a command within all Cilium pods of a cluster 367 368 .. code:: bash 369 370 k8s-cilium-exec.sh <command> 371 372 **Example:** 373 374 .. code:: bash 375 376 $ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-cilium-exec.sh 377 $ ./k8s-cilium-exec.sh uptime 378 10:15:16 up 6 days, 7:37, 0 users, load average: 0.00, 0.02, 0.00 379 10:15:16 up 6 days, 7:32, 0 users, load average: 0.00, 0.03, 0.04 380 10:15:16 up 6 days, 7:30, 0 users, load average: 0.75, 0.27, 0.15 381 10:15:16 up 6 days, 7:28, 0 users, load average: 0.14, 0.04, 0.01 382 383 List unmanaged Kubernetes pods 384 ------------------------------ 385 386 Lists all Kubernetes pods in the cluster for which Cilium does *not* provide 387 networking. This includes pods running in host-networking mode and pods that 388 were started before Cilium was deployed. 389 390 .. code:: bash 391 392 k8s-unmanaged.sh 393 394 **Example:** 395 396 .. code:: bash 397 398 $ curl -sLO releases.cilium.io/v1.1.0/tools/k8s-unmanaged.sh 399 $ ./k8s-unmanaged.sh 400 kube-system/cilium-hqpk7 401 kube-system/kube-addon-manager-minikube 402 kube-system/kube-dns-54cccfbdf8-zmv2c 403 kube-system/kubernetes-dashboard-77d8b98585-g52k5 404 kube-system/storage-provisioner 405 406 Reporting a problem 407 =================== 408 409 Automatic log & state collection 410 -------------------------------- 411 412 Before you report a problem, make sure to retrieve the necessary information 413 from your cluster before the failure state is lost. Cilium provides a script 414 to automatically grab logs and retrieve debug information from all Cilium pods 415 in the cluster. 416 417 The script has the following list of prerequisites: 418 419 * Requires Python >= 2.7.* 420 * Requires ``kubectl``. 421 * ``kubectl`` should be pointing to your cluster before running the tool. 422 423 You can download the latest version of the ``cilium-sysdump`` tool using the 424 following command: 425 426 .. code:: bash 427 428 curl -sLO https://github.com/cilium/cilium-sysdump/releases/latest/download/cilium-sysdump.zip 429 python cilium-sysdump.zip 430 431 You can specify from which nodes to collect the system dumps by passing 432 node IP addresses via the ``--nodes`` argument: 433 434 .. code:: bash 435 436 python cilium-sysdump.zip --nodes=$NODE1_IP,$NODE2_IP2 437 438 Use ``--help`` to see more options: 439 440 .. code:: bash 441 442 python cilium-sysdump.zip --help 443 444 Single Node Bugtool 445 ~~~~~~~~~~~~~~~~~~~ 446 447 If you are not running Kubernetes, it is also possible to run the bug 448 collection tool manually with the scope of a single node: 449 450 The ``cilium-bugtool`` captures potentially useful information about your 451 environment for debugging. The tool is meant to be used for debugging a single 452 Cilium agent node. In the Kubernetes case, if you have multiple Cilium pods, 453 the tool can retrieve debugging information from all of them. The tool works by 454 archiving a collection of command output and files from several places. By 455 default, it writes to the ``tmp`` directory. 456 457 Note that the command needs to be run from inside the Cilium pod/container. 458 459 .. code:: bash 460 461 $ cilium-bugtool 462 463 When running it with no option as shown above, it will try to copy various 464 files and execute some commands. If ``kubectl`` is detected, it will search for 465 Cilium pods. The default label being ``k8s-app=cilium``, but this and the 466 namespace can be changed via ``k8s-namespace`` and ``k8s-label`` respectively. 467 468 If you'd prefer to browse the dump, there is a HTTP flag. 469 470 .. code:: bash 471 472 $ cilium-bugtool --serve 473 474 475 If you want to capture the archive from a Kubernetes pod, then the process is a 476 bit different 477 478 .. code:: bash 479 480 # First we need to get the Cilium pod 481 $ kubectl get pods --namespace kube-system 482 NAME READY STATUS RESTARTS AGE 483 cilium-kg8lv 1/1 Running 0 13m 484 kube-addon-manager-minikube 1/1 Running 0 1h 485 kube-dns-6fc954457d-sf2nk 3/3 Running 0 1h 486 kubernetes-dashboard-6xvc7 1/1 Running 0 1h 487 488 # Run the bugtool from this pod 489 $ kubectl -n kube-system exec cilium-kg8lv cilium-bugtool 490 [...] 491 492 # Copy the archive from the pod 493 $ kubectl cp kube-system/cilium-kg8lv:/tmp/cilium-bugtool-20180411-155146.166+0000-UTC-266836983.tar /tmp/cilium-bugtool-20180411-155146.166+0000-UTC-266836983.tar 494 [...] 495 496 .. Note:: 497 498 Please check the archive for sensitive information and strip it 499 away before sharing it with us. 500 501 Below is an approximate list of the kind of information in the archive. 502 503 * Cilium status 504 * Cilium version 505 * Kernel configuration 506 * Resolve configuration 507 * Cilium endpoint state 508 * Cilium logs 509 * Docker logs 510 * ``dmesg`` 511 * ``ethtool`` 512 * ``ip a`` 513 * ``ip link`` 514 * ``ip r`` 515 * ``iptables-save`` 516 * ``kubectl -n kube-system get pods`` 517 * ``kubectl get pods,svc for all namespaces`` 518 * ``uname`` 519 * ``uptime`` 520 * ``cilium bpf * list`` 521 * ``cilium endpoint get for each endpoint`` 522 * ``cilium endpoint list`` 523 * ``hostname`` 524 * ``cilium policy get`` 525 * ``cilium service list`` 526 * ... 527 528 529 Debugging information 530 ~~~~~~~~~~~~~~~~~~~~~ 531 532 If you are not running Kubernetes, you can use the ``cilium debuginfo`` command 533 to retrieve useful debugging information. If you are running Kubernetes, this 534 command is automatically run as part of the system dump. 535 536 ``cilium debuginfo`` can print useful output from the Cilium API. The output 537 format is in Markdown format so this can be used when reporting a bug on the 538 `issue tracker`_. Running without arguments will print to standard output, but 539 you can also redirect to a file like 540 541 .. code:: bash 542 543 $ cilium debuginfo -f debuginfo.md 544 545 .. Note:: 546 547 Please check the debuginfo file for sensitive information and strip it 548 away before sharing it with us. 549 550 551 Slack Assistance 552 ---------------- 553 554 The Cilium slack community is helpful first point of assistance to get help 555 troubleshooting a problem or to discuss options on how to address a problem. 556 557 The slack community is open to everyone. You can request an invite email by 558 visiting `Slack <https://cilium.herokuapp.com/>`_. 559 560 Report an issue via GitHub 561 -------------------------- 562 563 If you believe to have found an issue in Cilium, please report a `GitHub issue 564 <https://github.com/cilium/cilium/issues>`_ and make sure to attach a system 565 dump as described above to ensure that developers have the best chance to 566 reproduce the issue. 567 568 .. _Slack channel: https://cilium.herokuapp.com 569 .. _NodeSelector: https://kubernetes.io/docs/concepts/configuration/assign-pod-node/ 570 .. _RBAC: https://kubernetes.io/docs/admin/authorization/rbac/ 571 .. _CNI: https://github.com/containernetworking/cni 572 .. _Volumes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-volume-storage/ 573 574 .. _Cilium Frequently Asked Questions (FAQ): https://github.com/cilium/cilium/issues?utf8=%E2%9C%93&q=label%3Akind%2Fquestion%20 575 576 .. _issue tracker: https://github.com/cilium/cilium/issues