github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/tuning.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _performance_tuning: 8 9 ************ 10 Tuning Guide 11 ************ 12 13 This guide helps you optimize a Cilium installation for optimal performance. 14 15 Recommendation 16 ============== 17 18 The default out of the box deployment of Cilium is focused on maximum compatibility 19 rather than most optimal performance. If you are a performance-conscious user, here 20 are the recommended settings for operating Cilium to get the best out of your setup. 21 22 .. note:: 23 In-place upgrade by just enabling the config settings on an existing 24 cluster is not possible since these tunings change the underlying datapath 25 fundamentals and therefore require Pod or even node restarts. 26 27 The best way to consume this for an existing cluster is to utilize per-node 28 configuration for enabling the tunings only on newly spawned nodes which join 29 the cluster. See the :ref:`per-node-configuration` page for more details. 30 31 Each of the settings for the recommended performance profile are described in more 32 detail on this page and in this `KubeCon talk <https://sched.co/1R2s5>`__: 33 34 - netkit device mode 35 - eBPF host-routing 36 - BIG TCP for IPv4/IPv6 37 - Bandwidth Manager (optional, for BBR congestion control) 38 39 **Requirements:** 40 41 * Kernel >= 6.8 42 * Supported NICs for BIG TCP: mlx4, mlx5, ice 43 44 To enable the first three settings: 45 46 .. tabs:: 47 48 .. group-tab:: Helm 49 50 .. parsed-literal:: 51 52 helm install cilium |CHART_RELEASE| \\ 53 --namespace kube-system \\ 54 --set routingMode=native \\ 55 --set bpf.datapathMode=netkit \\ 56 --set bpf.masquerade=true \\ 57 --set ipv6.enabled=true \\ 58 --set enableIPv6BIGTCP=true \\ 59 --set ipv4.enabled=true \\ 60 --set enableIPv4BIGTCP=true \\ 61 --set kubeProxyReplacement=true 62 63 For enabling BBR congestion control in addition, consider adding the following 64 settings to the above Helm install: 65 66 .. tabs:: 67 68 .. group-tab:: Helm 69 70 .. parsed-literal:: 71 72 --set bandwidthManager.enabled=true \\ 73 --set bandwidthManager.bbr=true 74 75 .. _netkit: 76 77 netkit device mode 78 ================== 79 80 netkit devices provide connectivity for Pods with the goal to improve throughput 81 and latency for applications as if they would have resided directly in the host 82 namespace, meaning, it reduces the datapath overhead for network namespaces down 83 to zero. The `netkit driver in the kernel <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/netkit.c>`__ 84 has been specifically designed for Cilium's needs and replaces the old-style veth 85 device type. See also the `KubeCon talk on netkit <https://sched.co/1R2s5>`__ for 86 more details. 87 88 Cilium utilizes netkit in L3 device mode with blackholing traffic from the Pods 89 when there is no BPF program attached. The Pod specific BPF programs are attached 90 inside the netkit peer device, and can only be managed from the host namespace 91 through Cilium. netkit in combination with eBPF-based host-routing achieves a 92 fast network namespace switch for off-node traffic ingressing into the Pod or 93 leaving the Pod. When netkit is enabled, Cilium also utilizes tcx for all 94 attachments to non-netkit devices. This is done for higher efficiency as well 95 as utilizing BPF links for all Cilium attachments. netkit is available for kernel 96 6.8 and onwards and it also supports BIG TCP. Once the base kernels become more 97 ubiquitous, the veth device mode of Cilium will be deprecated. 98 99 To validate whether your installation is running with netkit, run ``cilium status`` 100 in any of the Cilium Pods and look for the line reporting the status for 101 "Device Mode" which should state "netkit". Also, ensure to have eBPF host 102 routing enabled - the reporting status under "Host Routing" must state "BPF". 103 104 .. note:: 105 In-place upgrade by just enabling netkit on an existing cluster is not 106 possible since the CNI plugin cannot simply replace veth with netkit after 107 Pod creation. Also, running both flavors in parallel is currently not 108 supported. 109 110 The best way to consume this for an existing cluster is to utilize per-node 111 configuration for enabling netkit on newly spawned nodes which join the 112 cluster. See the :ref:`per-node-configuration` page for more details. 113 114 **Requirements:** 115 116 * Kernel >= 6.8 117 * Direct-routing configuration or tunneling 118 * eBPF host-routing 119 120 To enable netkit device mode with eBPF host-routing: 121 122 .. tabs:: 123 124 .. group-tab:: Helm 125 126 .. parsed-literal:: 127 128 helm install cilium |CHART_RELEASE| \\ 129 --namespace kube-system \\ 130 --set routingMode=native \\ 131 --set bpf.datapathMode=netkit \\ 132 --set bpf.masquerade=true \\ 133 --set kubeProxyReplacement=true 134 135 .. _eBPF_Host_Routing: 136 137 eBPF Host-Routing 138 ================= 139 140 Even when network routing is performed by Cilium using eBPF, by default network 141 packets still traverse some parts of the regular network stack of the node. 142 This ensures that all packets still traverse through all of the iptables hooks 143 in case you depend on them. However, they add significant overhead. For exact 144 numbers from our test environment, see :ref:`benchmark_throughput` and compare 145 the results for "Cilium" and "Cilium (legacy host-routing)". 146 147 We introduced `eBPF-based host-routing <https://cilium.io/blog/2020/11/10/cilium-19#veth>`_ 148 in Cilium 1.9 to fully bypass iptables and the upper host stack, and to achieve 149 a faster network namespace switch compared to regular veth device operation. 150 This option is automatically enabled if your kernel supports it. To validate 151 whether your installation is running with eBPF host-routing, run ``cilium status`` 152 in any of the Cilium pods and look for the line reporting the status for 153 "Host Routing" which should state "BPF". 154 155 **Requirements:** 156 157 * Kernel >= 5.10 158 * Direct-routing configuration or tunneling 159 * eBPF-based kube-proxy replacement 160 * eBPF-based masquerading 161 162 .. _ipv6_big_tcp: 163 164 IPv6 BIG TCP 165 ============ 166 167 IPv6 BIG TCP allows the network stack to prepare larger GSO (transmit) and GRO 168 (receive) packets to reduce the number of times the stack is traversed which 169 improves performance and latency. It reduces the CPU load and helps achieve 170 higher speeds (i.e. 100Gbit/s and beyond). 171 172 To pass such packets through the stack BIG TCP adds a temporary Hop-By-Hop header 173 after the IPv6 one which is stripped before transmitting the packet over the wire. 174 175 BIG TCP can operate in a DualStack setup, IPv4 packets will use the old lower 176 limits (64k) if IPv4 BIG TCP is not enabled, and IPv6 packets will use the new 177 larger ones (192k). Both IPv4 BIG TCP and IPv6 BIG TCP can be enabled so that 178 both use the larger one (192k). 179 180 Note that Cilium assumes the default kernel values for GSO and GRO maximum sizes 181 are 64k and adjusts them only when necessary, i.e. if BIG TCP is enabled and the 182 current GSO/GRO maximum sizes are less than 192k it will try to increase them, 183 respectively when BIG TCP is disabled and the current maximum values are more 184 than 64k it will try to decrease them. 185 186 BIG TCP doesn't require network interface MTU changes. 187 188 .. note:: 189 In-place upgrade by just enabling BIG TCP on an existing cluster is currently 190 not possible since Cilium does not have access into Pods after they have been 191 created. 192 193 The best way to consume this for an existing cluster is to either restart Pods 194 or to utilize per-node configuration for enabling BIG TCP on newly spawned nodes 195 which join the cluster. See the :ref:`per-node-configuration` page for more 196 details. 197 198 **Requirements:** 199 200 * Kernel >= 5.19 201 * eBPF Host-Routing 202 * eBPF-based kube-proxy replacement 203 * eBPF-based masquerading 204 * Tunneling and encryption disabled 205 * Supported NICs: mlx4, mlx5, ice 206 207 To enable IPv6 BIG TCP: 208 209 .. tabs:: 210 211 .. group-tab:: Helm 212 213 .. parsed-literal:: 214 215 helm install cilium |CHART_RELEASE| \\ 216 --namespace kube-system \\ 217 --set routingMode=native \\ 218 --set bpf.masquerade=true \\ 219 --set ipv6.enabled=true \\ 220 --set enableIPv6BIGTCP=true \\ 221 --set kubeProxyReplacement=true 222 223 Note that after toggling the IPv6 BIG TCP option the Kubernetes Pods must be 224 restarted for the changes to take effect. 225 226 To validate whether your installation is running with IPv6 BIG TCP, 227 run ``cilium status`` in any of the Cilium pods and look for the line 228 reporting the status for "IPv6 BIG TCP" which should state "enabled". 229 230 IPv4 BIG TCP 231 ============ 232 233 Similar to IPv6 BIG TCP, IPv4 BIG TCP allows the network stack to prepare larger 234 GSO (transmit) and GRO (receive) packets to reduce the number of times the stack 235 is traversed which improves performance and latency. It reduces the CPU load and 236 helps achieve higher speeds (i.e. 100Gbit/s and beyond). 237 238 To pass such packets through the stack BIG TCP sets IPv4 tot_len to 0 and uses 239 skb->len as the real IPv4 total length. The proper IPv4 tot_len is set before 240 transmitting the packet over the wire. 241 242 BIG TCP can operate in a DualStack setup, IPv6 packets will use the old lower 243 limits (64k) if IPv6 BIG TCP is not enabled, and IPv4 packets will use the new 244 larger ones (192k). Both IPv4 BIG TCP and IPv6 BIG TCP can be enabled so that 245 both use the larger one (192k). 246 247 Note that Cilium assumes the default kernel values for GSO and GRO maximum sizes 248 are 64k and adjusts them only when necessary, i.e. if BIG TCP is enabled and the 249 current GSO/GRO maximum sizes are less than 192k it will try to increase them, 250 respectively when BIG TCP is disabled and the current maximum values are more 251 than 64k it will try to decrease them. 252 253 BIG TCP doesn't require network interface MTU changes. 254 255 .. note:: 256 In-place upgrade by just enabling BIG TCP on an existing cluster is currently 257 not possible since Cilium does not have access into Pods after they have been 258 created. 259 260 The best way to consume this for an existing cluster is to either restart Pods 261 or to utilize per-node configuration for enabling BIG TCP on newly spawned nodes 262 which join the cluster. See the :ref:`per-node-configuration` page for more 263 details. 264 265 **Requirements:** 266 267 * Kernel >= 6.3 268 * eBPF Host-Routing 269 * eBPF-based kube-proxy replacement 270 * eBPF-based masquerading 271 * Tunneling and encryption disabled 272 * Supported NICs: mlx4, mlx5, ice 273 274 To enable IPv4 BIG TCP: 275 276 .. tabs:: 277 278 .. group-tab:: Helm 279 280 .. parsed-literal:: 281 282 helm install cilium |CHART_RELEASE| \\ 283 --namespace kube-system \\ 284 --set routingMode=native \\ 285 --set bpf.masquerade=true \\ 286 --set ipv4.enabled=true \\ 287 --set enableIPv4BIGTCP=true \\ 288 --set kubeProxyReplacement=true 289 290 Note that after toggling the IPv4 BIG TCP option the Kubernetes Pods 291 must be restarted for the changes to take effect. 292 293 To validate whether your installation is running with IPv4 BIG TCP, 294 run ``cilium status`` in any of the Cilium pods and look for the line 295 reporting the status for "IPv4 BIG TCP" which should state "enabled". 296 297 Bypass iptables Connection Tracking 298 =================================== 299 300 For the case when eBPF Host-Routing cannot be used and thus network packets 301 still need to traverse the regular network stack in the host namespace, 302 iptables can add a significant cost. This traversal cost can be minimized 303 by disabling the connection tracking requirement for all Pod traffic, thus 304 bypassing the iptables connection tracker. 305 306 **Requirements:** 307 308 * Kernel >= 4.19.57, >= 5.1.16, >= 5.2 309 * Direct-routing configuration 310 * eBPF-based kube-proxy replacement 311 * eBPF-based masquerading or no masquerading 312 313 To enable the iptables connection-tracking bypass: 314 315 .. tabs:: 316 317 .. group-tab:: Cilium CLI 318 319 .. parsed-literal:: 320 321 cilium install |CHART_VERSION| \ 322 --set installNoConntrackIptablesRules=true \ 323 --set kubeProxyReplacement=true 324 325 .. group-tab:: Helm 326 327 .. parsed-literal:: 328 329 helm install cilium |CHART_RELEASE| \\ 330 --namespace kube-system \\ 331 --set installNoConntrackIptablesRules=true \\ 332 --set kubeProxyReplacement=true 333 334 Hubble 335 ====== 336 337 Running with Hubble observability enabled can come at the expense of 338 performance. The overhead of Hubble is somewhere between 1-15% depending 339 on your network traffic patterns and Hubble aggregation settings. 340 341 In order to optimize for maximum performance, Hubble can be disabled: 342 343 .. tabs:: 344 345 .. group-tab:: Cilium CLI 346 347 .. code-block:: shell-session 348 349 cilium hubble disable 350 351 .. group-tab:: Helm 352 353 .. parsed-literal:: 354 355 helm install cilium |CHART_RELEASE| \\ 356 --namespace kube-system \\ 357 --set hubble.enabled=false 358 359 You can also choose to stop exposing event types in which you 360 are not interested. For instance if you are mainly interested in 361 dropped traffic, you can disable "trace" events which will likely reduce 362 the overall CPU consumption of the agent. 363 364 .. tabs:: 365 366 .. group-tab:: Cilium CLI 367 368 .. code-block:: shell-session 369 370 cilium config TraceNotification=disable 371 372 .. group-tab:: Helm 373 374 .. parsed-literal:: 375 376 helm install cilium |CHART_RELEASE| \\ 377 --namespace kube-system \\ 378 --set bpf.events.trace.enabled=false 379 380 .. warning:: 381 382 Suppressing one or more event types will impact ``cilium monitor`` as well as Hubble observability capabilities, metrics and exports. 383 384 MTU 385 === 386 387 The maximum transfer unit (MTU) can have a significant impact on the network 388 throughput of a configuration. Cilium will automatically detect the MTU of the 389 underlying network devices. Therefore, if your system is configured to use 390 jumbo frames, Cilium will automatically make use of it. 391 392 To benefit from this, make sure that your system is configured to use jumbo 393 frames if your network allows for it. 394 395 Bandwidth Manager 396 ================= 397 398 Cilium's Bandwidth Manager is responsible for managing network traffic more 399 efficiently with the goal of improving overall application latency and throughput. 400 401 Aside from natively supporting Kubernetes Pod bandwidth annotations, the 402 `Bandwidth Manager <https://cilium.io/blog/2020/11/10/cilium-19#bwmanager>`_, 403 first introduced in Cilium 1.9, is also setting up Fair Queue (FQ) 404 queueing disciplines to support TCP stack pacing (e.g. from EDT/BBR) on all 405 external-facing network devices as well as setting optimal server-grade sysctl 406 settings for the networking stack. 407 408 **Requirements:** 409 410 * Kernel >= 5.1 411 * Direct-routing configuration or tunneling 412 * eBPF-based kube-proxy replacement 413 414 To enable the Bandwidth Manager: 415 416 .. tabs:: 417 418 .. group-tab:: Helm 419 420 .. parsed-literal:: 421 422 helm install cilium |CHART_RELEASE| \\ 423 --namespace kube-system \\ 424 --set bandwidthManager.enabled=true \\ 425 --set kubeProxyReplacement=true 426 427 To validate whether your installation is running with Bandwidth Manager, 428 run ``cilium status`` in any of the Cilium pods and look for the line 429 reporting the status for "BandwidthManager" which should state "EDT with BPF". 430 431 BBR congestion control for Pods 432 =============================== 433 434 The base infrastructure around MQ/FQ setup provided by Cilium's Bandwidth Manager 435 also allows for use of TCP `BBR congestion control <https://queue.acm.org/detail.cfm?id=3022184>`_ 436 for Pods. BBR is in particular suitable when Pods are exposed behind Kubernetes 437 Services which face external clients from the Internet. BBR achieves higher 438 bandwidths and lower latencies for Internet traffic, for example, it has been 439 `shown <https://cloud.google.com/blog/products/networking/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster>`_ 440 that BBR's throughput can reach as much as 2,700x higher than today's best 441 loss-based congestion control and queueing delays can be 25x lower. 442 443 In order for BBR to work reliably for Pods, it requires a 5.18 or higher kernel. 444 As outlined in our `Linux Plumbers 2021 talk <https://lpc.events/event/11/contributions/953/>`_, 445 this is needed since older kernels do not retain timestamps of network packets 446 when switching from Pod to host network namespace. Due to the latter, the kernel's 447 pacing infrastructure does not function properly in general (not specific to Cilium). 448 We helped fixing this issue for recent kernels to retain timestamps and therefore to 449 get BBR for Pods working. 450 451 BBR also needs eBPF Host-Routing in order to retain the network packet's socket 452 association all the way until the packet hits the FQ queueing discipline on the 453 physical device in the host namespace. 454 455 .. note:: 456 In-place upgrade by just enabling BBR on an existing cluster is not possible 457 since Cilium cannot migrate existing sockets over to BBR congestion control. 458 459 The best way to consume this is to either only enable it on newly built clusters, 460 to restart Pods on existing clusters, or to utilize per-node configuration for 461 enabling BBR on newly spawned nodes which join the cluster. See the 462 :ref:`per-node-configuration` page for more details. 463 464 Note that the use of BBR could lead to a higher amount of TCP retransmissions 465 and more aggressive behavior towards TCP CUBIC connections. 466 467 **Requirements:** 468 469 * Kernel >= 5.18 470 * Bandwidth Manager 471 * eBPF Host-Routing 472 473 To enable the Bandwidth Manager with BBR for Pods: 474 475 .. tabs:: 476 477 .. group-tab:: Helm 478 479 .. parsed-literal:: 480 481 helm install cilium |CHART_RELEASE| \\ 482 --namespace kube-system \\ 483 --set bandwidthManager.enabled=true \\ 484 --set bandwidthManager.bbr=true \\ 485 --set kubeProxyReplacement=true 486 487 To validate whether your installation is running with BBR for Pods, 488 run ``cilium status`` in any of the Cilium pods and look for the line 489 reporting the status for "BandwidthManager" which should then state 490 ``EDT with BPF`` as well as ``[BBR]``. 491 492 XDP Acceleration 493 ================ 494 495 Cilium has built-in support for accelerating NodePort, LoadBalancer services 496 and services with externalIPs for the case where the arriving request needs 497 to be pushed back out of the node when the backend is located on a remote node. 498 499 In that case, the network packets do not need to be pushed all the way to the 500 upper networking stack, but with the help of XDP, Cilium is able to process 501 those requests right out of the network driver layer. This helps to reduce 502 latency and scale-out of services given a single node's forwarding capacity 503 is dramatically increased. The kube-proxy replacement at the XDP layer is 504 `available from Cilium 1.8 <https://cilium.io/blog/2020/06/22/cilium-18#kubeproxy-removal>`_. 505 506 **Requirements:** 507 508 * Kernel >= 4.19.57, >= 5.1.16, >= 5.2 509 * Native XDP supported driver, check :ref:`our driver list <XDP acceleration>` 510 * Direct-routing configuration 511 * eBPF-based kube-proxy replacement 512 513 To enable the XDP Acceleration, check out :ref:`our getting started guide <XDP acceleration>` which also contains instructions for setting it 514 up on public cloud providers. 515 516 To validate whether your installation is running with XDP Acceleration, 517 run ``cilium status`` in any of the Cilium pods and look for the line 518 reporting the status for "XDP Acceleration" which should say "Native". 519 520 eBPF Map Sizing 521 =============== 522 523 All eBPF maps are created with upper capacity limits. Insertion beyond the 524 limit would fail or constrain the scalability of the datapath. Cilium is 525 using auto-derived defaults based on the given ratio of the total system 526 memory. 527 528 However, the upper capacity limits used by the Cilium agent can be overridden 529 for advanced users. Please refer to the :ref:`bpf_map_limitations` guide. 530 531 Linux Kernel 532 ============ 533 534 In general, we highly recommend using the most recent LTS stable kernel (such 535 as >= 5.10) provided by the `kernel community <https://www.kernel.org/category/releases.html>`_ 536 or by a downstream distribution of your choice. The newer the kernel, the more 537 likely it is that various datapath optimizations can be used. 538 539 In our Cilium release blogs, we also regularly highlight some of the eBPF based 540 kernel work we conduct which implicitly helps Cilium's datapath performance 541 such as `replacing retpolines with direct jumps in the eBPF JIT <https://cilium.io/blog/2020/02/18/cilium-17#upstream-linux>`_. 542 543 Moreover, the kernel allows to configure several options which will help maximize 544 network performance. 545 546 CONFIG_PREEMPT_NONE 547 ------------------- 548 549 Run a kernel version with ``CONFIG_PREEMPT_NONE=y`` set. Some Linux 550 distributions offer kernel images with this option set or you can re-compile 551 the Linux kernel. ``CONFIG_PREEMPT_NONE=y`` is the recommended setting for 552 server workloads. 553 554 Further Considerations 555 ====================== 556 557 Various additional settings that we recommend help to tune the system for 558 specific workloads and to reduce jitter: 559 560 tuned network-* profiles 561 ------------------------ 562 563 The `tuned <https://tuned-project.org/>`_ project offers various profiles to 564 optimize for deterministic performance at the cost of increased power consumption, 565 that is, ``network-latency`` and ``network-throughput``, for example. To enable 566 the former, run: 567 568 .. code-block:: shell-session 569 570 tuned-adm profile network-latency 571 572 Set CPU governor to performance 573 ------------------------------- 574 575 The CPU scaling up and down can impact latency tests and lead to sub-optimal 576 performance. To achieve maximum consistent performance. Set the CPU governor 577 to ``performance``: 578 579 .. code-block:: bash 580 581 for CPU in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do 582 echo performance > $CPU 583 done 584 585 Stop ``irqbalance`` and pin the NIC interrupts to specific CPUs 586 --------------------------------------------------------------- 587 588 In case you are running ``irqbalance``, consider disabling it as it might 589 migrate the NIC's IRQ handling among CPUs and can therefore cause non-deterministic 590 performance: 591 592 .. code-block:: shell-session 593 594 killall irqbalance 595 596 We highly recommend to pin the NIC interrupts to specific CPUs in order to 597 allow for maximum workload isolation! 598 599 See `this script <https://github.com/borkmann/netperf_scripts/blob/master/set_irq_affinity>`_ 600 for details and initial pointers on how to achieve this. Note that pinning the 601 queues can potentially vary in setup between different drivers. 602 603 We generally also recommend to check various documentation and performance tuning 604 guides from NIC vendors on this matter such as from 605 `Mellanox <https://enterprise-support.nvidia.com/s/article/performance-tuning-for-mellanox-adapters>`_, 606 `Intel <https://www.intel.com/content/www/us/en/support/articles/000005811/network-and-i-o/ethernet-products.html>`_ 607 or others for more information.