github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/tuning.rst

github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/tuning.rst (about)

1 .. only:: not (epub or latex or html)
2
3 WARNING: You are looking at unreleased Cilium documentation.
4 Please use the official rendered version released here:
5 https://docs.cilium.io
6
7 .. _performance_tuning:
8
9 ************
10 Tuning Guide
11 ************
12
13 This guide helps you optimize a Cilium installation for optimal performance.
14
15 Recommendation
16 ==============
17
18 The default out of the box deployment of Cilium is focused on maximum compatibility
19 rather than most optimal performance. If you are a performance-conscious user, here
20 are the recommended settings for operating Cilium to get the best out of your setup.
21
22 .. note::
23 In-place upgrade by just enabling the config settings on an existing
24 cluster is not possible since these tunings change the underlying datapath
25 fundamentals and therefore require Pod or even node restarts.
26
27 The best way to consume this for an existing cluster is to utilize per-node
28 configuration for enabling the tunings only on newly spawned nodes which join
29 the cluster. See the :ref:`per-node-configuration` page for more details.
30
31 Each of the settings for the recommended performance profile are described in more
32 detail on this page and in this `KubeCon talk <https://sched.co/1R2s5>`__:
33
34 - netkit device mode
35 - eBPF host-routing
36 - BIG TCP for IPv4/IPv6
37 - Bandwidth Manager (optional, for BBR congestion control)
38
39 **Requirements:**
40
41 * Kernel >= 6.8
42 * Supported NICs for BIG TCP: mlx4, mlx5, ice
43
44 To enable the first three settings:
45
46 .. tabs::
47
48 .. group-tab:: Helm
49
50 .. parsed-literal::
51
52 helm install cilium |CHART_RELEASE| \\
53 --namespace kube-system \\
54 --set routingMode=native \\
55 --set bpf.datapathMode=netkit \\
56 --set bpf.masquerade=true \\
57 --set ipv6.enabled=true \\
58 --set enableIPv6BIGTCP=true \\
59 --set ipv4.enabled=true \\
60 --set enableIPv4BIGTCP=true \\
61 --set kubeProxyReplacement=true
62
63 For enabling BBR congestion control in addition, consider adding the following
64 settings to the above Helm install:
65
66 .. tabs::
67
68 .. group-tab:: Helm
69
70 .. parsed-literal::
71
72 --set bandwidthManager.enabled=true \\
73 --set bandwidthManager.bbr=true
74
75 .. _netkit:
76
77 netkit device mode
78 ==================
79
80 netkit devices provide connectivity for Pods with the goal to improve throughput
81 and latency for applications as if they would have resided directly in the host
82 namespace, meaning, it reduces the datapath overhead for network namespaces down
83 to zero. The `netkit driver in the kernel <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/net/netkit.c>`__
84 has been specifically designed for Cilium's needs and replaces the old-style veth
85 device type. See also the `KubeCon talk on netkit <https://sched.co/1R2s5>`__ for
86 more details.
87
88 Cilium utilizes netkit in L3 device mode with blackholing traffic from the Pods
89 when there is no BPF program attached. The Pod specific BPF programs are attached
90 inside the netkit peer device, and can only be managed from the host namespace
91 through Cilium. netkit in combination with eBPF-based host-routing achieves a
92 fast network namespace switch for off-node traffic ingressing into the Pod or
93 leaving the Pod. When netkit is enabled, Cilium also utilizes tcx for all
94 attachments to non-netkit devices. This is done for higher efficiency as well
95 as utilizing BPF links for all Cilium attachments. netkit is available for kernel
96 6.8 and onwards and it also supports BIG TCP. Once the base kernels become more
97 ubiquitous, the veth device mode of Cilium will be deprecated.
98
99 To validate whether your installation is running with netkit, run ``cilium status``
100 in any of the Cilium Pods and look for the line reporting the status for
101 "Device Mode" which should state "netkit". Also, ensure to have eBPF host
102 routing enabled - the reporting status under "Host Routing" must state "BPF".
103
104 .. note::
105 In-place upgrade by just enabling netkit on an existing cluster is not
106 possible since the CNI plugin cannot simply replace veth with netkit after
107 Pod creation. Also, running both flavors in parallel is currently not
108 supported.
109
110 The best way to consume this for an existing cluster is to utilize per-node
111 configuration for enabling netkit on newly spawned nodes which join the
112 cluster. See the :ref:`per-node-configuration` page for more details.
113
114 **Requirements:**
115
116 * Kernel >= 6.8
117 * Direct-routing configuration or tunneling
118 * eBPF host-routing
119
120 To enable netkit device mode with eBPF host-routing:
121
122 .. tabs::
123
124 .. group-tab:: Helm
125
126 .. parsed-literal::
127
128 helm install cilium |CHART_RELEASE| \\
129 --namespace kube-system \\
130 --set routingMode=native \\
131 --set bpf.datapathMode=netkit \\
132 --set bpf.masquerade=true \\
133 --set kubeProxyReplacement=true
134
135 .. _eBPF_Host_Routing:
136
137 eBPF Host-Routing
138 =================
139
140 Even when network routing is performed by Cilium using eBPF, by default network
141 packets still traverse some parts of the regular network stack of the node.
142 This ensures that all packets still traverse through all of the iptables hooks
143 in case you depend on them. However, they add significant overhead. For exact
144 numbers from our test environment, see :ref:`benchmark_throughput` and compare
145 the results for "Cilium" and "Cilium (legacy host-routing)".
146
147 We introduced `eBPF-based host-routing <https://cilium.io/blog/2020/11/10/cilium-19#veth>`_
148 in Cilium 1.9 to fully bypass iptables and the upper host stack, and to achieve
149 a faster network namespace switch compared to regular veth device operation.
150 This option is automatically enabled if your kernel supports it. To validate
151 whether your installation is running with eBPF host-routing, run ``cilium status``
152 in any of the Cilium pods and look for the line reporting the status for
153 "Host Routing" which should state "BPF".
154
155 **Requirements:**
156
157 * Kernel >= 5.10
158 * Direct-routing configuration or tunneling
159 * eBPF-based kube-proxy replacement
160 * eBPF-based masquerading
161
162 .. _ipv6_big_tcp:
163
164 IPv6 BIG TCP
165 ============
166
167 IPv6 BIG TCP allows the network stack to prepare larger GSO (transmit) and GRO
168 (receive) packets to reduce the number of times the stack is traversed which
169 improves performance and latency. It reduces the CPU load and helps achieve
170 higher speeds (i.e. 100Gbit/s and beyond).
171
172 To pass such packets through the stack BIG TCP adds a temporary Hop-By-Hop header
173 after the IPv6 one which is stripped before transmitting the packet over the wire.
174
175 BIG TCP can operate in a DualStack setup, IPv4 packets will use the old lower
176 limits (64k) if IPv4 BIG TCP is not enabled, and IPv6 packets will use the new
177 larger ones (192k). Both IPv4 BIG TCP and IPv6 BIG TCP can be enabled so that
178 both use the larger one (192k).
179
180 Note that Cilium assumes the default kernel values for GSO and GRO maximum sizes
181 are 64k and adjusts them only when necessary, i.e. if BIG TCP is enabled and the
182 current GSO/GRO maximum sizes are less than 192k it will try to increase them,
183 respectively when BIG TCP is disabled and the current maximum values are more
184 than 64k it will try to decrease them.
185
186 BIG TCP doesn't require network interface MTU changes.
187
188 .. note::
189 In-place upgrade by just enabling BIG TCP on an existing cluster is currently
190 not possible since Cilium does not have access into Pods after they have been
191 created.
192
193 The best way to consume this for an existing cluster is to either restart Pods
194 or to utilize per-node configuration for enabling BIG TCP on newly spawned nodes
195 which join the cluster. See the :ref:`per-node-configuration` page for more
196 details.
197
198 **Requirements:**
199
200 * Kernel >= 5.19
201 * eBPF Host-Routing
202 * eBPF-based kube-proxy replacement
203 * eBPF-based masquerading
204 * Tunneling and encryption disabled
205 * Supported NICs: mlx4, mlx5, ice
206
207 To enable IPv6 BIG TCP:
208
209 .. tabs::
210
211 .. group-tab:: Helm
212
213 .. parsed-literal::
214
215 helm install cilium |CHART_RELEASE| \\
216 --namespace kube-system \\
217 --set routingMode=native \\
218 --set bpf.masquerade=true \\
219 --set ipv6.enabled=true \\
220 --set enableIPv6BIGTCP=true \\
221 --set kubeProxyReplacement=true
222
223 Note that after toggling the IPv6 BIG TCP option the Kubernetes Pods must be
224 restarted for the changes to take effect.
225
226 To validate whether your installation is running with IPv6 BIG TCP,
227 run ``cilium status`` in any of the Cilium pods and look for the line
228 reporting the status for "IPv6 BIG TCP" which should state "enabled".
229
230 IPv4 BIG TCP
231 ============
232
233 Similar to IPv6 BIG TCP, IPv4 BIG TCP allows the network stack to prepare larger
234 GSO (transmit) and GRO (receive) packets to reduce the number of times the stack
235 is traversed which improves performance and latency. It reduces the CPU load and
236 helps achieve higher speeds (i.e. 100Gbit/s and beyond).
237
238 To pass such packets through the stack BIG TCP sets IPv4 tot_len to 0 and uses
239 skb->len as the real IPv4 total length. The proper IPv4 tot_len is set before
240 transmitting the packet over the wire.
241
242 BIG TCP can operate in a DualStack setup, IPv6 packets will use the old lower
243 limits (64k) if IPv6 BIG TCP is not enabled, and IPv4 packets will use the new
244 larger ones (192k). Both IPv4 BIG TCP and IPv6 BIG TCP can be enabled so that
245 both use the larger one (192k).
246
247 Note that Cilium assumes the default kernel values for GSO and GRO maximum sizes
248 are 64k and adjusts them only when necessary, i.e. if BIG TCP is enabled and the
249 current GSO/GRO maximum sizes are less than 192k it will try to increase them,
250 respectively when BIG TCP is disabled and the current maximum values are more
251 than 64k it will try to decrease them.
252
253 BIG TCP doesn't require network interface MTU changes.
254
255 .. note::
256 In-place upgrade by just enabling BIG TCP on an existing cluster is currently
257 not possible since Cilium does not have access into Pods after they have been
258 created.
259
260 The best way to consume this for an existing cluster is to either restart Pods
261 or to utilize per-node configuration for enabling BIG TCP on newly spawned nodes
262 which join the cluster. See the :ref:`per-node-configuration` page for more
263 details.
264
265 **Requirements:**
266
267 * Kernel >= 6.3
268 * eBPF Host-Routing
269 * eBPF-based kube-proxy replacement
270 * eBPF-based masquerading
271 * Tunneling and encryption disabled
272 * Supported NICs: mlx4, mlx5, ice
273
274 To enable IPv4 BIG TCP:
275
276 .. tabs::
277
278 .. group-tab:: Helm
279
280 .. parsed-literal::
281
282 helm install cilium |CHART_RELEASE| \\
283 --namespace kube-system \\
284 --set routingMode=native \\
285 --set bpf.masquerade=true \\
286 --set ipv4.enabled=true \\
287 --set enableIPv4BIGTCP=true \\
288 --set kubeProxyReplacement=true
289
290 Note that after toggling the IPv4 BIG TCP option the Kubernetes Pods
291 must be restarted for the changes to take effect.
292
293 To validate whether your installation is running with IPv4 BIG TCP,
294 run ``cilium status`` in any of the Cilium pods and look for the line
295 reporting the status for "IPv4 BIG TCP" which should state "enabled".
296
297 Bypass iptables Connection Tracking
298 ===================================
299
300 For the case when eBPF Host-Routing cannot be used and thus network packets
301 still need to traverse the regular network stack in the host namespace,
302 iptables can add a significant cost. This traversal cost can be minimized
303 by disabling the connection tracking requirement for all Pod traffic, thus
304 bypassing the iptables connection tracker.
305
306 **Requirements:**
307
308 * Kernel >= 4.19.57, >= 5.1.16, >= 5.2
309 * Direct-routing configuration
310 * eBPF-based kube-proxy replacement
311 * eBPF-based masquerading or no masquerading
312
313 To enable the iptables connection-tracking bypass:
314
315 .. tabs::
316
317 .. group-tab:: Cilium CLI
318
319 .. parsed-literal::
320
321 cilium install |CHART_VERSION| \
322 --set installNoConntrackIptablesRules=true \
323 --set kubeProxyReplacement=true
324
325 .. group-tab:: Helm
326
327 .. parsed-literal::
328
329 helm install cilium |CHART_RELEASE| \\
330 --namespace kube-system \\
331 --set installNoConntrackIptablesRules=true \\
332 --set kubeProxyReplacement=true
333
334 Hubble
335 ======
336
337 Running with Hubble observability enabled can come at the expense of
338 performance. The overhead of Hubble is somewhere between 1-15% depending
339 on your network traffic patterns and Hubble aggregation settings.
340
341 In order to optimize for maximum performance, Hubble can be disabled:
342
343 .. tabs::
344
345 .. group-tab:: Cilium CLI
346
347 .. code-block:: shell-session
348
349 cilium hubble disable
350
351 .. group-tab:: Helm
352
353 .. parsed-literal::
354
355 helm install cilium |CHART_RELEASE| \\
356 --namespace kube-system \\
357 --set hubble.enabled=false
358
359 You can also choose to stop exposing event types in which you
360 are not interested. For instance if you are mainly interested in
361 dropped traffic, you can disable "trace" events which will likely reduce
362 the overall CPU consumption of the agent.
363
364 .. tabs::
365
366 .. group-tab:: Cilium CLI
367
368 .. code-block:: shell-session
369
370 cilium config TraceNotification=disable
371
372 .. group-tab:: Helm
373
374 .. parsed-literal::
375
376 helm install cilium |CHART_RELEASE| \\
377 --namespace kube-system \\
378 --set bpf.events.trace.enabled=false
379
380 .. warning::
381
382 Suppressing one or more event types will impact ``cilium monitor`` as well as Hubble observability capabilities, metrics and exports.
383
384 MTU
385 ===
386
387 The maximum transfer unit (MTU) can have a significant impact on the network
388 throughput of a configuration. Cilium will automatically detect the MTU of the
389 underlying network devices. Therefore, if your system is configured to use
390 jumbo frames, Cilium will automatically make use of it.
391
392 To benefit from this, make sure that your system is configured to use jumbo
393 frames if your network allows for it.
394
395 Bandwidth Manager
396 =================
397
398 Cilium's Bandwidth Manager is responsible for managing network traffic more
399 efficiently with the goal of improving overall application latency and throughput.
400
401 Aside from natively supporting Kubernetes Pod bandwidth annotations, the
402 `Bandwidth Manager <https://cilium.io/blog/2020/11/10/cilium-19#bwmanager>`_,
403 first introduced in Cilium 1.9, is also setting up Fair Queue (FQ)
404 queueing disciplines to support TCP stack pacing (e.g. from EDT/BBR) on all
405 external-facing network devices as well as setting optimal server-grade sysctl
406 settings for the networking stack.
407
408 **Requirements:**
409
410 * Kernel >= 5.1
411 * Direct-routing configuration or tunneling
412 * eBPF-based kube-proxy replacement
413
414 To enable the Bandwidth Manager:
415
416 .. tabs::
417
418 .. group-tab:: Helm
419
420 .. parsed-literal::
421
422 helm install cilium |CHART_RELEASE| \\
423 --namespace kube-system \\
424 --set bandwidthManager.enabled=true \\
425 --set kubeProxyReplacement=true
426
427 To validate whether your installation is running with Bandwidth Manager,
428 run ``cilium status`` in any of the Cilium pods and look for the line
429 reporting the status for "BandwidthManager" which should state "EDT with BPF".
430
431 BBR congestion control for Pods
432 ===============================
433
434 The base infrastructure around MQ/FQ setup provided by Cilium's Bandwidth Manager
435 also allows for use of TCP `BBR congestion control <https://queue.acm.org/detail.cfm?id=3022184>`_
436 for Pods. BBR is in particular suitable when Pods are exposed behind Kubernetes
437 Services which face external clients from the Internet. BBR achieves higher
438 bandwidths and lower latencies for Internet traffic, for example, it has been
439 `shown <https://cloud.google.com/blog/products/networking/tcp-bbr-congestion-control-comes-to-gcp-your-internet-just-got-faster>`_
440 that BBR's throughput can reach as much as 2,700x higher than today's best
441 loss-based congestion control and queueing delays can be 25x lower.
442
443 In order for BBR to work reliably for Pods, it requires a 5.18 or higher kernel.
444 As outlined in our `Linux Plumbers 2021 talk <https://lpc.events/event/11/contributions/953/>`_,
445 this is needed since older kernels do not retain timestamps of network packets
446 when switching from Pod to host network namespace. Due to the latter, the kernel's
447 pacing infrastructure does not function properly in general (not specific to Cilium).
448 We helped fixing this issue for recent kernels to retain timestamps and therefore to
449 get BBR for Pods working.
450
451 BBR also needs eBPF Host-Routing in order to retain the network packet's socket
452 association all the way until the packet hits the FQ queueing discipline on the
453 physical device in the host namespace.
454
455 .. note::
456 In-place upgrade by just enabling BBR on an existing cluster is not possible
457 since Cilium cannot migrate existing sockets over to BBR congestion control.
458
459 The best way to consume this is to either only enable it on newly built clusters,
460 to restart Pods on existing clusters, or to utilize per-node configuration for
461 enabling BBR on newly spawned nodes which join the cluster. See the
462 :ref:`per-node-configuration` page for more details.
463
464 Note that the use of BBR could lead to a higher amount of TCP retransmissions
465 and more aggressive behavior towards TCP CUBIC connections.
466
467 **Requirements:**
468
469 * Kernel >= 5.18
470 * Bandwidth Manager
471 * eBPF Host-Routing
472
473 To enable the Bandwidth Manager with BBR for Pods:
474
475 .. tabs::
476
477 .. group-tab:: Helm
478
479 .. parsed-literal::
480
481 helm install cilium |CHART_RELEASE| \\
482 --namespace kube-system \\
483 --set bandwidthManager.enabled=true \\
484 --set bandwidthManager.bbr=true \\
485 --set kubeProxyReplacement=true
486
487 To validate whether your installation is running with BBR for Pods,
488 run ``cilium status`` in any of the Cilium pods and look for the line
489 reporting the status for "BandwidthManager" which should then state
490 ``EDT with BPF`` as well as ``[BBR]``.
491
492 XDP Acceleration
493 ================
494
495 Cilium has built-in support for accelerating NodePort, LoadBalancer services
496 and services with externalIPs for the case where the arriving request needs
497 to be pushed back out of the node when the backend is located on a remote node.
498
499 In that case, the network packets do not need to be pushed all the way to the
500 upper networking stack, but with the help of XDP, Cilium is able to process
501 those requests right out of the network driver layer. This helps to reduce
502 latency and scale-out of services given a single node's forwarding capacity
503 is dramatically increased. The kube-proxy replacement at the XDP layer is
504 `available from Cilium 1.8 <https://cilium.io/blog/2020/06/22/cilium-18#kubeproxy-removal>`_.
505
506 **Requirements:**
507
508 * Kernel >= 4.19.57, >= 5.1.16, >= 5.2
509 * Native XDP supported driver, check :ref:`our driver list <XDP acceleration>`
510 * Direct-routing configuration
511 * eBPF-based kube-proxy replacement
512
513 To enable the XDP Acceleration, check out :ref:`our getting started guide <XDP acceleration>` which also contains instructions for setting it
514 up on public cloud providers.
515
516 To validate whether your installation is running with XDP Acceleration,
517 run ``cilium status`` in any of the Cilium pods and look for the line
518 reporting the status for "XDP Acceleration" which should say "Native".
519
520 eBPF Map Sizing
521 ===============
522
523 All eBPF maps are created with upper capacity limits. Insertion beyond the
524 limit would fail or constrain the scalability of the datapath. Cilium is
525 using auto-derived defaults based on the given ratio of the total system
526 memory.
527
528 However, the upper capacity limits used by the Cilium agent can be overridden
529 for advanced users. Please refer to the :ref:`bpf_map_limitations` guide.
530
531 Linux Kernel
532 ============
533
534 In general, we highly recommend using the most recent LTS stable kernel (such
535 as >= 5.10) provided by the `kernel community <https://www.kernel.org/category/releases.html>`_
536 or by a downstream distribution of your choice. The newer the kernel, the more
537 likely it is that various datapath optimizations can be used.
538
539 In our Cilium release blogs, we also regularly highlight some of the eBPF based
540 kernel work we conduct which implicitly helps Cilium's datapath performance
541 such as `replacing retpolines with direct jumps in the eBPF JIT <https://cilium.io/blog/2020/02/18/cilium-17#upstream-linux>`_.
542
543 Moreover, the kernel allows to configure several options which will help maximize
544 network performance.
545
546 CONFIG_PREEMPT_NONE
547 -------------------
548
549 Run a kernel version with ``CONFIG_PREEMPT_NONE=y`` set. Some Linux
550 distributions offer kernel images with this option set or you can re-compile
551 the Linux kernel. ``CONFIG_PREEMPT_NONE=y`` is the recommended setting for
552 server workloads.
553
554 Further Considerations
555 ======================
556
557 Various additional settings that we recommend help to tune the system for
558 specific workloads and to reduce jitter:
559
560 tuned network-* profiles
561 ------------------------
562
563 The `tuned <https://tuned-project.org/>`_ project offers various profiles to
564 optimize for deterministic performance at the cost of increased power consumption,
565 that is, ``network-latency`` and ``network-throughput``, for example. To enable
566 the former, run:
567
568 .. code-block:: shell-session
569
570 tuned-adm profile network-latency
571
572 Set CPU governor to performance
573 -------------------------------
574
575 The CPU scaling up and down can impact latency tests and lead to sub-optimal
576 performance. To achieve maximum consistent performance. Set the CPU governor
577 to ``performance``:
578
579 .. code-block:: bash
580
581 for CPU in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
582 echo performance > $CPU
583 done
584
585 Stop ``irqbalance`` and pin the NIC interrupts to specific CPUs
586 ---------------------------------------------------------------
587
588 In case you are running ``irqbalance``, consider disabling it as it might
589 migrate the NIC's IRQ handling among CPUs and can therefore cause non-deterministic
590 performance:
591
592 .. code-block:: shell-session
593
594 killall irqbalance
595
596 We highly recommend to pin the NIC interrupts to specific CPUs in order to
597 allow for maximum workload isolation!
598
599 See `this script <https://github.com/borkmann/netperf_scripts/blob/master/set_irq_affinity>`_
600 for details and initial pointers on how to achieve this. Note that pinning the
601 queues can potentially vary in setup between different drivers.
602
603 We generally also recommend to check various documentation and performance tuning
604 guides from NIC vendors on this matter such as from
605 `Mellanox <https://enterprise-support.nvidia.com/s/article/performance-tuning-for-mellanox-adapters>`_,
606 `Intel <https://www.intel.com/content/www/us/en/support/articles/000005811/network-and-i-o/ethernet-products.html>`_
607 or others for more information.