github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/benchmark.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _performance_report: 8 9 ************************* 10 CNI Performance Benchmark 11 ************************* 12 13 Introduction 14 ============ 15 16 This chapter contains performance benchmark numbers for a variety of scenarios. 17 All tests are performed between containers running on two different bare metal 18 nodes connected back-to-back by a 100Gbit/s network interface. Upon popular 19 request we have included performance numbers for Calico for comparison. 20 21 .. admonition:: Video 22 :class: attention 23 24 You can also watch Thomas Graf, Co-founder of Cilium, dive deep into this chapter 25 in `eCHO episode 5: Network performance benchmarking <https://www.youtube.com/watch?v=2lGag_j4dIw&t=377s>`__. 26 27 .. tip:: 28 29 To achieve these performance results, follow the :ref:`performance_tuning`. 30 31 For more information on the used system and configuration, see 32 :ref:`test_hardware`. For more details on all tested configurations, see 33 :ref:`test_configurations`. 34 35 The following metrics are collected and reported. Each metric represents a 36 different traffic pattern that can be required for workloads. See the specific 37 sections for an explanation on what type of workloads are represented by each 38 benchmark. 39 40 Throughput 41 Maximum transfer rate via a single TCP connection and the total transfer rate 42 of 32 accumulated connections. 43 44 Request/Response Rate 45 The number of request/response messages per second that can be transmitted over 46 a single TCP connection and over 32 parallel TCP connections. 47 48 Connections Rate 49 The number of connections per second that can be established in sequence with 50 a single request/response payload message transmitted for each new connection. A 51 single process and 32 parallel processes are tested. 52 53 For the various benchmarks `netperf`_ has been used to generate the workloads 54 and to collect the metrics. For spawning parallel netperf sessions, 55 `super_netperf <https://raw.githubusercontent.com/borkmann/netperf_scripts/master/super_netperf>`_ 56 has been used. Both netperf and super_netperf are also frequently used and well 57 established tools for benchmarking in the Linux kernel networking community. 58 59 .. _benchmark_throughput: 60 61 TCP Throughput (TCP_STREAM) 62 =========================== 63 64 Throughput testing (TCP_STREAM) is useful to understand the maximum throughput 65 that can be achieved with a particular configuration. All or most configurations 66 can achieve line-rate or close to line-rate if enough CPU resources are thrown 67 at the load. It is therefore important to understand the amount of CPU resources 68 required to achieve a certain throughput as these CPU resources will no longer 69 be available to workloads running on the machine. 70 71 This test represents bulk data transfer workloads, e.g. streaming services or 72 services performing data upload/download. 73 74 Single-Stream 75 ------------- 76 77 In this test, a single TCP stream is opened between the containers and maximum 78 throughput is achieved: 79 80 .. image:: images/bench_tcp_stream_1_stream.png 81 82 We can see that eBPF-based solutions can outperform even the node-to-node 83 baseline on modern kernels despite performing additional work (forwarding 84 into the network namespace of the container, policy enforcement, ...). This is 85 because eBPF is capable of bypassing the iptables layer of the node which is 86 still traversed for the node to node baseline. 87 88 The following graph shows the total CPU consumption across the entire system 89 while running the benchmark, normalized to a 50Gbit throughput: 90 91 .. image:: images/bench_tcp_stream_1_stream_cpu.png 92 93 .. tip:: 94 95 **Kernel wisdom:** TCP flow performance is limited by the receiver, since 96 sender can use both TSO super-packets. This can be observed in the increased 97 CPU spending on the server-side above above. 98 99 Multi-Stream 100 ------------- 101 102 In this test, 32 processes are opening 32 parallel TCP connections. Each process 103 is attempting to reach maximum throughput and the total is reported: 104 105 .. image:: images/bench_tcp_stream_32_streams.png 106 107 Given multiple processes are being used, all test configurations can achieve 108 transfer rates close to the line-rate of the network interface. The main 109 difference is the CPU resources required to achieve it: 110 111 .. image:: images/bench_tcp_stream_32_streams_cpu.png 112 113 .. _request_response: 114 115 Request/Response Rate (TCP_RR) 116 ============================== 117 118 The request/response rate (TCP_RR) primarily measures the latency and 119 efficiency to handle round-trip forwarding of an individual network packet. 120 This benchmark will lead to the most packets per second possible on the wire 121 and stresses the cost performed by a network packet. This is the opposite of 122 the throughput test which maximizes the size of each network packet. 123 124 A configuration that is doing well in this test (delivering high requests per 125 second rates) will also deliver better (lower) network latencies. 126 127 This test represents services which maintain persistent connections and exchange 128 request/response type interactions with other services. This is common for services 129 using REST or gRPC APIs. 130 131 1 Process 132 --------- 133 134 In this test, a single TCP connection is opened between the containers and a 135 single byte is sent back and forth between the containers. For each round-trip, 136 one request is counted: 137 138 .. image:: images/bench_tcp_rr_1_process.png 139 140 eBPF on modern kernels can achieve almost the same request/response rate as the 141 baseline while only consuming marginally more CPU resources: 142 143 .. image:: images/bench_tcp_rr_1_process_cpu.png 144 145 32 Processes 146 ------------ 147 148 In this test, 32 processes are opening 32 parallel TCP connections. Each process 149 is performing single byte round-trips. The total number of requests per second 150 is reported: 151 152 .. image:: images/bench_tcp_rr_32_processes.png 153 154 Cilium can achieve close to 1M requests/s in this test while consuming about 30% 155 of the system resources on both the sender and receiver: 156 157 .. image:: images/bench_tcp_rr_32_processes_cpu.png 158 159 Connection Rate (TCP_CRR) 160 ========================= 161 162 The connection rate (TCP_CRR) test measures the efficiency in handling new 163 connections. It is similar to the request/response rate test but will create a new 164 TCP connection for each round-trip. This measures the cost of establishing a 165 connection, transmitting a byte in both directions, and closing the connection. 166 This is more expensive than the TCP_RR test and puts stress on the cost related 167 to handling new connections. 168 169 This test represents a workload that receives or initiates a lot of TCP 170 connections. An example where this is the case is a publicly exposed service 171 that receives connections from many clients. Good examples of this are L4 172 proxies or services opening many connections to external endpoints. This 173 benchmark puts the most stress on the system with the least work offloaded to 174 hardware so we can expect to see the biggest difference between tested 175 configurations. 176 177 A configuration that does well in this test (delivering high connection rates) 178 will handle situations with overwhelming connection rates much better, leaving 179 more CPU resources available to workloads on the system. 180 181 1 Process 182 --------- 183 184 In this test, a single process opens as many TCP connections as possible 185 in sequence: 186 187 .. image:: images/bench_tcp_crr_1_process.png 188 189 The following graph shows the total CPU consumption across the entire system 190 while running the benchmark: 191 192 .. image:: images/bench_tcp_crr_1_process_cpu.png 193 194 .. tip:: 195 196 **Kernel wisdom:** The CPU resources graph makes it obvious that some 197 additional kernel cost is paid at the sender as soon as network namespace 198 isolation is performed as all container workload benchmarks show signs of 199 this cost. We will investigate and optimize this aspect in a future release. 200 201 32 Processes 202 ------------ 203 204 In this test, 32 processes running in parallel open as many TCP connections in 205 sequence as possible. This is by far the most stressful test for the system. 206 207 .. image:: images/bench_tcp_crr_32_processes.png 208 209 This benchmark outlines major differences between the tested configurations. In 210 particular, it illustrates the overall cost of iptables which is optimized to 211 perform most of the required work per connection and then caches the result. 212 This leads to a worst-case performance scenario when a lot of new connections 213 are expected. 214 215 .. note:: 216 217 We have not been able to measure stable results for the Calico eBPF 218 datapath. We are not sure why. The network packet flow was never steady. We 219 have thus not included the result. We invite the Calico team to work with us 220 to investigate this and then re-test. 221 222 The following graph shows the total CPU consumption across the entire system 223 while running the benchmark: 224 225 .. image:: images/bench_tcp_crr_32_processes_cpu.png 226 227 Encryption (WireGuard/IPsec) 228 ============================ 229 230 Cilium supports encryption via WireGuard® and IPsec. This first section will 231 look at WireGuard and compare it against using Calico for WireGuard encryption. 232 If you are interested in IPsec performance and how it compares to WireGuard, 233 please see :ref:`performance_wireguard_ipsec`. 234 235 WireGuard Throughput 236 -------------------- 237 238 Looking at TCP throughput first, the following graph shows results for both 239 1500 bytes MTU and 9000 bytes MTU: 240 241 .. image:: images/bench_wireguard_tcp_1_stream.png 242 243 .. note:: 244 245 The Cilium eBPF kube-proxy replacement combined with WireGuard is currently 246 slightly slower than Cilium eBPF + kube-proxy. We have identified the 247 problem and will be resolving this deficit in one of the next releases. 248 249 The following graph shows the total CPU consumption across the entire system 250 while running the WireGuard encryption benchmark: 251 252 .. image:: images/bench_wireguard_tcp_1_stream_cpu.png 253 254 WireGuard Request/Response 255 -------------------------- 256 257 The next benchmark measures the request/response rate while encrypting with 258 WireGuard. See :ref:`request_response` for details on what this test actually 259 entails. 260 261 .. image:: images/bench_wireguard_rr_1_process.png 262 263 All tested configurations performed more or less the same. The following graph 264 shows the total CPU consumption across the entire system while running the 265 WireGuard encryption benchmark: 266 267 .. image:: images/bench_wireguard_rr_1_process_cpu.png 268 269 .. _performance_wireguard_ipsec: 270 271 WireGuard vs IPsec 272 ------------------ 273 274 In this section, we compare Cilium encryption using WireGuard and IPsec. 275 WireGuard is able to achieve a higher maximum throughput: 276 277 .. image:: images/bench_wireguard_ipsec_tcp_stream_1_stream.png 278 279 However, looking at the CPU resources required to achieve 10Gbit/s of 280 throughput, WireGuard is less efficient at achieving the same throughput: 281 282 .. image:: images/bench_wireguard_ipsec_tcp_stream_1_stream_cpu.png 283 284 .. tip:: 285 286 IPsec performing better than WireGuard in in this test is unexpected in some 287 ways. A possible explanation is that the IPsec encryption is making use of 288 AES-NI instructions whereas the WireGuard implementation is not. This would 289 typically lead to IPsec being more efficient when AES-NI offload is 290 available and WireGuard being more efficient if the instruction set is not 291 available. 292 293 Looking at the request/response rate, IPsec is outperforming WireGuard in our 294 tests. Unlike for the throughput tests, the MTU does not have any effect as the 295 packet sizes remain small: 296 297 .. image:: images/bench_wireguard_ipsec_tcp_rr_1_process.png 298 .. image:: images/bench_wireguard_ipsec_tcp_rr_1_process_cpu.png 299 300 Test Environment 301 ================ 302 303 .. _test_hardware: 304 305 Test Hardware 306 ------------- 307 308 All tests are performed using regular off-the-shelf hardware. 309 310 ============ ====================================================================================================================================================== 311 Item Description 312 ============ ====================================================================================================================================================== 313 CPU `AMD Ryzen 9 3950x <https://www.amd.com/en/support/cpu/amd-ryzen-processors/amd-ryzen-9-desktop-processors/amd-ryzen-9-3950x>`_, AM4 platform, 3.5GHz, 16 cores / 32 threads 314 Mainboard `x570 Aorus Master <https://www.gigabyte.com/us/Motherboard/X570-AORUS-MASTER-rev-11-12/sp#sp>`_, PCIe 4.0 x16 support 315 Memory `HyperX Fury DDR4-3200 <https://www.hyperxgaming.com/us/memory/fury-ddr4>`_ 128GB, XMP clocked to 3.2GHz 316 Network Card `Intel E810-CQDA2 <https://ark.intel.com/content/www/us/en/ark/products/192558/intel-ethernet-network-adapter-e810-cqda2.html>`_, dual port, 100Gbit/s per port, PCIe 4.0 x16 317 Kernel Linux 5.10 LTS, see also :ref:`performance_tuning` 318 ============ ====================================================================================================================================================== 319 320 .. _test_configurations: 321 322 Test Configurations 323 ------------------- 324 325 All tests are performed using standardized configuration. Upon popular request, 326 we have included measurements for Calico for direct comparison. 327 328 ============================ =================================================================== 329 Configuration Name Description 330 ============================ =================================================================== 331 Baseline (Node to Node) No Kubernetes 332 Cilium Cilium 1.9.6, eBPF host-routing, kube-proxy replacement, No CT 333 Cilium (legacy host-routing) Cilium 1.9.6, legacy host-routing, kube-proxy replacement, No CT 334 Calico Calico 3.17.3, kube-proxy 335 Calico eBPF Calico 3.17.3, eBPF datapath, No CT 336 ============================ =================================================================== 337 338 How to reproduce 339 ================ 340 341 To ease reproducibility, this report is paired with a set of scripts that can 342 be found in `cilium/cilium-perf-networking <https://github.com/cilium/cilium-perf-networking>`_. 343 All scripts in this document refer to this repository. Specifically, we use 344 `Terraform <https://www.terraform.io/>`_ and `Ansible 345 <https://www.ansible.com/>`_ to setup the environment and execute benchmarks. 346 We use `Packet <https://deploy.equinix.com/>`_ bare metal servers as our hardware 347 platform, but the guide is structured so that it can be easily adapted to other 348 environments. 349 350 Download the Cilium performance evaluation scripts: 351 352 .. code-block:: shell-session 353 354 $ git clone https://github.com/cilium/cilium-perf-networking.git 355 $ cd cilium-perf-networking 356 357 Packet Servers 358 -------------- 359 360 To evaluate both :ref:`arch_overlay` and :ref:`native_routing`, we configure 361 the Packet machines to use a `"Mixed/Hybrid" 362 <https://deploy.equinix.com/developers/docs/metal/layer2-networking/overview/>`_ network 363 mode, where the secondary interfaces of the machines share a flat L2 network. 364 While this can be done on the Packet web UI, we include appropriate Terraform 365 (version 0.13) files to automate this process. 366 367 .. code-block:: shell-session 368 369 $ cd terraform 370 $ terraform init 371 $ terraform apply -var 'packet_token=API_TOKEN' -var 'packet_project_id=PROJECT_ID' 372 $ terraform output ansible_inventory | tee ../packet-hosts.ini 373 $ cd ../ 374 375 376 The above will provision two servers named ``knb-0`` and ``knb-1`` of type 377 ``c3.small.x86`` and configure them to use a "Mixed/Hybrid" network mode under a 378 common VLAN named ``knb``. The machines will be provisioned with an 379 ``ubuntu_20_04`` OS. We also create a ``packet-hosts.ini`` file to use as an 380 inventory file for Ansible. 381 382 Verify that the servers are successfully provisioned by executing an ad-hoc ``uptime`` 383 command on the servers. 384 385 .. code-block:: shell-session 386 387 $ cat packet-hosts.ini 388 [master] 389 136.144.55.223 ansible_python_interpreter=python3 ansible_user=root prv_ip=10.67.33.131 node_ip=10.33.33.10 master=knb-0 390 [nodes] 391 136.144.55.225 ansible_python_interpreter=python3 ansible_user=root prv_ip=10.67.33.133 node_ip=10.33.33.11 392 $ ansible -i packet-hosts.ini all -m shell -a 'uptime' 393 136.144.55.223 | CHANGED | rc=0 >> 394 09:31:43 up 33 min, 1 user, load average: 0.00, 0.00, 0.00 395 136.144.55.225 | CHANGED | rc=0 >> 396 09:31:44 up 33 min, 1 user, load average: 0.00, 0.00, 0.00 397 398 399 Next, we use the ``packet-disbond.yaml`` playbook to configure the network 400 interfaces of the machines. This will destroy the ``bond0`` interface and 401 configure the first physical interface with the public and private IPs 402 (``prv_ip``) and the second with the node IP (``node_ip``) that will be used 403 for our evaluations (see `Packet documentation 404 <https://deploy.equinix.com/developers/docs/metal/layer2-networking/overview/>`_ and our 405 scripts for more info). 406 407 .. code-block:: shell-session 408 409 $ ansible-playbook -i packet-hosts.ini playbooks/packet-disbond.yaml 410 411 412 .. note:: 413 414 For hardware platforms other than Packet, users need to provide their own 415 inventory file (``packet-hosts.ini``) and follow the subsequent steps. 416 417 418 Install Required Software 419 ------------------------- 420 421 Install netperf (used for raw host-to-host measurements): 422 423 .. code-block:: shell-session 424 425 $ ansible-playbook -i packet-hosts.ini playbooks/install-misc.yaml 426 427 428 Install ``kubeadm`` and its dependencies: 429 430 .. code-block:: shell-session 431 432 $ ansible-playbook -i packet-hosts.ini playbooks/install-kubeadm.yaml 433 434 We use `kubenetbench <https://github.com/cilium/kubenetbench>`_ to execute the 435 `netperf`_ benchmark in a Kubernetes environment. kubenetbench is a Kubernetes 436 benchmarking project that is agnostic to the CNI or networking plugin that the 437 cluster is deployed with. In this report we focus on pod-to-pod communication 438 between different nodes. To install kubenetbench: 439 440 .. code-block:: shell-session 441 442 $ ansible-playbook -i packet-hosts.ini playbooks/install-kubenetbench.yaml 443 444 .. _netperf: https://github.com/HewlettPackard/netperf 445 446 Running Benchmarks 447 ------------------ 448 449 .. _tunneling_results: 450 451 Tunneling 452 ~~~~~~~~~ 453 454 Configure Cilium in tunneling (:ref:`arch_overlay`) mode: 455 456 .. code-block:: shell-session 457 458 $ ansible-playbook -e mode=tunneling -i packet-hosts.ini playbooks/install-k8s-cilium.yaml 459 $ ansible-playbook -e conf=vxlan -i packet-hosts.ini playbooks/run-kubenetbench.yaml 460 461 The first command configures Cilium to use tunneling (``-e mode=tunneling``), 462 which by default uses the VXLAN overlay. The second executes our benchmark 463 suite (the ``conf`` variable is used to identify this benchmark run). Once 464 execution is done, a results directory will be copied back in a folder named 465 after the ``conf`` variable (in this case, ``vxlan``). This directory includes 466 all the benchmark results as generated by kubenetbench, including netperf output 467 and system information. 468 469 .. _native_routing_results: 470 471 Native Routing 472 ~~~~~~~~~~~~~~ 473 474 We repeat the same operation as before, but configure Cilium to use 475 :ref:`native_routing` (``-e mode=directrouting``). 476 477 .. code-block:: shell-session 478 479 $ ansible-playbook -e mode=directrouting -i packet-hosts.ini playbooks/install-k8s-cilium.yaml 480 $ ansible-playbook -e conf=routing -i packet-hosts.ini playbooks/run-kubenetbench.yaml 481 482 .. _encryption_results: 483 484 Encryption 485 ~~~~~~~~~~ 486 487 To use encryption with native routing: 488 489 .. code-block:: shell-session 490 491 $ ansible-playbook -e kubeproxyfree=disabled -e mode=directrouting -e encryption=yes -i packet-hosts.ini playbooks/install-k8s-cilium.yaml 492 $ ansible-playbook -e conf=encryption-routing -i packet-hosts.ini playbooks/run-kubenetbench.yaml 493 494 Baseline 495 ~~~~~~~~ 496 497 To have a point of reference for our results, we execute the same benchmarks 498 between hosts without Kubernetes running. This provides an effective upper 499 limit to the performance achieved by Cilium. 500 501 .. code-block:: shell-session 502 503 $ ansible-playbook -i packet-hosts.ini playbooks/reset-kubeadm.yaml 504 $ ansible-playbook -i packet-hosts.ini playbooks/run-rawnetperf.yaml 505 506 The first command removes Kubernetes and reboots the machines to ensure that there 507 are no residues in the systems, whereas the second executes the same set of 508 benchmarks between hosts. An alternative would be to run the raw benchmark 509 before setting up Cilium, in which case one would only need the second command. 510 511 Cleanup 512 ------- 513 514 When done with benchmarking, the allocated Packet resources can be released with: 515 516 .. code-block:: shell-session 517 518 $ cd terraform && terraform destroy -var 'packet_token=API_TOKEN' -var 'packet_project_id=PROJECT_ID' 519 520