github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/benchmark.rst

github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/benchmark.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _performance_report:
     8  
     9  *************************
    10  CNI Performance Benchmark
    11  *************************
    12  
    13  Introduction
    14  ============
    15  
    16  This chapter contains performance benchmark numbers for a variety of scenarios.
    17  All tests are performed between containers running on two different bare metal
    18  nodes connected back-to-back by a 100Gbit/s network interface. Upon popular
    19  request we have included performance numbers for Calico for comparison.
    20  
    21  .. admonition:: Video
    22    :class: attention
    23  
    24    You can also watch Thomas Graf, Co-founder of Cilium, dive deep into this chapter
    25    in `eCHO episode 5: Network performance benchmarking <https://www.youtube.com/watch?v=2lGag_j4dIw&t=377s>`__.
    26  
    27  .. tip::
    28  
    29     To achieve these performance results, follow the :ref:`performance_tuning`.
    30  
    31  For more information on the used system and configuration, see
    32  :ref:`test_hardware`. For more details on all tested configurations, see
    33  :ref:`test_configurations`.
    34  
    35  The following metrics are collected and reported. Each metric represents a
    36  different traffic pattern that can be required for workloads. See the specific
    37  sections for an explanation on what type of workloads are represented by each
    38  benchmark.
    39  
    40  Throughput
    41    Maximum transfer rate via a single TCP connection and the total transfer rate
    42    of 32 accumulated connections.
    43  
    44  Request/Response Rate
    45    The number of request/response messages per second that can be transmitted over
    46    a single TCP connection and over 32 parallel TCP connections.
    47  
    48  Connections Rate
    49    The number of connections per second that can be established in sequence with
    50    a single request/response payload message transmitted for each new connection. A
    51    single process and 32 parallel processes are tested.
    52  
    53  For the various benchmarks `netperf`_ has been used to generate the workloads
    54  and to collect the metrics. For spawning parallel netperf sessions,
    55  `super_netperf <https://raw.githubusercontent.com/borkmann/netperf_scripts/master/super_netperf>`_
    56  has been used. Both netperf and super_netperf are also frequently used and well
    57  established tools for benchmarking in the Linux kernel networking community.
    58  
    59  .. _benchmark_throughput:
    60  
    61  TCP Throughput (TCP_STREAM)
    62  ===========================
    63  
    64  Throughput testing (TCP_STREAM) is useful to understand the maximum throughput
    65  that can be achieved with a particular configuration. All or most configurations
    66  can achieve line-rate or close to line-rate if enough CPU resources are thrown
    67  at the load. It is therefore important to understand the amount of CPU resources
    68  required to achieve a certain throughput as these CPU resources will no longer
    69  be available to workloads running on the machine.
    70  
    71  This test represents bulk data transfer workloads, e.g. streaming services or
    72  services performing data upload/download.
    73  
    74  Single-Stream
    75  -------------
    76  
    77  In this test, a single TCP stream is opened between the containers and maximum
    78  throughput is achieved:
    79  
    80  .. image:: images/bench_tcp_stream_1_stream.png
    81  
    82  We can see that eBPF-based solutions can outperform even the node-to-node
    83  baseline on modern kernels despite performing additional work (forwarding
    84  into the network namespace of the container, policy enforcement, ...). This is
    85  because eBPF is capable of bypassing the iptables layer of the node which is
    86  still traversed for the node to node baseline.
    87  
    88  The following graph shows the total CPU consumption across the entire system
    89  while running the benchmark, normalized to a 50Gbit throughput:
    90  
    91  .. image:: images/bench_tcp_stream_1_stream_cpu.png
    92  
    93  .. tip::
    94  
    95     **Kernel wisdom:** TCP flow performance is limited by the receiver, since
    96     sender can use both TSO super-packets. This can be observed in the increased
    97     CPU spending on the server-side above above.
    98  
    99  Multi-Stream
   100  -------------
   101  
   102  In this test, 32 processes are opening 32 parallel TCP connections. Each process
   103  is attempting to reach maximum throughput and the total is reported:
   104  
   105  .. image:: images/bench_tcp_stream_32_streams.png
   106  
   107  Given multiple processes are being used, all test configurations can achieve
   108  transfer rates close to the line-rate of the network interface. The main
   109  difference is the CPU resources required to achieve it:
   110  
   111  .. image:: images/bench_tcp_stream_32_streams_cpu.png
   112  
   113  .. _request_response:
   114  
   115  Request/Response Rate (TCP_RR)
   116  ==============================
   117  
   118  The request/response rate (TCP_RR) primarily measures the latency and
   119  efficiency to handle round-trip forwarding of an individual network packet.
   120  This benchmark will lead to the most packets per second possible on the wire
   121  and stresses the cost performed by a network packet. This is the opposite of
   122  the throughput test which maximizes the size of each network packet.
   123  
   124  A configuration that is doing well in this test (delivering high requests per
   125  second rates) will also deliver better (lower) network latencies.
   126  
   127  This test represents services which maintain persistent connections and exchange
   128  request/response type interactions with other services. This is common for services
   129  using REST or gRPC APIs.
   130  
   131  1 Process
   132  ---------
   133  
   134  In this test, a single TCP connection is opened between the containers and a
   135  single byte is sent back and forth between the containers. For each round-trip,
   136  one request is counted:
   137  
   138  .. image:: images/bench_tcp_rr_1_process.png
   139  
   140  eBPF on modern kernels can achieve almost the same request/response rate as the
   141  baseline while only consuming marginally more CPU resources:
   142  
   143  .. image:: images/bench_tcp_rr_1_process_cpu.png
   144  
   145  32 Processes
   146  ------------
   147  
   148  In this test, 32 processes are opening 32 parallel TCP connections. Each process
   149  is performing single byte round-trips. The total number of requests per second
   150  is reported:
   151  
   152  .. image:: images/bench_tcp_rr_32_processes.png
   153  
   154  Cilium can achieve close to 1M requests/s in this test while consuming about 30%
   155  of the system resources on both the sender and receiver:
   156  
   157  .. image:: images/bench_tcp_rr_32_processes_cpu.png
   158  
   159  Connection Rate (TCP_CRR)
   160  =========================
   161  
   162  The connection rate (TCP_CRR) test measures the efficiency in handling new
   163  connections. It is similar to the request/response rate test but will create a new
   164  TCP connection for each round-trip. This measures the cost of establishing a
   165  connection, transmitting a byte in both directions, and closing the connection.
   166  This is more expensive than the TCP_RR test and puts stress on the cost related
   167  to handling new connections.
   168  
   169  This test represents a workload that receives or initiates a lot of TCP
   170  connections. An example where this is the case is a publicly exposed service
   171  that receives connections from many clients. Good examples of this are L4
   172  proxies or services opening many connections to external endpoints. This
   173  benchmark puts the most stress on the system with the least work offloaded to
   174  hardware so we can expect to see the biggest difference between tested
   175  configurations.
   176  
   177  A configuration that does well in this test (delivering high connection rates)
   178  will handle situations with overwhelming connection rates much better, leaving
   179  more CPU resources available to workloads on the system.
   180  
   181  1 Process
   182  ---------
   183  
   184  In this test, a single process opens as many TCP connections as possible
   185  in sequence:
   186  
   187  .. image:: images/bench_tcp_crr_1_process.png
   188  
   189  The following graph shows the total CPU consumption across the entire system
   190  while running the benchmark:
   191  
   192  .. image:: images/bench_tcp_crr_1_process_cpu.png
   193  
   194  .. tip::
   195  
   196     **Kernel wisdom:** The CPU resources graph makes it obvious that some
   197     additional kernel cost is paid at the sender as soon as network namespace
   198     isolation is performed as all container workload benchmarks show signs of
   199     this cost. We will investigate and optimize this aspect in a future release.
   200  
   201  32 Processes
   202  ------------
   203  
   204  In this test, 32 processes running in parallel open as many TCP connections in
   205  sequence as possible. This is by far the most stressful test for the system.
   206  
   207  .. image:: images/bench_tcp_crr_32_processes.png
   208  
   209  This benchmark outlines major differences between the tested configurations. In
   210  particular, it illustrates the overall cost of iptables which is optimized to
   211  perform most of the required work per connection and then caches the result.
   212  This leads to a worst-case performance scenario when a lot of new connections
   213  are expected.
   214  
   215  .. note::
   216  
   217     We have not been able to measure stable results for the Calico eBPF
   218     datapath.  We are not sure why. The network packet flow was never steady. We
   219     have thus not included the result. We invite the Calico team to work with us
   220     to investigate this and then re-test.
   221  
   222  The following graph shows the total CPU consumption across the entire system
   223  while running the benchmark:
   224  
   225  .. image:: images/bench_tcp_crr_32_processes_cpu.png
   226  
   227  Encryption (WireGuard/IPsec)
   228  ============================
   229  
   230  Cilium supports encryption via WireGuard® and IPsec. This first section will
   231  look at WireGuard and compare it against using Calico for WireGuard encryption.
   232  If you are interested in IPsec performance and how it compares to WireGuard,
   233  please see :ref:`performance_wireguard_ipsec`.
   234  
   235  WireGuard Throughput
   236  --------------------
   237  
   238  Looking at TCP throughput first, the following graph shows results for both
   239  1500 bytes MTU and 9000 bytes MTU:
   240  
   241  .. image:: images/bench_wireguard_tcp_1_stream.png
   242  
   243  .. note::
   244  
   245     The Cilium eBPF kube-proxy replacement combined with WireGuard is currently
   246     slightly slower than Cilium eBPF + kube-proxy. We have identified the
   247     problem and will be resolving this deficit in one of the next releases.
   248  
   249  The following graph shows the total CPU consumption across the entire system
   250  while running the WireGuard encryption benchmark:
   251  
   252  .. image:: images/bench_wireguard_tcp_1_stream_cpu.png
   253  
   254  WireGuard Request/Response
   255  --------------------------
   256  
   257  The next benchmark measures the request/response rate while encrypting with
   258  WireGuard. See :ref:`request_response` for details on what this test actually
   259  entails.
   260  
   261  .. image:: images/bench_wireguard_rr_1_process.png
   262  
   263  All tested configurations performed more or less the same. The following graph
   264  shows the total CPU consumption across the entire system while running the
   265  WireGuard encryption benchmark:
   266  
   267  .. image:: images/bench_wireguard_rr_1_process_cpu.png
   268  
   269  .. _performance_wireguard_ipsec:
   270  
   271  WireGuard vs IPsec
   272  ------------------
   273  
   274  In this section, we compare Cilium encryption using WireGuard and IPsec.
   275  WireGuard is able to achieve a higher maximum throughput:
   276  
   277  .. image:: images/bench_wireguard_ipsec_tcp_stream_1_stream.png
   278  
   279  However, looking at the CPU resources required to achieve 10Gbit/s of
   280  throughput, WireGuard is less efficient at achieving the same throughput:
   281  
   282  .. image:: images/bench_wireguard_ipsec_tcp_stream_1_stream_cpu.png
   283  
   284  .. tip::
   285  
   286     IPsec performing better than WireGuard in in this test is unexpected in some
   287     ways. A possible explanation is that the IPsec encryption is making use of
   288     AES-NI instructions whereas the WireGuard implementation is not. This would
   289     typically lead to IPsec being more efficient when AES-NI offload is
   290     available and WireGuard being more efficient if the instruction set is not
   291     available.
   292  
   293  Looking at the request/response rate, IPsec is outperforming WireGuard in our
   294  tests. Unlike for the throughput tests, the MTU does not have any effect as the
   295  packet sizes remain small:
   296  
   297  .. image:: images/bench_wireguard_ipsec_tcp_rr_1_process.png
   298  .. image:: images/bench_wireguard_ipsec_tcp_rr_1_process_cpu.png
   299  
   300  Test Environment
   301  ================
   302  
   303  .. _test_hardware:
   304  
   305  Test Hardware
   306  -------------
   307  
   308  All tests are performed using regular off-the-shelf hardware.
   309  
   310  ============  ======================================================================================================================================================
   311  Item          Description
   312  ============  ======================================================================================================================================================
   313  CPU           `AMD Ryzen 9 3950x <https://www.amd.com/en/support/cpu/amd-ryzen-processors/amd-ryzen-9-desktop-processors/amd-ryzen-9-3950x>`_, AM4 platform, 3.5GHz, 16 cores / 32 threads
   314  Mainboard     `x570 Aorus Master <https://www.gigabyte.com/us/Motherboard/X570-AORUS-MASTER-rev-11-12/sp#sp>`_, PCIe 4.0 x16 support
   315  Memory        `HyperX Fury DDR4-3200 <https://www.hyperxgaming.com/us/memory/fury-ddr4>`_ 128GB, XMP clocked to 3.2GHz
   316  Network Card  `Intel E810-CQDA2 <https://ark.intel.com/content/www/us/en/ark/products/192558/intel-ethernet-network-adapter-e810-cqda2.html>`_, dual port, 100Gbit/s per port, PCIe 4.0 x16
   317  Kernel        Linux 5.10 LTS, see also :ref:`performance_tuning`
   318  ============  ======================================================================================================================================================
   319  
   320  .. _test_configurations:
   321  
   322  Test Configurations
   323  -------------------
   324  
   325  All tests are performed using standardized configuration. Upon popular request,
   326  we have included measurements for Calico for direct comparison.
   327  
   328  ============================ ===================================================================
   329  Configuration Name           Description
   330  ============================ ===================================================================
   331  Baseline (Node to Node)      No Kubernetes
   332  Cilium                       Cilium 1.9.6, eBPF host-routing, kube-proxy replacement, No CT
   333  Cilium (legacy host-routing) Cilium 1.9.6, legacy host-routing, kube-proxy replacement, No CT
   334  Calico                       Calico 3.17.3, kube-proxy
   335  Calico eBPF                  Calico 3.17.3, eBPF datapath, No CT
   336  ============================ ===================================================================
   337  
   338  How to reproduce
   339  ================
   340  
   341  To ease reproducibility, this report is paired with a set of scripts that can
   342  be found in `cilium/cilium-perf-networking <https://github.com/cilium/cilium-perf-networking>`_.
   343  All scripts in this document refer to this repository. Specifically, we use
   344  `Terraform <https://www.terraform.io/>`_ and `Ansible
   345  <https://www.ansible.com/>`_ to setup the environment and execute benchmarks.
   346  We use `Packet <https://deploy.equinix.com/>`_ bare metal servers as our hardware
   347  platform, but the guide is structured so that it can be easily adapted to other
   348  environments.
   349  
   350  Download the Cilium performance evaluation scripts:
   351  
   352  .. code-block:: shell-session
   353  
   354    $ git clone https://github.com/cilium/cilium-perf-networking.git
   355    $ cd cilium-perf-networking
   356  
   357  Packet Servers
   358  --------------
   359  
   360  To evaluate both :ref:`arch_overlay` and :ref:`native_routing`, we configure
   361  the Packet machines to use a `"Mixed/Hybrid"
   362  <https://deploy.equinix.com/developers/docs/metal/layer2-networking/overview/>`_ network
   363  mode, where the secondary interfaces of the machines share a flat L2 network.
   364  While this can be done on the Packet web UI, we include appropriate Terraform
   365  (version 0.13) files to automate this process.
   366  
   367  .. code-block:: shell-session
   368  
   369    $ cd terraform
   370    $ terraform init
   371    $ terraform apply -var 'packet_token=API_TOKEN' -var 'packet_project_id=PROJECT_ID'
   372    $ terraform output ansible_inventory  | tee ../packet-hosts.ini
   373    $ cd ../
   374  
   375  
   376  The above will provision two servers named ``knb-0`` and ``knb-1`` of type
   377  ``c3.small.x86`` and configure them to use a "Mixed/Hybrid" network mode under a
   378  common VLAN named ``knb``.  The machines will be provisioned with an
   379  ``ubuntu_20_04`` OS.  We also create a ``packet-hosts.ini`` file to use as an
   380  inventory file for Ansible.
   381  
   382  Verify that the servers are successfully provisioned by executing an ad-hoc ``uptime``
   383  command on the servers.
   384  
   385  .. code-block:: shell-session
   386  
   387    $ cat packet-hosts.ini
   388    [master]
   389    136.144.55.223 ansible_python_interpreter=python3 ansible_user=root prv_ip=10.67.33.131 node_ip=10.33.33.10 master=knb-0
   390    [nodes]
   391    136.144.55.225 ansible_python_interpreter=python3 ansible_user=root prv_ip=10.67.33.133 node_ip=10.33.33.11
   392    $ ansible -i packet-hosts.ini all -m shell -a 'uptime'
   393    136.144.55.223 | CHANGED | rc=0 >>
   394    09:31:43 up 33 min,  1 user,  load average: 0.00, 0.00, 0.00
   395    136.144.55.225 | CHANGED | rc=0 >>
   396      09:31:44 up 33 min,  1 user,  load average: 0.00, 0.00, 0.00
   397  
   398  
   399  Next, we use the ``packet-disbond.yaml`` playbook to configure the network
   400  interfaces of the machines. This will destroy the ``bond0`` interface and
   401  configure the first physical interface with the public and private IPs
   402  (``prv_ip``) and the second with the node IP (``node_ip``) that will be used
   403  for our evaluations (see `Packet documentation
   404  <https://deploy.equinix.com/developers/docs/metal/layer2-networking/overview/>`_ and our
   405  scripts for more info).
   406  
   407  .. code-block:: shell-session
   408  
   409    $ ansible-playbook -i packet-hosts.ini playbooks/packet-disbond.yaml
   410  
   411  
   412  .. note::
   413  
   414      For hardware platforms other than Packet, users need to provide their own
   415      inventory file (``packet-hosts.ini``) and follow the subsequent steps.
   416  
   417  
   418  Install Required Software
   419  -------------------------
   420  
   421  Install netperf (used for raw host-to-host measurements):
   422  
   423  .. code-block:: shell-session
   424  
   425    $ ansible-playbook -i packet-hosts.ini playbooks/install-misc.yaml
   426  
   427  
   428  Install ``kubeadm`` and its dependencies:
   429  
   430  .. code-block:: shell-session
   431  
   432    $ ansible-playbook -i packet-hosts.ini playbooks/install-kubeadm.yaml
   433  
   434  We use `kubenetbench <https://github.com/cilium/kubenetbench>`_ to execute the
   435  `netperf`_ benchmark in a Kubernetes environment. kubenetbench is a Kubernetes
   436  benchmarking project that is agnostic to the CNI or networking plugin that the
   437  cluster is deployed with. In this report we focus on pod-to-pod communication
   438  between different nodes. To install kubenetbench:
   439  
   440  .. code-block:: shell-session
   441  
   442    $ ansible-playbook -i packet-hosts.ini playbooks/install-kubenetbench.yaml
   443  
   444  .. _netperf: https://github.com/HewlettPackard/netperf
   445  
   446  Running Benchmarks
   447  ------------------
   448  
   449  .. _tunneling_results:
   450  
   451  Tunneling
   452  ~~~~~~~~~
   453  
   454  Configure Cilium in tunneling (:ref:`arch_overlay`) mode:
   455  
   456  .. code-block:: shell-session
   457  
   458    $ ansible-playbook -e mode=tunneling -i packet-hosts.ini playbooks/install-k8s-cilium.yaml
   459    $ ansible-playbook -e conf=vxlan -i packet-hosts.ini playbooks/run-kubenetbench.yaml
   460  
   461  The first command configures Cilium to use tunneling (``-e mode=tunneling``),
   462  which by default uses the VXLAN overlay.  The second executes our benchmark
   463  suite (the ``conf`` variable is used to identify this benchmark run). Once
   464  execution is done, a results directory will be copied back in a folder named
   465  after the ``conf`` variable (in this case, ``vxlan``). This directory includes
   466  all the benchmark results as generated by kubenetbench, including netperf output
   467  and system information.
   468  
   469  .. _native_routing_results:
   470  
   471  Native Routing
   472  ~~~~~~~~~~~~~~
   473  
   474  We repeat the same operation as before, but configure Cilium to use
   475  :ref:`native_routing` (``-e mode=directrouting``).
   476  
   477  .. code-block:: shell-session
   478  
   479    $ ansible-playbook -e mode=directrouting -i packet-hosts.ini playbooks/install-k8s-cilium.yaml
   480    $ ansible-playbook -e conf=routing -i packet-hosts.ini playbooks/run-kubenetbench.yaml
   481  
   482  .. _encryption_results:
   483  
   484  Encryption
   485  ~~~~~~~~~~
   486  
   487  To use encryption with native routing:
   488  
   489  .. code-block:: shell-session
   490  
   491      $ ansible-playbook -e kubeproxyfree=disabled -e mode=directrouting -e encryption=yes -i packet-hosts.ini playbooks/install-k8s-cilium.yaml
   492      $ ansible-playbook -e conf=encryption-routing -i packet-hosts.ini playbooks/run-kubenetbench.yaml
   493  
   494  Baseline
   495  ~~~~~~~~
   496  
   497  To have a point of reference for our results, we execute the same benchmarks
   498  between hosts without Kubernetes running. This provides an effective upper
   499  limit to the performance achieved by Cilium.
   500  
   501  .. code-block:: shell-session
   502  
   503    $ ansible-playbook -i packet-hosts.ini playbooks/reset-kubeadm.yaml
   504    $ ansible-playbook -i packet-hosts.ini playbooks/run-rawnetperf.yaml
   505  
   506  The first command removes Kubernetes and reboots the machines to ensure that there
   507  are no residues in the systems, whereas the second executes the same set of
   508  benchmarks between hosts. An alternative would be to run the raw benchmark
   509  before setting up Cilium, in which case one would only need the second command.
   510  
   511  Cleanup
   512  -------
   513  
   514  When done with benchmarking, the allocated Packet resources can be released with:
   515  
   516  .. code-block:: shell-session
   517  
   518    $ cd terraform && terraform destroy -var 'packet_token=API_TOKEN' -var 'packet_project_id=PROJECT_ID'
   519  
   520