github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/scalability/report.rst

github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/scalability/report.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _scalability_guide:
     8  
     9  ******************
    10  Scalability report
    11  ******************
    12  
    13  This report is intended for users planning to run Cilium on clusters with more
    14  than 200 nodes in CRD mode (without a kvstore available). In our development
    15  cycle we have deployed Cilium on large clusters and these were the options that
    16  were suitable for our testing:
    17  
    18  =====
    19  Setup
    20  =====
    21  
    22  .. code-block:: shell-session
    23  
    24   helm template cilium \\
    25       --namespace kube-system \\
    26       --set endpointHealthChecking.enabled=false \\
    27       --set healthChecking=false \\
    28       --set ipam.mode=kubernetes \\
    29       --set k8sServiceHost=<KUBE-APISERVER-LB-IP-ADDRESS> \\
    30       --set k8sServicePort=<KUBE-APISERVER-LB-PORT-NUMBER> \\
    31       --set prometheus.enabled=true \\
    32       --set operator.prometheus.enabled=true \\
    33     > cilium.yaml
    34  
    35  
    36  * ``--set endpointHealthChecking.enabled=false`` and
    37    ``--set healthChecking=false`` disable endpoint health
    38    checking entirely. However it is recommended that those features be enabled
    39    initially on a smaller cluster (3-10 nodes) where it can be used to detect
    40    potential packet loss due to firewall rules or hypervisor settings.
    41  
    42  * ``--set ipam.mode=kubernetes`` is set to ``"kubernetes"`` since our
    43    cloud provider has pod CIDR allocation enabled in ``kube-controller-manager``.
    44  
    45  * ``--set k8sServiceHost`` and ``--set k8sServicePort`` were set
    46    with the IP address of the loadbalancer that was in front of ``kube-apiserver``.
    47    This allows Cilium to not depend on kube-proxy to connect to ``kube-apiserver``.
    48  
    49  * ``--set prometheus.enabled=true`` and
    50    ``--set operator.prometheus.enabled=true`` were just set because we
    51    had a Prometheus server probing for metrics in the entire cluster.
    52  
    53  Our testing cluster consisted of 3 controller nodes and 1000 worker nodes.
    54  We have followed the recommended settings from the
    55  `official Kubernetes documentation <https://kubernetes.io/docs/setup/best-practices/cluster-large/>`_
    56  and have provisioned our machines with the following settings:
    57  
    58  * **Cloud provider**: Google Cloud
    59  
    60  * **Controllers**: 3x n1-standard-32 (32vCPU, 120GB memory and 50GB SSD, kernel 5.4.0-1009-gcp)
    61  
    62  * **Workers**: 1 pool of 1000x custom-2-4096 (2vCPU, 4GB memory and 10GB HDD, kernel 5.4.0-1009-gcp)
    63  
    64  * **Metrics**: 1x n1-standard-32 (32vCPU, 120GB memory and 10GB HDD + 500GB HDD)
    65    this is a dedicated node for prometheus and grafana pods.
    66  
    67  .. note::
    68  
    69      All 3 controller nodes were behind a GCE load balancer.
    70  
    71      Each controller contained ``etcd``, ``kube-apiserver``,
    72      ``kube-controller-manager`` and ``kube-scheduler`` instances.
    73  
    74      The CPU, memory and disk size set for the workers might be different for
    75      your use case. You might have pods that require more memory or CPU available
    76      so you should design your workers based on your requirements.
    77  
    78      During our testing we had to set the ``etcd`` option
    79      ``quota-backend-bytes=17179869184`` because ``etcd`` failed once it reached
    80      around ``2GiB`` of allocated space.
    81  
    82      We provisioned our worker nodes without ``kube-proxy`` since Cilium is
    83      capable of performing all functionalities provided by ``kube-proxy``. We
    84      created a load balancer in front of ``kube-apiserver`` to allow Cilium to
    85      access ``kube-apiserver`` without ``kube-proxy``, and configured Cilium with
    86      the options ``--set k8sServiceHost=<KUBE-APISERVER-LB-IP-ADDRESS>``
    87      and ``--set k8sServicePort=<KUBE-APISERVER-LB-PORT-NUMBER>``.
    88  
    89      Our ``DaemonSet`` ``updateStrategy`` had the ``maxUnavailable`` set to 250
    90      pods instead of 2, but this value highly depends on your requirements when
    91      you are performing a rolling update of Cilium.
    92  
    93  =====
    94  Steps
    95  =====
    96  
    97  For each step we took, we provide more details below, with our findings and
    98  expected behaviors.
    99  
   100  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   101  1. Install Kubernetes v1.18.3 with EndpointSlice feature enabled
   102  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   103  
   104  To test the most up-to-date functionalities from Kubernetes and Cilium, we have
   105  performed our testing with Kubernetes v1.18.3 and the EndpointSlice feature
   106  enabled to improve scalability.
   107  
   108  Since Kubernetes requires an ``etcd`` cluster, we have deployed v3.4.9.
   109  
   110  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   111  2. Deploy Prometheus, Grafana and Cilium
   112  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   113  
   114  We have used Prometheus v2.18.1 and Grafana v7.0.1 to retrieve and analyze
   115  ``etcd``, ``kube-apiserver``, ``cilium`` and ``cilium-operator`` metrics.
   116  
   117  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   118  3. Provision 2 worker nodes
   119  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   120  
   121  This helped us to understand if our testing cluster was correctly provisioned
   122  and all metrics were being gathered.
   123  
   124  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   125  4. Deploy 5 namespaces with 25 deployments on each namespace
   126  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   127  
   128  * Each deployment had 1 replica (125 pods in total).
   129  
   130  * To measure **only** the resources consumed by Cilium, all deployments used
   131    the same base image ``registry.k8s.io/pause:3.2``. This image does not have any
   132    CPU or memory overhead.
   133  
   134  * We provision a small number of pods in a small cluster to understand the CPU
   135    usage of Cilium:
   136  
   137  .. figure:: images/image_4_01.png
   138  
   139  The mark shows when the creation of 125 pods started.
   140  As expected, we can see a slight increase of the CPU usage on both
   141  Cilium agents running and in the Cilium operator. The agents peaked at 6.8% CPU
   142  usage on a 2vCPU machine.
   143  
   144  .. figure:: images/image_4_02.png
   145  
   146  For the memory usage, we have not seen a significant memory growth in the
   147  Cilium agent. On the eBPF memory side, we do see it increasing due to the
   148  initialization of some eBPF maps for the new pods.
   149  
   150  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   151  5. Provision 998 additional nodes (total 1000 nodes)
   152  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   153  
   154  .. figure:: images/image_5_01.png
   155  
   156  The first mark represents the action of creating nodes, the second mark
   157  when 1000 Cilium pods were in ready state. The CPU usage increase is expected
   158  since each Cilium agent receives events from Kubernetes whenever a new node is
   159  provisioned in the cluster. Once all nodes were deployed the CPU usage was
   160  0.15% on average on a 2vCPU node.
   161  
   162  .. figure:: images/image_5_02.png
   163  
   164  As we have increased the number of nodes in the cluster to 1000, it is expected
   165  to see a small growth of the memory usage in all metrics. However, it is
   166  relevant to point out that **an increase in the number of nodes does not cause
   167  any significant increase in Cilium’s memory consumption in both control and
   168  dataplane.**
   169  
   170  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   171  6. Deploy 25 more deployments on each namespace
   172  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   173  
   174  This will now bring us a total of
   175  ``5 namespaces * (25 old deployments + 25 new deployments)=250`` deployments in
   176  the entire cluster.
   177  We did not install 250 deployments from the start since we only had 2 nodes and
   178  that would create 125 pods on each worker node. According to the Kubernetes
   179  documentation the maximum recommended number of pods per node is 100.
   180  
   181  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   182  7. Scale each deployment to 200 replicas (50000 pods in total)
   183  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   184  
   185  Having 5 namespaces with 50 deployments means that we have 250 different unique
   186  security identities. Having a low cardinality in the labels selected by Cilium
   187  helps scale the cluster. By default, Cilium has a limit of 16k security
   188  identities, but it can be increased with ``bpf-policy-map-max`` in the Cilium
   189  ``ConfigMap``.
   190  
   191  .. figure:: images/image_7_01.png
   192  
   193  The first mark represents the action of scaling up the deployments, the second
   194  mark when 50000 pods were in ready state.
   195  
   196  * It is expected to see the CPU usage of Cilium increase since, on each node,
   197    Cilium agents receive events from Kubernetes when a new pod is scheduled
   198    and started.
   199  
   200  * The average CPU consumption of all Cilium agents was 3.38% on a 2vCPU machine.
   201    At one point, roughly around minute 15:23, one of those Cilium agents picked
   202    27.94% CPU usage.
   203  
   204  * Cilium Operator had a stable 5% CPU consumption while the pods were being
   205    created.
   206  
   207  .. figure:: images/image_7_02.png
   208  
   209  Similar to the behavior seen while increasing the number of worker nodes,
   210  adding new pods also increases Cilium memory consumption.
   211  
   212  * As we increased the number of pods from 250 to 50000, we saw a maximum memory
   213    usage of 573MiB for one of the Cilium agents while the average was 438 MiB.
   214  * For the eBPF memory usage we saw a max usage of 462.7MiB
   215  * This means that each **Cilium agent's memory increased by 10.5KiB per new pod
   216    in the cluster.**
   217  
   218  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   219  8. Deploy 250 policies for 1 namespace
   220  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   221  
   222  Here we have created 125 L4 network policies and 125 L7 policies. Each policy
   223  selected all pods on this namespace and was allowed to send traffic to another
   224  pod on this namespace. Each of the 250 policies allows access to a disjoint set
   225  of ports. In the end we will have 250 different policies selecting 10000 pods.
   226  
   227  .. code-block:: yaml
   228  
   229      apiVersion: "cilium.io/v2"
   230      kind: CiliumNetworkPolicy
   231      metadata:
   232        name: "l4-rule-#"
   233        namespace: "namespace-1"
   234      spec:
   235        endpointSelector:
   236          matchLabels:
   237            my-label: testing
   238        fromEndpoints:
   239          matchLabels:
   240            my-label: testing
   241        egress:
   242          - toPorts:
   243            - ports:
   244              - port: "[0-125]+80" // from 80 to 12580
   245                protocol: TCP
   246      ---
   247      apiVersion: "cilium.io/v2"
   248      kind: CiliumNetworkPolicy
   249      metadata:
   250        name: "l7-rule-#"
   251        namespace: "namespace-1"
   252      spec:
   253        endpointSelector:
   254          matchLabels:
   255            my-label: testing
   256        fromEndpoints:
   257          matchLabels:
   258            my-label: testing
   259        ingress:
   260        - toPorts:
   261          - ports:
   262            - port: '[126-250]+80' // from 12680 to 25080
   263              protocol: TCP
   264            rules:
   265              http:
   266              - method: GET
   267                path: "/path1$"
   268              - method: PUT
   269                path: "/path2$"
   270                headers:
   271                - 'X-My-Header: true'
   272  
   273  .. figure:: images/image_8_01.png
   274  
   275  In this case we saw one of the Cilium agents jumping to 100% CPU usage for 15
   276  seconds while the average peak was 40% during a period of 90 seconds.
   277  
   278  .. figure:: images/image_8_02.png
   279  
   280  As expected, **increasing the number of policies does not have a significant
   281  impact on the memory usage of Cilium since the eBPF policy maps have a constant
   282  size** once a pod is initialized.
   283  
   284  .. figure:: images/image_8_03.png
   285  .. figure:: images/image_8_04.png
   286  
   287  
   288  The first mark represents the point in time when we ran ``kubectl create`` to
   289  create the ``CiliumNetworkPolicies``. Since we created the 250 policies
   290  sequentially, we cannot properly compute the convergence time. To do that,
   291  we could use a single CNP with multiple policy rules defined under the
   292  ``specs`` field (instead of the ``spec`` field).
   293  
   294  Nevertheless, we can see the time it took the last Cilium agent to increment its
   295  Policy Revision, which is incremented individually on each Cilium agent every
   296  time a CiliumNetworkPolicy (CNP) is received, between second ``15:45:44``
   297  and ``15:45:46`` and see when was the last time an Endpoint was regenerated by
   298  checking the 99th percentile of the "Endpoint regeneration time". In this
   299  manner, that it took less than 5s. We can also verify **the maximum time was
   300  less than 600ms for an endpoint to have the policy enforced.**
   301  
   302  
   303  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   304  9. Deploy 250 policies for CiliumClusterwideNetworkPolicies (CCNP)
   305  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   306  
   307  The difference between these policies and the previous ones installed is that
   308  these select all pods in all namespaces. To recap, this means that we will now
   309  have **250 different network policies selecting 10000 pods and 250 different
   310  network policies selecting 50000 pods on a cluster with 1000 nodes.** Similarly
   311  to the previous step we will deploy 125 L4 policies and another 125 L7 policies.
   312  
   313  .. figure:: images/image_9_01.png
   314  .. figure:: images/image_9_02.png
   315  
   316  Similar to the creation of the previous 250 CNPs, there was also an increase in
   317  CPU usage during the creation of the CCNPs. The CPU usage was similar even
   318  though the policies were effectively selecting more pods.
   319  
   320  .. figure:: images/image_9_03.png
   321  
   322  As all pods running in a node are selected by **all 250 CCNPs created**, we see
   323  an increase of the **Endpoint regeneration time** which **peaked a little above
   324  3s.**
   325  
   326  
   327  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   328  10. "Accidentally" delete 10000 pods
   329  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   330  
   331  In this step we have "accidentally" deleted 10000 random pods. Kubernetes will
   332  then recreate 10000 new pods so it will help us understand what the convergence
   333  time is for all the deployed network polices.
   334  
   335  .. figure:: images/image_10_01.png
   336  .. figure:: images/image_10_02.png
   337  
   338  
   339  * The first mark represents the point in time when pods were "deleted" and the
   340    second mark represents the point in time when Kubernetes finished recreating
   341    10k pods.
   342  
   343  * Besides the CPU usage slightly increasing while pods are being scheduled in
   344    the cluster, we did see some interesting data points in the eBPF memory usage.
   345    As each endpoint can have one or more dedicated eBPF maps, the eBPF memory usage
   346    is directly proportional to the number of pods running in a node. **If the
   347    number of pods per node decreases so does the eBPF memory usage.**
   348  
   349  .. figure:: images/image_10_03.png
   350  
   351  We inferred the time it took for all the endpoints to get regenerated by looking
   352  at the number of Cilium endpoints with the policy enforced over time.
   353  Luckily enough we had another metric that was showing how many Cilium endpoints
   354  had policy being enforced:
   355  
   356  .. figure:: images/image_10_04.png
   357  
   358  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   359  11. Control plane metrics over the test run
   360  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   361  
   362  The focus of this test was to study the Cilium agent resource consumption at
   363  scale. However, we also monitored some metrics of the control plane nodes such as
   364  etcd metrics and CPU usage of the k8s-controllers and we present them in the
   365  next figures.
   366  
   367  .. figure:: images/image_11_01.png
   368  
   369  Memory consumption of the 3 etcd instances during the entire scalability
   370  testing.
   371  
   372  .. figure:: images/image_11_02.png
   373  
   374  CPU usage for the 3 controller nodes, average latency per request type in
   375  the etcd cluster as well as the number of operations per second made to etcd.
   376  
   377  .. figure:: images/image_11_03.png
   378  
   379  All etcd metrics, from left to right, from top to bottom: database size,
   380  disk sync duration, client traffic in, client traffic out, peer traffic in,
   381  peer traffic out.
   382  
   383  =============
   384  Final Remarks
   385  =============
   386  
   387  These experiments helped us develop a better understanding of Cilium running
   388  in a large cluster entirely in CRD mode and without depending on etcd. There is
   389  still some work to be done to optimize the memory footprint of eBPF maps even
   390  further, as well as reducing the memory footprint of the Cilium agent. We will
   391  address those in the next Cilium version.
   392  
   393  We can also determine that it is scalable to run Cilium in CRD mode on a cluster
   394  with more than 200 nodes. However, it is worth pointing out that we need to run
   395  more tests to verify Cilium's behavior when it loses the connectivity with
   396  ``kube-apiserver``, as can happen during a control plane upgrade for example.
   397  This will also be our focus in the next Cilium version.