github.com/cilium/cilium@v1.16.2/Documentation/operations/performance/scalability/report.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _scalability_guide: 8 9 ****************** 10 Scalability report 11 ****************** 12 13 This report is intended for users planning to run Cilium on clusters with more 14 than 200 nodes in CRD mode (without a kvstore available). In our development 15 cycle we have deployed Cilium on large clusters and these were the options that 16 were suitable for our testing: 17 18 ===== 19 Setup 20 ===== 21 22 .. code-block:: shell-session 23 24 helm template cilium \\ 25 --namespace kube-system \\ 26 --set endpointHealthChecking.enabled=false \\ 27 --set healthChecking=false \\ 28 --set ipam.mode=kubernetes \\ 29 --set k8sServiceHost=<KUBE-APISERVER-LB-IP-ADDRESS> \\ 30 --set k8sServicePort=<KUBE-APISERVER-LB-PORT-NUMBER> \\ 31 --set prometheus.enabled=true \\ 32 --set operator.prometheus.enabled=true \\ 33 > cilium.yaml 34 35 36 * ``--set endpointHealthChecking.enabled=false`` and 37 ``--set healthChecking=false`` disable endpoint health 38 checking entirely. However it is recommended that those features be enabled 39 initially on a smaller cluster (3-10 nodes) where it can be used to detect 40 potential packet loss due to firewall rules or hypervisor settings. 41 42 * ``--set ipam.mode=kubernetes`` is set to ``"kubernetes"`` since our 43 cloud provider has pod CIDR allocation enabled in ``kube-controller-manager``. 44 45 * ``--set k8sServiceHost`` and ``--set k8sServicePort`` were set 46 with the IP address of the loadbalancer that was in front of ``kube-apiserver``. 47 This allows Cilium to not depend on kube-proxy to connect to ``kube-apiserver``. 48 49 * ``--set prometheus.enabled=true`` and 50 ``--set operator.prometheus.enabled=true`` were just set because we 51 had a Prometheus server probing for metrics in the entire cluster. 52 53 Our testing cluster consisted of 3 controller nodes and 1000 worker nodes. 54 We have followed the recommended settings from the 55 `official Kubernetes documentation <https://kubernetes.io/docs/setup/best-practices/cluster-large/>`_ 56 and have provisioned our machines with the following settings: 57 58 * **Cloud provider**: Google Cloud 59 60 * **Controllers**: 3x n1-standard-32 (32vCPU, 120GB memory and 50GB SSD, kernel 5.4.0-1009-gcp) 61 62 * **Workers**: 1 pool of 1000x custom-2-4096 (2vCPU, 4GB memory and 10GB HDD, kernel 5.4.0-1009-gcp) 63 64 * **Metrics**: 1x n1-standard-32 (32vCPU, 120GB memory and 10GB HDD + 500GB HDD) 65 this is a dedicated node for prometheus and grafana pods. 66 67 .. note:: 68 69 All 3 controller nodes were behind a GCE load balancer. 70 71 Each controller contained ``etcd``, ``kube-apiserver``, 72 ``kube-controller-manager`` and ``kube-scheduler`` instances. 73 74 The CPU, memory and disk size set for the workers might be different for 75 your use case. You might have pods that require more memory or CPU available 76 so you should design your workers based on your requirements. 77 78 During our testing we had to set the ``etcd`` option 79 ``quota-backend-bytes=17179869184`` because ``etcd`` failed once it reached 80 around ``2GiB`` of allocated space. 81 82 We provisioned our worker nodes without ``kube-proxy`` since Cilium is 83 capable of performing all functionalities provided by ``kube-proxy``. We 84 created a load balancer in front of ``kube-apiserver`` to allow Cilium to 85 access ``kube-apiserver`` without ``kube-proxy``, and configured Cilium with 86 the options ``--set k8sServiceHost=<KUBE-APISERVER-LB-IP-ADDRESS>`` 87 and ``--set k8sServicePort=<KUBE-APISERVER-LB-PORT-NUMBER>``. 88 89 Our ``DaemonSet`` ``updateStrategy`` had the ``maxUnavailable`` set to 250 90 pods instead of 2, but this value highly depends on your requirements when 91 you are performing a rolling update of Cilium. 92 93 ===== 94 Steps 95 ===== 96 97 For each step we took, we provide more details below, with our findings and 98 expected behaviors. 99 100 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 101 1. Install Kubernetes v1.18.3 with EndpointSlice feature enabled 102 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 103 104 To test the most up-to-date functionalities from Kubernetes and Cilium, we have 105 performed our testing with Kubernetes v1.18.3 and the EndpointSlice feature 106 enabled to improve scalability. 107 108 Since Kubernetes requires an ``etcd`` cluster, we have deployed v3.4.9. 109 110 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 111 2. Deploy Prometheus, Grafana and Cilium 112 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 113 114 We have used Prometheus v2.18.1 and Grafana v7.0.1 to retrieve and analyze 115 ``etcd``, ``kube-apiserver``, ``cilium`` and ``cilium-operator`` metrics. 116 117 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 118 3. Provision 2 worker nodes 119 ^^^^^^^^^^^^^^^^^^^^^^^^^^^ 120 121 This helped us to understand if our testing cluster was correctly provisioned 122 and all metrics were being gathered. 123 124 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 125 4. Deploy 5 namespaces with 25 deployments on each namespace 126 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 127 128 * Each deployment had 1 replica (125 pods in total). 129 130 * To measure **only** the resources consumed by Cilium, all deployments used 131 the same base image ``registry.k8s.io/pause:3.2``. This image does not have any 132 CPU or memory overhead. 133 134 * We provision a small number of pods in a small cluster to understand the CPU 135 usage of Cilium: 136 137 .. figure:: images/image_4_01.png 138 139 The mark shows when the creation of 125 pods started. 140 As expected, we can see a slight increase of the CPU usage on both 141 Cilium agents running and in the Cilium operator. The agents peaked at 6.8% CPU 142 usage on a 2vCPU machine. 143 144 .. figure:: images/image_4_02.png 145 146 For the memory usage, we have not seen a significant memory growth in the 147 Cilium agent. On the eBPF memory side, we do see it increasing due to the 148 initialization of some eBPF maps for the new pods. 149 150 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 151 5. Provision 998 additional nodes (total 1000 nodes) 152 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 153 154 .. figure:: images/image_5_01.png 155 156 The first mark represents the action of creating nodes, the second mark 157 when 1000 Cilium pods were in ready state. The CPU usage increase is expected 158 since each Cilium agent receives events from Kubernetes whenever a new node is 159 provisioned in the cluster. Once all nodes were deployed the CPU usage was 160 0.15% on average on a 2vCPU node. 161 162 .. figure:: images/image_5_02.png 163 164 As we have increased the number of nodes in the cluster to 1000, it is expected 165 to see a small growth of the memory usage in all metrics. However, it is 166 relevant to point out that **an increase in the number of nodes does not cause 167 any significant increase in Cilium’s memory consumption in both control and 168 dataplane.** 169 170 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 171 6. Deploy 25 more deployments on each namespace 172 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 173 174 This will now bring us a total of 175 ``5 namespaces * (25 old deployments + 25 new deployments)=250`` deployments in 176 the entire cluster. 177 We did not install 250 deployments from the start since we only had 2 nodes and 178 that would create 125 pods on each worker node. According to the Kubernetes 179 documentation the maximum recommended number of pods per node is 100. 180 181 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 182 7. Scale each deployment to 200 replicas (50000 pods in total) 183 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 184 185 Having 5 namespaces with 50 deployments means that we have 250 different unique 186 security identities. Having a low cardinality in the labels selected by Cilium 187 helps scale the cluster. By default, Cilium has a limit of 16k security 188 identities, but it can be increased with ``bpf-policy-map-max`` in the Cilium 189 ``ConfigMap``. 190 191 .. figure:: images/image_7_01.png 192 193 The first mark represents the action of scaling up the deployments, the second 194 mark when 50000 pods were in ready state. 195 196 * It is expected to see the CPU usage of Cilium increase since, on each node, 197 Cilium agents receive events from Kubernetes when a new pod is scheduled 198 and started. 199 200 * The average CPU consumption of all Cilium agents was 3.38% on a 2vCPU machine. 201 At one point, roughly around minute 15:23, one of those Cilium agents picked 202 27.94% CPU usage. 203 204 * Cilium Operator had a stable 5% CPU consumption while the pods were being 205 created. 206 207 .. figure:: images/image_7_02.png 208 209 Similar to the behavior seen while increasing the number of worker nodes, 210 adding new pods also increases Cilium memory consumption. 211 212 * As we increased the number of pods from 250 to 50000, we saw a maximum memory 213 usage of 573MiB for one of the Cilium agents while the average was 438 MiB. 214 * For the eBPF memory usage we saw a max usage of 462.7MiB 215 * This means that each **Cilium agent's memory increased by 10.5KiB per new pod 216 in the cluster.** 217 218 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 219 8. Deploy 250 policies for 1 namespace 220 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 221 222 Here we have created 125 L4 network policies and 125 L7 policies. Each policy 223 selected all pods on this namespace and was allowed to send traffic to another 224 pod on this namespace. Each of the 250 policies allows access to a disjoint set 225 of ports. In the end we will have 250 different policies selecting 10000 pods. 226 227 .. code-block:: yaml 228 229 apiVersion: "cilium.io/v2" 230 kind: CiliumNetworkPolicy 231 metadata: 232 name: "l4-rule-#" 233 namespace: "namespace-1" 234 spec: 235 endpointSelector: 236 matchLabels: 237 my-label: testing 238 fromEndpoints: 239 matchLabels: 240 my-label: testing 241 egress: 242 - toPorts: 243 - ports: 244 - port: "[0-125]+80" // from 80 to 12580 245 protocol: TCP 246 --- 247 apiVersion: "cilium.io/v2" 248 kind: CiliumNetworkPolicy 249 metadata: 250 name: "l7-rule-#" 251 namespace: "namespace-1" 252 spec: 253 endpointSelector: 254 matchLabels: 255 my-label: testing 256 fromEndpoints: 257 matchLabels: 258 my-label: testing 259 ingress: 260 - toPorts: 261 - ports: 262 - port: '[126-250]+80' // from 12680 to 25080 263 protocol: TCP 264 rules: 265 http: 266 - method: GET 267 path: "/path1$" 268 - method: PUT 269 path: "/path2$" 270 headers: 271 - 'X-My-Header: true' 272 273 .. figure:: images/image_8_01.png 274 275 In this case we saw one of the Cilium agents jumping to 100% CPU usage for 15 276 seconds while the average peak was 40% during a period of 90 seconds. 277 278 .. figure:: images/image_8_02.png 279 280 As expected, **increasing the number of policies does not have a significant 281 impact on the memory usage of Cilium since the eBPF policy maps have a constant 282 size** once a pod is initialized. 283 284 .. figure:: images/image_8_03.png 285 .. figure:: images/image_8_04.png 286 287 288 The first mark represents the point in time when we ran ``kubectl create`` to 289 create the ``CiliumNetworkPolicies``. Since we created the 250 policies 290 sequentially, we cannot properly compute the convergence time. To do that, 291 we could use a single CNP with multiple policy rules defined under the 292 ``specs`` field (instead of the ``spec`` field). 293 294 Nevertheless, we can see the time it took the last Cilium agent to increment its 295 Policy Revision, which is incremented individually on each Cilium agent every 296 time a CiliumNetworkPolicy (CNP) is received, between second ``15:45:44`` 297 and ``15:45:46`` and see when was the last time an Endpoint was regenerated by 298 checking the 99th percentile of the "Endpoint regeneration time". In this 299 manner, that it took less than 5s. We can also verify **the maximum time was 300 less than 600ms for an endpoint to have the policy enforced.** 301 302 303 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 304 9. Deploy 250 policies for CiliumClusterwideNetworkPolicies (CCNP) 305 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 306 307 The difference between these policies and the previous ones installed is that 308 these select all pods in all namespaces. To recap, this means that we will now 309 have **250 different network policies selecting 10000 pods and 250 different 310 network policies selecting 50000 pods on a cluster with 1000 nodes.** Similarly 311 to the previous step we will deploy 125 L4 policies and another 125 L7 policies. 312 313 .. figure:: images/image_9_01.png 314 .. figure:: images/image_9_02.png 315 316 Similar to the creation of the previous 250 CNPs, there was also an increase in 317 CPU usage during the creation of the CCNPs. The CPU usage was similar even 318 though the policies were effectively selecting more pods. 319 320 .. figure:: images/image_9_03.png 321 322 As all pods running in a node are selected by **all 250 CCNPs created**, we see 323 an increase of the **Endpoint regeneration time** which **peaked a little above 324 3s.** 325 326 327 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 328 10. "Accidentally" delete 10000 pods 329 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 330 331 In this step we have "accidentally" deleted 10000 random pods. Kubernetes will 332 then recreate 10000 new pods so it will help us understand what the convergence 333 time is for all the deployed network polices. 334 335 .. figure:: images/image_10_01.png 336 .. figure:: images/image_10_02.png 337 338 339 * The first mark represents the point in time when pods were "deleted" and the 340 second mark represents the point in time when Kubernetes finished recreating 341 10k pods. 342 343 * Besides the CPU usage slightly increasing while pods are being scheduled in 344 the cluster, we did see some interesting data points in the eBPF memory usage. 345 As each endpoint can have one or more dedicated eBPF maps, the eBPF memory usage 346 is directly proportional to the number of pods running in a node. **If the 347 number of pods per node decreases so does the eBPF memory usage.** 348 349 .. figure:: images/image_10_03.png 350 351 We inferred the time it took for all the endpoints to get regenerated by looking 352 at the number of Cilium endpoints with the policy enforced over time. 353 Luckily enough we had another metric that was showing how many Cilium endpoints 354 had policy being enforced: 355 356 .. figure:: images/image_10_04.png 357 358 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 359 11. Control plane metrics over the test run 360 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 361 362 The focus of this test was to study the Cilium agent resource consumption at 363 scale. However, we also monitored some metrics of the control plane nodes such as 364 etcd metrics and CPU usage of the k8s-controllers and we present them in the 365 next figures. 366 367 .. figure:: images/image_11_01.png 368 369 Memory consumption of the 3 etcd instances during the entire scalability 370 testing. 371 372 .. figure:: images/image_11_02.png 373 374 CPU usage for the 3 controller nodes, average latency per request type in 375 the etcd cluster as well as the number of operations per second made to etcd. 376 377 .. figure:: images/image_11_03.png 378 379 All etcd metrics, from left to right, from top to bottom: database size, 380 disk sync duration, client traffic in, client traffic out, peer traffic in, 381 peer traffic out. 382 383 ============= 384 Final Remarks 385 ============= 386 387 These experiments helped us develop a better understanding of Cilium running 388 in a large cluster entirely in CRD mode and without depending on etcd. There is 389 still some work to be done to optimize the memory footprint of eBPF maps even 390 further, as well as reducing the memory footprint of the Cilium agent. We will 391 address those in the next Cilium version. 392 393 We can also determine that it is scalable to run Cilium in CRD mode on a cluster 394 with more than 200 nodes. However, it is worth pointing out that we need to run 395 more tests to verify Cilium's behavior when it loses the connectivity with 396 ``kube-apiserver``, as can happen during a control plane upgrade for example. 397 This will also be our focus in the next Cilium version.