github.com/cilium/cilium@v1.16.2/Documentation/network/kubernetes/kubeproxy-free.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _kubeproxy-free: 8 9 ***************************** 10 Kubernetes Without kube-proxy 11 ***************************** 12 13 This guide explains how to provision a Kubernetes cluster without ``kube-proxy``, 14 and to use Cilium to fully replace it. For simplicity, we will use ``kubeadm`` to 15 bootstrap the cluster. 16 17 For help with installing ``kubeadm`` and for more provisioning options please refer to 18 `the official Kubeadm documentation <https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/>`_. 19 20 .. note:: 21 22 Cilium's kube-proxy replacement depends on the socket-LB feature, 23 which requires a v4.19.57, v5.1.16, v5.2.0 or more recent Linux kernel. 24 Linux kernels v5.3 and v5.8 add additional features that Cilium can use to 25 further optimize the kube-proxy replacement implementation. 26 27 Note that v5.0.y kernels do not have the fix required to run the kube-proxy 28 replacement since at this point in time the v5.0.y stable kernel is end-of-life 29 (EOL) and not maintained anymore on kernel.org. For individual distribution 30 maintained kernels, the situation could differ. Therefore, please check with 31 your distribution. 32 33 Quick-Start 34 ########### 35 36 Initialize the control-plane node via ``kubeadm init`` and skip the 37 installation of the ``kube-proxy`` add-on: 38 39 .. note:: 40 Depending on what CRI implementation you are using, you may need to use the 41 ``--cri-socket`` flag with your ``kubeadm init ...`` command. 42 For example: if you're using Docker CRI you would use 43 ``--cri-socket unix:///var/run/cri-dockerd.sock``. 44 45 .. code-block:: shell-session 46 47 $ kubeadm init --skip-phases=addon/kube-proxy 48 49 Afterwards, join worker nodes by specifying the control-plane node IP address and 50 the token returned by ``kubeadm init`` 51 (for this tutorial, you will want to add at least one worker node to the cluster): 52 53 .. code-block:: shell-session 54 55 $ kubeadm join <..> 56 57 .. note:: 58 59 Please ensure that 60 `kubelet <https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/>`_'s 61 ``--node-ip`` is set correctly on each worker if you have multiple interfaces. 62 Cilium's kube-proxy replacement may not work correctly otherwise. 63 You can validate this by running ``kubectl get nodes -o wide`` to see whether 64 each node has an ``InternalIP`` which is assigned to a device with the same 65 name on each node. 66 67 For existing installations with ``kube-proxy`` running as a DaemonSet, remove it 68 by using the following commands below. 69 70 .. warning:: 71 Be aware that removing ``kube-proxy`` will break existing service connections. It will also stop service related traffic 72 until the Cilium replacement has been installed. 73 74 .. code-block:: shell-session 75 76 $ kubectl -n kube-system delete ds kube-proxy 77 $ # Delete the configmap as well to avoid kube-proxy being reinstalled during a Kubeadm upgrade (works only for K8s 1.19 and newer) 78 $ kubectl -n kube-system delete cm kube-proxy 79 $ # Run on each node with root permissions: 80 $ iptables-save | grep -v KUBE | iptables-restore 81 82 .. include:: ../../installation/k8s-install-download-release.rst 83 84 Next, generate the required YAML files and deploy them. 85 86 .. important:: 87 88 Make sure you correctly set your ``API_SERVER_IP`` and ``API_SERVER_PORT`` 89 below with the control-plane node IP address and the kube-apiserver port 90 number reported by ``kubeadm init`` (Kubeadm will use port ``6443`` by default). 91 92 Specifying this is necessary as ``kubeadm init`` is run explicitly without setting 93 up kube-proxy and as a consequence, although it exports ``KUBERNETES_SERVICE_HOST`` 94 and ``KUBERNETES_SERVICE_PORT`` with a ClusterIP of the kube-apiserver service 95 to the environment, there is no kube-proxy in our setup provisioning that service. 96 Therefore, the Cilium agent needs to be made aware of this information with the following configuration: 97 98 .. parsed-literal:: 99 100 API_SERVER_IP=<your_api_server_ip> 101 # Kubeadm default is 6443 102 API_SERVER_PORT=<your_api_server_port> 103 helm install cilium |CHART_RELEASE| \\ 104 --namespace kube-system \\ 105 --set kubeProxyReplacement=true \\ 106 --set k8sServiceHost=${API_SERVER_IP} \\ 107 --set k8sServicePort=${API_SERVER_PORT} 108 109 .. note:: 110 111 Cilium will automatically mount cgroup v2 filesystem required to attach BPF 112 cgroup programs by default at the path ``/run/cilium/cgroupv2``. To do that, 113 it needs to mount the host ``/proc`` inside an init container 114 launched by the DaemonSet temporarily. If you need to disable the auto-mount, 115 specify ``--set cgroup.autoMount.enabled=false``, and set the host mount point 116 where cgroup v2 filesystem is already mounted by using ``--set cgroup.hostRoot``. 117 For example, if not already mounted, you can mount cgroup v2 filesystem by 118 running the below command on the host, and specify ``--set cgroup.hostRoot=/sys/fs/cgroup``. 119 120 .. code:: shell-session 121 122 mount -t cgroup2 none /sys/fs/cgroup 123 124 This will install Cilium as a CNI plugin with the eBPF kube-proxy replacement to 125 implement handling of Kubernetes services of type ClusterIP, NodePort, LoadBalancer 126 and services with externalIPs. As well, the eBPF kube-proxy replacement also 127 supports hostPort for containers such that using portmap is not necessary anymore. 128 129 Finally, as a last step, verify that Cilium has come up correctly on all nodes and 130 is ready to operate: 131 132 .. code-block:: shell-session 133 134 $ kubectl -n kube-system get pods -l k8s-app=cilium 135 NAME READY STATUS RESTARTS AGE 136 cilium-fmh8d 1/1 Running 0 10m 137 cilium-mkcmb 1/1 Running 0 10m 138 139 Note, in above Helm configuration, the ``kubeProxyReplacement`` has been set to 140 ``true`` mode. This means that the Cilium agent will bail out in case the 141 underlying Linux kernel support is missing. 142 143 By default, Helm sets ``kubeProxyReplacement=false``, which only enables 144 per-packet in-cluster load-balancing of ClusterIP services. 145 146 Cilium's eBPF kube-proxy replacement is supported in direct routing as well as in 147 tunneling mode. 148 149 Validate the Setup 150 ################## 151 152 After deploying Cilium with above Quick-Start guide, we can first validate that 153 the Cilium agent is running in the desired mode: 154 155 .. code-block:: shell-session 156 157 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep KubeProxyReplacement 158 KubeProxyReplacement: True [eth0 (Direct Routing), eth1] 159 160 Use ``--verbose`` for full details: 161 162 .. code-block:: shell-session 163 164 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status --verbose 165 [...] 166 KubeProxyReplacement Details: 167 Status: True 168 Socket LB: Enabled 169 Protocols: TCP, UDP 170 Devices: eth0 (Direct Routing), eth1 171 Mode: SNAT 172 Backend Selection: Random 173 Session Affinity: Enabled 174 Graceful Termination: Enabled 175 NAT46/64 Support: Enabled 176 XDP Acceleration: Disabled 177 Services: 178 - ClusterIP: Enabled 179 - NodePort: Enabled (Range: 30000-32767) 180 - LoadBalancer: Enabled 181 - externalIPs: Enabled 182 - HostPort: Enabled 183 [...] 184 185 As an optional next step, we will create an Nginx Deployment. Then we'll create a new NodePort service and 186 validate that Cilium installed the service correctly. 187 188 The following YAML is used for the backend pods: 189 190 .. code-block:: yaml 191 192 apiVersion: apps/v1 193 kind: Deployment 194 metadata: 195 name: my-nginx 196 spec: 197 selector: 198 matchLabels: 199 run: my-nginx 200 replicas: 2 201 template: 202 metadata: 203 labels: 204 run: my-nginx 205 spec: 206 containers: 207 - name: my-nginx 208 image: nginx 209 ports: 210 - containerPort: 80 211 212 Verify that the Nginx pods are up and running: 213 214 .. code-block:: shell-session 215 216 $ kubectl get pods -l run=my-nginx -o wide 217 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 218 my-nginx-756fb87568-gmp8c 1/1 Running 0 62m 10.217.0.149 apoc <none> <none> 219 my-nginx-756fb87568-n5scv 1/1 Running 0 62m 10.217.0.107 apoc <none> <none> 220 221 In the next step, we create a NodePort service for the two instances: 222 223 .. code-block:: shell-session 224 225 $ kubectl expose deployment my-nginx --type=NodePort --port=80 226 service/my-nginx exposed 227 228 Verify that the NodePort service has been created: 229 230 .. code-block:: shell-session 231 232 $ kubectl get svc my-nginx 233 NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE 234 my-nginx NodePort 10.104.239.135 <none> 80:31940/TCP 24m 235 236 With the help of the ``cilium-dbg service list`` command, we can validate that 237 Cilium's eBPF kube-proxy replacement created the new NodePort service. 238 In this example, services with port ``31940`` were created (one for each of devices ``eth0`` and ``eth1``): 239 240 .. code-block:: shell-session 241 242 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg service list 243 ID Frontend Service Type Backend 244 [...] 245 4 10.104.239.135:80 ClusterIP 1 => 10.217.0.107:80 246 2 => 10.217.0.149:80 247 5 0.0.0.0:31940 NodePort 1 => 10.217.0.107:80 248 2 => 10.217.0.149:80 249 6 192.168.178.29:31940 NodePort 1 => 10.217.0.107:80 250 2 => 10.217.0.149:80 251 7 172.16.0.29:31940 NodePort 1 => 10.217.0.107:80 252 2 => 10.217.0.149:80 253 254 Create a variable with the node port for testing: 255 256 .. code-block:: shell-session 257 258 $ node_port=$(kubectl get svc my-nginx -o=jsonpath='{@.spec.ports[0].nodePort}') 259 260 At the same time we can verify, using ``iptables`` in the host namespace, 261 that no ``iptables`` rule for the service is present: 262 263 .. code-block:: shell-session 264 265 $ iptables-save | grep KUBE-SVC 266 [ empty line ] 267 268 Last but not least, a simple ``curl`` test shows connectivity for the exposed 269 NodePort as well as for the ClusterIP: 270 271 .. code-block:: shell-session 272 273 $ curl 127.0.0.1:$node_port 274 <!DOCTYPE html> 275 <html> 276 <head> 277 <title>Welcome to nginx!</title> 278 [....] 279 280 .. code-block:: shell-session 281 282 $ curl 192.168.178.29:$node_port 283 <!doctype html> 284 <html> 285 <head> 286 <title>welcome to nginx!</title> 287 [....] 288 289 .. code-block:: shell-session 290 291 $ curl 172.16.0.29:$node_port 292 <!doctype html> 293 <html> 294 <head> 295 <title>welcome to nginx!</title> 296 [....] 297 298 .. code-block:: shell-session 299 300 $ curl 10.104.239.135:80 301 <!DOCTYPE html> 302 <html> 303 <head> 304 <title>Welcome to nginx!</title> 305 [....] 306 307 As can be seen, Cilium's eBPF kube-proxy replacement is set up correctly. 308 309 Advanced Configuration 310 ###################### 311 312 This section covers a few advanced configuration modes for the kube-proxy replacement 313 that go beyond the above Quick-Start guide and are entirely optional. 314 315 Client Source IP Preservation 316 ***************************** 317 318 Cilium's eBPF kube-proxy replacement implements various options to avoid 319 performing SNAT on NodePort requests where the client source IP address would otherwise 320 be lost on its path to the service endpoint. 321 322 - ``externalTrafficPolicy=Local``: The ``Local`` policy is generally supported through 323 the eBPF implementation. In-cluster connectivity for services with ``externalTrafficPolicy=Local`` 324 is possible and can also be reached from nodes which have no local backends, meaning, 325 given SNAT does not need to be performed, all service endpoints are available for 326 load balancing from in-cluster side. 327 328 - ``externalTrafficPolicy=Cluster``: For the ``Cluster`` policy which is the default 329 upon service creation, multiple options exist for achieving client source IP preservation 330 for external traffic, that is, operating the kube-proxy replacement in :ref:`DSR<DSR Mode>` 331 or :ref:`Hybrid<Hybrid Mode>` mode if only TCP-based services are exposed to the outside 332 world for the latter. 333 334 Internal Traffic Policy 335 *********************** 336 337 Similar to ``externalTrafficPolicy`` described above, Cilium's eBPF kube-proxy replacement 338 supports ``internalTrafficPolicy``, which translates the above semantics to in-cluster traffic. 339 340 - For services with ``internalTrafficPolicy=Local``, traffic originated from pods in the 341 current cluster is routed only to endpoints within the same node the traffic originated from. 342 343 - ``internalTrafficPolicy=Cluster`` is the default, and it doesn't restrict the endpoints that 344 can handle internal (in-cluster) traffic. 345 346 The following table gives an idea of what backends are used to serve connections to a service, 347 depending on the external and internal traffic policies: 348 349 +---------------------+-------------------------------------------------+ 350 | Traffic policy | Service backends used | 351 +----------+----------+-------------------------+-----------------------+ 352 | Internal | External | for North-South traffic | for East-West traffic | 353 +==========+==========+=========================+=======================+ 354 | Cluster | Cluster | All (default) | All (default) | 355 +----------+----------+-------------------------+-----------------------+ 356 | Cluster | Local | Node-local only | All (default) | 357 +----------+----------+-------------------------+-----------------------+ 358 | Local | Cluster | All (default) | Node-local only | 359 +----------+----------+-------------------------+-----------------------+ 360 | Local | Local | Node-local only | Node-local only | 361 +----------+----------+-------------------------+-----------------------+ 362 363 .. _maglev: 364 365 Maglev Consistent Hashing 366 ************************* 367 368 Cilium's eBPF kube-proxy replacement supports consistent hashing by implementing a variant 369 of `The Maglev hashing <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44824.pdf>`_ 370 in its load balancer for backend selection. This improves resiliency in case of 371 failures. As well, it provides better load balancing properties since Nodes added to the cluster will 372 make consistent backend selection throughout the cluster for a given 5-tuple without 373 having to synchronize state with the other Nodes. Similarly, upon backend removal the backend 374 lookup tables are reprogrammed with minimal disruption for unrelated backends (at most 1% 375 difference in the reassignments) for the given service. 376 377 Maglev hashing for services load balancing can be enabled by setting ``loadBalancer.algorithm=maglev``: 378 379 .. parsed-literal:: 380 381 helm install cilium |CHART_RELEASE| \\ 382 --namespace kube-system \\ 383 --set kubeProxyReplacement=true \\ 384 --set loadBalancer.algorithm=maglev \\ 385 --set k8sServiceHost=${API_SERVER_IP} \\ 386 --set k8sServicePort=${API_SERVER_PORT} 387 388 Note that Maglev hashing is applied only to external (N-S) traffic. For 389 in-cluster service connections (E-W), sockets are assigned to service backends 390 directly, e.g. at TCP connect time, without any intermediate hop and thus are 391 not subject to Maglev. Maglev hashing is also supported for Cilium's 392 :ref:`XDP<XDP Acceleration>` acceleration. 393 394 There are two more Maglev-specific configuration settings: ``maglev.tableSize`` 395 and ``maglev.hashSeed``. 396 397 ``maglev.tableSize`` specifies the size of the Maglev lookup table for each single service. 398 `Maglev <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44824.pdf>`__ 399 recommends the table size (``M``) to be significantly larger than the number of maximum expected 400 backends (``N``). In practice that means that ``M`` should be larger than ``100 * N`` in 401 order to guarantee the property of at most 1% difference in the reassignments on backend 402 changes. ``M`` must be a prime number. Cilium uses a default size of ``16381`` for ``M``. 403 The following sizes for ``M`` are supported as ``maglev.tableSize`` Helm option: 404 405 +----------------------------+ 406 | ``maglev.tableSize`` value | 407 +============================+ 408 | 251 | 409 +----------------------------+ 410 | 509 | 411 +----------------------------+ 412 | 1021 | 413 +----------------------------+ 414 | 2039 | 415 +----------------------------+ 416 | 4093 | 417 +----------------------------+ 418 | 8191 | 419 +----------------------------+ 420 | 16381 | 421 +----------------------------+ 422 | 32749 | 423 +----------------------------+ 424 | 65521 | 425 +----------------------------+ 426 | 131071 | 427 +----------------------------+ 428 429 For example, a ``maglev.tableSize`` of ``16381`` is suitable for a maximum of ``~160`` backends 430 per service. If a higher number of backends are provisioned under this setting, then the 431 difference in reassignments on backend changes will increase. 432 433 The ``maglev.hashSeed`` option is recommended to be set in order for Cilium to not rely on the 434 fixed built-in seed. The seed is a base64-encoded 12 byte-random number, and can be 435 generated once through ``head -c12 /dev/urandom | base64 -w0``, for example. 436 Every Cilium agent in the cluster must use the same hash seed for Maglev to work. 437 438 The below deployment example is generating and passing such seed to Helm as well as setting the 439 Maglev table size to ``65521`` to allow for ``~650`` maximum backends for a 440 given service (with the property of at most 1% difference on backend reassignments): 441 442 .. parsed-literal:: 443 444 SEED=$(head -c12 /dev/urandom | base64 -w0) 445 helm install cilium |CHART_RELEASE| \\ 446 --namespace kube-system \\ 447 --set kubeProxyReplacement=true \\ 448 --set loadBalancer.algorithm=maglev \\ 449 --set maglev.tableSize=65521 \\ 450 --set maglev.hashSeed=$SEED \\ 451 --set k8sServiceHost=${API_SERVER_IP} \\ 452 --set k8sServicePort=${API_SERVER_PORT} 453 454 455 Note that enabling Maglev will have a higher memory consumption on each Cilium-managed Node compared 456 to the default of ``loadBalancer.algorithm=random`` given ``random`` does not need the extra lookup 457 tables. However, ``random`` won't have consistent backend selection. 458 459 .. _DSR mode: 460 461 Direct Server Return (DSR) 462 ************************** 463 464 By default, Cilium's eBPF NodePort implementation operates in SNAT mode. That is, 465 when node-external traffic arrives and the node determines that the backend for 466 the LoadBalancer, NodePort, or services with externalIPs is at a remote node, then the 467 node is redirecting the request to the remote backend on its behalf by performing 468 SNAT. This does not require any additional MTU changes. The cost is that replies 469 from the backend need to make the extra hop back to that node to perform the 470 reverse SNAT translation there before returning the packet directly to the external 471 client. 472 473 This setting can be changed through the ``loadBalancer.mode`` Helm option to 474 ``dsr`` in order to let Cilium's eBPF NodePort implementation operate in DSR mode. 475 In this mode, the backends reply directly to the external client without taking 476 the extra hop, meaning, backends reply by using the service IP/port as a source. 477 478 Another advantage in DSR mode is that the client's source IP is preserved, so policy 479 can match on it at the backend node. In the SNAT mode this is not possible. 480 Given a specific backend can be used by multiple services, the backends need to be 481 made aware of the service IP/port which they need to reply with. Cilium encodes this 482 information into the packet (using one of the dispatch mechanisms described below), 483 at the cost of advertising a lower MTU. For TCP services, Cilium 484 only encodes the service IP/port for the SYN packet, but not subsequent ones. This 485 optimization also allows to operate Cilium in a hybrid mode as detailed in the later 486 subsection where DSR is used for TCP and SNAT for UDP in order to avoid an otherwise 487 needed MTU reduction. 488 489 In some public cloud provider environments that implement source / 490 destination IP address checking (e.g. AWS), the checking has to be disabled in 491 order for the DSR mode to work. 492 493 By default Cilium uses special ExternalIP mitigation for CEV-2020-8554 MITM vulnerability. 494 This may affect connectivity targeted to ExternalIP on the same cluster. 495 This mitigation can be disabled by setting ``bpf.disableExternalIPMitigation`` to ``true``. 496 497 .. _DSR mode with Option: 498 499 Direct Server Return (DSR) with IPv4 option / IPv6 extension Header 500 ******************************************************************* 501 502 In this DSR dispatch mode, the service IP/port information is transported to the 503 backend through a Cilium-specific IPv4 Option or IPv6 Destination Option extension header. 504 It requires Cilium to be deployed in :ref:`arch_direct_routing`, i.e. 505 it will not work in :ref:`arch_overlay` mode. 506 507 This DSR mode might not work in some public cloud provider environments 508 due to the Cilium-specific IP options that could be dropped by an underlying network fabric. 509 In case of connectivity issues to services where backends are located on 510 a remote node from the node that is processing the given NodePort request, 511 first check whether the NodePort request actually arrived on the node 512 containing the backend. If this was not the case, then consider either switching to 513 DSR with Geneve (as described below), or switching back to the default SNAT mode. 514 515 The above Helm example configuration in a kube-proxy-free environment with DSR-only mode 516 enabled would look as follows: 517 518 .. parsed-literal:: 519 520 helm install cilium |CHART_RELEASE| \\ 521 --namespace kube-system \\ 522 --set routingMode=native \\ 523 --set kubeProxyReplacement=true \\ 524 --set loadBalancer.mode=dsr \\ 525 --set loadBalancer.dsrDispatch=opt \\ 526 --set k8sServiceHost=${API_SERVER_IP} \\ 527 --set k8sServicePort=${API_SERVER_PORT} 528 529 .. _DSR mode with Geneve: 530 531 Direct Server Return (DSR) with Geneve 532 ************************************** 533 By default, Cilium with DSR mode encodes the service IP/port in a Cilium-specific 534 IPv4 option or IPv6 Destination Option extension so that the backends are aware of 535 the service IP/port, which they need to reply with. 536 537 However, some data center routers pass packets with unknown IP options to software 538 processing called "Layer 2 slow path". Those routers drop the packets if the amount 539 of packets with IP options exceeds a given threshold, which may significantly affect 540 network performance. 541 542 Cilium offers another dispatch mode, DSR with Geneve, to avoid this problem. 543 In DSR with Geneve, Cilium encapsulates packets to the Loadbalancer with the Geneve 544 header that includes the service IP/port in the Geneve option and redirects them 545 to the backends. 546 547 The Helm example configuration in a kube-proxy-free environment with DSR and 548 Geneve dispatch enabled would look as follows: 549 550 .. parsed-literal:: 551 helm install cilium |CHART_RELEASE| \\ 552 --namespace kube-system \\ 553 --set routingMode=native \\ 554 --set tunnelProtocol=geneve \\ 555 --set kubeProxyReplacement=true \\ 556 --set loadBalancer.mode=dsr \\ 557 --set loadBalancer.dsrDispatch=geneve \\ 558 --set k8sServiceHost=${API_SERVER_IP} \\ 559 --set k8sServicePort=${API_SERVER_PORT} 560 561 DSR with Geneve is compatible with the Geneve encapsulation mode (:ref:`arch_overlay`). 562 It works with either the direct routing mode or the Geneve tunneling mode. Unfortunately, 563 it doesn't work with the vxlan encapsulation mode. 564 565 The example configuration in DSR with Geneve dispatch and tunneling mode is as follows. 566 567 .. parsed-literal:: 568 helm install cilium |CHART_RELEASE| \\ 569 --namespace kube-system \\ 570 --set routingMode=tunnel \\ 571 --set tunnelProtocol=geneve \\ 572 --set kubeProxyReplacement=true \\ 573 --set loadBalancer.mode=dsr \\ 574 --set loadBalancer.dsrDispatch=geneve \\ 575 --set k8sServiceHost=${API_SERVER_IP} \\ 576 --set k8sServicePort=${API_SERVER_PORT} 577 578 .. _Hybrid mode: 579 580 Hybrid DSR and SNAT Mode 581 ************************ 582 583 Cilium also supports a hybrid DSR and SNAT mode, that is, DSR is performed for TCP 584 and SNAT for UDP connections. 585 This removes the need for manual MTU changes in the network while still benefiting from the latency improvements 586 through the removed extra hop for replies, in particular, when TCP is the main transport 587 for workloads. 588 589 The mode setting ``loadBalancer.mode`` allows to control the behavior through the 590 options ``dsr``, ``snat`` and ``hybrid``. By default the ``snat`` mode is used in the 591 agent. 592 593 A Helm example configuration in a kube-proxy-free environment with DSR enabled in hybrid 594 mode would look as follows: 595 596 .. parsed-literal:: 597 598 helm install cilium |CHART_RELEASE| \\ 599 --namespace kube-system \\ 600 --set routingMode=native \\ 601 --set kubeProxyReplacement=true \\ 602 --set loadBalancer.mode=hybrid \\ 603 --set k8sServiceHost=${API_SERVER_IP} \\ 604 --set k8sServicePort=${API_SERVER_PORT} 605 606 .. _socketlb-host-netns-only: 607 608 Socket LoadBalancer Bypass in Pod Namespace 609 ******************************************* 610 611 The socket-level loadbalancer acts transparent to Cilium's lower layer datapath 612 in that upon ``connect`` (TCP, connected UDP), ``sendmsg`` (UDP), or ``recvmsg`` 613 (UDP) system calls, the destination IP is checked for an existing service IP and 614 one of the service backends is selected as a target. This means that although 615 the application assumes it is connected to the service address, the 616 corresponding kernel socket is actually connected to the backend address and 617 therefore no additional lower layer NAT is required. 618 619 Cilium has built-in support for bypassing the socket-level loadbalancer and falling back 620 to the tc loadbalancer at the veth interface when a custom redirection/operation relies 621 on the original ClusterIP within pod namespace (e.g., Istio sidecar) or due to the Pod's 622 nature the socket-level loadbalancer is ineffective (e.g., KubeVirt, Kata Containers, 623 gVisor). 624 625 Setting ``socketLB.hostNamespaceOnly=true`` enables this bypassing mode. When enabled, 626 this circumvents socket rewrite in the ``connect()`` and ``sendmsg()`` syscall bpf hook and 627 will pass the original packet to next stage of operation (e.g., stack in 628 ``per-endpoint-routing`` mode) and re-enables service lookup in the tc bpf program. 629 630 A Helm example configuration in a kube-proxy-free environment with socket LB bypass 631 looks as follows: 632 633 .. parsed-literal:: 634 635 helm install cilium |CHART_RELEASE| \\ 636 --namespace kube-system \\ 637 --set routingMode=native \\ 638 --set kubeProxyReplacement=true \\ 639 --set socketLB.hostNamespaceOnly=true 640 641 .. _XDP acceleration: 642 643 LoadBalancer & NodePort XDP Acceleration 644 **************************************** 645 646 Cilium has built-in support for accelerating NodePort, LoadBalancer services and 647 services with externalIPs for the case where the arriving request needs to be 648 forwarded and the backend is located on a remote node. This feature was introduced 649 in Cilium version `1.8 <https://cilium.io/blog/2020/06/22/cilium-18/#kube-proxy-replacement-at-the-xdp-layer>`_ at 650 the XDP (eXpress Data Path) layer where eBPF is operating directly in the networking 651 driver instead of a higher layer. 652 653 Setting ``loadBalancer.acceleration`` to option ``native`` enables this acceleration. 654 The option ``disabled`` is the default and disables the acceleration. The majority 655 of drivers supporting 10G or higher rates also support ``native`` XDP on a recent 656 kernel. For cloud based deployments most of these drivers have SR-IOV variants that 657 support native XDP as well. For on-prem deployments the Cilium XDP acceleration can 658 be used in combination with LoadBalancer service implementations for Kubernetes such 659 as `MetalLB <https://metallb.universe.tf/>`_. The acceleration can be enabled only 660 on a single device which is used for direct routing. 661 662 For high-scale environments, also consider tweaking the default map sizes to a larger 663 number of entries e.g. through setting a higher ``config.bpfMapDynamicSizeRatio``. 664 See :ref:`bpf_map_limitations` for further details. 665 666 The ``loadBalancer.acceleration`` setting is supported for DSR, SNAT and hybrid 667 modes and can be enabled as follows for ``loadBalancer.mode=hybrid`` in this example: 668 669 .. parsed-literal:: 670 671 helm install cilium |CHART_RELEASE| \\ 672 --namespace kube-system \\ 673 --set routingMode=native \\ 674 --set kubeProxyReplacement=true \\ 675 --set loadBalancer.acceleration=native \\ 676 --set loadBalancer.mode=hybrid \\ 677 --set k8sServiceHost=${API_SERVER_IP} \\ 678 --set k8sServicePort=${API_SERVER_PORT} 679 680 681 In case of a multi-device environment, where Cilium's device auto-detection selects 682 more than a single device to expose NodePort or a user specifies multiple devices 683 with ``devices``, the XDP acceleration is enabled on all devices. This means that 684 each underlying device's driver must have native XDP support on all Cilium managed 685 nodes. If you have an environment where some devices support XDP but others do not 686 you can have XDP enabled on the supported devices by setting 687 ``loadBalancer.acceleration`` to ``best-effort``. In addition, for performance 688 reasons we recommend kernel >= 5.5 for the multi-device XDP acceleration. 689 690 A list of drivers supporting XDP can be found in :ref:`the XDP documentation<xdp_drivers>`. 691 692 The current Cilium kube-proxy XDP acceleration mode can also be introspected through 693 the ``cilium-dbg status`` CLI command. If it has been enabled successfully, ``Native`` 694 is shown: 695 696 .. code-block:: shell-session 697 698 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status --verbose | grep XDP 699 XDP Acceleration: Native 700 701 Note that packets which have been pushed back out of the device for NodePort handling 702 right at the XDP layer are not visible in tcpdump since packet taps come at a much 703 later stage in the networking stack. Cilium's monitor command or metric counters can be used 704 instead for gaining visibility. 705 706 NodePort XDP on AWS 707 =================== 708 709 In order to run with NodePort XDP on AWS, follow the instructions in the :ref:`k8s_install_quick` 710 guide to set up an EKS cluster or use any other method of your preference to set up a 711 Kubernetes cluster. 712 713 If you are following the EKS guide, make sure to create a node group with SSH access, since 714 we need few additional setup steps as well as create a larger instance type which supports 715 the `Elastic Network Adapter <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking-ena.html>`__ (ena). 716 As an instance example, ``m5n.xlarge`` is used in the config ``nodegroup-config.yaml``: 717 718 .. code-block:: yaml 719 720 apiVersion: eksctl.io/v1alpha5 721 kind: ClusterConfig 722 723 metadata: 724 name: test-cluster 725 region: us-west-2 726 727 nodeGroups: 728 - name: ng-1 729 instanceType: m5n.xlarge 730 desiredCapacity: 2 731 ssh: 732 allow: true 733 ## taint nodes so that application pods are 734 ## not scheduled/executed until Cilium is deployed. 735 ## Alternatively, see the note below. 736 taints: 737 - key: "node.cilium.io/agent-not-ready" 738 value: "true" 739 effect: "NoExecute" 740 741 .. note:: 742 743 Please make sure to read and understand the documentation page on :ref:`taint effects and unmanaged pods<taint_effects>`. 744 745 The nodegroup is created with: 746 747 .. code-block:: shell-session 748 749 $ eksctl create nodegroup -f nodegroup-config.yaml 750 751 Each of the nodes need the ``kernel-ng`` and ``ethtool`` package installed. The former is 752 needed in order to run a sufficiently recent kernel for eBPF in general and native XDP 753 support on the ena driver. The latter is needed to configure channel parameters for the NIC. 754 755 .. code-block:: shell-session 756 757 $ IPS=$(kubectl get no -o jsonpath='{$.items[*].status.addresses[?(@.type=="ExternalIP")].address }{"\\n"}' | tr ' ' '\\n') 758 759 $ for ip in $IPS ; do ssh ec2-user@$ip "sudo amazon-linux-extras install -y kernel-ng && sudo yum install -y ethtool && sudo reboot"; done 760 761 Once the nodes come back up their kernel version should say ``5.4.58-27.104.amzn2.x86_64`` or 762 similar through ``uname -r``. In order to run XDP on ena, make sure the driver version is at 763 least `2.2.8 <https://github.com/amzn/amzn-drivers/commit/ccbb1fe2c2f2ab3fc6d7827b012ba8ec06f32c39>`__. 764 The driver version can be inspected through ``ethtool -i eth0``. For the given kernel version 765 the driver version should be reported as ``2.2.10g``. 766 767 Before Cilium's XDP acceleration can be deployed, there are two settings needed on the 768 network adapter side, that is, MTU needs to be lowered in order to be able to operate 769 with XDP, and number of combined channels need to be adapted. 770 771 The default MTU is set to 9001 on the ena driver. Given XDP buffers are linear, they 772 operate on a single page. A driver typically reserves some headroom for XDP as well 773 (e.g. for encapsulation purpose), therefore, the highest possible MTU for XDP would 774 be 3498. 775 776 In terms of ena channels, the settings can be gathered via ``ethtool -l eth0``. For the 777 ``m5n.xlarge`` instance, the default output should look like:: 778 779 Channel parameters for eth0: 780 Pre-set maximums: 781 RX: 0 782 TX: 0 783 Other: 0 784 Combined: 4 785 Current hardware settings: 786 RX: 0 787 TX: 0 788 Other: 0 789 Combined: 4 790 791 In order to use XDP the channels must be set to at most 1/2 of the value from 792 ``Combined`` above. Both, MTU and channel changes are applied as follows: 793 794 .. code-block:: shell-session 795 796 $ for ip in $IPS ; do ssh ec2-user@$ip "sudo ip link set dev eth0 mtu 3498"; done 797 $ for ip in $IPS ; do ssh ec2-user@$ip "sudo ethtool -L eth0 combined 2"; done 798 799 In order to deploy Cilium, the Kubernetes API server IP and port is needed: 800 801 .. code-block:: shell-session 802 803 $ export API_SERVER_IP=$(kubectl get ep kubernetes -o jsonpath='{$.subsets[0].addresses[0].ip}') 804 $ export API_SERVER_PORT=443 805 806 Finally, the deployment can be upgraded and later rolled-out with the 807 ``loadBalancer.acceleration=native`` setting to enable XDP in Cilium: 808 809 .. parsed-literal:: 810 811 helm upgrade cilium |CHART_RELEASE| \\ 812 --namespace kube-system \\ 813 --reuse-values \\ 814 --set kubeProxyReplacement=true \\ 815 --set loadBalancer.acceleration=native \\ 816 --set loadBalancer.mode=snat \\ 817 --set k8sServiceHost=${API_SERVER_IP} \\ 818 --set k8sServicePort=${API_SERVER_PORT} 819 820 821 NodePort XDP on Azure 822 ===================== 823 824 To enable NodePort XDP on Azure AKS or a self-managed Kubernetes running on Azure, the virtual 825 machines running Kubernetes must have `Accelerated Networking 826 <https://azure.microsoft.com/en-us/updates/accelerated-networking-in-expanded-preview/>`_ 827 enabled. In addition, the Linux kernel on the nodes must also have support for 828 native XDP in the ``hv_netvsc`` driver, which is available in kernel >= 5.6 and was backported to 829 the Azure Linux kernel in 5.4.0-1022. 830 831 On AKS, make sure to use the AKS Ubuntu 22.04 node image with Kubernetes version v1.26 which will 832 provide a Linux kernel with the necessary backports to the ``hv_netvsc`` driver. Please refer to the 833 documentation on `how to configure an AKS cluster 834 <https://docs.microsoft.com/en-us/azure/aks/cluster-configuration>`_ for more details. 835 836 To enable accelerated networking when creating a virtual machine or 837 virtual machine scale set, pass the ``--accelerated-networking`` option to the 838 Azure CLI. Please refer to the guide on how to `create a Linux virtual machine 839 with Accelerated Networking using Azure CLI 840 <https://docs.microsoft.com/en-us/azure/virtual-network/create-vm-accelerated-networking-cli>`_ 841 for more details. 842 843 When *Accelerated Networking* is enabled, ``lspci`` will show a 844 Mellanox ConnectX NIC: 845 846 .. code-block:: shell-session 847 848 $ lspci | grep Ethernet 849 2846:00:02.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] (rev 80) 850 851 XDP acceleration can only be enabled on NICs ConnectX-4 Lx and onwards. 852 853 In order to run XDP, large receive offload (LRO) needs to be disabled on the 854 ``hv_netvsc`` device. If not the case already, this can be achieved by: 855 856 .. code-block:: shell-session 857 858 $ ethtool -K eth0 lro off 859 860 It is recommended to use Azure IPAM for the pod IP address allocation, which 861 will automatically configure your virtual network to route pod traffic correctly: 862 863 .. parsed-literal:: 864 865 helm install cilium |CHART_RELEASE| \\ 866 --namespace kube-system \\ 867 --set ipam.mode=azure \\ 868 --set azure.enabled=true \\ 869 --set azure.resourceGroup=$AZURE_NODE_RESOURCE_GROUP \\ 870 --set azure.subscriptionID=$AZURE_SUBSCRIPTION_ID \\ 871 --set azure.tenantID=$AZURE_TENANT_ID \\ 872 --set azure.clientID=$AZURE_CLIENT_ID \\ 873 --set azure.clientSecret=$AZURE_CLIENT_SECRET \\ 874 --set routingMode=native \\ 875 --set enableIPv4Masquerade=false \\ 876 --set devices=eth0 \\ 877 --set kubeProxyReplacement=true \\ 878 --set loadBalancer.acceleration=native \\ 879 --set loadBalancer.mode=snat \\ 880 --set k8sServiceHost=${API_SERVER_IP} \\ 881 --set k8sServicePort=${API_SERVER_PORT} 882 883 884 When running Azure IPAM on a self-managed Kubernetes cluster, each ``v1.Node`` 885 must have the resource ID of its VM in the ``spec.providerID`` field. 886 Refer to the :ref:`ipam_azure` reference for more information. 887 888 NodePort XDP on GCP 889 =================== 890 891 NodePort XDP on the Google Cloud Platform is currently not supported. Both 892 virtual network interfaces available on Google Compute Engine (the older 893 virtIO-based interface and the newer `gVNIC 894 <https://cloud.google.com/compute/docs/instances/create-vm-with-gvnic>`_) are 895 currently lacking support for native XDP. 896 897 .. _NodePort Devices: 898 899 NodePort Devices, Port and Bind settings 900 **************************************** 901 902 When running Cilium's eBPF kube-proxy replacement, by default, a NodePort or 903 LoadBalancer service or a service with externalIPs will be accessible through 904 the IP addresses of native devices which have the default route on the host or 905 have Kubernetes InternalIP or ExternalIP assigned. InternalIP is preferred over 906 ExternalIP if both exist. To change the devices, set their names in the 907 ``devices`` Helm option, e.g. ``devices='{eth0,eth1,eth2}'``. Each 908 listed device has to be named the same on all Cilium managed nodes. Alternatively 909 if the devices do not match across different nodes, the wildcard option can be 910 used, e.g. ``devices=eth+``, which would match any device starting with prefix 911 ``eth``. If no device can be matched the Cilium agent will try to perform auto 912 detection. 913 914 When multiple devices are used, only one device can be used for direct routing 915 between Cilium nodes. By default, if a single device was detected or specified 916 via ``devices`` then Cilium will use that device for direct routing. 917 Otherwise, Cilium will use a device with Kubernetes InternalIP or ExternalIP 918 set. InternalIP is preferred over ExternalIP if both exist. To change 919 the direct routing device, set the ``nodePort.directRoutingDevice`` Helm 920 option, e.g. ``nodePort.directRoutingDevice=eth1``. The wildcard option can be 921 used as well as the devices option, e.g. ``directRoutingDevice=eth+``. 922 If more than one devices match the wildcard option, Cilium will sort them 923 in increasing alphanumerical order and pick the first one. If the direct routing 924 device does not exist within ``devices``, Cilium will add the device to the latter 925 list. The direct routing device is used for 926 :ref:`the NodePort XDP acceleration<XDP Acceleration>` as well (if enabled). 927 928 In addition, thanks to the socket-LB feature, the NodePort service can 929 be accessed by default from a host or a pod within a cluster via its public, any 930 local (except for ``docker*`` prefixed names) or loopback address, e.g. 931 ``127.0.0.1:NODE_PORT``. 932 933 If ``kube-apiserver`` was configured to use a non-default NodePort port range, 934 then the same range must be passed to Cilium via the ``nodePort.range`` 935 option, for example, as ``nodePort.range="10000\,32767"`` for a 936 range of ``10000-32767``. The default Kubernetes NodePort range is ``30000-32767``. 937 938 If the NodePort port range overlaps with the ephemeral port range 939 (``net.ipv4.ip_local_port_range``), Cilium will append the NodePort range to 940 the reserved ports (``net.ipv4.ip_local_reserved_ports``). This is needed to 941 prevent a NodePort service from hijacking traffic of a host local application 942 which source port matches the service port. To disable the modification of 943 the reserved ports, set ``nodePort.autoProtectPortRanges`` to ``false``. 944 945 By default, the NodePort implementation prevents application ``bind(2)`` requests 946 to NodePort service ports. In such case, the application will typically see a 947 ``bind: Operation not permitted`` error. This happens either globally for older 948 kernels or starting from v5.7 kernels only for the host namespace by default 949 and therefore not affecting any application pod ``bind(2)`` requests anymore. In 950 order to opt-out from this behavior in general, this setting can be changed for 951 expert users by switching ``nodePort.bindProtection`` to ``false``. 952 953 .. _Configuring Maps: 954 955 Configuring BPF Map Sizes 956 ************************* 957 958 For high-scale environments, Cilium's BPF maps can be configured to have higher 959 limits on the number of entries. Overriding Helm options can be used to tweak 960 these limits. 961 962 To increase the number of entries in Cilium's BPF LB service, backend and 963 affinity maps consider overriding ``bpf.lbMapMax`` Helm option. 964 The default value of this LB map size is 65536. 965 966 .. parsed-literal:: 967 968 helm install cilium |CHART_RELEASE| \\ 969 --namespace kube-system \\ 970 --set kubeProxyReplacement=true \\ 971 --set bpf.lbMapMax=131072 972 973 .. _kubeproxyfree_hostport: 974 975 Container HostPort Support 976 ************************** 977 978 Although not part of kube-proxy, Cilium's eBPF kube-proxy replacement also 979 natively supports ``hostPort`` service mapping without having to use the 980 Helm CNI chaining option of ``cni.chainingMode=portmap``. 981 982 By specifying ``kubeProxyReplacement=true`` the native hostPort support is 983 automatically enabled and therefore no further action is required. Otherwise 984 ``hostPort.enabled=true`` can be used to enable the setting. 985 986 If the ``hostPort`` is specified without an additional ``hostIP``, then the 987 Pod will be exposed to the outside world with the same local addresses from 988 the node that were detected and used for exposing NodePort services, e.g. 989 the Kubernetes InternalIP or ExternalIP if set. Additionally, the Pod is also 990 accessible through the loopback address on the node such as ``127.0.0.1:hostPort``. 991 If in addition to ``hostPort`` also a ``hostIP`` has been specified for the 992 Pod, then the Pod will only be exposed on the given ``hostIP`` instead. A 993 ``hostIP`` of ``0.0.0.0`` will have the same behavior as if a ``hostIP`` was 994 not specified. The ``hostPort`` must not reside in the configured NodePort 995 port range to avoid collisions. 996 997 An example deployment in a kube-proxy-free environment therefore is the same 998 as in the earlier getting started deployment: 999 1000 .. parsed-literal:: 1001 1002 helm install cilium |CHART_RELEASE| \\ 1003 --namespace kube-system \\ 1004 --set kubeProxyReplacement=true \\ 1005 --set k8sServiceHost=${API_SERVER_IP} \\ 1006 --set k8sServicePort=${API_SERVER_PORT} 1007 1008 1009 Also, ensure that each node IP is known via ``INTERNAL-IP`` or ``EXTERNAL-IP``, 1010 for example: 1011 1012 .. code-block:: shell-session 1013 1014 $ kubectl get nodes -o wide 1015 NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP [...] 1016 apoc Ready master 6h15m v1.17.3 192.168.178.29 <none> [...] 1017 tank Ready <none> 6h13m v1.17.3 192.168.178.28 <none> [...] 1018 1019 If this is not the case, then ``kubelet`` needs to be made aware of it through 1020 specifying ``--node-ip`` through ``KUBELET_EXTRA_ARGS``. Assuming ``eth0`` is 1021 the public facing interface, this can be achieved by: 1022 1023 .. code-block:: shell-session 1024 1025 $ echo KUBELET_EXTRA_ARGS=\"--node-ip=$(ip -4 -o a show eth0 | awk '{print $4}' | cut -d/ -f1)\" | tee -a /etc/default/kubelet 1026 1027 After updating ``/etc/default/kubelet``, kubelet needs to be restarted. 1028 1029 In order to verify whether the HostPort feature has been enabled in Cilium, the 1030 ``cilium-dbg status`` CLI command provides visibility through the ``KubeProxyReplacement`` 1031 info line. If it has been enabled successfully, ``HostPort`` is shown as ``Enabled``, 1032 for example: 1033 1034 .. code-block:: shell-session 1035 1036 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status --verbose | grep HostPort 1037 - HostPort: Enabled 1038 1039 The following modified example yaml from the setup validation with an additional 1040 ``hostPort: 8080`` parameter can be used to verify the mapping: 1041 1042 .. code-block:: yaml 1043 1044 apiVersion: apps/v1 1045 kind: Deployment 1046 metadata: 1047 name: my-nginx 1048 spec: 1049 selector: 1050 matchLabels: 1051 run: my-nginx 1052 replicas: 1 1053 template: 1054 metadata: 1055 labels: 1056 run: my-nginx 1057 spec: 1058 containers: 1059 - name: my-nginx 1060 image: nginx 1061 ports: 1062 - containerPort: 80 1063 hostPort: 8080 1064 1065 After deployment, we can validate that Cilium's eBPF kube-proxy replacement 1066 exposed the container as HostPort under the specified port ``8080``: 1067 1068 .. code-block:: shell-session 1069 1070 $ kubectl exec -it -n kube-system cilium-fmh8d -- cilium-dbg service list 1071 ID Frontend Service Type Backend 1072 [...] 1073 5 192.168.178.29:8080 HostPort 1 => 10.29.207.199:80 1074 1075 Similarly, we can inspect through ``iptables`` in the host namespace that 1076 no ``iptables`` rule for the HostPort service is present: 1077 1078 .. code-block:: shell-session 1079 1080 $ iptables-save | grep HOSTPORT 1081 [ empty line ] 1082 1083 Last but not least, a simple ``curl`` test shows connectivity for the 1084 exposed HostPort container under the node's IP: 1085 1086 .. code-block:: shell-session 1087 1088 $ curl 192.168.178.29:8080 1089 <!DOCTYPE html> 1090 <html> 1091 <head> 1092 <title>Welcome to nginx!</title> 1093 [....] 1094 1095 Removing the deployment also removes the corresponding HostPort from 1096 the ``cilium-dbg service list`` dump: 1097 1098 .. code-block:: shell-session 1099 1100 $ kubectl delete deployment my-nginx 1101 1102 kube-proxy Hybrid Modes 1103 *********************** 1104 1105 Cilium's eBPF kube-proxy replacement can be configured in several modes, i.e. it can 1106 replace kube-proxy entirely or it can co-exist with kube-proxy on the system if the 1107 underlying Linux kernel requirements do not support a full kube-proxy replacement. 1108 1109 .. warning:: 1110 When deploying the eBPF kube-proxy replacement under co-existence with 1111 kube-proxy on the system, be aware that both mechanisms operate independent of each 1112 other. Meaning, if the eBPF kube-proxy replacement is added or removed on an already 1113 *running* cluster in order to delegate operation from respectively back to kube-proxy, 1114 then it must be expected that existing connections will break since, for example, 1115 both NAT tables are not aware of each other. If deployed in co-existence on a newly 1116 spawned up node/cluster which does not yet serve user traffic, then this is not an 1117 issue. 1118 1119 This section elaborates on the ``kubeProxyReplacement`` options: 1120 1121 - ``kubeProxyReplacement=true``: When using this option, it's highly recommended 1122 to run a kube-proxy-free Kubernetes setup where Cilium is expected to fully replace 1123 all kube-proxy functionality. However, if it's not possible to remove kube-proxy for 1124 specific reasons (e.g. Kubernetes distribution limitations), it's also acceptable to 1125 leave it deployed in the background. Just be aware of the potential side effects on 1126 existing nodes as mentioned above when running kube-proxy in co-existence. Once the 1127 Cilium agent is up and running, it takes care of handling Kubernetes services of type 1128 ClusterIP, NodePort, LoadBalancer, services with externalIPs as well as HostPort. 1129 If the underlying kernel version requirements are not met 1130 (see :ref:`kubeproxy-free` note), then the Cilium agent will bail out on start-up 1131 with an error message. 1132 1133 - ``kubeProxyReplacement=false``: This option is used to disable any Kubernetes service 1134 handling by fully relying on kube-proxy instead, except for ClusterIP services 1135 accessed from pods (pre-v1.6 behavior), or for a hybrid setup. That is, 1136 kube-proxy is running in the Kubernetes cluster where Cilium 1137 partially replaces and optimizes kube-proxy functionality. The ``false`` 1138 option requires the user to manually specify which components for the eBPF 1139 kube-proxy replacement should be used. 1140 Similarly to ``true`` mode, the Cilium agent will bail out on start-up with 1141 an error message if the underlying kernel requirements are not met when components 1142 are manually enabled. For 1143 fine-grained configuration, ``socketLB.enabled``, ``nodePort.enabled``, 1144 ``externalIPs.enabled`` and ``hostPort.enabled`` can be set to ``true``. By 1145 default all four options are set to ``false``. 1146 If you are setting ``nodePort.enabled`` to true, make sure to also 1147 set ``nodePort.enableHealthCheck`` to ``false``, so that the Cilium agent does not 1148 start the NodePort health check server (``kube-proxy`` will also attempt to start 1149 this server, and there would otherwise be a clash when cilium attempts to bind its server to the 1150 same port). A few example configurations 1151 for the ``false`` option are provided below. 1152 1153 .. note:: 1154 1155 Switching from the ``true`` to ``false`` mode, or vice versa can break 1156 existing connections to services in a cluster. The same goes for enabling, or 1157 disabling ``socketLB``. It is recommended to drain all the workloads before 1158 performing such configuration changes. 1159 1160 The following Helm setup below would be equivalent to ``kubeProxyReplacement=true`` 1161 in a kube-proxy-free environment: 1162 1163 .. parsed-literal:: 1164 1165 helm install cilium |CHART_RELEASE| \\ 1166 --namespace kube-system \\ 1167 --set kubeProxyReplacement=false \\ 1168 --set socketLB.enabled=true \\ 1169 --set nodePort.enabled=true \\ 1170 --set externalIPs.enabled=true \\ 1171 --set hostPort.enabled=true \\ 1172 --set k8sServiceHost=${API_SERVER_IP} \\ 1173 --set k8sServicePort=${API_SERVER_PORT} 1174 1175 1176 The following Helm setup below would be equivalent to the default Cilium service 1177 handling in v1.6 or earlier in a kube-proxy environment, that is, serving ClusterIP 1178 for pods: 1179 1180 .. parsed-literal:: 1181 1182 helm install cilium |CHART_RELEASE| \\ 1183 --namespace kube-system \\ 1184 --set kubeProxyReplacement=false 1185 1186 The following Helm setup below would optimize Cilium's NodePort, LoadBalancer and services 1187 with externalIPs handling for external traffic ingressing into the Cilium managed node in 1188 a kube-proxy environment: 1189 1190 .. parsed-literal:: 1191 1192 helm install cilium |CHART_RELEASE| \\ 1193 --namespace kube-system \\ 1194 --set kubeProxyReplacement=false \\ 1195 --set nodePort.enabled=true \\ 1196 --set externalIPs.enabled=true 1197 1198 In Cilium's Helm chart, the default mode is ``kubeProxyReplacement=false`` for 1199 new deployments. 1200 1201 The current Cilium kube-proxy replacement mode can also be introspected through the 1202 ``cilium-dbg status`` CLI command: 1203 1204 .. code-block:: shell-session 1205 1206 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep KubeProxyReplacement 1207 KubeProxyReplacement: True [eth0 (DR)] 1208 1209 Graceful Termination 1210 ******************** 1211 1212 Cilium's eBPF kube-proxy replacement supports graceful termination of service 1213 endpoint pods. The feature requires at least Kubernetes version 1.20, and 1214 the feature gate ``EndpointSliceTerminatingCondition`` needs to be enabled. 1215 By default, the Cilium agent then detects such terminating Pod events, and 1216 increments the metric ``k8s_terminating_endpoints_events_total``. If needed, 1217 the feature can be disabled with the configuration option ``enable-k8s-terminating-endpoint``. 1218 1219 The cilium agent feature flag can be probed by running ``cilium-dbg status`` command: 1220 1221 .. code-block:: shell-session 1222 1223 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status --verbose 1224 [...] 1225 KubeProxyReplacement Details: 1226 [...] 1227 Graceful Termination: Enabled 1228 [...] 1229 1230 When Cilium agent receives a Kubernetes update event for a terminating endpoint, 1231 the datapath state for the endpoint is removed such that it won't service new 1232 connections, but the endpoint's active connections are able to terminate 1233 gracefully. The endpoint state is fully removed when the agent receives 1234 a Kubernetes delete event for the endpoint. The `Kubernetes 1235 pod termination <https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination>`_ 1236 documentation contains more background on the behavior and configuration using ``terminationGracePeriodSeconds``. 1237 There are some special cases, like zero disruption during rolling updates, that require to be able to send traffic 1238 to Terminating Pods that are still Serving traffic during the Terminating period, the Kubernetes blog 1239 `Advancements in Kubernetes Traffic Engineering 1240 <https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/#traffic-loss-from-load-balancers-during-rolling-updates>`_ 1241 explains it in detail. 1242 1243 .. admonition:: Video 1244 :class: attention 1245 1246 To learn more about Cilium's graceful termination support, check out `eCHO Episode 49: Graceful Termination Support with Cilium 1.11 <https://www.youtube.com/watch?v=9GBxJMp6UkI&t=980s>`__. 1247 1248 .. _session-affinity: 1249 1250 Session Affinity 1251 **************** 1252 1253 Cilium's eBPF kube-proxy replacement supports Kubernetes service session affinity. 1254 Each connection from the same pod or host to a service configured with 1255 ``sessionAffinity: ClientIP`` will always select the same service endpoint. 1256 The default timeout for the affinity is three hours (updated by each request to 1257 the service), but it can be configured through Kubernetes' ``sessionAffinityConfig`` 1258 if needed. 1259 1260 The source for the affinity depends on the origin of a request. If a request is 1261 sent from outside the cluster to the service, the request's source IP address is 1262 used for determining the endpoint affinity. If a request is sent from inside 1263 the cluster, then the source depends on whether the socket-LB feature 1264 is used to load balance ClusterIP services. If yes, then the client's network 1265 namespace cookie is used as the source. The latter was introduced in the 5.7 1266 Linux kernel to implement the affinity at the socket layer at which 1267 the socket-LB operates (a source IP is not available there, as the 1268 endpoint selection happens before a network packet has been built by the 1269 kernel). If the socket-LB is not used (i.e. the loadbalancing is done 1270 at the pod network interface, on a per-packet basis), then the request's source 1271 IP address is used as the source. 1272 1273 The session affinity support is enabled by default for Cilium's kube-proxy 1274 replacement. For users who run on older kernels which do not support the network 1275 namespace cookies, a fallback in-cluster mode is implemented, which is based on 1276 a fixed cookie value as a trade-off. This makes all applications on the host to 1277 select the same service endpoint for a given service with session affinity configured. 1278 To disable the feature, set ``config.sessionAffinity=false``. 1279 1280 When the fixed cookie value is not used, the session affinity of a service with 1281 multiple ports is per service IP and port. Meaning that all requests for a 1282 given service sent from the same source and to the same service port will be routed 1283 to the same service endpoints; but two requests for the same service, sent from 1284 the same source but to different service ports may be routed to distinct service 1285 endpoints. 1286 1287 For users who run with kube-proxy (i.e. with Cilium's kube-proxy replacement 1288 disabled), the ClusterIP service loadbalancing when a request is sent from a pod 1289 running in a non-host network namespace is still performed at the pod network 1290 interface (until `GH#16197 <https://github.com/cilium/cilium/issues/16197>`__ is 1291 fixed). For this case the session affinity support is disabled by default. To 1292 enable the feature, set ``config.sessionAffinity=true``. 1293 1294 kube-proxy Replacement Health Check server 1295 ****************************************** 1296 To enable health check server for the kube-proxy replacement, the 1297 ``kubeProxyReplacementHealthzBindAddr`` option has to be set (disabled by 1298 default). The option accepts the IP address with port for the health check server 1299 to serve on. 1300 E.g. to enable for IPv4 interfaces set ``kubeProxyReplacementHealthzBindAddr='0.0.0.0:10256'``, 1301 for IPv6 - ``kubeProxyReplacementHealthzBindAddr='[::]:10256'``. The health check server is 1302 accessible via the HTTP ``/healthz`` endpoint. 1303 1304 LoadBalancer Source Ranges Checks 1305 ********************************* 1306 1307 When a ``LoadBalancer`` service is configured with ``spec.loadBalancerSourceRanges``, 1308 Cilium's eBPF kube-proxy replacement restricts access from outside (e.g. external 1309 world traffic) to the service to the white-listed CIDRs specified in the field. If 1310 the field is empty, no restrictions for the access will be applied. 1311 1312 When accessing the service from inside a cluster, the kube-proxy replacement will 1313 ignore the field regardless whether it is set. This means that any pod or any host 1314 process in the cluster will be able to access the ``LoadBalancer`` service internally. 1315 1316 The load balancer source range check feature is enabled by default, and it can be 1317 disabled by setting ``config.svcSourceRangeCheck=false``. It makes sense to disable 1318 the check when running on some cloud providers. E.g. `Amazon NLB 1319 <https://kubernetes.io/docs/concepts/services-networking/service/#aws-nlb-support>`__ 1320 natively implements the check, so the kube-proxy replacement's feature can be disabled. 1321 Meanwhile `GKE internal TCP/UDP load balancer 1322 <https://cloud.google.com/kubernetes-engine/docs/how-to/service-parameters#lb_source_ranges>`__ 1323 does not, so the feature must be kept enabled in order to restrict the access. 1324 1325 Service Proxy Name Configuration 1326 ******************************** 1327 1328 Like kube-proxy, Cilium also honors the ``service.kubernetes.io/service-proxy-name`` service annotation 1329 and only manages services that contain a matching service-proxy-name label. This name can be configured 1330 by setting ``k8s.serviceProxyName`` option and the behavior is identical to that of 1331 kube-proxy. The service proxy name defaults to an empty string which instructs Cilium to 1332 only manage services not having ``service.kubernetes.io/service-proxy-name`` label. 1333 1334 For more details on the usage of ``service.kubernetes.io/service-proxy-name`` label and its 1335 working, take a look at `this KEP 1336 <https://github.com/kubernetes/enhancements/blob/3ad891202dab1fd5211946f10f31b48003bf8113/keps/sig-network/2447-Make-kube-proxy-service-abstraction-optional/README.md>`__. 1337 1338 .. note:: 1339 1340 If Cilium with a non-empty service proxy name is meant to manage all services in kube-proxy 1341 free mode, make sure that default Kubernetes services like ``kube-dns`` and ``kubernetes`` 1342 have the required label value. 1343 1344 Traffic Distribution and Topology Aware Hints 1345 ********************************************* 1346 1347 The kube-proxy replacement implements both Kubernetes `Topology Aware Routing 1348 <https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing>`__, 1349 and the more recent `Traffic Distribution 1350 <https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution>`__ 1351 features. 1352 1353 Both of these features work by setting ``hints`` on EndpointSlices that enable 1354 Cilium to route to endpoints residing in the same zone. To enable the feature, 1355 set ``loadBalancer.serviceTopology=true``. 1356 1357 Neighbor Discovery 1358 ****************** 1359 1360 When kube-proxy replacement is enabled, Cilium does L2 neighbor discovery of nodes 1361 in the cluster. This is required for the service load-balancing to populate L2 1362 addresses for backends since it is not possible to dynamically resolve neighbors 1363 on demand in the fast-path. 1364 1365 In Cilium 1.10 or earlier, the agent itself contained an ARP resolution library 1366 where it triggered discovery and periodic refresh of new nodes joining the cluster. 1367 The resolved neighbor entries were pushed into the kernel and refreshed as PERMANENT 1368 entries. In some rare cases, Cilium 1.10 or earlier might have left stale entries behind 1369 in the neighbor table causing packets between some nodes to be dropped. To skip the 1370 neighbor discovery and instead rely on the Linux kernel to discover neighbors, you can 1371 pass the ``--enable-l2-neigh-discovery=false`` flag to the cilium-agent. However, 1372 note that relying on the Linux Kernel might also cause some packets to be dropped. 1373 For example, a NodePort request can be dropped on an intermediate node (i.e., the 1374 one which received a service packet and is going to forward it to a destination node 1375 which runs the selected service endpoint). This could happen if there is no L2 neighbor 1376 entry in the kernel (due to the entry being garbage collected or given that the neighbor 1377 resolution has not been done by the kernel). This is because it is not possible to drive 1378 the neighbor resolution from BPF programs in the fast-path e.g. at the XDP layer. 1379 1380 From Cilium 1.11 onwards, the neighbor discovery has been fully reworked and the Cilium 1381 internal ARP resolution library has been removed from the agent. The agent now fully 1382 relies on the Linux kernel to discover gateways or hosts on the same L2 network. Both 1383 IPv4 and IPv6 neighbor discovery is supported in the Cilium agent. As per our recent 1384 kernel work `presented at Plumbers <https://linuxplumbersconf.org/event/11/contributions/953/>`__, 1385 "managed" neighbor entries have been `upstreamed <https://lore.kernel.org/netdev/20211011121238.25542-1-daniel@iogearbox.net/>`__ 1386 and will be available in Linux kernel v5.16 or later which the Cilium agent will detect 1387 and transparently use. In this case, the agent pushes down L3 addresses of new nodes 1388 joining the cluster as externally learned "managed" neighbor entries. For introspection, 1389 iproute2 displays them as "managed extern_learn". The "extern_learn" attribute prevents 1390 garbage collection of the entries by the kernel's neighboring subsystem. Such "managed" 1391 neighbor entries are dynamically resolved and periodically refreshed by the Linux kernel 1392 itself in case there is no active traffic for a certain period of time. That is, the 1393 kernel attempts to always keep them in REACHABLE state. For Linux kernels v5.15 or 1394 earlier where "managed" neighbor entries are not present, the Cilium agent similarly 1395 pushes L3 addresses of new nodes into the kernel for dynamic resolution, but with an 1396 agent triggered periodic refresh. For introspection, iproute2 displays them only as 1397 "extern_learn" in this case. If there is no active traffic for a certain period of 1398 time, then a Cilium agent controller triggers the Linux kernel-based re-resolution for 1399 attempting to keep them in REACHABLE state. The refresh interval can be changed if needed 1400 through a ``--arping-refresh-period=30s`` flag passed to the cilium-agent. The default 1401 period is ``30s`` which corresponds to the kernel's base reachable time. 1402 1403 The neighbor discovery supports multi-device environments where each node has multiple devices 1404 and multiple next-hops to another node. The Cilium agent pushes neighbor entries for all target 1405 devices, including the direct routing device. Currently, it supports one next-hop per device. 1406 The following example illustrates how the neighbor discovery works in a multi-device environment. 1407 Each node has two devices connected to different L3 networks (10.69.0.64/26 and 10.69.0.128/26), 1408 and global scope addresses each (10.69.0.1/26 and 10.69.0.2/26). A next-hop from node1 to node2 is 1409 either ``10.69.0.66 dev eno1`` or ``10.69.0.130 dev eno2``. The Cilium agent pushes neighbor 1410 entries for both ``10.69.0.66 dev eno1`` and ``10.69.0.130 dev eno2`` in this case. 1411 1412 :: 1413 1414 +---------------+ +---------------+ 1415 | node1 | | node2 | 1416 | 10.69.0.1/26 | | 10.69.0.2/26 | 1417 | eno1+-----+eno1 | 1418 | | | | | | 1419 | 10.69.0.65/26 | |10.69.0.66/26 | 1420 | | | | 1421 | eno2+-----+eno2 | 1422 | | | | | | 1423 | 10.69.0.129/26| | 10.69.0.130/26| 1424 +---------------+ +---------------+ 1425 1426 With, on node1: 1427 1428 .. code-block:: shell-session 1429 1430 $ ip route show 1431 10.69.0.2 1432 nexthop via 10.69.0.66 dev eno1 weight 1 1433 nexthop via 10.69.0.130 dev eno2 weight 1 1434 1435 $ ip neigh show 1436 10.69.0.66 dev eno1 lladdr 96:eb:75:fd:89:fd extern_learn REACHABLE 1437 10.69.0.130 dev eno2 lladdr 52:54:00:a6:62:56 extern_learn REACHABLE 1438 1439 .. _external_access_to_clusterip_services: 1440 1441 External Access To ClusterIP Services 1442 ************************************* 1443 1444 As per `k8s Service <https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types>`__, 1445 Cilium's eBPF kube-proxy replacement by default disallows access to a ClusterIP service from outside the cluster. 1446 This can be allowed by setting ``bpf.lbExternalClusterIP=true``. 1447 1448 Observability 1449 ************* 1450 1451 You can trace socket LB related datapath events using Hubble and cilium monitor. 1452 1453 Apply the following pod and service: 1454 1455 .. code-block:: yaml 1456 1457 apiVersion: v1 1458 kind: Pod 1459 metadata: 1460 name: nginx 1461 labels: 1462 app: proxy 1463 spec: 1464 containers: 1465 - name: nginx 1466 image: nginx:stable 1467 ports: 1468 - containerPort: 80 1469 --- 1470 apiVersion: v1 1471 kind: Service 1472 metadata: 1473 name: nginx-service 1474 spec: 1475 selector: 1476 app: proxy 1477 ports: 1478 - port: 80 1479 1480 Deploy a client pod to start traffic. 1481 1482 .. parsed-literal:: 1483 1484 $ kubectl create -f \ |SCM_WEB|\/examples/kubernetes-dns/dns-sw-app.yaml 1485 1486 .. code-block:: shell-session 1487 1488 $ kubectl get svc | grep nginx 1489 nginx-service ClusterIP 10.96.128.44 <none> 80/TCP 140m 1490 1491 $ kubectl exec -it mediabot -- curl -v --connect-timeout 5 10.96.128.44 1492 1493 Follow the Hubble :ref:`hubble_cli` guide to see the network flows. The Hubble 1494 output prints datapath events before and after socket LB translation between service 1495 and selected service endpoint. 1496 1497 .. code-block:: shell-session 1498 1499 $ hubble observe --all | grep mediabot 1500 Jan 13 13:47:20.932: default/mediabot (ID:5618) <> default/nginx-service:80 (world) pre-xlate-fwd TRACED (TCP) 1501 Jan 13 13:47:20.932: default/mediabot (ID:5618) <> default/nginx:80 (ID:35772) post-xlate-fwd TRANSLATED (TCP) 1502 Jan 13 13:47:20.932: default/nginx:80 (ID:35772) <> default/mediabot (ID:5618) pre-xlate-rev TRACED (TCP) 1503 Jan 13 13:47:20.932: default/nginx-service:80 (world) <> default/mediabot (ID:5618) post-xlate-rev TRANSLATED (TCP) 1504 Jan 13 13:47:20.932: default/mediabot:38750 (ID:5618) <> default/nginx (ID:35772) pre-xlate-rev TRACED (TCP) 1505 1506 Socket LB tracing with Hubble requires cilium agent to detect pod cgroup paths. 1507 If you see a warning message in cilium agent ``No valid cgroup base path found: socket load-balancing tracing with Hubble will not work.``, 1508 you can trace packets using ``cilium-dbg monitor`` instead. 1509 1510 .. note:: 1511 1512 In case of the warning log, please file a GitHub issue with the cgroup path 1513 for any of your pods, obtained by running the following command on a Kubernetes 1514 node in your cluster: ``sudo crictl inspectp -o=json $POD_ID | grep cgroup``. 1515 1516 .. code-block:: shell-session 1517 1518 $ kubectl get pods -o wide 1519 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 1520 mediabot 1/1 Running 0 54m 10.244.1.237 kind-worker <none> <none> 1521 nginx 1/1 Running 0 3h25m 10.244.1.246 kind-worker <none> <none> 1522 1523 $ kubectl exec -n kube-system cilium-rt2jh -- cilium-dbg monitor -v -t trace-sock 1524 CPU 11: [pre-xlate-fwd] cgroup_id: 479586 sock_cookie: 7123674, dst [10.96.128.44]:80 tcp 1525 CPU 11: [post-xlate-fwd] cgroup_id: 479586 sock_cookie: 7123674, dst [10.244.1.246]:80 tcp 1526 CPU 11: [pre-xlate-rev] cgroup_id: 479586 sock_cookie: 7123674, dst [10.244.1.246]:80 tcp 1527 CPU 11: [post-xlate-rev] cgroup_id: 479586 sock_cookie: 7123674, dst [10.96.128.44]:80 tcp 1528 1529 You can identify the client pod using its printed ``cgroup id`` metadata. The pod 1530 ``cgroup path`` corresponding to the ``cgroup id`` has its UUID. The socket 1531 cookie is a unique socket identifier allocated in the Linux kernel. The socket 1532 cookie metadata can be used to identify all the trace events from a socket. 1533 1534 .. code-block:: shell-session 1535 1536 $ kubectl get pods -o custom-columns=PodName:.metadata.name,PodUID:.metadata.uid 1537 PodName PodUID 1538 mediabot b620703c-c446-49c7-84c8-e23f4ba5626b 1539 nginx 73b9938b-7e4b-4cbd-8c4c-67d4f253ccf4 1540 1541 $ kubectl exec -n kube-system cilium-rt2jh -- find /run/cilium/cgroupv2/ -inum 479586 1542 Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), clean-cilium-state (init) 1543 /run/cilium/cgroupv2/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-podb620703c_c446_49c7_84c8_e23f4ba5626b.slice/cri-containerd-4e7fc71c8bef8c05c9fb76d93a186736fca266e668722e1239fe64503b3e80d3.scope 1544 1545 Troubleshooting 1546 *************** 1547 1548 Validate BPF cgroup programs attachment 1549 ======================================= 1550 1551 Cilium attaches BPF ``cgroup`` programs to enable socket-based load-balancing (aka 1552 ``host-reachable`` services). If you see connectivity issues for ``clusterIP`` services, 1553 check if the programs are attached to the host ``cgroup root``. The default ``cgroup`` 1554 root is set to ``/run/cilium/cgroupv2``. 1555 Run the following commands from a Cilium agent pod as well as the underlying 1556 kubernetes node where the pod is running. If the container runtime in your cluster 1557 is running in the cgroup namespace mode, Cilium agent pod can attach BPF ``cgroup`` 1558 programs to the ``virtualized cgroup root``. In such cases, Cilium kube-proxy replacement 1559 based load-balancing may not be effective leading to connectivity issues. 1560 For more information, ensure that you have the fix `Pull Request <https://github.com/cilium/cilium/pull/16259>`__. 1561 1562 .. code-block:: shell-session 1563 1564 $ mount | grep cgroup2 1565 none on /run/cilium/cgroupv2 type cgroup2 (rw,relatime) 1566 1567 $ bpftool cgroup tree /run/cilium/cgroupv2/ 1568 CgroupPath 1569 ID AttachType AttachFlags Name 1570 /run/cilium/cgroupv2 1571 10613 device multi 1572 48497 connect4 1573 48493 connect6 1574 48499 sendmsg4 1575 48495 sendmsg6 1576 48500 recvmsg4 1577 48496 recvmsg6 1578 48498 getpeername4 1579 48494 getpeername6 1580 1581 Known Issues 1582 ############ 1583 1584 For clusters deployed with Cilium version 1.11.14 or earlier, service backend entries could 1585 be leaked in the BPF maps in some instances. The known cases that could lead 1586 to such leaks are due to race conditions between deletion of a service backend 1587 while it's terminating, and simultaneous deletion of the service the backend is 1588 associated with. This could lead to duplicate backend entries that could eventually 1589 fill up the ``cilium_lb4_backends_v2`` map. 1590 In such cases, you might see error messages like these in the Cilium agent logs:: 1591 1592 Unable to update element for cilium_lb4_backends_v2 map with file descriptor 15: the map is full, please consider resizing it. argument list too long 1593 1594 While the leak was fixed in Cilium version 1.11.15, in some cases, any affected clusters upgrading 1595 from the problematic cilium versions 1.11.14 or earlier to any subsequent versions may not 1596 see the leaked backends cleaned up from the BPF maps after the Cilium agent restarts. 1597 The fixes to clean up leaked duplicate backend entries were backported to older 1598 releases, and are available as part of Cilium versions v1.11.16, v1.12.9 and v1.13.2. 1599 Fresh clusters deploying Cilium versions 1.11.15 or later don't experience this leak issue. 1600 1601 For more information, see `this GitHub issue <https://github.com/cilium/cilium/issues/23551>`__. 1602 1603 Limitations 1604 ########### 1605 1606 * Cilium's eBPF kube-proxy replacement currently cannot be used with :ref:`encryption_ipsec`. 1607 * Cilium's eBPF kube-proxy replacement relies upon the socket-LB feature 1608 which uses eBPF cgroup hooks to implement the service translation. Using it with libceph 1609 deployments currently requires support for the getpeername(2) hook address translation in 1610 eBPF, which is only available for kernels v5.8 and higher. 1611 * In order to support nfs in the kernel with the socket-LB feature, ensure that 1612 kernel commit ``0bdf399342c5 ("net: Avoid address overwrite in kernel_connect")`` 1613 is part of your underlying kernel. Linux kernels v6.6 and higher support it. Older 1614 stable kernels are TBD. For a more detailed discussion see :gh-issue:`21541`. 1615 * Cilium's DSR NodePort mode currently does not operate well in environments with 1616 TCP Fast Open (TFO) enabled. It is recommended to switch to ``snat`` mode in this 1617 situation. 1618 * Cilium's eBPF kube-proxy replacement does not support the SCTP transport protocol. 1619 Only TCP and UDP is supported as a transport for services at this point. 1620 * Cilium's eBPF kube-proxy replacement does not allow ``hostPort`` port configurations 1621 for Pods that overlap with the configured NodePort range. In such case, the ``hostPort`` 1622 setting will be ignored and a warning emitted to the Cilium agent log. Similarly, 1623 explicitly binding the ``hostIP`` to the loopback address in the host namespace is 1624 currently not supported and will log a warning to the Cilium agent log. 1625 * When Cilium's kube-proxy replacement is used with Kubernetes versions(< 1.19) that have 1626 support for ``EndpointSlices``, ``Services`` without selectors and backing ``Endpoints`` 1627 don't work. The reason is that Cilium only monitors changes made to ``EndpointSlices`` 1628 objects if support is available and ignores ``Endpoints`` in those cases. Kubernetes 1.19 1629 release introduces ``EndpointSliceMirroring`` controller that mirrors custom ``Endpoints`` 1630 resources to corresponding ``EndpointSlices`` and thus allowing backing ``Endpoints`` 1631 to work. For a more detailed discussion see :gh-issue:`12438`. 1632 * When deployed on kernels older than 5.7, Cilium is unable to distinguish between host and 1633 pod namespaces due to the lack of kernel support for network namespace cookies. As a result, 1634 Kubernetes services are reachable from all pods via the loopback address. 1635 * The neighbor discovery in a multi-device environment doesn't work with the runtime device 1636 detection which means that the target devices for the neighbor discovery doesn't follow the 1637 device changes. 1638 * When socket-LB feature is enabled, pods sending (connected) UDP traffic to services 1639 can continue to send traffic to a service backend even after it's deleted. Cilium agent 1640 handles such scenarios by forcefully terminating application sockets that are connected 1641 to deleted backends, so that the applications can be load-balanced to active backends. 1642 This functionality requires these kernel configs to be enabled: 1643 ``CONFIG_INET_DIAG``, ``CONFIG_INET_UDP_DIAG`` and ``CONFIG_INET_DIAG_DESTROY``. 1644 * Cilium's BPF-based masquerading is recommended over iptables when using the 1645 BPF-based NodePort. Otherwise, there is a risk for port collisions between 1646 BPF and iptables SNAT, which might result in dropped NodePort 1647 connections :gh-issue:`23604`. 1648 1649 Further Readings 1650 ################ 1651 1652 The following presentations describe inner-workings of the kube-proxy replacement in eBPF 1653 in great details: 1654 1655 * "Liberating Kubernetes from kube-proxy and iptables" (KubeCon North America 2019, `slides 1656 <https://docs.google.com/presentation/d/1cZJ-pcwB9WG88wzhDm2jxQY4Sh8adYg0-N3qWQ8593I/edit>`__, 1657 `video <https://www.youtube.com/watch?v=bIRwSIwNHC0>`__) 1658 * "Kubernetes service load-balancing at scale with BPF & XDP" (Linux Plumbers 2020, `slides 1659 <https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf>`__, 1660 `video <https://www.youtube.com/watch?v=UkvxPyIJAko&t=21s>`__) 1661 * "eBPF as a revolutionary technology for the container landscape" (Fosdem 2020, `slides 1662 <https://docs.google.com/presentation/d/1VOUcoIxgM_c6M_zAV1dLlRCjyYCMdR3tJv6CEdfLMh8/edit>`__, 1663 `video <https://fosdem.org/2020/schedule/event/containers_bpf/>`__) 1664 * "Kernel improvements for Cilium socket LB" (LSF/MM/BPF 2020, `slides 1665 <https://docs.google.com/presentation/d/1w2zlpGWV7JUhHYd37El_AUZzyUNSvDfktrF5MJ5G8Bs/edit#slide=id.g746fc02b5b_2_0>`__)