github.com/cilium/cilium@v1.16.2/Documentation/network/kubernetes/kubeproxy-free.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _kubeproxy-free:
     8  
     9  *****************************
    10  Kubernetes Without kube-proxy
    11  *****************************
    12  
    13  This guide explains how to provision a Kubernetes cluster without ``kube-proxy``,
    14  and to use Cilium to fully replace it. For simplicity, we will use ``kubeadm`` to
    15  bootstrap the cluster.
    16  
    17  For help with installing ``kubeadm`` and for more provisioning options please refer to
    18  `the official Kubeadm documentation <https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/create-cluster-kubeadm/>`_.
    19  
    20  .. note::
    21  
    22     Cilium's kube-proxy replacement depends on the socket-LB feature,
    23     which requires a v4.19.57, v5.1.16, v5.2.0 or more recent Linux kernel.
    24     Linux kernels v5.3 and v5.8 add additional features that Cilium can use to
    25     further optimize the kube-proxy replacement implementation.
    26  
    27     Note that v5.0.y kernels do not have the fix required to run the kube-proxy
    28     replacement since at this point in time the v5.0.y stable kernel is end-of-life
    29     (EOL) and not maintained anymore on kernel.org. For individual distribution
    30     maintained kernels, the situation could differ. Therefore, please check with
    31     your distribution.
    32  
    33  Quick-Start
    34  ###########
    35  
    36  Initialize the control-plane node via ``kubeadm init`` and skip the
    37  installation of the ``kube-proxy`` add-on:
    38  
    39  .. note::
    40      Depending on what CRI implementation you are using, you may need to use the
    41      ``--cri-socket`` flag with your ``kubeadm init ...`` command.
    42      For example: if you're using Docker CRI you would use
    43      ``--cri-socket unix:///var/run/cri-dockerd.sock``.
    44  
    45  .. code-block:: shell-session
    46  
    47      $ kubeadm init --skip-phases=addon/kube-proxy
    48  
    49  Afterwards, join worker nodes by specifying the control-plane node IP address and
    50  the token returned by ``kubeadm init``
    51  (for this tutorial, you will want to add at least one worker node to the cluster):
    52  
    53  .. code-block:: shell-session
    54  
    55      $ kubeadm join <..>
    56  
    57  .. note::
    58  
    59      Please ensure that
    60      `kubelet <https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/>`_'s
    61      ``--node-ip`` is set correctly on each worker if you have multiple interfaces.
    62      Cilium's kube-proxy replacement may not work correctly otherwise.
    63      You can validate this by running ``kubectl get nodes -o wide`` to see whether
    64      each node has an ``InternalIP`` which is assigned to a device with the same
    65      name on each node.
    66  
    67  For existing installations with ``kube-proxy`` running as a DaemonSet, remove it
    68  by using the following commands below.
    69  
    70  .. warning::
    71     Be aware that removing ``kube-proxy`` will break existing service connections. It will also stop service related traffic
    72     until the Cilium replacement has been installed.
    73  
    74  .. code-block:: shell-session
    75  
    76     $ kubectl -n kube-system delete ds kube-proxy
    77     $ # Delete the configmap as well to avoid kube-proxy being reinstalled during a Kubeadm upgrade (works only for K8s 1.19 and newer)
    78     $ kubectl -n kube-system delete cm kube-proxy
    79     $ # Run on each node with root permissions:
    80     $ iptables-save | grep -v KUBE | iptables-restore
    81  
    82  .. include:: ../../installation/k8s-install-download-release.rst
    83  
    84  Next, generate the required YAML files and deploy them.
    85  
    86  .. important::
    87  
    88     Make sure you correctly set your ``API_SERVER_IP`` and ``API_SERVER_PORT``
    89     below with the control-plane node IP address and the kube-apiserver port
    90     number reported by ``kubeadm init`` (Kubeadm will use port ``6443`` by default).
    91  
    92  Specifying this is necessary as ``kubeadm init`` is run explicitly without setting
    93  up kube-proxy and as a consequence, although it exports ``KUBERNETES_SERVICE_HOST``
    94  and ``KUBERNETES_SERVICE_PORT`` with a ClusterIP of the kube-apiserver service
    95  to the environment, there is no kube-proxy in our setup provisioning that service.
    96  Therefore, the Cilium agent needs to be made aware of this information with the following configuration:
    97  
    98  .. parsed-literal::
    99  
   100      API_SERVER_IP=<your_api_server_ip>
   101      # Kubeadm default is 6443
   102      API_SERVER_PORT=<your_api_server_port>
   103      helm install cilium |CHART_RELEASE| \\
   104          --namespace kube-system \\
   105          --set kubeProxyReplacement=true \\
   106          --set k8sServiceHost=${API_SERVER_IP} \\
   107          --set k8sServicePort=${API_SERVER_PORT}
   108  
   109  .. note::
   110  
   111      Cilium will automatically mount cgroup v2 filesystem required to attach BPF
   112      cgroup programs by default at the path ``/run/cilium/cgroupv2``. To do that,
   113      it needs to mount the host ``/proc`` inside an init container
   114      launched by the DaemonSet temporarily. If you need to disable the auto-mount,
   115      specify ``--set cgroup.autoMount.enabled=false``, and set the host mount point
   116      where cgroup v2 filesystem is already mounted by using ``--set cgroup.hostRoot``.
   117      For example, if not already mounted, you can mount cgroup v2 filesystem by
   118      running the below command on the host, and specify ``--set cgroup.hostRoot=/sys/fs/cgroup``.
   119  
   120      .. code:: shell-session
   121  
   122          mount -t cgroup2 none /sys/fs/cgroup
   123  
   124  This will install Cilium as a CNI plugin with the eBPF kube-proxy replacement to
   125  implement handling of Kubernetes services of type ClusterIP, NodePort, LoadBalancer
   126  and services with externalIPs. As well, the eBPF kube-proxy replacement also
   127  supports hostPort for containers such that using portmap is not necessary anymore.
   128  
   129  Finally, as a last step, verify that Cilium has come up correctly on all nodes and
   130  is ready to operate:
   131  
   132  .. code-block:: shell-session
   133  
   134      $ kubectl -n kube-system get pods -l k8s-app=cilium
   135      NAME                READY     STATUS    RESTARTS   AGE
   136      cilium-fmh8d        1/1       Running   0          10m
   137      cilium-mkcmb        1/1       Running   0          10m
   138  
   139  Note, in above Helm configuration, the ``kubeProxyReplacement`` has been set to
   140  ``true`` mode. This means that the Cilium agent will bail out in case the
   141  underlying Linux kernel support is missing.
   142  
   143  By default, Helm sets ``kubeProxyReplacement=false``, which only enables
   144  per-packet in-cluster load-balancing of ClusterIP services.
   145  
   146  Cilium's eBPF kube-proxy replacement is supported in direct routing as well as in
   147  tunneling mode.
   148  
   149  Validate the Setup
   150  ##################
   151  
   152  After deploying Cilium with above Quick-Start guide, we can first validate that
   153  the Cilium agent is running in the desired mode:
   154  
   155  .. code-block:: shell-session
   156  
   157      $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep KubeProxyReplacement
   158      KubeProxyReplacement:   True	[eth0 (Direct Routing), eth1]
   159  
   160  Use ``--verbose`` for full details:
   161  
   162  .. code-block:: shell-session
   163  
   164      $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status --verbose
   165      [...]
   166      KubeProxyReplacement Details:
   167        Status:                True
   168        Socket LB:             Enabled
   169        Protocols:             TCP, UDP
   170        Devices:               eth0 (Direct Routing), eth1
   171        Mode:                  SNAT
   172        Backend Selection:     Random
   173        Session Affinity:      Enabled
   174        Graceful Termination:  Enabled
   175        NAT46/64 Support:      Enabled
   176        XDP Acceleration:      Disabled
   177        Services:
   178        - ClusterIP:      Enabled
   179        - NodePort:       Enabled (Range: 30000-32767)
   180        - LoadBalancer:   Enabled
   181        - externalIPs:    Enabled
   182        - HostPort:       Enabled
   183      [...]
   184  
   185  As an optional next step, we will create an Nginx Deployment. Then we'll create a new NodePort service and
   186  validate that Cilium installed the service correctly.
   187  
   188  The following YAML is used for the backend pods:
   189  
   190  .. code-block:: yaml
   191  
   192      apiVersion: apps/v1
   193      kind: Deployment
   194      metadata:
   195        name: my-nginx
   196      spec:
   197        selector:
   198          matchLabels:
   199            run: my-nginx
   200        replicas: 2
   201        template:
   202          metadata:
   203            labels:
   204              run: my-nginx
   205          spec:
   206            containers:
   207            - name: my-nginx
   208              image: nginx
   209              ports:
   210              - containerPort: 80
   211  
   212  Verify that the Nginx pods are up and running:
   213  
   214  .. code-block:: shell-session
   215  
   216      $ kubectl get pods -l run=my-nginx -o wide
   217      NAME                        READY   STATUS    RESTARTS   AGE   IP             NODE   NOMINATED NODE   READINESS GATES
   218      my-nginx-756fb87568-gmp8c   1/1     Running   0          62m   10.217.0.149   apoc   <none>           <none>
   219      my-nginx-756fb87568-n5scv   1/1     Running   0          62m   10.217.0.107   apoc   <none>           <none>
   220  
   221  In the next step, we create a NodePort service for the two instances:
   222  
   223  .. code-block:: shell-session
   224  
   225      $ kubectl expose deployment my-nginx --type=NodePort --port=80
   226      service/my-nginx exposed
   227  
   228  Verify that the NodePort service has been created:
   229  
   230  .. code-block:: shell-session
   231  
   232      $ kubectl get svc my-nginx
   233      NAME       TYPE       CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
   234      my-nginx   NodePort   10.104.239.135   <none>        80:31940/TCP   24m
   235  
   236  With the help of the ``cilium-dbg service list`` command, we can validate that
   237  Cilium's eBPF kube-proxy replacement created the new NodePort service.
   238  In this example, services with port ``31940`` were created (one for each of devices ``eth0`` and ``eth1``):
   239  
   240  .. code-block:: shell-session
   241  
   242      $ kubectl -n kube-system exec ds/cilium -- cilium-dbg service list
   243      ID   Frontend               Service Type   Backend
   244      [...]
   245      4    10.104.239.135:80      ClusterIP      1 => 10.217.0.107:80
   246                                                 2 => 10.217.0.149:80
   247      5    0.0.0.0:31940          NodePort       1 => 10.217.0.107:80
   248                                                 2 => 10.217.0.149:80
   249      6    192.168.178.29:31940   NodePort       1 => 10.217.0.107:80
   250                                                 2 => 10.217.0.149:80
   251      7    172.16.0.29:31940      NodePort       1 => 10.217.0.107:80
   252                                                 2 => 10.217.0.149:80
   253  
   254  Create a variable with the node port for testing:
   255  
   256  .. code-block:: shell-session
   257  
   258      $ node_port=$(kubectl get svc my-nginx -o=jsonpath='{@.spec.ports[0].nodePort}')
   259  
   260  At the same time we can verify, using ``iptables`` in the host namespace,
   261  that no ``iptables`` rule for the service is present:
   262  
   263  .. code-block:: shell-session
   264  
   265      $ iptables-save | grep KUBE-SVC
   266      [ empty line ]
   267  
   268  Last but not least, a simple ``curl`` test shows connectivity for the exposed
   269  NodePort as well as for the ClusterIP:
   270  
   271  .. code-block:: shell-session
   272  
   273      $ curl 127.0.0.1:$node_port
   274      <!DOCTYPE html>
   275      <html>
   276      <head>
   277      <title>Welcome to nginx!</title>
   278      [....]
   279  
   280  .. code-block:: shell-session
   281  
   282      $ curl 192.168.178.29:$node_port
   283      <!doctype html>
   284      <html>
   285      <head>
   286      <title>welcome to nginx!</title>
   287      [....]
   288  
   289  .. code-block:: shell-session
   290  
   291      $ curl 172.16.0.29:$node_port
   292      <!doctype html>
   293      <html>
   294      <head>
   295      <title>welcome to nginx!</title>
   296      [....]
   297  
   298  .. code-block:: shell-session
   299  
   300      $ curl 10.104.239.135:80
   301      <!DOCTYPE html>
   302      <html>
   303      <head>
   304      <title>Welcome to nginx!</title>
   305      [....]
   306  
   307  As can be seen, Cilium's eBPF kube-proxy replacement is set up correctly.
   308  
   309  Advanced Configuration
   310  ######################
   311  
   312  This section covers a few advanced configuration modes for the kube-proxy replacement
   313  that go beyond the above Quick-Start guide and are entirely optional.
   314  
   315  Client Source IP Preservation
   316  *****************************
   317  
   318  Cilium's eBPF kube-proxy replacement implements various options to avoid
   319  performing SNAT on NodePort requests where the client source IP address would otherwise
   320  be lost on its path to the service endpoint.
   321  
   322  - ``externalTrafficPolicy=Local``: The ``Local`` policy is generally supported through
   323    the eBPF implementation. In-cluster connectivity for services with ``externalTrafficPolicy=Local``
   324    is possible and can also be reached from nodes which have no local backends, meaning,
   325    given SNAT does not need to be performed, all service endpoints are available for
   326    load balancing from in-cluster side.
   327  
   328  - ``externalTrafficPolicy=Cluster``: For the ``Cluster`` policy which is the default
   329    upon service creation, multiple options exist for achieving client source IP preservation
   330    for external traffic, that is, operating the kube-proxy replacement in :ref:`DSR<DSR Mode>`
   331    or :ref:`Hybrid<Hybrid Mode>` mode if only TCP-based services are exposed to the outside
   332    world for the latter.
   333  
   334  Internal Traffic Policy
   335  ***********************
   336  
   337  Similar to ``externalTrafficPolicy`` described above, Cilium's eBPF kube-proxy replacement
   338  supports ``internalTrafficPolicy``, which translates the above semantics to in-cluster traffic.
   339  
   340  - For services with ``internalTrafficPolicy=Local``, traffic originated from pods in the
   341    current cluster is routed only to endpoints within the same node the traffic originated from.
   342  
   343  - ``internalTrafficPolicy=Cluster`` is the default, and it doesn't restrict the endpoints that
   344    can handle internal (in-cluster) traffic.
   345  
   346  The following table gives an idea of what backends are used to serve connections to a service,
   347  depending on the external and internal traffic policies:
   348  
   349  +---------------------+-------------------------------------------------+
   350  | Traffic policy      | Service backends used                           |
   351  +----------+----------+-------------------------+-----------------------+
   352  | Internal | External | for North-South traffic | for East-West traffic |
   353  +==========+==========+=========================+=======================+
   354  | Cluster  | Cluster  | All (default)           | All (default)         |
   355  +----------+----------+-------------------------+-----------------------+
   356  | Cluster  | Local    | Node-local only         | All (default)         |
   357  +----------+----------+-------------------------+-----------------------+
   358  | Local    | Cluster  | All (default)           | Node-local only       |
   359  +----------+----------+-------------------------+-----------------------+
   360  | Local    | Local    | Node-local only         | Node-local only       |
   361  +----------+----------+-------------------------+-----------------------+
   362  
   363  .. _maglev:
   364  
   365  Maglev Consistent Hashing
   366  *************************
   367  
   368  Cilium's eBPF kube-proxy replacement supports consistent hashing by implementing a variant
   369  of `The Maglev hashing <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44824.pdf>`_
   370  in its load balancer for backend selection. This improves resiliency in case of
   371  failures. As well, it provides better load balancing properties since Nodes added to the cluster will
   372  make consistent backend selection throughout the cluster for a given 5-tuple without
   373  having to synchronize state with the other Nodes. Similarly, upon backend removal the backend
   374  lookup tables are reprogrammed with minimal disruption for unrelated backends (at most 1%
   375  difference in the reassignments) for the given service.
   376  
   377  Maglev hashing for services load balancing can be enabled by setting ``loadBalancer.algorithm=maglev``:
   378  
   379  .. parsed-literal::
   380  
   381      helm install cilium |CHART_RELEASE| \\
   382          --namespace kube-system \\
   383          --set kubeProxyReplacement=true \\
   384          --set loadBalancer.algorithm=maglev \\
   385          --set k8sServiceHost=${API_SERVER_IP} \\
   386          --set k8sServicePort=${API_SERVER_PORT}
   387  
   388  Note that Maglev hashing is applied only to external (N-S) traffic. For
   389  in-cluster service connections (E-W), sockets are assigned to service backends
   390  directly, e.g. at TCP connect time, without any intermediate hop and thus are
   391  not subject to Maglev. Maglev hashing is also supported for Cilium's
   392  :ref:`XDP<XDP Acceleration>` acceleration.
   393  
   394  There are two more Maglev-specific configuration settings: ``maglev.tableSize``
   395  and ``maglev.hashSeed``.
   396  
   397  ``maglev.tableSize`` specifies the size of the Maglev lookup table for each single service.
   398  `Maglev <https://storage.googleapis.com/pub-tools-public-publication-data/pdf/44824.pdf>`__
   399  recommends the table size (``M``) to be significantly larger than the number of maximum expected
   400  backends (``N``). In practice that means that ``M`` should be larger than ``100 * N`` in
   401  order to guarantee the property of at most 1% difference in the reassignments on backend
   402  changes. ``M`` must be a prime number. Cilium uses a default size of ``16381`` for ``M``.
   403  The following sizes for ``M`` are supported as ``maglev.tableSize`` Helm option:
   404  
   405  +----------------------------+
   406  | ``maglev.tableSize`` value |
   407  +============================+
   408  | 251                        |
   409  +----------------------------+
   410  | 509                        |
   411  +----------------------------+
   412  | 1021                       |
   413  +----------------------------+
   414  | 2039                       |
   415  +----------------------------+
   416  | 4093                       |
   417  +----------------------------+
   418  | 8191                       |
   419  +----------------------------+
   420  | 16381                      |
   421  +----------------------------+
   422  | 32749                      |
   423  +----------------------------+
   424  | 65521                      |
   425  +----------------------------+
   426  | 131071                     |
   427  +----------------------------+
   428  
   429  For example, a ``maglev.tableSize`` of ``16381`` is suitable for a maximum of ``~160`` backends
   430  per service. If a higher number of backends are provisioned under this setting, then the
   431  difference in reassignments on backend changes will increase.
   432  
   433  The ``maglev.hashSeed`` option is recommended to be set in order for Cilium to not rely on the
   434  fixed built-in seed. The seed is a base64-encoded 12 byte-random number, and can be
   435  generated once through ``head -c12 /dev/urandom | base64 -w0``, for example.
   436  Every Cilium agent in the cluster must use the same hash seed for Maglev to work.
   437  
   438  The below deployment example is generating and passing such seed to Helm as well as setting the
   439  Maglev table size to ``65521`` to allow for ``~650`` maximum backends for a
   440  given service (with the property of at most 1% difference on backend reassignments):
   441  
   442  .. parsed-literal::
   443  
   444      SEED=$(head -c12 /dev/urandom | base64 -w0)
   445      helm install cilium |CHART_RELEASE| \\
   446          --namespace kube-system \\
   447          --set kubeProxyReplacement=true \\
   448          --set loadBalancer.algorithm=maglev \\
   449          --set maglev.tableSize=65521 \\
   450          --set maglev.hashSeed=$SEED \\
   451          --set k8sServiceHost=${API_SERVER_IP} \\
   452          --set k8sServicePort=${API_SERVER_PORT}
   453  
   454  
   455  Note that enabling Maglev will have a higher memory consumption on each Cilium-managed Node compared
   456  to the default of ``loadBalancer.algorithm=random`` given ``random`` does not need the extra lookup
   457  tables. However, ``random`` won't have consistent backend selection.
   458  
   459  .. _DSR mode:
   460  
   461  Direct Server Return (DSR)
   462  **************************
   463  
   464  By default, Cilium's eBPF NodePort implementation operates in SNAT mode. That is,
   465  when node-external traffic arrives and the node determines that the backend for
   466  the LoadBalancer, NodePort, or services with externalIPs is at a remote node, then the
   467  node is redirecting the request to the remote backend on its behalf by performing
   468  SNAT. This does not require any additional MTU changes. The cost is that replies
   469  from the backend need to make the extra hop back to that node to perform the
   470  reverse SNAT translation there before returning the packet directly to the external
   471  client.
   472  
   473  This setting can be changed through the ``loadBalancer.mode`` Helm option to
   474  ``dsr`` in order to let Cilium's eBPF NodePort implementation operate in DSR mode.
   475  In this mode, the backends reply directly to the external client without taking
   476  the extra hop, meaning, backends reply by using the service IP/port as a source.
   477  
   478  Another advantage in DSR mode is that the client's source IP is preserved, so policy
   479  can match on it at the backend node. In the SNAT mode this is not possible.
   480  Given a specific backend can be used by multiple services, the backends need to be
   481  made aware of the service IP/port which they need to reply with. Cilium encodes this
   482  information into the packet (using one of the dispatch mechanisms described below),
   483  at the cost of advertising a lower MTU. For TCP services, Cilium
   484  only encodes the service IP/port for the SYN packet, but not subsequent ones. This
   485  optimization also allows to operate Cilium in a hybrid mode as detailed in the later
   486  subsection where DSR is used for TCP and SNAT for UDP in order to avoid an otherwise
   487  needed MTU reduction.
   488  
   489  In some public cloud provider environments that implement source /
   490  destination IP address checking (e.g. AWS), the checking has to be disabled in
   491  order for the DSR mode to work.
   492  
   493  By default Cilium uses special ExternalIP mitigation for CEV-2020-8554 MITM vulnerability.
   494  This may affect connectivity targeted to ExternalIP on the same cluster.
   495  This mitigation can be disabled by setting ``bpf.disableExternalIPMitigation`` to ``true``.
   496  
   497  .. _DSR mode with Option:
   498  
   499  Direct Server Return (DSR) with IPv4 option / IPv6 extension Header
   500  *******************************************************************
   501  
   502  In this DSR dispatch mode, the service IP/port information is transported to the
   503  backend through a Cilium-specific IPv4 Option or IPv6 Destination Option extension header.
   504  It requires Cilium to be deployed in :ref:`arch_direct_routing`, i.e.
   505  it will not work in :ref:`arch_overlay` mode.
   506  
   507  This DSR mode might not work in some public cloud provider environments
   508  due to the Cilium-specific IP options that could be dropped by an underlying network fabric.
   509  In case of connectivity issues to services where backends are located on
   510  a remote node from the node that is processing the given NodePort request,
   511  first check whether the NodePort request actually arrived on the node
   512  containing the backend. If this was not the case, then consider either switching to
   513  DSR with Geneve (as described below), or switching back to the default SNAT mode.
   514  
   515  The above Helm example configuration in a kube-proxy-free environment with DSR-only mode
   516  enabled would look as follows:
   517  
   518  .. parsed-literal::
   519  
   520      helm install cilium |CHART_RELEASE| \\
   521          --namespace kube-system \\
   522          --set routingMode=native \\
   523          --set kubeProxyReplacement=true \\
   524          --set loadBalancer.mode=dsr \\
   525          --set loadBalancer.dsrDispatch=opt \\
   526          --set k8sServiceHost=${API_SERVER_IP} \\
   527          --set k8sServicePort=${API_SERVER_PORT}
   528  
   529  .. _DSR mode with Geneve:
   530  
   531  Direct Server Return (DSR) with Geneve
   532  **************************************
   533  By default, Cilium with DSR mode encodes the service IP/port in a Cilium-specific
   534  IPv4 option or IPv6 Destination Option extension so that the backends are aware of
   535  the service IP/port, which they need to reply with.
   536  
   537  However, some data center routers pass packets with unknown IP options to software
   538  processing called "Layer 2 slow path". Those routers drop the packets if the amount
   539  of packets with IP options exceeds a given threshold, which may significantly affect
   540  network performance.
   541  
   542  Cilium offers another dispatch mode, DSR with Geneve, to avoid this problem.
   543  In DSR with Geneve, Cilium encapsulates packets to the Loadbalancer with the Geneve
   544  header that includes the service IP/port in the Geneve option and redirects them
   545  to the backends.
   546  
   547  The Helm example configuration in a kube-proxy-free environment with DSR and
   548  Geneve dispatch enabled would look as follows:
   549  
   550  .. parsed-literal::
   551      helm install cilium |CHART_RELEASE| \\
   552          --namespace kube-system \\
   553          --set routingMode=native \\
   554          --set tunnelProtocol=geneve \\
   555          --set kubeProxyReplacement=true \\
   556          --set loadBalancer.mode=dsr \\
   557          --set loadBalancer.dsrDispatch=geneve \\
   558          --set k8sServiceHost=${API_SERVER_IP} \\
   559          --set k8sServicePort=${API_SERVER_PORT}
   560  
   561  DSR with Geneve is compatible with the Geneve encapsulation mode (:ref:`arch_overlay`).
   562  It works with either the direct routing mode or the Geneve tunneling mode. Unfortunately,
   563  it doesn't work with the vxlan encapsulation mode.
   564  
   565  The example configuration in DSR with Geneve dispatch and tunneling mode is as follows.
   566  
   567  .. parsed-literal::
   568      helm install cilium |CHART_RELEASE| \\
   569          --namespace kube-system \\
   570          --set routingMode=tunnel \\
   571          --set tunnelProtocol=geneve \\
   572          --set kubeProxyReplacement=true \\
   573          --set loadBalancer.mode=dsr \\
   574          --set loadBalancer.dsrDispatch=geneve \\
   575          --set k8sServiceHost=${API_SERVER_IP} \\
   576          --set k8sServicePort=${API_SERVER_PORT}
   577  
   578  .. _Hybrid mode:
   579  
   580  Hybrid DSR and SNAT Mode
   581  ************************
   582  
   583  Cilium also supports a hybrid DSR and SNAT mode, that is, DSR is performed for TCP
   584  and SNAT for UDP connections.
   585  This removes the need for manual MTU changes in the network while still benefiting from the latency improvements
   586  through the removed extra hop for replies, in particular, when TCP is the main transport
   587  for workloads.
   588  
   589  The mode setting ``loadBalancer.mode`` allows to control the behavior through the
   590  options ``dsr``, ``snat`` and ``hybrid``. By default the ``snat`` mode is used in the
   591  agent.
   592  
   593  A Helm example configuration in a kube-proxy-free environment with DSR enabled in hybrid
   594  mode would look as follows:
   595  
   596  .. parsed-literal::
   597  
   598      helm install cilium |CHART_RELEASE| \\
   599          --namespace kube-system \\
   600          --set routingMode=native \\
   601          --set kubeProxyReplacement=true \\
   602          --set loadBalancer.mode=hybrid \\
   603          --set k8sServiceHost=${API_SERVER_IP} \\
   604          --set k8sServicePort=${API_SERVER_PORT}
   605  
   606  .. _socketlb-host-netns-only:
   607  
   608  Socket LoadBalancer Bypass in Pod Namespace
   609  *******************************************
   610  
   611  The socket-level loadbalancer acts transparent to Cilium's lower layer datapath
   612  in that upon ``connect`` (TCP, connected UDP), ``sendmsg`` (UDP), or ``recvmsg``
   613  (UDP) system calls, the destination IP is checked for an existing service IP and
   614  one of the service backends is selected as a target. This means that although
   615  the application assumes it is connected to the service address, the
   616  corresponding kernel socket is actually connected to the backend address and
   617  therefore no additional lower layer NAT is required.
   618  
   619  Cilium has built-in support for bypassing the socket-level loadbalancer and falling back
   620  to the tc loadbalancer at the veth interface when a custom redirection/operation relies
   621  on the original ClusterIP within pod namespace (e.g., Istio sidecar) or due to the Pod's
   622  nature the socket-level loadbalancer is ineffective (e.g., KubeVirt, Kata Containers,
   623  gVisor).
   624  
   625  Setting ``socketLB.hostNamespaceOnly=true`` enables this bypassing mode. When enabled,
   626  this circumvents socket rewrite in the ``connect()`` and ``sendmsg()`` syscall bpf hook and
   627  will pass the original packet to next stage of operation (e.g., stack in
   628  ``per-endpoint-routing`` mode) and re-enables service lookup in the tc bpf program.
   629  
   630  A Helm example configuration in a kube-proxy-free environment with socket LB bypass
   631  looks as follows:
   632  
   633  .. parsed-literal::
   634  
   635      helm install cilium |CHART_RELEASE| \\
   636          --namespace kube-system \\
   637          --set routingMode=native \\
   638          --set kubeProxyReplacement=true \\
   639          --set socketLB.hostNamespaceOnly=true
   640  
   641  .. _XDP acceleration:
   642  
   643  LoadBalancer & NodePort XDP Acceleration
   644  ****************************************
   645  
   646  Cilium has built-in support for accelerating NodePort, LoadBalancer services and
   647  services with externalIPs for the case where the arriving request needs to be
   648  forwarded and the backend is located on a remote node. This feature was introduced
   649  in Cilium version `1.8 <https://cilium.io/blog/2020/06/22/cilium-18/#kube-proxy-replacement-at-the-xdp-layer>`_ at
   650  the XDP (eXpress Data Path) layer where eBPF is operating directly in the networking
   651  driver instead of a higher layer.
   652  
   653  Setting ``loadBalancer.acceleration`` to option ``native`` enables this acceleration.
   654  The option ``disabled`` is the default and disables the acceleration. The majority
   655  of drivers supporting 10G or higher rates also support ``native`` XDP on a recent
   656  kernel. For cloud based deployments most of these drivers have SR-IOV variants that
   657  support native XDP as well. For on-prem deployments the Cilium XDP acceleration can
   658  be used in combination with LoadBalancer service implementations for Kubernetes such
   659  as `MetalLB <https://metallb.universe.tf/>`_. The acceleration can be enabled only
   660  on a single device which is used for direct routing.
   661  
   662  For high-scale environments, also consider tweaking the default map sizes to a larger
   663  number of entries e.g. through setting a higher ``config.bpfMapDynamicSizeRatio``.
   664  See :ref:`bpf_map_limitations` for further details.
   665  
   666  The ``loadBalancer.acceleration`` setting is supported for DSR, SNAT and hybrid
   667  modes and can be enabled as follows for ``loadBalancer.mode=hybrid`` in this example:
   668  
   669  .. parsed-literal::
   670  
   671      helm install cilium |CHART_RELEASE| \\
   672          --namespace kube-system \\
   673          --set routingMode=native \\
   674          --set kubeProxyReplacement=true \\
   675          --set loadBalancer.acceleration=native \\
   676          --set loadBalancer.mode=hybrid \\
   677          --set k8sServiceHost=${API_SERVER_IP} \\
   678          --set k8sServicePort=${API_SERVER_PORT}
   679  
   680  
   681  In case of a multi-device environment, where Cilium's device auto-detection selects
   682  more than a single device to expose NodePort or a user specifies multiple devices
   683  with ``devices``, the XDP acceleration is enabled on all devices. This means that
   684  each underlying device's driver must have native XDP support on all Cilium managed
   685  nodes. If you have an environment where some devices support XDP but others do not
   686  you can have XDP enabled on the supported devices by setting
   687  ``loadBalancer.acceleration`` to ``best-effort``. In addition, for performance
   688  reasons we recommend kernel >= 5.5 for the multi-device XDP acceleration.
   689  
   690  A list of drivers supporting XDP can be found in :ref:`the XDP documentation<xdp_drivers>`.
   691  
   692  The current Cilium kube-proxy XDP acceleration mode can also be introspected through
   693  the ``cilium-dbg status`` CLI command. If it has been enabled successfully, ``Native``
   694  is shown:
   695  
   696  .. code-block:: shell-session
   697  
   698      $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status --verbose | grep XDP
   699        XDP Acceleration:    Native
   700  
   701  Note that packets which have been pushed back out of the device for NodePort handling
   702  right at the XDP layer are not visible in tcpdump since packet taps come at a much
   703  later stage in the networking stack. Cilium's monitor command or metric counters can be used
   704  instead for gaining visibility.
   705  
   706  NodePort XDP on AWS
   707  ===================
   708  
   709  In order to run with NodePort XDP on AWS, follow the instructions in the :ref:`k8s_install_quick`
   710  guide to set up an EKS cluster or use any other method of your preference to set up a
   711  Kubernetes cluster.
   712  
   713  If you are following the EKS guide, make sure to create a node group with SSH access, since
   714  we need few additional setup steps as well as create a larger instance type which supports
   715  the `Elastic Network Adapter <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/enhanced-networking-ena.html>`__ (ena).
   716  As an instance example, ``m5n.xlarge`` is used in the config ``nodegroup-config.yaml``:
   717  
   718  .. code-block:: yaml
   719  
   720    apiVersion: eksctl.io/v1alpha5
   721    kind: ClusterConfig
   722  
   723    metadata:
   724      name: test-cluster
   725      region: us-west-2
   726  
   727    nodeGroups:
   728      - name: ng-1
   729        instanceType: m5n.xlarge
   730        desiredCapacity: 2
   731        ssh:
   732          allow: true
   733        ## taint nodes so that application pods are
   734        ## not scheduled/executed until Cilium is deployed.
   735        ## Alternatively, see the note below.
   736        taints:
   737          - key: "node.cilium.io/agent-not-ready"
   738            value: "true"
   739            effect: "NoExecute"
   740  
   741  .. note::
   742  
   743    Please make sure to read and understand the documentation page on :ref:`taint effects and unmanaged pods<taint_effects>`.
   744  
   745  The nodegroup is created with:
   746  
   747  .. code-block:: shell-session
   748  
   749    $ eksctl create nodegroup -f nodegroup-config.yaml
   750  
   751  Each of the nodes need the ``kernel-ng`` and ``ethtool`` package installed. The former is
   752  needed in order to run a sufficiently recent kernel for eBPF in general and native XDP
   753  support on the ena driver. The latter is needed to configure channel parameters for the NIC.
   754  
   755  .. code-block:: shell-session
   756  
   757    $ IPS=$(kubectl get no -o jsonpath='{$.items[*].status.addresses[?(@.type=="ExternalIP")].address }{"\\n"}' | tr ' ' '\\n')
   758  
   759    $ for ip in $IPS ; do ssh ec2-user@$ip "sudo amazon-linux-extras install -y kernel-ng && sudo yum install -y ethtool && sudo reboot"; done
   760  
   761  Once the nodes come back up their kernel version should say ``5.4.58-27.104.amzn2.x86_64`` or
   762  similar through ``uname -r``. In order to run XDP on ena, make sure the driver version is at
   763  least `2.2.8 <https://github.com/amzn/amzn-drivers/commit/ccbb1fe2c2f2ab3fc6d7827b012ba8ec06f32c39>`__.
   764  The driver version can be inspected through ``ethtool -i eth0``. For the given kernel version
   765  the driver version should be reported as ``2.2.10g``.
   766  
   767  Before Cilium's XDP acceleration can be deployed, there are two settings needed on the
   768  network adapter side, that is, MTU needs to be lowered in order to be able to operate
   769  with XDP, and number of combined channels need to be adapted.
   770  
   771  The default MTU is set to 9001 on the ena driver. Given XDP buffers are linear, they
   772  operate on a single page. A driver typically reserves some headroom for XDP as well
   773  (e.g. for encapsulation purpose), therefore, the highest possible MTU for XDP would
   774  be 3498.
   775  
   776  In terms of ena channels, the settings can be gathered via ``ethtool -l eth0``. For the
   777  ``m5n.xlarge`` instance, the default output should look like::
   778  
   779    Channel parameters for eth0:
   780    Pre-set maximums:
   781    RX:             0
   782    TX:             0
   783    Other:          0
   784    Combined:       4
   785    Current hardware settings:
   786    RX:             0
   787    TX:             0
   788    Other:          0
   789    Combined:       4
   790  
   791  In order to use XDP the channels must be set to at most 1/2 of the value from
   792  ``Combined`` above. Both, MTU and channel changes are applied as follows:
   793  
   794  .. code-block:: shell-session
   795  
   796    $ for ip in $IPS ; do ssh ec2-user@$ip "sudo ip link set dev eth0 mtu 3498"; done
   797    $ for ip in $IPS ; do ssh ec2-user@$ip "sudo ethtool -L eth0 combined 2"; done
   798  
   799  In order to deploy Cilium, the Kubernetes API server IP and port is needed:
   800  
   801  .. code-block:: shell-session
   802  
   803    $ export API_SERVER_IP=$(kubectl get ep kubernetes -o jsonpath='{$.subsets[0].addresses[0].ip}')
   804    $ export API_SERVER_PORT=443
   805  
   806  Finally, the deployment can be upgraded and later rolled-out with the
   807  ``loadBalancer.acceleration=native`` setting to enable XDP in Cilium:
   808  
   809  .. parsed-literal::
   810  
   811    helm upgrade cilium |CHART_RELEASE| \\
   812          --namespace kube-system \\
   813          --reuse-values \\
   814          --set kubeProxyReplacement=true \\
   815          --set loadBalancer.acceleration=native \\
   816          --set loadBalancer.mode=snat \\
   817          --set k8sServiceHost=${API_SERVER_IP} \\
   818          --set k8sServicePort=${API_SERVER_PORT}
   819  
   820  
   821  NodePort XDP on Azure
   822  =====================
   823  
   824  To enable NodePort XDP on Azure AKS or a self-managed Kubernetes running on Azure, the virtual
   825  machines running Kubernetes must have `Accelerated Networking
   826  <https://azure.microsoft.com/en-us/updates/accelerated-networking-in-expanded-preview/>`_
   827  enabled. In addition, the Linux kernel on the nodes must also have support for
   828  native XDP in the ``hv_netvsc`` driver, which is available in kernel >= 5.6 and was backported to
   829  the Azure Linux kernel in 5.4.0-1022.
   830  
   831  On AKS, make sure to use the AKS Ubuntu 22.04 node image with Kubernetes version v1.26 which will
   832  provide a Linux kernel with the necessary backports to the ``hv_netvsc`` driver. Please refer to the
   833  documentation on `how to configure an AKS cluster
   834  <https://docs.microsoft.com/en-us/azure/aks/cluster-configuration>`_ for more details.
   835  
   836  To enable accelerated networking when creating a virtual machine or
   837  virtual machine scale set, pass the ``--accelerated-networking`` option to the
   838  Azure CLI. Please refer to the guide on how to `create a Linux virtual machine
   839  with Accelerated Networking using Azure CLI
   840  <https://docs.microsoft.com/en-us/azure/virtual-network/create-vm-accelerated-networking-cli>`_
   841  for more details.
   842  
   843  When *Accelerated Networking* is enabled, ``lspci`` will show a
   844  Mellanox ConnectX NIC:
   845  
   846  .. code-block:: shell-session
   847  
   848      $ lspci | grep Ethernet
   849      2846:00:02.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx Virtual Function] (rev 80)
   850  
   851  XDP acceleration can only be enabled on NICs ConnectX-4 Lx and onwards.
   852  
   853  In order to run XDP, large receive offload (LRO) needs to be disabled on the
   854  ``hv_netvsc`` device. If not the case already, this can be achieved by:
   855  
   856  .. code-block:: shell-session
   857  
   858     $ ethtool -K eth0 lro off
   859  
   860  It is recommended to use Azure IPAM for the pod IP address allocation, which
   861  will automatically configure your virtual network to route pod traffic correctly:
   862  
   863  .. parsed-literal::
   864  
   865     helm install cilium |CHART_RELEASE| \\
   866       --namespace kube-system \\
   867       --set ipam.mode=azure \\
   868       --set azure.enabled=true \\
   869       --set azure.resourceGroup=$AZURE_NODE_RESOURCE_GROUP \\
   870       --set azure.subscriptionID=$AZURE_SUBSCRIPTION_ID \\
   871       --set azure.tenantID=$AZURE_TENANT_ID \\
   872       --set azure.clientID=$AZURE_CLIENT_ID \\
   873       --set azure.clientSecret=$AZURE_CLIENT_SECRET \\
   874       --set routingMode=native \\
   875       --set enableIPv4Masquerade=false \\
   876       --set devices=eth0 \\
   877       --set kubeProxyReplacement=true \\
   878       --set loadBalancer.acceleration=native \\
   879       --set loadBalancer.mode=snat \\
   880       --set k8sServiceHost=${API_SERVER_IP} \\
   881       --set k8sServicePort=${API_SERVER_PORT}
   882  
   883  
   884  When running Azure IPAM on a self-managed Kubernetes cluster, each ``v1.Node``
   885  must have the resource ID of its VM in the ``spec.providerID`` field.
   886  Refer to the :ref:`ipam_azure` reference for more information.
   887  
   888  NodePort XDP on GCP
   889  ===================
   890  
   891  NodePort XDP on the Google Cloud Platform is currently not supported. Both
   892  virtual network interfaces available on Google Compute Engine (the older
   893  virtIO-based interface and the newer `gVNIC
   894  <https://cloud.google.com/compute/docs/instances/create-vm-with-gvnic>`_) are
   895  currently lacking support for native XDP.
   896  
   897  .. _NodePort Devices:
   898  
   899  NodePort Devices, Port and Bind settings
   900  ****************************************
   901  
   902  When running Cilium's eBPF kube-proxy replacement, by default, a NodePort or
   903  LoadBalancer service or a service with externalIPs will be accessible through
   904  the IP addresses of native devices which have the default route on the host or
   905  have Kubernetes InternalIP or ExternalIP assigned. InternalIP is preferred over
   906  ExternalIP if both exist. To change the devices, set their names in the
   907  ``devices`` Helm option, e.g. ``devices='{eth0,eth1,eth2}'``. Each
   908  listed device has to be named the same on all Cilium managed nodes. Alternatively
   909  if the devices do not match across different nodes, the wildcard option can be
   910  used, e.g. ``devices=eth+``, which would match any device starting with prefix
   911  ``eth``. If no device can be matched the Cilium agent will try to perform auto
   912  detection.
   913  
   914  When multiple devices are used, only one device can be used for direct routing
   915  between Cilium nodes. By default, if a single device was detected or specified
   916  via ``devices`` then Cilium will use that device for direct routing.
   917  Otherwise, Cilium will use a device with Kubernetes InternalIP or ExternalIP
   918  set. InternalIP is preferred over ExternalIP if both exist. To change
   919  the direct routing device, set the ``nodePort.directRoutingDevice`` Helm
   920  option, e.g. ``nodePort.directRoutingDevice=eth1``. The wildcard option can be
   921  used as well as the devices option, e.g. ``directRoutingDevice=eth+``.
   922  If more than one devices match the wildcard option, Cilium will sort them
   923  in increasing alphanumerical order and pick the first one. If the direct routing
   924  device does not exist within ``devices``, Cilium will add the device to the latter
   925  list. The direct routing device is used for
   926  :ref:`the NodePort XDP acceleration<XDP Acceleration>` as well (if enabled).
   927  
   928  In addition, thanks to the socket-LB feature, the NodePort service can
   929  be accessed by default from a host or a pod within a cluster via its public, any
   930  local (except for ``docker*`` prefixed names) or loopback address, e.g.
   931  ``127.0.0.1:NODE_PORT``.
   932  
   933  If ``kube-apiserver`` was configured to use a non-default NodePort port range,
   934  then the same range must be passed to Cilium via the ``nodePort.range``
   935  option, for example, as ``nodePort.range="10000\,32767"`` for a
   936  range of ``10000-32767``. The default Kubernetes NodePort range is ``30000-32767``.
   937  
   938  If the NodePort port range overlaps with the ephemeral port range
   939  (``net.ipv4.ip_local_port_range``), Cilium will append the NodePort range to
   940  the reserved ports (``net.ipv4.ip_local_reserved_ports``). This is needed to
   941  prevent a NodePort service from hijacking traffic of a host local application
   942  which source port matches the service port. To disable the modification of
   943  the reserved ports, set ``nodePort.autoProtectPortRanges`` to ``false``.
   944  
   945  By default, the NodePort implementation prevents application ``bind(2)`` requests
   946  to NodePort service ports. In such case, the application will typically see a
   947  ``bind: Operation not permitted`` error. This happens either globally for older
   948  kernels or starting from v5.7 kernels only for the host namespace by default
   949  and therefore not affecting any application pod ``bind(2)`` requests anymore. In
   950  order to opt-out from this behavior in general, this setting can be changed for
   951  expert users by switching ``nodePort.bindProtection`` to ``false``.
   952  
   953  .. _Configuring Maps:
   954  
   955  Configuring BPF Map Sizes
   956  *************************
   957  
   958  For high-scale environments, Cilium's BPF maps can be configured to have higher
   959  limits on the number of entries. Overriding Helm options can be used to tweak
   960  these limits.
   961  
   962  To increase the number of entries in Cilium's BPF LB service, backend and
   963  affinity maps consider overriding ``bpf.lbMapMax`` Helm option.
   964  The default value of this LB map size is 65536.
   965  
   966  .. parsed-literal::
   967  
   968      helm install cilium |CHART_RELEASE| \\
   969          --namespace kube-system \\
   970          --set kubeProxyReplacement=true \\
   971          --set bpf.lbMapMax=131072
   972  
   973  .. _kubeproxyfree_hostport:
   974  
   975  Container HostPort Support
   976  **************************
   977  
   978  Although not part of kube-proxy, Cilium's eBPF kube-proxy replacement also
   979  natively supports ``hostPort`` service mapping without having to use the
   980  Helm CNI chaining option of ``cni.chainingMode=portmap``.
   981  
   982  By specifying ``kubeProxyReplacement=true`` the native hostPort support is
   983  automatically enabled and therefore no further action is required. Otherwise
   984  ``hostPort.enabled=true`` can be used to enable the setting.
   985  
   986  If the ``hostPort`` is specified without an additional ``hostIP``, then the
   987  Pod will be exposed to the outside world with the same local addresses from
   988  the node that were detected and used for exposing NodePort services, e.g.
   989  the Kubernetes InternalIP or ExternalIP if set. Additionally, the Pod is also
   990  accessible through the loopback address on the node such as ``127.0.0.1:hostPort``.
   991  If in addition to ``hostPort`` also a ``hostIP`` has been specified for the
   992  Pod, then the Pod will only be exposed on the given ``hostIP`` instead. A
   993  ``hostIP`` of ``0.0.0.0`` will have the same behavior as if a ``hostIP`` was
   994  not specified. The ``hostPort`` must not reside in the configured NodePort
   995  port range to avoid collisions.
   996  
   997  An example deployment in a kube-proxy-free environment therefore is the same
   998  as in the earlier getting started deployment:
   999  
  1000  .. parsed-literal::
  1001  
  1002      helm install cilium |CHART_RELEASE| \\
  1003          --namespace kube-system \\
  1004          --set kubeProxyReplacement=true \\
  1005          --set k8sServiceHost=${API_SERVER_IP} \\
  1006          --set k8sServicePort=${API_SERVER_PORT}
  1007  
  1008  
  1009  Also, ensure that each node IP is known via ``INTERNAL-IP`` or ``EXTERNAL-IP``,
  1010  for example:
  1011  
  1012  .. code-block:: shell-session
  1013  
  1014      $ kubectl get nodes -o wide
  1015      NAME   STATUS   ROLES    AGE     VERSION   INTERNAL-IP      EXTERNAL-IP   [...]
  1016      apoc   Ready    master   6h15m   v1.17.3   192.168.178.29   <none>        [...]
  1017      tank   Ready    <none>   6h13m   v1.17.3   192.168.178.28   <none>        [...]
  1018  
  1019  If this is not the case, then ``kubelet`` needs to be made aware of it through
  1020  specifying ``--node-ip`` through ``KUBELET_EXTRA_ARGS``. Assuming ``eth0`` is
  1021  the public facing interface, this can be achieved by:
  1022  
  1023  .. code-block:: shell-session
  1024  
  1025      $ echo KUBELET_EXTRA_ARGS=\"--node-ip=$(ip -4 -o a show eth0 | awk '{print $4}' | cut -d/ -f1)\" | tee -a /etc/default/kubelet
  1026  
  1027  After updating ``/etc/default/kubelet``, kubelet needs to be restarted.
  1028  
  1029  In order to verify whether the HostPort feature has been enabled in Cilium, the
  1030  ``cilium-dbg status`` CLI command provides visibility through the ``KubeProxyReplacement``
  1031  info line. If it has been enabled successfully, ``HostPort`` is shown as ``Enabled``,
  1032  for example:
  1033  
  1034  .. code-block:: shell-session
  1035  
  1036      $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status --verbose | grep HostPort
  1037        - HostPort:       Enabled
  1038  
  1039  The following modified example yaml from the setup validation with an additional
  1040  ``hostPort: 8080`` parameter can be used to verify the mapping:
  1041  
  1042  .. code-block:: yaml
  1043  
  1044      apiVersion: apps/v1
  1045      kind: Deployment
  1046      metadata:
  1047        name: my-nginx
  1048      spec:
  1049        selector:
  1050          matchLabels:
  1051            run: my-nginx
  1052        replicas: 1
  1053        template:
  1054          metadata:
  1055            labels:
  1056              run: my-nginx
  1057          spec:
  1058            containers:
  1059            - name: my-nginx
  1060              image: nginx
  1061              ports:
  1062              - containerPort: 80
  1063                hostPort: 8080
  1064  
  1065  After deployment, we can validate that Cilium's eBPF kube-proxy replacement
  1066  exposed the container as HostPort under the specified port ``8080``:
  1067  
  1068  .. code-block:: shell-session
  1069  
  1070      $ kubectl exec -it -n kube-system cilium-fmh8d -- cilium-dbg service list
  1071      ID   Frontend               Service Type   Backend
  1072      [...]
  1073      5    192.168.178.29:8080    HostPort       1 => 10.29.207.199:80
  1074  
  1075  Similarly, we can inspect through ``iptables`` in the host namespace that
  1076  no ``iptables`` rule for the HostPort service is present:
  1077  
  1078  .. code-block:: shell-session
  1079  
  1080      $ iptables-save | grep HOSTPORT
  1081      [ empty line ]
  1082  
  1083  Last but not least, a simple ``curl`` test shows connectivity for the
  1084  exposed HostPort container under the node's IP:
  1085  
  1086  .. code-block:: shell-session
  1087  
  1088      $ curl 192.168.178.29:8080
  1089      <!DOCTYPE html>
  1090      <html>
  1091      <head>
  1092      <title>Welcome to nginx!</title>
  1093      [....]
  1094  
  1095  Removing the deployment also removes the corresponding HostPort from
  1096  the ``cilium-dbg service list`` dump:
  1097  
  1098  .. code-block:: shell-session
  1099  
  1100      $ kubectl delete deployment my-nginx
  1101  
  1102  kube-proxy Hybrid Modes
  1103  ***********************
  1104  
  1105  Cilium's eBPF kube-proxy replacement can be configured in several modes, i.e. it can
  1106  replace kube-proxy entirely or it can co-exist with kube-proxy on the system if the
  1107  underlying Linux kernel requirements do not support a full kube-proxy replacement.
  1108  
  1109  .. warning::
  1110     When deploying the eBPF kube-proxy replacement under co-existence with
  1111     kube-proxy on the system, be aware that both mechanisms operate independent of each
  1112     other. Meaning, if the eBPF kube-proxy replacement is added or removed on an already
  1113     *running* cluster in order to delegate operation from respectively back to kube-proxy,
  1114     then it must be expected that existing connections will break since, for example,
  1115     both NAT tables are not aware of each other. If deployed in co-existence on a newly
  1116     spawned up node/cluster which does not yet serve user traffic, then this is not an
  1117     issue.
  1118  
  1119  This section elaborates on the ``kubeProxyReplacement`` options:
  1120  
  1121  - ``kubeProxyReplacement=true``: When using this option, it's highly recommended
  1122    to run a kube-proxy-free Kubernetes setup where Cilium is expected to fully replace
  1123    all kube-proxy functionality. However, if it's not possible to remove kube-proxy for
  1124    specific reasons (e.g. Kubernetes distribution limitations), it's also acceptable to
  1125    leave it deployed in the background. Just be aware of the potential side effects on
  1126    existing nodes as mentioned above when running kube-proxy in co-existence. Once the
  1127    Cilium agent is up and running, it takes care of handling Kubernetes services of type
  1128    ClusterIP, NodePort, LoadBalancer, services with externalIPs as well as HostPort.
  1129    If the underlying kernel version requirements are not met
  1130    (see :ref:`kubeproxy-free` note), then the Cilium agent will bail out on start-up
  1131    with an error message.
  1132  
  1133  - ``kubeProxyReplacement=false``: This option is used to disable any Kubernetes service
  1134    handling by fully relying on kube-proxy instead, except for ClusterIP services
  1135    accessed from pods (pre-v1.6 behavior), or for a hybrid setup. That is,
  1136    kube-proxy is running in the Kubernetes cluster where Cilium
  1137    partially replaces and optimizes kube-proxy functionality. The ``false``
  1138    option requires the user to manually specify which components for the eBPF
  1139    kube-proxy replacement should be used.
  1140    Similarly to ``true`` mode, the Cilium agent will bail out on start-up with
  1141    an error message if the underlying kernel requirements are not met when components
  1142    are manually enabled. For
  1143    fine-grained configuration, ``socketLB.enabled``, ``nodePort.enabled``,
  1144    ``externalIPs.enabled`` and ``hostPort.enabled`` can be set to ``true``. By
  1145    default all four options are set to ``false``.
  1146    If you are setting ``nodePort.enabled`` to true, make sure to also
  1147    set ``nodePort.enableHealthCheck`` to ``false``, so that the Cilium agent does not
  1148    start the NodePort health check server (``kube-proxy`` will also attempt to start
  1149    this server, and there would otherwise be a clash when cilium attempts to bind its server to the
  1150    same port). A few example configurations
  1151    for the ``false`` option are provided below.
  1152  
  1153  .. note::
  1154  
  1155      Switching from the ``true`` to ``false`` mode, or vice versa can break
  1156      existing connections to services in a cluster. The same goes for enabling, or
  1157      disabling ``socketLB``. It is recommended to drain all the workloads before
  1158      performing such configuration changes.
  1159  
  1160    The following Helm setup below would be equivalent to ``kubeProxyReplacement=true``
  1161    in a kube-proxy-free environment:
  1162  
  1163    .. parsed-literal::
  1164  
  1165      helm install cilium |CHART_RELEASE| \\
  1166          --namespace kube-system \\
  1167          --set kubeProxyReplacement=false \\
  1168          --set socketLB.enabled=true \\
  1169          --set nodePort.enabled=true \\
  1170          --set externalIPs.enabled=true \\
  1171          --set hostPort.enabled=true \\
  1172          --set k8sServiceHost=${API_SERVER_IP} \\
  1173          --set k8sServicePort=${API_SERVER_PORT}
  1174  
  1175  
  1176    The following Helm setup below would be equivalent to the default Cilium service
  1177    handling in v1.6 or earlier in a kube-proxy environment, that is, serving ClusterIP
  1178    for pods:
  1179  
  1180    .. parsed-literal::
  1181  
  1182      helm install cilium |CHART_RELEASE| \\
  1183          --namespace kube-system \\
  1184          --set kubeProxyReplacement=false
  1185  
  1186    The following Helm setup below would optimize Cilium's NodePort, LoadBalancer and services
  1187    with externalIPs handling for external traffic ingressing into the Cilium managed node in
  1188    a kube-proxy environment:
  1189  
  1190    .. parsed-literal::
  1191  
  1192      helm install cilium |CHART_RELEASE| \\
  1193          --namespace kube-system \\
  1194          --set kubeProxyReplacement=false \\
  1195          --set nodePort.enabled=true \\
  1196          --set externalIPs.enabled=true
  1197  
  1198  In Cilium's Helm chart, the default mode is ``kubeProxyReplacement=false`` for
  1199  new deployments.
  1200  
  1201  The current Cilium kube-proxy replacement mode can also be introspected through the
  1202  ``cilium-dbg status`` CLI command:
  1203  
  1204  .. code-block:: shell-session
  1205  
  1206      $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status | grep KubeProxyReplacement
  1207      KubeProxyReplacement:   True	[eth0 (DR)]
  1208  
  1209  Graceful Termination
  1210  ********************
  1211  
  1212  Cilium's eBPF kube-proxy replacement supports graceful termination of service
  1213  endpoint pods. The feature requires at least Kubernetes version 1.20, and
  1214  the feature gate ``EndpointSliceTerminatingCondition`` needs to be enabled.
  1215  By default, the Cilium agent then detects such terminating Pod events, and
  1216  increments the metric ``k8s_terminating_endpoints_events_total``. If needed,
  1217  the feature can be disabled with the configuration option ``enable-k8s-terminating-endpoint``.
  1218  
  1219  The cilium agent feature flag can be probed by running ``cilium-dbg status`` command:
  1220  
  1221  .. code-block:: shell-session
  1222  
  1223      $ kubectl -n kube-system exec ds/cilium -- cilium-dbg status --verbose
  1224      [...]
  1225      KubeProxyReplacement Details:
  1226       [...]
  1227       Graceful Termination:  Enabled
  1228      [...]
  1229  
  1230  When Cilium agent receives a Kubernetes update event for a terminating endpoint,
  1231  the datapath state for the endpoint is removed such that it won't service new
  1232  connections, but the endpoint's active connections are able to terminate
  1233  gracefully. The endpoint state is fully removed when the agent receives
  1234  a Kubernetes delete event for the endpoint. The `Kubernetes
  1235  pod termination <https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination>`_
  1236  documentation contains more background on the behavior and configuration using ``terminationGracePeriodSeconds``.
  1237  There are some special cases, like zero disruption during rolling updates, that require to be able to send traffic
  1238  to Terminating Pods that are still Serving traffic during the Terminating period, the Kubernetes blog
  1239  `Advancements in Kubernetes Traffic Engineering
  1240  <https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/#traffic-loss-from-load-balancers-during-rolling-updates>`_
  1241  explains it in detail.
  1242  
  1243  .. admonition:: Video
  1244    :class: attention
  1245  
  1246    To learn more about Cilium's graceful termination support, check out `eCHO Episode 49: Graceful Termination Support with Cilium 1.11 <https://www.youtube.com/watch?v=9GBxJMp6UkI&t=980s>`__.
  1247  
  1248  .. _session-affinity:
  1249  
  1250  Session Affinity
  1251  ****************
  1252  
  1253  Cilium's eBPF kube-proxy replacement supports Kubernetes service session affinity.
  1254  Each connection from the same pod or host to a service configured with
  1255  ``sessionAffinity: ClientIP`` will always select the same service endpoint.
  1256  The default timeout for the affinity is three hours (updated by each request to
  1257  the service), but it can be configured through Kubernetes' ``sessionAffinityConfig``
  1258  if needed.
  1259  
  1260  The source for the affinity depends on the origin of a request. If a request is
  1261  sent from outside the cluster to the service, the request's source IP address is
  1262  used for determining the endpoint affinity. If a request is sent from inside
  1263  the cluster, then the source depends on whether the socket-LB feature
  1264  is used to load balance ClusterIP services. If yes, then the client's network
  1265  namespace cookie is used as the source. The latter was introduced in the 5.7
  1266  Linux kernel to implement the affinity at the socket layer at which
  1267  the socket-LB operates (a source IP is not available there, as the
  1268  endpoint selection happens before a network packet has been built by the
  1269  kernel). If the socket-LB is not used (i.e. the loadbalancing is done
  1270  at the pod network interface, on a per-packet basis), then the request's source
  1271  IP address is used as the source.
  1272  
  1273  The session affinity support is enabled by default for Cilium's kube-proxy
  1274  replacement. For users who run on older kernels which do not support the network
  1275  namespace cookies, a fallback in-cluster mode is implemented, which is based on
  1276  a fixed cookie value as a trade-off. This makes all applications on the host to
  1277  select the same service endpoint for a given service with session affinity configured.
  1278  To disable the feature, set ``config.sessionAffinity=false``.
  1279  
  1280  When the fixed cookie value is not used, the session affinity of a service with
  1281  multiple ports is per service IP and port. Meaning that all requests for a
  1282  given service sent from the same source and to the same service port will be routed
  1283  to the same service endpoints; but two requests for the same service, sent from
  1284  the same source but to different service ports may be routed to distinct service
  1285  endpoints.
  1286  
  1287  For users who run with kube-proxy (i.e. with Cilium's kube-proxy replacement
  1288  disabled), the ClusterIP service loadbalancing when a request is sent from a pod
  1289  running in a non-host network namespace is still performed at the pod network
  1290  interface (until `GH#16197 <https://github.com/cilium/cilium/issues/16197>`__ is
  1291  fixed).  For this case the session affinity support is disabled by default. To
  1292  enable the feature, set ``config.sessionAffinity=true``.
  1293  
  1294  kube-proxy Replacement Health Check server
  1295  ******************************************
  1296  To enable health check server for the kube-proxy replacement, the
  1297  ``kubeProxyReplacementHealthzBindAddr`` option has to be set (disabled by
  1298  default). The option accepts the IP address with port for the health check server
  1299  to serve on.
  1300  E.g. to enable for IPv4 interfaces set ``kubeProxyReplacementHealthzBindAddr='0.0.0.0:10256'``,
  1301  for IPv6 - ``kubeProxyReplacementHealthzBindAddr='[::]:10256'``. The health check server is
  1302  accessible via the HTTP ``/healthz`` endpoint.
  1303  
  1304  LoadBalancer Source Ranges Checks
  1305  *********************************
  1306  
  1307  When a ``LoadBalancer`` service is configured with ``spec.loadBalancerSourceRanges``,
  1308  Cilium's eBPF kube-proxy replacement restricts access from outside (e.g. external
  1309  world traffic) to the service to the white-listed CIDRs specified in the field. If
  1310  the field is empty, no restrictions for the access will be applied.
  1311  
  1312  When accessing the service from inside a cluster, the kube-proxy replacement will
  1313  ignore the field regardless whether it is set. This means that any pod or any host
  1314  process in the cluster will be able to access the ``LoadBalancer`` service internally.
  1315  
  1316  The load balancer source range check feature is enabled by default, and it can be
  1317  disabled by setting ``config.svcSourceRangeCheck=false``. It makes sense to disable
  1318  the check when running on some cloud providers. E.g. `Amazon NLB
  1319  <https://kubernetes.io/docs/concepts/services-networking/service/#aws-nlb-support>`__
  1320  natively implements the check, so the kube-proxy replacement's feature can be disabled.
  1321  Meanwhile `GKE internal TCP/UDP load balancer
  1322  <https://cloud.google.com/kubernetes-engine/docs/how-to/service-parameters#lb_source_ranges>`__
  1323  does not, so the feature must be kept enabled in order to restrict the access.
  1324  
  1325  Service Proxy Name Configuration
  1326  ********************************
  1327  
  1328  Like kube-proxy, Cilium also honors the ``service.kubernetes.io/service-proxy-name`` service annotation
  1329  and only manages services that contain a matching service-proxy-name label. This name can be configured
  1330  by setting ``k8s.serviceProxyName`` option and the behavior is identical to that of
  1331  kube-proxy. The service proxy name defaults to an empty string which instructs Cilium to
  1332  only manage services not having ``service.kubernetes.io/service-proxy-name`` label.
  1333  
  1334  For more details on the usage of ``service.kubernetes.io/service-proxy-name`` label and its
  1335  working, take a look at `this KEP
  1336  <https://github.com/kubernetes/enhancements/blob/3ad891202dab1fd5211946f10f31b48003bf8113/keps/sig-network/2447-Make-kube-proxy-service-abstraction-optional/README.md>`__.
  1337  
  1338  .. note::
  1339  
  1340      If Cilium with a non-empty service proxy name is meant to manage all services in kube-proxy
  1341      free mode, make sure that default Kubernetes services like ``kube-dns`` and ``kubernetes``
  1342      have the required label value.
  1343  
  1344  Traffic Distribution and Topology Aware Hints
  1345  *********************************************
  1346  
  1347  The kube-proxy replacement implements both Kubernetes `Topology Aware Routing
  1348  <https://kubernetes.io/docs/concepts/services-networking/topology-aware-routing>`__,
  1349  and the more recent `Traffic Distribution
  1350  <https://kubernetes.io/docs/concepts/services-networking/service/#traffic-distribution>`__
  1351  features.
  1352  
  1353  Both of these features work by setting ``hints`` on EndpointSlices that enable
  1354  Cilium to route to endpoints residing in the same zone. To enable the feature,
  1355  set ``loadBalancer.serviceTopology=true``.
  1356  
  1357  Neighbor Discovery
  1358  ******************
  1359  
  1360  When kube-proxy replacement is enabled, Cilium does L2 neighbor discovery of nodes
  1361  in the cluster. This is required for the service load-balancing to populate L2
  1362  addresses for backends since it is not possible to dynamically resolve neighbors
  1363  on demand in the fast-path.
  1364  
  1365  In Cilium 1.10 or earlier, the agent itself contained an ARP resolution library
  1366  where it triggered discovery and periodic refresh of new nodes joining the cluster.
  1367  The resolved neighbor entries were pushed into the kernel and refreshed as PERMANENT
  1368  entries. In some rare cases, Cilium 1.10 or earlier might have left stale entries behind
  1369  in the neighbor table causing packets between some nodes to be dropped. To skip the
  1370  neighbor discovery and instead rely on the Linux kernel to discover neighbors, you can
  1371  pass the ``--enable-l2-neigh-discovery=false`` flag to the cilium-agent. However,
  1372  note that relying on the Linux Kernel might also cause some packets to be dropped.
  1373  For example, a NodePort request can be dropped on an intermediate node (i.e., the
  1374  one which received a service packet and is going to forward it to a destination node
  1375  which runs the selected service endpoint). This could happen if there is no L2 neighbor
  1376  entry in the kernel (due to the entry being garbage collected or given that the neighbor
  1377  resolution has not been done by the kernel). This is because it is not possible to drive
  1378  the neighbor resolution from BPF programs in the fast-path e.g. at the XDP layer.
  1379  
  1380  From Cilium 1.11 onwards, the neighbor discovery has been fully reworked and the Cilium
  1381  internal ARP resolution library has been removed from the agent. The agent now fully
  1382  relies on the Linux kernel to discover gateways or hosts on the same L2 network. Both
  1383  IPv4 and IPv6 neighbor discovery is supported in the Cilium agent. As per our recent
  1384  kernel work `presented at Plumbers <https://linuxplumbersconf.org/event/11/contributions/953/>`__,
  1385  "managed" neighbor entries have been `upstreamed <https://lore.kernel.org/netdev/20211011121238.25542-1-daniel@iogearbox.net/>`__
  1386  and will be available in Linux kernel v5.16 or later which the Cilium agent will detect
  1387  and transparently use. In this case, the agent pushes down L3 addresses of new nodes
  1388  joining the cluster as externally learned "managed" neighbor entries. For introspection,
  1389  iproute2 displays them as "managed extern_learn". The "extern_learn" attribute prevents
  1390  garbage collection of the entries by the kernel's neighboring subsystem. Such "managed"
  1391  neighbor entries are dynamically resolved and periodically refreshed by the Linux kernel
  1392  itself in case there is no active traffic for a certain period of time. That is, the
  1393  kernel attempts to always keep them in REACHABLE state. For Linux kernels v5.15 or
  1394  earlier where "managed" neighbor entries are not present, the Cilium agent similarly
  1395  pushes L3 addresses of new nodes into the kernel for dynamic resolution, but with an
  1396  agent triggered periodic refresh. For introspection, iproute2 displays them only as
  1397  "extern_learn" in this case. If there is no active traffic for a certain period of
  1398  time, then a Cilium agent controller triggers the Linux kernel-based re-resolution for
  1399  attempting to keep them in REACHABLE state. The refresh interval can be changed if needed
  1400  through a ``--arping-refresh-period=30s`` flag passed to the cilium-agent. The default
  1401  period is ``30s`` which corresponds to the kernel's base reachable time.
  1402  
  1403  The neighbor discovery supports multi-device environments where each node has multiple devices
  1404  and multiple next-hops to another node. The Cilium agent pushes neighbor entries for all target
  1405  devices, including the direct routing device. Currently, it supports one next-hop per device.
  1406  The following example illustrates how the neighbor discovery works in a multi-device environment.
  1407  Each node has two devices connected to different L3 networks (10.69.0.64/26 and 10.69.0.128/26),
  1408  and global scope addresses each (10.69.0.1/26 and 10.69.0.2/26). A next-hop from node1 to node2 is
  1409  either ``10.69.0.66 dev eno1`` or ``10.69.0.130 dev eno2``. The Cilium agent pushes neighbor
  1410  entries for both ``10.69.0.66 dev eno1`` and ``10.69.0.130 dev eno2`` in this case.
  1411  
  1412  ::
  1413  
  1414      +---------------+     +---------------+
  1415      |    node1      |     |    node2      |
  1416      | 10.69.0.1/26  |     | 10.69.0.2/26  |
  1417      |           eno1+-----+eno1           |
  1418      |           |   |     |   |           |
  1419      | 10.69.0.65/26 |     |10.69.0.66/26  |
  1420      |               |     |               |
  1421      |           eno2+-----+eno2           |
  1422      |           |   |     | |             |
  1423      | 10.69.0.129/26|     | 10.69.0.130/26|
  1424      +---------------+     +---------------+
  1425  
  1426  With, on node1:
  1427  
  1428  .. code-block:: shell-session
  1429  
  1430      $ ip route show
  1431      10.69.0.2
  1432              nexthop via 10.69.0.66 dev eno1 weight 1
  1433              nexthop via 10.69.0.130 dev eno2 weight 1
  1434  
  1435      $ ip neigh show
  1436      10.69.0.66 dev eno1 lladdr 96:eb:75:fd:89:fd extern_learn  REACHABLE
  1437      10.69.0.130 dev eno2 lladdr 52:54:00:a6:62:56 extern_learn  REACHABLE
  1438  
  1439  .. _external_access_to_clusterip_services:
  1440  
  1441  External Access To ClusterIP Services
  1442  *************************************
  1443  
  1444  As per `k8s Service <https://kubernetes.io/docs/concepts/services-networking/service/#publishing-services-service-types>`__,
  1445  Cilium's eBPF kube-proxy replacement by default disallows access to a ClusterIP service from outside the cluster.
  1446  This can be allowed by setting ``bpf.lbExternalClusterIP=true``.
  1447  
  1448  Observability
  1449  *************
  1450  
  1451  You can trace socket LB related datapath events using Hubble and cilium monitor.
  1452  
  1453  Apply the following pod and service:
  1454  
  1455  .. code-block:: yaml
  1456  
  1457      apiVersion: v1
  1458      kind: Pod
  1459      metadata:
  1460        name: nginx
  1461        labels:
  1462          app: proxy
  1463      spec:
  1464        containers:
  1465        - name: nginx
  1466          image: nginx:stable
  1467          ports:
  1468            - containerPort: 80
  1469      ---
  1470      apiVersion: v1
  1471      kind: Service
  1472      metadata:
  1473        name: nginx-service
  1474      spec:
  1475        selector:
  1476          app: proxy
  1477        ports:
  1478        - port: 80
  1479  
  1480  Deploy a client pod to start traffic.
  1481  
  1482  .. parsed-literal::
  1483  
  1484      $ kubectl create -f \ |SCM_WEB|\/examples/kubernetes-dns/dns-sw-app.yaml
  1485  
  1486  .. code-block:: shell-session
  1487  
  1488      $ kubectl get svc | grep nginx
  1489        nginx-service   ClusterIP   10.96.128.44   <none>        80/TCP    140m
  1490  
  1491      $ kubectl exec -it mediabot -- curl -v --connect-timeout 5 10.96.128.44
  1492  
  1493  Follow the Hubble :ref:`hubble_cli` guide  to see the network flows. The Hubble
  1494  output prints datapath events before and after socket LB translation between service
  1495  and selected service endpoint.
  1496  
  1497  .. code-block:: shell-session
  1498  
  1499      $ hubble observe --all | grep mediabot
  1500      Jan 13 13:47:20.932: default/mediabot (ID:5618) <> default/nginx-service:80 (world) pre-xlate-fwd TRACED (TCP)
  1501      Jan 13 13:47:20.932: default/mediabot (ID:5618) <> default/nginx:80 (ID:35772) post-xlate-fwd TRANSLATED (TCP)
  1502      Jan 13 13:47:20.932: default/nginx:80 (ID:35772) <> default/mediabot (ID:5618) pre-xlate-rev TRACED (TCP)
  1503      Jan 13 13:47:20.932: default/nginx-service:80 (world) <> default/mediabot (ID:5618) post-xlate-rev TRANSLATED (TCP)
  1504      Jan 13 13:47:20.932: default/mediabot:38750 (ID:5618) <> default/nginx (ID:35772) pre-xlate-rev TRACED (TCP)
  1505  
  1506  Socket LB tracing with Hubble requires cilium agent to detect pod cgroup paths.
  1507  If you see a warning message in cilium agent ``No valid cgroup base path found: socket load-balancing tracing with Hubble will not work.``,
  1508  you can trace packets using ``cilium-dbg monitor`` instead.
  1509  
  1510  .. note::
  1511  
  1512      In case of the warning log, please file a GitHub issue with the cgroup path
  1513      for any of your pods, obtained by running the following command on a Kubernetes
  1514      node in your cluster: ``sudo crictl inspectp -o=json $POD_ID | grep cgroup``.
  1515  
  1516  .. code-block:: shell-session
  1517  
  1518      $ kubectl get pods -o wide
  1519      NAME       READY   STATUS    RESTARTS   AGE     IP             NODE          NOMINATED NODE   READINESS GATES
  1520      mediabot   1/1     Running   0          54m     10.244.1.237   kind-worker   <none>           <none>
  1521      nginx      1/1     Running   0          3h25m   10.244.1.246   kind-worker   <none>           <none>
  1522  
  1523      $ kubectl exec -n kube-system cilium-rt2jh -- cilium-dbg monitor -v -t trace-sock
  1524      CPU 11: [pre-xlate-fwd] cgroup_id: 479586 sock_cookie: 7123674, dst [10.96.128.44]:80 tcp
  1525      CPU 11: [post-xlate-fwd] cgroup_id: 479586 sock_cookie: 7123674, dst [10.244.1.246]:80 tcp
  1526      CPU 11: [pre-xlate-rev] cgroup_id: 479586 sock_cookie: 7123674, dst [10.244.1.246]:80 tcp
  1527      CPU 11: [post-xlate-rev] cgroup_id: 479586 sock_cookie: 7123674, dst [10.96.128.44]:80 tcp
  1528  
  1529  You can identify the client pod using its printed ``cgroup id`` metadata. The pod
  1530  ``cgroup path`` corresponding to the ``cgroup id`` has its UUID. The socket
  1531  cookie is a unique socket identifier allocated in the Linux kernel. The socket
  1532  cookie metadata can be used to identify all the trace events from a socket.
  1533  
  1534  .. code-block:: shell-session
  1535  
  1536      $ kubectl get pods -o custom-columns=PodName:.metadata.name,PodUID:.metadata.uid
  1537      PodName    PodUID
  1538      mediabot   b620703c-c446-49c7-84c8-e23f4ba5626b
  1539      nginx      73b9938b-7e4b-4cbd-8c4c-67d4f253ccf4
  1540  
  1541      $ kubectl exec -n kube-system cilium-rt2jh -- find /run/cilium/cgroupv2/ -inum 479586
  1542      Defaulted container "cilium-agent" out of: cilium-agent, mount-cgroup (init), apply-sysctl-overwrites (init), clean-cilium-state (init)
  1543      /run/cilium/cgroupv2/kubelet.slice/kubelet-kubepods.slice/kubelet-kubepods-besteffort.slice/kubelet-kubepods-besteffort-podb620703c_c446_49c7_84c8_e23f4ba5626b.slice/cri-containerd-4e7fc71c8bef8c05c9fb76d93a186736fca266e668722e1239fe64503b3e80d3.scope
  1544  
  1545  Troubleshooting
  1546  ***************
  1547  
  1548  Validate BPF cgroup programs attachment
  1549  =======================================
  1550  
  1551  Cilium attaches BPF ``cgroup`` programs to enable socket-based load-balancing (aka
  1552  ``host-reachable`` services). If you see connectivity issues for ``clusterIP`` services,
  1553  check if the programs are attached to the host ``cgroup root``. The default ``cgroup``
  1554  root is set to ``/run/cilium/cgroupv2``.
  1555  Run the following commands from a Cilium agent pod as well as the underlying
  1556  kubernetes node where the pod is running. If the container runtime in your cluster
  1557  is running in the cgroup namespace mode, Cilium agent pod can attach BPF ``cgroup``
  1558  programs to the ``virtualized cgroup root``. In such cases, Cilium kube-proxy replacement
  1559  based load-balancing may not be effective leading to connectivity issues.
  1560  For more information, ensure that you have the fix `Pull Request <https://github.com/cilium/cilium/pull/16259>`__.
  1561  
  1562  .. code-block:: shell-session
  1563  
  1564      $ mount | grep cgroup2
  1565      none on /run/cilium/cgroupv2 type cgroup2 (rw,relatime)
  1566  
  1567      $ bpftool cgroup tree /run/cilium/cgroupv2/
  1568      CgroupPath
  1569      ID       AttachType      AttachFlags     Name
  1570      /run/cilium/cgroupv2
  1571      10613    device          multi
  1572      48497    connect4
  1573      48493    connect6
  1574      48499    sendmsg4
  1575      48495    sendmsg6
  1576      48500    recvmsg4
  1577      48496    recvmsg6
  1578      48498    getpeername4
  1579      48494    getpeername6
  1580  
  1581  Known Issues
  1582  ############
  1583  
  1584  For clusters deployed with Cilium version 1.11.14 or earlier, service backend entries could
  1585  be leaked in the BPF maps in some instances. The known cases that could lead
  1586  to such leaks are due to race conditions between deletion of a service backend
  1587  while it's terminating, and simultaneous deletion of the service the backend is
  1588  associated with. This could lead to duplicate backend entries that could eventually
  1589  fill up the ``cilium_lb4_backends_v2`` map.
  1590  In such cases, you might see error messages like these in the Cilium agent logs::
  1591  
  1592      Unable to update element for cilium_lb4_backends_v2 map with file descriptor 15: the map is full, please consider resizing it. argument list too long
  1593  
  1594  While the leak was fixed in Cilium version 1.11.15, in some cases, any affected clusters upgrading
  1595  from the problematic cilium versions 1.11.14 or earlier to any subsequent versions may not
  1596  see the leaked backends cleaned up from the BPF maps after the Cilium agent restarts.
  1597  The fixes to clean up leaked duplicate backend entries were backported to older
  1598  releases, and are available as part of Cilium versions v1.11.16, v1.12.9 and v1.13.2.
  1599  Fresh clusters deploying Cilium versions 1.11.15 or later don't experience this leak issue.
  1600  
  1601  For more information, see `this GitHub issue <https://github.com/cilium/cilium/issues/23551>`__.
  1602  
  1603  Limitations
  1604  ###########
  1605  
  1606      * Cilium's eBPF kube-proxy replacement currently cannot be used with :ref:`encryption_ipsec`.
  1607      * Cilium's eBPF kube-proxy replacement relies upon the socket-LB feature
  1608        which uses eBPF cgroup hooks to implement the service translation. Using it with libceph
  1609        deployments currently requires support for the getpeername(2) hook address translation in
  1610        eBPF, which is only available for kernels v5.8 and higher.
  1611      * In order to support nfs in the kernel with the socket-LB feature, ensure that
  1612        kernel commit ``0bdf399342c5 ("net: Avoid address overwrite in kernel_connect")``
  1613        is part of your underlying kernel. Linux kernels v6.6 and higher support it. Older
  1614        stable kernels are TBD. For a more detailed discussion see :gh-issue:`21541`.
  1615      * Cilium's DSR NodePort mode currently does not operate well in environments with
  1616        TCP Fast Open (TFO) enabled. It is recommended to switch to ``snat`` mode in this
  1617        situation.
  1618      * Cilium's eBPF kube-proxy replacement does not support the SCTP transport protocol.
  1619        Only TCP and UDP is supported as a transport for services at this point.
  1620      * Cilium's eBPF kube-proxy replacement does not allow ``hostPort`` port configurations
  1621        for Pods that overlap with the configured NodePort range. In such case, the ``hostPort``
  1622        setting will be ignored and a warning emitted to the Cilium agent log. Similarly,
  1623        explicitly binding the ``hostIP`` to the loopback address in the host namespace is
  1624        currently not supported and will log a warning to the Cilium agent log.
  1625      * When Cilium's kube-proxy replacement is used with Kubernetes versions(< 1.19) that have
  1626        support for ``EndpointSlices``, ``Services`` without selectors and backing ``Endpoints``
  1627        don't work. The reason is that Cilium only monitors changes made to ``EndpointSlices``
  1628        objects if support is available and ignores ``Endpoints`` in those cases. Kubernetes 1.19
  1629        release introduces ``EndpointSliceMirroring`` controller that mirrors custom ``Endpoints``
  1630        resources to corresponding ``EndpointSlices`` and thus allowing backing ``Endpoints``
  1631        to work. For a more detailed discussion see :gh-issue:`12438`.
  1632      * When deployed on kernels older than 5.7, Cilium is unable to distinguish between host and
  1633        pod namespaces due to the lack of kernel support for network namespace cookies. As a result,
  1634        Kubernetes services are reachable from all pods via the loopback address.
  1635      * The neighbor discovery in a multi-device environment doesn't work with the runtime device
  1636        detection which means that the target devices for the neighbor discovery doesn't follow the
  1637        device changes.
  1638      * When socket-LB feature is enabled, pods sending (connected) UDP traffic to services
  1639        can continue to send traffic to a service backend even after it's deleted. Cilium agent
  1640        handles such scenarios by forcefully terminating application sockets that are connected
  1641        to deleted backends, so that the applications can be load-balanced to active backends.
  1642        This functionality requires these kernel configs to be enabled:
  1643        ``CONFIG_INET_DIAG``, ``CONFIG_INET_UDP_DIAG`` and ``CONFIG_INET_DIAG_DESTROY``.
  1644      * Cilium's BPF-based masquerading is recommended over iptables when using the
  1645        BPF-based NodePort. Otherwise, there is a risk for port collisions between
  1646        BPF and iptables SNAT, which might result in dropped NodePort
  1647        connections :gh-issue:`23604`.
  1648  
  1649  Further Readings
  1650  ################
  1651  
  1652  The following presentations describe inner-workings of the kube-proxy replacement in eBPF
  1653  in great details:
  1654  
  1655      * "Liberating Kubernetes from kube-proxy and iptables" (KubeCon North America 2019, `slides
  1656        <https://docs.google.com/presentation/d/1cZJ-pcwB9WG88wzhDm2jxQY4Sh8adYg0-N3qWQ8593I/edit>`__,
  1657        `video <https://www.youtube.com/watch?v=bIRwSIwNHC0>`__)
  1658      * "Kubernetes service load-balancing at scale with BPF & XDP" (Linux Plumbers 2020, `slides
  1659        <https://linuxplumbersconf.org/event/7/contributions/674/attachments/568/1002/plumbers_2020_cilium_load_balancer.pdf>`__,
  1660        `video <https://www.youtube.com/watch?v=UkvxPyIJAko&t=21s>`__)
  1661      * "eBPF as a revolutionary technology for the container landscape" (Fosdem 2020, `slides
  1662        <https://docs.google.com/presentation/d/1VOUcoIxgM_c6M_zAV1dLlRCjyYCMdR3tJv6CEdfLMh8/edit>`__,
  1663        `video <https://fosdem.org/2020/schedule/event/containers_bpf/>`__)
  1664      * "Kernel improvements for Cilium socket LB" (LSF/MM/BPF 2020, `slides
  1665        <https://docs.google.com/presentation/d/1w2zlpGWV7JUhHYd37El_AUZzyUNSvDfktrF5MJ5G8Bs/edit#slide=id.g746fc02b5b_2_0>`__)