(about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 6 7 .. _l2_announcements: 8 9 ************************************* 10 L2 Announcements / L2 Aware LB (Beta) 11 ************************************* 12 13 .. include:: ../beta.rst 14 15 L2 Announcements is a feature which makes services visible and reachable on 16 the local area network. This feature is primarily intended for on-premises 17 deployments within networks without BGP based routing such as office or 18 campus networks. 19 20 When used, this feature will respond to ARP queries for ExternalIPs and/or 21 LoadBalancer IPs. These IPs are Virtual IPs (not installed on network 22 devices) on multiple nodes, so for each service one node at a time will respond 23 to the ARP queries and respond with its MAC address. This node will perform 24 load balancing with the service load balancing feature, thus acting as a 25 north/south load balancer. 26 27 The advantage of this feature over NodePort services is that each service can 28 use a unique IP so multiple services can use the same port numbers. When using 29 NodePorts, it is up to the client to decide to which host to send traffic, and if a node 30 goes down, the IP+Port combo becomes unusable. With L2 announcements the service 31 VIP simply migrates to another node and will continue to work. 32 33 .. _l2_announcements_settings: 34 35 Configuration 36 ############# 37 38 The L2 Announcements feature and all the requirements can be enabled as follows: 39 40 .. tabs:: 41 .. group-tab:: Helm 42 43 .. parsed-literal:: 44 45 $ helm upgrade cilium |CHART_RELEASE| \\ 46 --namespace kube-system \\ 47 --reuse-values \\ 48 --set l2announcements.enabled=true \\ 49 --set k8sClientRateLimit.qps={QPS} \\ 50 --set k8sClientRateLimit.burst={BURST} \\ 51 --set kubeProxyReplacement=true \\ 52 --set k8sServiceHost=${API_SERVER_IP} \\ 53 --set k8sServicePort=${API_SERVER_PORT} 54 55 56 .. group-tab:: ConfigMap 57 58 .. code-block:: yaml 59 60 enable-l2-announcements: true 61 kube-proxy-replacement: true 62 k8s-client-qps: {QPS} 63 k8s-client-burst: {BURST} 64 65 .. warning:: 66 Sizing the client rate limit (``k8sClientRateLimit.qps`` and ``k8sClientRateLimit.burst``) 67 is important when using this feature due to increased API usage. See :ref:`sizing_client_rate_limit` for sizing guidelines. 68 69 Prerequisites 70 ############# 71 72 * Kube Proxy replacement mode must be enabled. For more information, see 73 :ref:`kubeproxy-free`. 74 75 * All devices on which L2 Aware LB will be announced should be enabled and included in the 76 ``--devices`` flag or ``devices`` Helm option if explicitly set, see :ref:`NodePort Devices`. 77 78 * The ``externalIPs.enabled=true`` Helm option must be set, if usage of externalIPs 79 is desired. Otherwise service load balancing for external IPs is disabled. 80 81 Limitations 82 ########### 83 84 * The feature currently does not support IPv6/NDP. 85 86 * Due to the way L3->L2 translation protocols work, one node receives all 87 ARP requests for a specific IP, so no load balancing can happen before traffic hits the cluster. 88 89 * The feature currently has no traffic balancing mechanism so nodes within the 90 same policy might be asymmetrically loaded. For details see :ref:`l2_announcements_leader_election`. 91 92 * The feature is incompatible with the ``externalTrafficPolicy: Local`` on services as it may cause 93 service IPs to be announced on nodes without pods causing traffic drops. 94 95 Policies 96 ######## 97 98 Policies provide fine-grained control over which services should be announced, 99 where, and how. This is an example policy using all optional fields: 100 101 .. code-block:: yaml 102 103 apiVersion: "" 104 kind: CiliumL2AnnouncementPolicy 105 metadata: 106 name: policy1 107 spec: 108 serviceSelector: 109 matchLabels: 110 color: blue 111 nodeSelector: 112 matchExpressions: 113 - key: 114 operator: DoesNotExist 115 interfaces: 116 - ^eth[0-9]+ 117 externalIPs: true 118 loadBalancerIPs: true 119 120 Service Selector 121 ---------------- 122 123 The service selector is a `label selector <>`__ 124 that determines which services are selected by this policy. If no service 125 selector is provided, all services are selected by the policy. A service must have 126 `loadBalancerClass <>`__ 127 unspecified or set to ``io.cilium/l2-announcer`` to be selected by a policy for announcement. 128 129 There are a few special purpose selector fields which don't match on labels but 130 instead on other metadata like ```` or ``.meta.namespace``. 131 132 =============================== =================== 133 Selector Field 134 ------------------------------- ------------------- 135 io.kubernetes.service.namespace ``.meta.namespace`` 136 ```` 137 =============================== =================== 138 139 Node Selector 140 ------------- 141 142 The node selector field is a `label selector <>`__ 143 which determines which nodes are candidates to announce the services from. 144 145 It might be desirable to pick a subset of nodes in you cluster, since the chosen 146 node (see :ref:`l2_announcements_leader_election`) will act as the north/south 147 load balancer for all of the traffic for a particular service. 148 149 Interfaces 150 ---------- 151 152 The interfaces field is a list of regular expressions (`golang syntax <>`__) 153 that determine over which network interfaces the selected services will be 154 announced. This field is optional, if not specified all interfaces will be used. 155 156 The expressions are OR-ed together, so any network device matching any of the 157 expressions will be matched. 158 159 L2 announcements only work if the selected devices are also part of the set of 160 devices specified in the ``devices`` Helm option, see :ref:`NodePort Devices`. 161 162 .. note:: 163 This selector is NOT a security feature, services will still be available 164 via interfaces when not advertised (for example by hard-coding ARP entries). 165 166 IP Types 167 -------- 168 169 The ``externalIPs`` and ``loadBalancerIPs`` fields determine what sort of IPs 170 are announced. They are both set to ``false`` by default, so a functional policy should always 171 have one or both set to ``true``. 172 173 If ``externalIPs`` is ``true`` all IPs in `.spec.externalIPs <>`__ 174 field are announced. These IPs are are managed by service authors. 175 176 If ``loadBalancerIPs`` is ``true`` all IPs in the service's ``.status.loadbalancer.ingress`` field 177 are announced. These can be assigned by :ref:`lb_ipam` which can be configured 178 by cluster admins for better control over which IPs can be allocated. 179 180 .. note:: 181 If a user intends to use ``externalIPs``, the ``externalIPs.enable=true`` 182 Helm option should be set to enable service load balancing for external IPs. 183 184 Status 185 ------ 186 187 If a policy is invalid for any number of reasons, the status of the policy will reflect that. 188 For example if an invalid match expression is provided: 189 190 .. code-block:: shell-session 191 192 $ kubectl describe l2announcement 193 Name: policy1 194 Namespace: 195 Labels: <none> 196 Annotations: <none> 197 API Version: 198 Kind: CiliumL2AnnouncementPolicy 199 Metadata: 200 #[...] 201 Spec: 202 #[...] 203 Service Selector: 204 Match Expressions: 205 Key: something 206 Operator: NotIn 207 Values: 208 Status: 209 Conditions: 210 Last Transition Time: 2023-05-12T15:39:01Z 211 Message: values: Invalid value: []string(nil): for 'in', 'notin' operators, values set can't be empty 212 Observed Generation: 1 213 Reason: error 214 Status: True 215 Type: io.cilium/bad-service-selector 216 217 The status of these error conditions will go to ``False`` as soon as the user 218 updates the policy to resolve the error. 219 220 .. _l2_announcements_leader_election: 221 222 Leader Election 223 ############### 224 225 Due to the way ARP/NDP works, hosts only store one MAC address per IP, that being 226 the latest reply they see. This means that only one node in the cluster is allowed 227 to reply to requests for a given IP. 228 229 To implement this behavior, every Cilium agent resolves which services are 230 selected for its node and will start participating in leader election for every 231 service. We use Kubernetes `lease mechanism <>`__ 232 to achieve this. Each service translates to a lease, the lease holder will start 233 replying to requests on the selected interfaces. 234 235 The lease mechanism is a first come, first serve picking order. So the first 236 node to claim a lease gets it. This might cause asymmetric traffic distribution. 237 238 Leases 239 ------ 240 241 The leases are created in the same namespace where Cilium is deployed, 242 typically ``kube-system``. You can inspect the leases with the following command: 243 244 .. code-block:: shell-session 245 246 $ kubectl -n kube-system get lease 247 NAME HOLDER AGE 248 cilium-l2announce-default-deathstar worker-node 2d20h 249 cilium-operator-resource-lock worker-node2-tPDVulKoRK 2d20h 250 kube-controller-manager control-plane-node_9bd97f6c-cd0c-4565-8486-e718deb310e4 2d21h 251 kube-scheduler control-plane-node_2c490643-dd95-4f73-8862-139afe771ffd 2d21h 252 253 The leases starting with ``cilium-l2announce-`` are leases used by this feature. 254 The last part of the name is the namespace and service name. The holder indicates 255 the name of the node that currently holds the lease and thus announced the IPs 256 of that given service. 257 258 To inspect a lease: 259 260 .. code-block:: shell-session 261 262 $ kubectl -n kube-system get lease/cilium-l2announce-default-deathstar -o yaml 263 apiVersion: 264 kind: Lease 265 metadata: 266 creationTimestamp: "2023-05-09T15:13:32Z" 267 name: cilium-l2announce-default-deathstar 268 namespace: kube-system 269 resourceVersion: "449966" 270 uid: e3c9c020-6e24-4c5c-9df9-d0c50f6c4cec 271 spec: 272 acquireTime: "2023-05-09T15:14:20.108431Z" 273 holderIdentity: worker-node 274 leaseDurationSeconds: 3 275 leaseTransitions: 1 276 renewTime: "2023-05-12T12:15:26.773020Z" 277 278 The ``acquireTime`` is the time at which the current leader acquired the lease. 279 The ``holderIdentity`` is the name of the current holder/leader node. 280 If the leader does not renew the lease for ``leaseDurationSeconds`` seconds a 281 new leader is chosen. ``leaseTransitions`` indicates how often the lease changed 282 hands and ``renewTime`` the last time the leader renewed the lease. 283 284 There are three Helm options that can be tuned with regards to leases: 285 286 * ``l2announcements.leaseDuration`` determines the ``leaseDurationSeconds`` value 287 of created leases and by extent how long a leader must be "down" before 288 failover occurs. Its default value is 15s, it must always be greater than 1s 289 and be larger than ``leaseRenewDeadline``. 290 291 * ``l2announcements.leaseRenewDeadline`` is the interval at which the leader 292 should renew the lease. Its default value is 5s, it must be greater than 293 ``leaseRetryPeriod`` by at least 20% and is not allowed to be below ``1ns``. 294 295 * ``l2announcements.leaseRetryPeriod`` if renewing the lease fails, how long 296 should the agent wait before it tries again. Its default value is 2s, it 297 must be smaller than ``leaseRenewDeadline`` by at least 20% and above ``1ns``. 298 299 .. note:: 300 The theoretical shortest time between failure and failover is 301 ``leaseDuration - leaseRenewDeadline`` and the longest ``leaseDuration + leaseRenewDeadline``. 302 So with the default values, failover occurs between 10s and 20s. 303 For the example below, these times are between 2s and 4s. 304 305 .. tabs:: 306 .. group-tab:: Helm 307 308 .. parsed-literal:: 309 310 $ helm upgrade cilium |CHART_RELEASE| \\ 311 --namespace kube-system \\ 312 --reuse-values \\ 313 --set l2announcements.enabled=true \\ 314 --set kubeProxyReplacement=true \\ 315 --set k8sServiceHost=${API_SERVER_IP} \\ 316 --set k8sServicePort=${API_SERVER_PORT} \\ 317 --set k8sClientRateLimit.qps={QPS} \\ 318 --set k8sClientRateLimit.burst={BURST} \\ 319 --set l2announcements.leaseDuration=3s \\ 320 --set l2announcements.leaseRenewDeadline=1s \\ 321 --set l2announcements.leaseRetryPeriod=200ms 322 323 .. group-tab:: ConfigMap 324 325 .. code-block:: yaml 326 327 enable-l2-announcements: true 328 kube-proxy-replacement: true 329 l2-announcements-lease-duration: 3s 330 l2-announcements-renew-deadline: 1s 331 l2-announcements-retry-period: 200ms 332 k8s-client-qps: {QPS} 333 k8s-client-burst: {BURST} 334 335 There is a trade-off between fast failure detection and CPU + network usage. 336 Each service incurs a CPU and network overhead, so clusters with smaller amounts 337 of services can more easily afford faster failover times. Larger clusters might 338 need to increase parameters if the overhead is too high. 339 340 .. _sizing_client_rate_limit: 341 342 Sizing client rate limit 343 ======================== 344 345 The leader election process continually generates API traffic, the exact amount 346 depends on the configured lease duration, configured renew deadline, and amount 347 of services using the feature. 348 349 The default client rate limit is 5 QPS with allowed bursts up to 10 QPS. this 350 default limit is quickly reached when utilizing L2 announcements and thus users 351 should size the client rate limit accordingly. 352 353 In a worst case scenario, services are distributed unevenly, so we will assume 354 a peek load based on the renew deadline. In complex scenarios with multiple 355 policies over disjunct sets of node, max QPS per node will be lower. 356 357 .. code-block:: text 358 359 QPS = #services * (1 / leaseRenewDeadline) 360 361 // example 362 #services = 65 363 leaseRenewDeadline = 2s 364 QPS = 65 * (1 / 2s) = 32.5 QPS 365 366 Setting the base QPS to around the calculated value should be sufficient, given 367 in multi-node scenarios leases are spread around nodes, and non-holders participating 368 in the election have a lower QPS. 369 370 The burst QPS should be slightly higher to allow for bursts of traffic caused 371 by other features which also use the API server. 372 373 Failover 374 -------- 375 376 When nodes participating in leader election detect that the lease holder did not 377 renew the lease for ``leaseDurationSeconds`` amount of seconds, they will ask 378 the API server to make them the new holder. The first request to be processed 379 gets through and the rest are denied. 380 381 When a node becomes the leader/holder, it will send out a gratuitous ARP reply 382 over all of the configured interfaces. Clients who accept these will update 383 their ARP tables at once causing them to send traffic to the new leader/holder. 384 Not all clients accept gratuitous ARP replies since they can be used for ARP spoofing. 385 Such clients might experience longer downtime then configured in the leases 386 since they will only re-query via ARP when TTL in their internal tables 387 has been reached. 388 389 .. note:: 390 Since this feature has no IPv6 support yet, only ARP messages are sent, no 391 Unsolicited Neighbor Advertisements are sent. 392 393 Troubleshooting 394 ############### 395 396 This section is a step by step guide on how to troubleshoot L2 Announcements, 397 hopefully solving your issue or narrowing it down to a specific area. 398 399 The first thing we need to do is to check that the feature is enabled, kube proxy replacement 400 is active and optionally that external IPs are enabled. 401 402 .. code-block:: shell-session 403 404 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg config --all | grep EnableL2Announcements 405 EnableL2Announcements : true 406 407 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg config --all | grep KubeProxyReplacement 408 KubeProxyReplacement : true 409 410 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg config --all | grep EnableExternalIPs 411 EnableExternalIPs : true 412 413 If ``EnableL2Announcements`` or ``KubeProxyReplacement`` indicates ``false``, make sure to enable the 414 correct settings and deploy the helm chart :ref:`l2_announcements_settings`. ``EnableExternalIPs`` should be set to ``true`` if you intend to use external IPs. 415 416 Next, ensure you have at least one policy configured, L2 announcements will not work without a policy. 417 418 .. code-block:: shell-session 419 420 $ kubectl get CiliumL2AnnouncementPolicy 421 NAME AGE 422 policy1 6m16s 423 424 L2 announcements should not create a lease for very service matched by the policy. We can check the leases like so: 425 426 .. code-block:: shell-session 427 428 $ kubectl -n kube-system get lease | grep "cilium-l2announce" 429 cilium-l2announce-default-service-red kind-worker 34s 430 431 If the output is empty, then the policy is not correctly configured or the agent is not running correctly. 432 Check the logs of the agent for error messages: 433 434 .. code-block:: shell-session 435 436 $ kubectl -n kube-system logs ds/cilium | grep "l2" 437 438 A common error is that the agent is not able to create leases. 439 440 .. code-block:: shell-session 441 442 $ kubectl -n kube-system logs ds/cilium | grep "error" 443 time="2024-06-25T12:01:43Z" level=error msg="error retrieving resource lock kube-system/cilium-l2announce-default-service-red: \"cilium-l2announce-default-service-red\" is forbidden: User \"system:serviceaccount:kube-system:cilium\" cannot get resource \"leases\" in API group \"\" in the namespace \"kube-system\"" subsys=klog 444 445 This can happen if the cluster role of the agent is not correct. This tends to happen when L2 announcements is enabled 446 without using the helm chart. Redeploy the helm chart or manually update the cluster role, by running 447 ``kubectl edit clusterrole cilium`` and adding the following block to the rules: 448 449 .. code-block:: yaml 450 451 - apiGroups: 452 - 453 resources: 454 - leases 455 verbs: 456 - create 457 - get 458 - update 459 - list 460 - delete 461 462 Another common error is that the configured client rate limit is too low. 463 This can be seen in the logs as well: 464 465 .. code-block:: shell-session 466 467 $ kubectl -n kube-system logs ds/cilium | grep "l2" 468 2023-07-04T14:59:51.959400310Z level=info msg="Waited for 1.395439596s due to client-side throttling, not priority and fairness, request: GET:" subsys=klog 469 2023-07-04T15:00:12.159409007Z level=info msg="Waited for 1.398748976s due to client-side throttling, not priority and fairness, request: PUT:" subsys=klog 470 471 These logs are associated with intermittent failures to renew the lease, connection issues and/or frequent leader changes. 472 See :ref:`sizing_client_rate_limit` for more information on how to size the client rate limit. 473 474 If you find a different L2 related error, please open a GitHub issue with the error message and the 475 steps you took to get there. 476 477 Assuming the leases are created, the next step is to check the agent internal state. Pick a service which isn't working 478 and inspect its lease. Take the holder name and find the cilium agent pod for the holder node. 479 Finally, take the name of the cilium agent pod and inspect the l2-announce state: 480 481 .. code-block:: shell-session 482 483 $ kubectl -n kube-system get lease cilium-l2announce-default-service-red 484 NAME HOLDER AGE 485 cilium-l2announce-default-service-red <node-name> 20m 486 487 $ kubectl -n kube-system get pod -l '' -o wide | grep <node-name> 488 <agent-pod> 1/1 Running 0 35m kind-worker <none> <none> 489 490 $ kubectl -n kube-system exec pod/<agent-pod> -- cilium-dbg statedb l2-announce 491 # IP NetworkInterface 492 eth0 493 494 The l2 announce state should contain the IP of the service and the network interface it is announced on. 495 If the lease is present but its IP is not in the l2-announce state, or you are missing an entry for a given network device. 496 Double check that the device selector in the policy matches the desired network device (values are regular expressions). 497 If the filter seems correct or isn't specified, inspect the known devices: 498 499 .. code-block:: shell-session 500 501 $ kubectl -n kube-system exec ds/cilium -- cilium-dbg statedb devices 502 # Name Index Selected Type MTU HWAddr Flags Addresses 503 lxc5d23398605f6 10 false veth 1500 b6:ed:d8:d2:dd:ec up|broadcast|multicast fe80::b4ed:d8ff:fed2:ddec 504 lxc3bf03c00d6e3 12 false veth 1500 8a:d1:0c:91:8a:d3 up|broadcast|multicast fe80::88d1:cff:fe91:8ad3 505 eth0 50 true veth 1500 02:42:ac:13:00:03 up|broadcast|multicast, fc00:c111::3, fe80::42:acff:fe13:3 506 lo 1 false device 65536 up|loopback, ::1 507 cilium_net 2 false veth 1500 1a:a9:2f:4d:d3:3d up|broadcast|multicast fe80::18a9:2fff:fe4d:d33d 508 cilium_vxlan 4 false vxlan 1500 2a:05:26:8d:79:9c up|broadcast|multicast fe80::2805:26ff:fe8d:799c 509 lxc611291f1ecbb 8 false veth 1500 7a:fb:ec:54:e2:5c up|broadcast|multicast fe80::78fb:ecff:fe54:e25c 510 lxc_health 16 false veth 1500 0a:94:bf:49:d5:50 up|broadcast|multicast fe80::894:bfff:fe49:d550 511 cilium_host 3 false veth 1500 22:32:e2:80:21:34 up|broadcast|multicast, fd00:10:244:1::f58a 512 513 Only devices with ``Selected`` set to ``true`` can be used for L2 announcements. Typically all physical devices with IPs 514 assigned to them will be considered selected. The ``--devices`` flag or ``devices`` Helm option can be used to filter 515 out devices. If your desired device is in the list but not selected, check the devices flag/option to see if it filters it out. 516 517 Please open a Github issue if your desired device doesn't appear in the list or it isn't selected while you believe it should be. 518 519 If the L2 state contains the IP and device combination but there are still connection issues, it's time to test ARP 520 within the cluster. Pick a cilium agent pod other than the lease holder on the same L2 network. 521 Then use the following command to send an ARP request to the service IP: 522 523 .. code-block:: shell-session 524 525 $ kubectl -n kube-system exec pod/cilium-z4ef7 -- sh -c 'apt update && apt install -y arping && arping -i <netdev-on-l2> <service-ip>' 526 [omitting apt output...] 527 ARPING 528 58 bytes from 02:42:ac:13:00:03 ( index=0 time=11.772 usec 529 58 bytes from 02:42:ac:13:00:03 ( index=1 time=9.234 usec 530 58 bytes from 02:42:ac:13:00:03 ( index=2 time=10.568 usec 531 532 If the output is as above yet the service is still unreachable, from clients within the same L2 network, 533 the issue might be client related. If you expect the service to be reachable from outside the L2 network, 534 and it is not, check the ARP and routing tables of the gateway device. 535 536 If the ARP request fails (the output shows ``Timeout``), check the BPF map of the cilium-agent with the lease: 537 538 .. code-block:: shell-session 539 540 $ kubectl -n kube-system exec pod/cilium-vxz67 -- bpftool map dump pinned /sys/fs/bpf/tc/globals/cilium_l2_responder_v4 541 [{ 542 "key": { 543 "ip4": 655370, 544 "ifindex": 50 545 }, 546 "value": { 547 "responses_sent": 20 548 } 549 } 550 ] 551 552 The ``responses_sent`` field is incremented every time the datapath responds to an ARP request. If the field 553 is 0, then the ARP request doesn't make it to the node. If the field is greater than 0, the issue is on the 554 return path. In both cases, inspect the network and the client. 555 556 It is still possible that the service is unreachable even though ARP requests are answered. This can happen 557 for a number of reasons, usually unrelated to L2 announcements, but rather other Cilium features. 558 559 One common issue however is caused by the usage of ``.Spec.ExternalTrafficPolicy: Local`` on services. This setting 560 normally tells a load balancer to only forward traffic to nodes with at least 1 ready pod to avoid a second hop. 561 Unfortunately, L2 announcements isn't currently aware of this setting and will announce the service IP on all nodes 562 matching policies. If a node without a pod receives traffic, it will drop it. To fix this, set the policy to 563 ``.Spec.ExternalTrafficPolicy: Cluster``. 564 565 Please open a Github issue if none of the above steps helped you solve your issue. 566 567 .. _l2_pod_announcements: 568 569 L2 Pod Announcements 570 #################### 571 572 L2 Pod Announcements announce Pod IP addresses on the L2 network using 573 Gratuitous ARP replies. When enabled, the node transmits Gratuitous ARP 574 replies for every locally created pod, on the configured network 575 interface. This feature is enabled separately from the above L2 576 announcements feature. 577 578 To enable L2 Pod Announcements, set the following: 579 580 .. tabs:: 581 .. group-tab:: Helm 582 583 .. parsed-literal:: 584 585 $ helm upgrade cilium |CHART_RELEASE| \\ 586 --namespace kube-system \\ 587 --reuse-values \\ 588 --set l2podAnnouncements.enabled=true \\ 589 --set l2podAnnouncements.interface=eth0 590 591 592 .. group-tab:: ConfigMap 593 594 .. code-block:: yaml 595 596 enable-l2-pod-announcements: true 597 l2-pod-announcements-interface: eth0 598 599 .. note:: 600 Since this feature has no IPv6 support yet, only ARP messages are 601 sent, no Unsolicited Neighbor Advertisements are sent.