github.com/cilium/cilium@v1.16.2/Documentation/network/concepts/ipam/eni.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _ipam_eni: 8 9 ####### 10 AWS ENI 11 ####### 12 13 The AWS ENI allocator is specific to Cilium deployments running in the AWS 14 cloud and performs IP allocation based on IPs of `AWS Elastic Network Interfaces (ENI) 15 <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html>`__ by 16 communicating with the AWS EC2 API. 17 18 The architecture ensures that only a single operator communicates with the EC2 19 service API to avoid rate-limiting issues in large clusters. A pre-allocation 20 watermark is used to maintain a number of IP addresses to be available for use 21 on nodes at all time without needing to contact the EC2 API when a new pod is 22 scheduled in the cluster. 23 24 Note that Cilium currently does not support IPv6-only ENIs. Cilium support for 25 IPv6 ENIs is being tracked in :gh-issue:`18405`, and the related feature of 26 assigning IPv6 prefixes in :gh-issue:`19251`. 27 28 ************ 29 Architecture 30 ************ 31 32 .. image:: eni_arch.png 33 :align: center 34 35 The AWS ENI allocator builds on top of the CRD-backed allocator. Each node 36 creates a ``ciliumnodes.cilium.io`` custom resource matching the node name when 37 Cilium starts up for the first time on that node. It contacts the EC2 metadata 38 API to retrieve the instance ID, instance type, and VPC information, then it 39 populates the custom resource with this information. ENI allocation parameters 40 are provided as agent configuration option and are passed into the custom 41 resource as well. 42 43 The Cilium operator listens for new ``ciliumnodes.cilium.io`` custom resources 44 and starts managing the IPAM aspect automatically. It scans the EC2 instances 45 for existing ENIs with associated IPs and makes them available via the 46 ``spec.ipam.available`` field. It will then constantly monitor the used IP 47 addresses in the ``status.ipam.used`` field and automatically create ENIs and 48 allocate more IPs as needed to meet the IP pre-allocation watermark. This ensures 49 that there are always IPs available. 50 51 The selection of subnets to use for allocation as well as attachment of 52 security groups to new ENIs can be controlled separately for each node. This 53 makes it possible to hand out pod IPs with differing security groups on 54 individual nodes. 55 56 The corresponding datapath is described in section :ref:`aws_eni_datapath`. 57 58 ************* 59 Configuration 60 ************* 61 62 * The Cilium agent and operator must be run with the option ``--ipam=eni`` or 63 the option ``ipam: eni`` must be set in the ConfigMap. This will enable ENI 64 allocation in both the node agent and operator. 65 66 * In most scenarios, it makes sense to automatically create the 67 ``ciliumnodes.cilium.io`` custom resource when the agent starts up on a node 68 for the first time. To enable this, specify the option 69 ``--auto-create-cilium-node-resource`` or set 70 ``auto-create-cilium-node-resource: "true"`` in the ConfigMap. 71 72 * If IPs are limited, run the Operator with option 73 ``--aws-release-excess-ips=true``. When enabled, operator checks the number 74 of IPs regularly and attempts to release excess free IPs from ENI. 75 76 * It is generally a good idea to enable metrics in the Operator as well with 77 the option ``--enable-metrics``. See the section :ref:`install_metrics` for 78 additional information how to install and run Prometheus including the 79 Grafana dashboard. 80 81 * By default, ENIs will be tagged with the cluster name, to allow Cilium 82 Operator to garbage collect these ENIs if left dangling. The cluster name is 83 either extracted from Cilium's own ``cluster-name`` flag or from the 84 ``aws:eks:cluster-name`` tag on the operator's EC2 instance. If neither 85 cluster names are available, a static default cluster name is assumed and 86 ENI garbage collection will be performed across all such unnamed clusters. 87 You may override this behavior by setting a cluster-specific ``--eni-gc-tags`` 88 tag set. 89 90 Custom ENI Configuration 91 ======================== 92 93 Custom ENI configuration can be defined with a custom CNI configuration 94 ``ConfigMap``: 95 96 Create a CNI configuration 97 -------------------------- 98 99 Create a ``cni-config.yaml`` file based on the template below. Fill in the 100 ``subnet-tags`` field, assuming that the subnets in AWS have the tags applied 101 to them: 102 103 .. code-block:: yaml 104 105 apiVersion: v1 106 kind: ConfigMap 107 metadata: 108 name: cni-configuration 109 namespace: kube-system 110 data: 111 cni-config: |- 112 { 113 "cniVersion":"0.3.1", 114 "name":"cilium", 115 "plugins": [ 116 { 117 "cniVersion":"0.3.1", 118 "type":"cilium-cni", 119 "eni": { 120 "subnet-tags":{ 121 "foo":"true" 122 } 123 } 124 } 125 ] 126 } 127 128 Additional parameters may be configured in the ``eni`` or ``ipam`` section of 129 the CNI configuration file. See the list of ENI allocation parameters below 130 for a reference of the supported options. 131 132 Deploy the ``ConfigMap``: 133 134 .. code-block:: shell-session 135 136 kubectl apply -f cni-config.yaml 137 138 Configure Cilium with subnet-tags-filter 139 ---------------------------------------- 140 141 Using the instructions above to deploy Cilium and CNI config, specify the 142 following additional arguments to Helm: 143 144 .. code-block:: shell-session 145 146 --set cni.customConf=true \ 147 --set cni.configMap=cni-configuration 148 149 ENI Allocation Parameters 150 ========================= 151 152 The following parameters are available to control the ENI creation and IP 153 allocation: 154 155 ``InstanceType`` 156 The AWS EC2 instance type 157 158 *This field is automatically populated when using ``--auto-create-cilium-node-resource``* 159 160 ``spec.eni.vpc-id`` 161 The VPC identifier used to create ENIs and select AWS subnets for IP 162 allocation. 163 164 *This field is automatically populated when using ``--auto-create-cilium-node-resource``* 165 166 ``spec.eni.availability-zone`` 167 The availability zone used to create ENIs and select AWS subnets for IP 168 allocation. 169 170 *This field is automatically populated when using ``--auto-create-cilium-node-resource``* 171 172 ``spec.eni.node-subnet-id`` 173 The subnet ID of the first ENI of a node. Used as a fallback for subnet 174 selection in the case where no subnet IDs or tags are configured. 175 176 *This field is automatically populated when using ``--auto-create-cilium-node-resource``* 177 178 ``spec.ipam.min-allocate`` 179 The minimum number of IPs that must be allocated when the node is first 180 bootstrapped. It defines the minimum base socket of addresses that must be 181 available. After reaching this watermark, the PreAllocate and 182 MaxAboveWatermark logic takes over to continue allocating IPs. 183 184 If unspecified, no minimum number of IPs is required. 185 186 ``spec.ipam.max-allocate`` 187 The maximum number of IPs that can be allocated to the node. 188 When the current amount of allocated IPs will approach this value, 189 the considered value for PreAllocate will decrease down to 0 in order to 190 not attempt to allocate more addresses than defined. 191 192 If unspecified, no maximum number of IPs will be enforced. 193 194 ``spec.ipam.pre-allocate`` 195 The number of IP addresses that must be available for allocation at all 196 times. It defines the buffer of addresses available immediately without 197 requiring for the operator to get involved. 198 199 If unspecified, this value defaults to 8. 200 201 ``spec.ipam.max-above-watermark`` 202 The maximum number of addresses to allocate beyond the addresses needed to 203 reach the PreAllocate watermark. Going above the watermark can help reduce 204 the number of API calls to allocate IPs, e.g. when a new ENI is allocated, as 205 many secondary IPs as possible are allocated. Limiting the amount can help 206 reduce waste of IPs. 207 208 If let unspecified, the value defaults to 0. 209 210 ``spec.eni.first-interface-index`` 211 The index of the first ENI to use for IP allocation, e.g. if the node has 212 ``eth0``, ``eth1``, ``eth2`` and FirstInterfaceIndex is set to 1, then only 213 ``eth1`` and ``eth2`` will be used for IP allocation, ``eth0`` will be 214 ignored for PodIP allocation. 215 216 If unspecified, this value defaults to 0 which means that ``eth0`` will 217 be used for pod IPs. 218 219 ``spec.eni.security-group-tags`` 220 The list tags which will be used to filter the security groups to 221 attach to any ENI that is created and attached to the instance. 222 223 If unspecified, the security group ids passed in 224 ``spec.eni.security-groups`` field will be used. 225 226 ``spec.eni.security-groups`` 227 The list of security group ids to attach to any ENI that is created 228 and attached to the instance. 229 230 If unspecified, the security group ids of ``eth0`` will be used. 231 232 ``spec.eni.subnet-ids`` 233 The subnet IDs used to select the AWS subnets for IP allocation. This is an 234 additional requirement on top of requiring to match the availability zone and 235 VPC of the instance. This parameter is mutually exclusive and has priority over 236 ``spec.eni.subnet-tags``. 237 238 If unspecified, it will let the operator pick any available subnet in the AZ 239 with the most IP addresses available. 240 241 ``spec.eni.subnet-tags`` 242 The tags used to select the AWS subnets for IP allocation. This is an 243 additional requirement on top of requiring to match the availability zone and 244 VPC of the instance. 245 246 If unspecified, no tags are required. 247 248 ``spec.eni.exclude-interface-tags`` 249 The tags used to exclude interfaces from IP allocation. Any ENI attached to 250 a node which matches this set of tags will be ignored by Cilium and may be 251 used for other purposes. This parameter can be used in combination with 252 ``subnet-tags`` or ``first-interface-index`` to exclude additional interfaces. 253 254 If unspecified, no tags are used to exclude interfaces. 255 256 ``spec.eni.delete-on-termination`` 257 Remove the ENI when the instance is terminated 258 259 If unspecified, this option is enabled. 260 261 ******************* 262 Operational Details 263 ******************* 264 265 Cache of ENIs, Subnets, and VPCs 266 ================================ 267 268 The operator maintains a list of all EC2 ENIs, VPCs and subnets associated with 269 the AWS account in a cache. For this purpose, the operator performs the 270 following three EC2 API operations: 271 272 * ``DescribeNetworkInterfaces`` 273 * ``DescribeSubnets`` 274 * ``DescribeVpcs`` 275 276 The cache is updated once per minute or after an IP allocation or ENI creation 277 has been performed. When triggered based on an allocation or creation, the 278 operation is performed at most once per second. 279 280 Publication of available ENI IPs 281 ================================ 282 283 Following the update of the cache, all CiliumNode custom resources representing 284 nodes are updated to publish eventual new IPs that have become available. 285 286 In this process, all ENIs with an interface index greater than 287 ``spec.eni.first-interface-index`` are scanned for all available IPs. All IPs 288 found are added to ``spec.ipam.available``. Each ENI meeting this criteria is 289 also added to ``status.eni.enis``. 290 291 If this update caused the custom resource to change, the custom resource is 292 updated using the Kubernetes API methods ``Update()`` and/or ``UpdateStatus()`` 293 if available. 294 295 Determination of ENI IP deficits or excess 296 ========================================== 297 298 The operator constantly monitors all nodes and detects deficits in available 299 ENI IP addresses. The check to recognize a deficit is performed on two 300 occasions: 301 302 * When a ``CiliumNode`` custom resource is updated 303 * All nodes are scanned in a regular interval (once per minute) 304 305 If ``--aws-release-excess-ips`` is enabled, the check to recognize IP excess 306 is performed at the interval based scan. 307 308 When determining whether a node has a deficit in IP addresses, the following 309 calculation is performed: 310 311 .. code-block:: go 312 313 availableIPs := len(spec.ipam.pool) 314 neededIPs = max(spec.ipam.pre-allocate - (availableIPs - len(status.ipam.used)), spec.ipam.min-allocate - availableIPs) 315 if spec.ipam.max-allocate > 0 { 316 neededIPs = min(max(spec.ipam.max-allocate - availableIPs, 0), neededIPs) 317 } 318 319 For excess IP calculation: 320 321 .. code-block:: go 322 323 availableIPs := len(spec.ipam.pool) 324 upperBound := spec.ipam.min-allocate + spec.ipam.max-above-watermark 325 switch { 326 case availableIPs <= upperBound: 327 excessIPs = 0 328 case len(status.ipam.used) <= upperBound && len(status.ipam.used) + spec.ipam.pre-allocate <= upperBound: 329 excessIPs = availableIPs - upperBound 330 default: 331 excessIPs = max(availableIPs - len(status.ipam.used) - upperBound, 0) 332 } 333 334 Upon detection of a deficit, the node is added to the list of nodes which 335 require IP address allocation. When a deficit is detected using the interval 336 based scan, the allocation order of nodes is determined based on the severity 337 of the deficit, i.e. the node with the biggest deficit will be at the front of 338 the allocation queue. Nodes that need to release IPs are behind nodes that need 339 allocation. 340 341 The allocation queue is handled on demand but at most once per second. 342 343 IP Allocation 344 ============= 345 346 When performing IP allocation for a node with an address deficit, the operator 347 first looks at the ENIs which are already attached to the instance represented 348 by the CiliumNode resource. All ENIs with an interface index greater than 349 ``spec.eni.first-interface-index`` are considered for use. 350 351 .. note:: 352 353 In order to not use ``eth0`` for IP allocation, set 354 ``spec.eni.first-interface-index`` to ``1`` to skip the first interface in 355 line. 356 357 The operator will then pick the first already allocated ENI which meets the 358 following criteria: 359 360 * The ENI has addresses associated which are not yet used or the number of 361 addresses associated with the ENI is lesser than the instance type specific 362 limit. 363 364 * The subnet associated with the ENI has IPs available for allocation 365 366 The following formula is used to determine how many IPs are allocated on the ENI: 367 368 .. code-block:: go 369 370 // surgeAllocate kicks in if numPendingPods is greater than NeededAddresses 371 min(AvailableOnSubnet, min(AvailableOnENI, NeededAddresses + spec.ipam.max-above-watermark + surgeAllocate)) 372 373 .. note:: 374 375 In scenarios where the pre-allocated number is lower than the number of pending pods on the node, the operator will 376 pro-actively allocate more than the pre-allocated number of IPs to avoid having to wait for the next allocation 377 cycles. 378 379 This means that the number of IPs allocated in a single allocation cycle can be 380 less than what is required to fulfill ``spec.ipam.pre-allocate``. 381 382 In order to allocate the IPs, the method ``AssignPrivateIpAddresses`` of the 383 EC2 service API is called. When no more ENIs are available meeting the above 384 criteria, a new ENI is created. 385 386 IP Release 387 ========== 388 389 When performing IP release for a node with IP excess, the operator scans 390 ENIs attached to the node with an interface index greater than 391 ``spec.eni.first-interface-index`` and selects an ENI with the most free IPs 392 available for release. The following formula is used to determine how many IPs 393 are available for release on the ENI: 394 395 .. code-block:: go 396 397 min(FreeOnENI, (FreeIPs - spec.ipam.pre-allocate - spec.ipam.max-above-watermark)) 398 399 Operator releases IPs from the selected ENI, if there is still excess free IP 400 not released, operator will attempt to release in next release cycle. 401 402 In order to release the IPs, the method ``UnassignPrivateIpAddresses`` of the 403 EC2 service API is called. There is no limit on ENIs per subnet so ENIs are 404 remained on the node. 405 406 407 ENI Creation 408 ============ 409 410 As long as an instance type is capable allocating additional ENIs, ENIs are 411 allocated automatically based on demand. 412 413 When allocating an ENI, the first operation performed is to identify the best 414 subnet. This is done by searching through all subnets and finding a subnet that 415 matches the following criteria: 416 417 * The VPC ID of the subnet matches ``spec.eni.vpc-id`` 418 * The Availability Zone of the subnet matches 419 ``spec.eni.availability-zone`` 420 421 If set, ``spec.eni.subnet-ids`` or ``spec.eni.subnet-tags`` are used to further 422 narrow down the set of candidate subnets. Any subnet with an ID in 423 ``subnet-ids`` is a candidate, whereas a subnet must match all ``subnet-tags`` 424 to be candidate. Note that when ``subnet-ids`` is set, ``subnet-tags`` are 425 ignored. If multiple subnets match, the subnet with the most available addresses 426 is selected. 427 428 If neither ``subnet-ids`` nor ``subnet-tags`` are set, the operator consults 429 ``spec.eni.node-subnet-id``, attempting to create the ENI in the same subnet as 430 the primary ENI of the instance. If this is not possible (e.g. if there are not 431 enough IPs in said subnet), the operator falls back to allocating the IP in the 432 largest subnet matching VPC and Availability Zone. 433 434 After selecting the subnet, the interface index is determined. For this purpose, 435 all existing ENIs are scanned and the first unused index greater than 436 ``spec.eni.first-interface-index`` is selected. 437 438 After determining the subnet and interface index, the ENI is created and 439 attached to the EC2 instance using the methods ``CreateNetworkInterface`` and 440 ``AttachNetworkInterface`` of the EC2 API. 441 442 The security group ids attached to the ENI are computed in the following order: 443 444 1. The field ``spec.eni.security-groups`` is consulted first. If this is set 445 then these will be the security group ids attached to the newly created ENI. 446 2. The filed ``spec.eni.security-group-tags`` is consulted. If this is set then 447 the operator will list all security groups in the account and will attach to 448 the ENI the ones that match the list of tags passed. 449 3. Finally if none of the above fields are set then the newly created ENI will 450 inherit the security group ids of ``eth0`` of the machine. 451 452 The description will be in the following format: 453 454 .. code-block:: go 455 456 "Cilium-CNI (<EC2 instance ID>)" 457 458 If the ENI tagging feature is enabled then the ENI will be tagged with the provided information. 459 460 ENI Deletion Policy 461 =================== 462 463 ENIs can be marked for deletion when the EC2 instance to which the ENI is 464 attached to is terminated. In order to enable this, the option 465 ``spec.eni.delete-on-termination`` can be enabled. If enabled, the ENI 466 is modified after creation using ``ModifyNetworkInterfaceAttribute`` to specify this 467 deletion policy. 468 469 Node Termination 470 ================ 471 472 When a node or instance terminates, the Kubernetes apiserver will send a node 473 deletion event. This event will be picked up by the operator and the operator 474 will delete the corresponding ``ciliumnodes.cilium.io`` custom resource. 475 476 .. _ec2privileges: 477 478 ******************* 479 Required Privileges 480 ******************* 481 482 The following EC2 privileges are required by the Cilium operator in order to 483 perform ENI creation and IP allocation: 484 485 * ``DeleteNetworkInterface`` 486 * ``DescribeNetworkInterfaces`` 487 * ``DescribeSubnets`` 488 * ``DescribeVpcs`` 489 * ``DescribeSecurityGroups`` 490 * ``CreateNetworkInterface`` 491 * ``AttachNetworkInterface`` 492 * ``ModifyNetworkInterfaceAttribute`` 493 * ``AssignPrivateIpAddresses`` 494 * ``CreateTags`` 495 496 If ENI GC is enabled (which is the default), and ``--cluster-name`` and ``--eni-gc-tags`` are not set to custom values: 497 498 * ``DescribeTags`` 499 500 If release excess IP enabled: 501 502 * ``UnassignPrivateIpAddresses`` 503 504 If ``--instance-tags-filter`` is used: 505 506 * ``DescribeInstances`` 507 508 ***************************** 509 EC2 instance types ENI limits 510 ***************************** 511 512 Currently the EC2 Instance ENI limits (adapters per instance + IPv4/IPv6 IPs per adapter) are 513 hardcoded in the Cilium codebase for easy out-of-the box deployment and usage. 514 515 The limits can be modified via the ``--aws-instance-limit-mapping`` CLI flag on 516 the cilium-operator. This allows the user to supply a custom limit. 517 518 Additionally the limits can be updated via the EC2 API by passing the 519 ``--update-ec2-adapter-limit-via-api`` CLI flag. 520 This will require an additional EC2 IAM permission: 521 522 * ``DescribeInstanceTypes`` 523 524 ******* 525 Metrics 526 ******* 527 528 The IPAM metrics are documented in the section :ref:`ipam_metrics`. 529 530 ****************** 531 Node Configuration 532 ****************** 533 534 The IP address and routes on ENIs attached to the instance will be 535 managed by the Cilium agent. Therefore, any system service trying to manage 536 newly attached network interfaces will interfere with Cilium's configuration. 537 Common scenarios are ``NetworkManager`` or ``systemd-networkd`` automatically 538 performing DHCP on these interfaces or removing Cilium's IP address when the 539 carrier is temporarily lost. Be sure to disable these services or configure 540 your Linux distribution to not manage the newly attached ENI devices. 541 The following examples configure all Linux network devices named ``eth*`` 542 except ``eth0`` as unmanaged. 543 544 .. tabs:: 545 546 .. group-tab:: Network Manager 547 548 .. code-block:: shell-session 549 550 # cat <<EOF >/etc/NetworkManager/conf.d/99-unmanaged-devices.conf 551 [keyfile] 552 unmanaged-devices=interface-name:eth*,except:interface-name:eth0 553 EOF 554 # systemctl reload NetworkManager 555 556 .. group-tab:: systemd-networkd 557 558 .. code-block:: shell-session 559 560 # cat <<EOF >/etc/systemd/network/99-unmanaged-devices.network 561 [Match] 562 Name=eth[1-9]* 563 564 [Link] 565 Unmanaged=yes 566 EOF 567 # systemctl restart systemd-networkd