github.com/cilium/cilium@v1.16.2/Documentation/network/concepts/ipam/eni.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _ipam_eni:
     8  
     9  #######
    10  AWS ENI
    11  #######
    12  
    13  The AWS ENI allocator is specific to Cilium deployments running in the AWS
    14  cloud and performs IP allocation based on IPs of `AWS Elastic Network Interfaces (ENI)
    15  <https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html>`__ by
    16  communicating with the AWS EC2 API.
    17  
    18  The architecture ensures that only a single operator communicates with the EC2
    19  service API to avoid rate-limiting issues in large clusters. A pre-allocation
    20  watermark is used to maintain a number of IP addresses to be available for use
    21  on nodes at all time without needing to contact the EC2 API when a new pod is
    22  scheduled in the cluster.
    23  
    24  Note that Cilium currently does not support IPv6-only ENIs. Cilium support for
    25  IPv6 ENIs is being tracked in :gh-issue:`18405`, and the related feature of
    26  assigning IPv6 prefixes in :gh-issue:`19251`.
    27  
    28  ************
    29  Architecture
    30  ************
    31  
    32  .. image:: eni_arch.png
    33      :align: center
    34  
    35  The AWS ENI allocator builds on top of the CRD-backed allocator. Each node
    36  creates a ``ciliumnodes.cilium.io`` custom resource matching the node name when
    37  Cilium starts up for the first time on that node. It contacts the EC2 metadata
    38  API to retrieve the instance ID, instance type, and VPC information, then it
    39  populates the custom resource with this information. ENI allocation parameters
    40  are provided as agent configuration option and are passed into the custom
    41  resource as well.
    42  
    43  The Cilium operator listens for new ``ciliumnodes.cilium.io`` custom resources
    44  and starts managing the IPAM aspect automatically. It scans the EC2 instances
    45  for existing ENIs with associated IPs and makes them available via the
    46  ``spec.ipam.available`` field. It will then constantly monitor the used IP
    47  addresses in the ``status.ipam.used`` field and automatically create ENIs and
    48  allocate more IPs as needed to meet the IP pre-allocation watermark. This ensures
    49  that there are always IPs available.
    50  
    51  The selection of subnets to use for allocation as well as attachment of
    52  security groups to new ENIs can be controlled separately for each node. This
    53  makes it possible to hand out pod IPs with differing security groups on
    54  individual nodes.
    55  
    56  The corresponding datapath is described in section :ref:`aws_eni_datapath`.
    57  
    58  *************
    59  Configuration
    60  *************
    61  
    62  * The Cilium agent and operator must be run with the option ``--ipam=eni`` or
    63    the option ``ipam: eni``  must be set in the ConfigMap. This will enable ENI
    64    allocation in both the node agent and operator.
    65  
    66  * In most scenarios, it makes sense to automatically create the
    67    ``ciliumnodes.cilium.io`` custom resource when the agent starts up on a node
    68    for the first time. To enable this, specify the option
    69    ``--auto-create-cilium-node-resource`` or  set
    70    ``auto-create-cilium-node-resource: "true"`` in the ConfigMap.
    71  
    72  * If IPs are limited, run the Operator with option
    73    ``--aws-release-excess-ips=true``. When enabled, operator checks the number
    74    of IPs regularly and attempts to release excess free IPs from ENI.
    75  
    76  * It is generally a good idea to enable metrics in the Operator as well with
    77    the option ``--enable-metrics``. See the section :ref:`install_metrics` for
    78    additional information how to install and run Prometheus including the
    79    Grafana dashboard.
    80  
    81  * By default, ENIs will be tagged with the cluster name, to allow Cilium
    82    Operator to garbage collect these ENIs if left dangling. The cluster name is
    83    either extracted from Cilium's own ``cluster-name`` flag or from the
    84    ``aws:eks:cluster-name`` tag on the operator's EC2 instance. If neither
    85    cluster names are available, a static default cluster name is assumed and
    86    ENI garbage collection will be performed across all such unnamed clusters.
    87    You may override this behavior by setting a cluster-specific ``--eni-gc-tags``
    88    tag set.
    89  
    90  Custom ENI Configuration
    91  ========================
    92  
    93  Custom ENI configuration can be defined with a custom CNI configuration
    94  ``ConfigMap``:
    95  
    96  Create a CNI configuration
    97  --------------------------
    98  
    99  Create a ``cni-config.yaml`` file based on the template below. Fill in the
   100  ``subnet-tags`` field, assuming that the subnets in AWS have the tags applied
   101  to them:
   102  
   103  .. code-block:: yaml
   104  
   105     apiVersion: v1
   106     kind: ConfigMap
   107     metadata:
   108       name: cni-configuration
   109       namespace: kube-system
   110     data:
   111       cni-config: |-
   112         {
   113           "cniVersion":"0.3.1",
   114           "name":"cilium",
   115           "plugins": [
   116             {
   117               "cniVersion":"0.3.1",
   118               "type":"cilium-cni",
   119               "eni": {
   120                 "subnet-tags":{
   121                   "foo":"true"
   122                 }
   123               }
   124             }
   125           ]
   126         }
   127  
   128  Additional parameters may be configured in the ``eni`` or ``ipam`` section of
   129  the CNI configuration file. See the list of ENI allocation parameters below
   130  for a reference of the supported options.
   131  
   132  Deploy the ``ConfigMap``:
   133  
   134  .. code-block:: shell-session
   135  
   136     kubectl apply -f cni-config.yaml
   137  
   138  Configure Cilium with subnet-tags-filter
   139  ----------------------------------------
   140  
   141  Using the instructions above to deploy Cilium and CNI config, specify the
   142  following additional arguments to Helm:
   143  
   144  .. code-block:: shell-session
   145  
   146     --set cni.customConf=true \
   147     --set cni.configMap=cni-configuration
   148  
   149  ENI Allocation Parameters
   150  =========================
   151  
   152  The following parameters are available to control the ENI creation and IP
   153  allocation:
   154  
   155  ``InstanceType``
   156    The AWS EC2 instance type
   157  
   158    *This field is automatically populated when using ``--auto-create-cilium-node-resource``*
   159  
   160  ``spec.eni.vpc-id``
   161    The VPC identifier used to create ENIs and select AWS subnets for IP
   162    allocation.
   163  
   164    *This field is automatically populated when using ``--auto-create-cilium-node-resource``*
   165  
   166  ``spec.eni.availability-zone``
   167    The availability zone used to create ENIs and select AWS subnets for IP
   168    allocation.
   169  
   170    *This field is automatically populated when using ``--auto-create-cilium-node-resource``*
   171  
   172  ``spec.eni.node-subnet-id``
   173    The subnet ID of the first ENI of a node. Used as a fallback for subnet
   174    selection in the case where no subnet IDs or tags are configured.
   175  
   176    *This field is automatically populated when using ``--auto-create-cilium-node-resource``*
   177  
   178  ``spec.ipam.min-allocate``
   179    The minimum number of IPs that must be allocated when the node is first
   180    bootstrapped. It defines the minimum base socket of addresses that must be
   181    available. After reaching this watermark, the PreAllocate and
   182    MaxAboveWatermark logic takes over to continue allocating IPs.
   183  
   184    If unspecified, no minimum number of IPs is required.
   185  
   186  ``spec.ipam.max-allocate``
   187    The maximum number of IPs that can be allocated to the node.
   188    When the current amount of allocated IPs will approach this value,
   189    the considered value for PreAllocate will decrease down to 0 in order to
   190    not attempt to allocate more addresses than defined.
   191  
   192    If unspecified, no maximum number of IPs will be enforced.
   193  
   194  ``spec.ipam.pre-allocate``
   195    The number of IP addresses that must be available for allocation at all
   196    times.  It defines the buffer of addresses available immediately without
   197    requiring for the operator to get involved.
   198  
   199    If unspecified, this value defaults to 8.
   200  
   201  ``spec.ipam.max-above-watermark``
   202    The maximum number of addresses to allocate beyond the addresses needed to
   203    reach the PreAllocate watermark.  Going above the watermark can help reduce
   204    the number of API calls to allocate IPs, e.g. when a new ENI is allocated, as
   205    many secondary IPs as possible are allocated. Limiting the amount can help
   206    reduce waste of IPs.
   207  
   208    If let unspecified, the value defaults to 0.
   209  
   210  ``spec.eni.first-interface-index``
   211    The index of the first ENI to use for IP allocation, e.g. if the node has
   212    ``eth0``, ``eth1``, ``eth2`` and FirstInterfaceIndex is set to 1, then only
   213    ``eth1`` and ``eth2`` will be used for IP allocation, ``eth0`` will be
   214    ignored for PodIP allocation.
   215  
   216    If unspecified, this value defaults to 0 which means that ``eth0`` will
   217    be used for pod IPs.
   218  
   219  ``spec.eni.security-group-tags``
   220    The list tags which will be used to filter the security groups to
   221    attach to any ENI that is created and attached to the instance.
   222  
   223    If unspecified, the security group ids passed in
   224    ``spec.eni.security-groups`` field will be used.
   225  
   226  ``spec.eni.security-groups``
   227    The list of security group ids to attach to any ENI that is created
   228    and attached to the instance.
   229  
   230    If unspecified, the security group ids of ``eth0`` will be used.
   231  
   232  ``spec.eni.subnet-ids``
   233    The subnet IDs used to select the AWS subnets for IP allocation. This is an
   234    additional requirement on top of requiring to match the availability zone and
   235    VPC of the instance. This parameter is mutually exclusive and has priority over
   236    ``spec.eni.subnet-tags``.
   237  
   238    If unspecified, it will let the operator pick any available subnet in the AZ 
   239    with the most IP addresses available.
   240  
   241  ``spec.eni.subnet-tags``
   242    The tags used to select the AWS subnets for IP allocation. This is an
   243    additional requirement on top of requiring to match the availability zone and
   244    VPC of the instance.
   245  
   246    If unspecified, no tags are required.
   247  
   248  ``spec.eni.exclude-interface-tags``
   249    The tags used to exclude interfaces from IP allocation. Any ENI attached to
   250    a node which matches this set of tags will be ignored by Cilium and may be
   251    used for other purposes. This parameter can be used in combination with
   252    ``subnet-tags`` or ``first-interface-index`` to exclude additional interfaces.
   253  
   254    If unspecified, no tags are used to exclude interfaces.
   255  
   256  ``spec.eni.delete-on-termination``
   257    Remove the ENI when the instance is terminated
   258  
   259    If unspecified, this option is enabled.
   260  
   261  *******************
   262  Operational Details
   263  *******************
   264  
   265  Cache of ENIs, Subnets, and VPCs
   266  ================================
   267  
   268  The operator maintains a list of all EC2 ENIs, VPCs and subnets associated with
   269  the AWS account in a cache. For this purpose, the operator performs the
   270  following three EC2 API operations:
   271  
   272   * ``DescribeNetworkInterfaces``
   273   * ``DescribeSubnets``
   274   * ``DescribeVpcs``
   275  
   276  The cache is updated once per minute or after an IP allocation or ENI creation
   277  has been performed. When triggered based on an allocation or creation, the
   278  operation is performed at most once per second.
   279  
   280  Publication of available ENI IPs
   281  ================================
   282  
   283  Following the update of the cache, all CiliumNode custom resources representing
   284  nodes are updated to publish eventual new IPs that have become available.
   285  
   286  In this process, all ENIs with an interface index greater than
   287  ``spec.eni.first-interface-index`` are scanned for all available IPs.  All IPs
   288  found are added to ``spec.ipam.available``. Each ENI meeting this criteria is
   289  also added to ``status.eni.enis``.
   290  
   291  If this update caused the custom resource to change, the custom resource is
   292  updated using the Kubernetes API methods ``Update()`` and/or ``UpdateStatus()``
   293  if available.
   294  
   295  Determination of ENI IP deficits or excess
   296  ==========================================
   297  
   298  The operator constantly monitors all nodes and detects deficits in available
   299  ENI IP addresses. The check to recognize a deficit is performed on two
   300  occasions:
   301  
   302   * When a ``CiliumNode`` custom resource is updated
   303   * All nodes are scanned in a regular interval (once per minute)
   304  
   305  If ``--aws-release-excess-ips`` is enabled, the check to recognize IP excess
   306  is performed at the interval based scan.
   307  
   308  When determining whether a node has a deficit in IP addresses, the following
   309  calculation is performed:
   310  
   311  .. code-block:: go
   312  
   313       availableIPs := len(spec.ipam.pool)
   314       neededIPs = max(spec.ipam.pre-allocate - (availableIPs - len(status.ipam.used)), spec.ipam.min-allocate - availableIPs)
   315       if spec.ipam.max-allocate > 0 {
   316        neededIPs = min(max(spec.ipam.max-allocate - availableIPs, 0), neededIPs)
   317       }
   318  
   319  For excess IP calculation:
   320  
   321  .. code-block:: go
   322  
   323       availableIPs := len(spec.ipam.pool)
   324       upperBound := spec.ipam.min-allocate + spec.ipam.max-above-watermark
   325       switch {
   326       case availableIPs <= upperBound:
   327         excessIPs = 0
   328       case len(status.ipam.used) <= upperBound && len(status.ipam.used) + spec.ipam.pre-allocate <= upperBound:
   329         excessIPs = availableIPs - upperBound
   330       default:
   331         excessIPs = max(availableIPs - len(status.ipam.used) - upperBound, 0)
   332       }
   333  
   334  Upon detection of a deficit, the node is added to the list of nodes which
   335  require IP address allocation. When a deficit is detected using the interval
   336  based scan, the allocation order of nodes is determined based on the severity
   337  of the deficit, i.e. the node with the biggest deficit will be at the front of
   338  the allocation queue. Nodes that need to release IPs are behind nodes that need
   339  allocation.
   340  
   341  The allocation queue is handled on demand but at most once per second.
   342  
   343  IP Allocation
   344  =============
   345  
   346  When performing IP allocation for a node with an address deficit, the operator
   347  first looks at the ENIs which are already attached to the instance represented
   348  by the CiliumNode resource. All ENIs with an interface index greater than
   349  ``spec.eni.first-interface-index`` are considered for use.
   350  
   351  .. note::
   352  
   353     In order to not use ``eth0`` for IP allocation, set
   354     ``spec.eni.first-interface-index`` to ``1`` to skip the first interface in
   355     line.
   356  
   357  The operator will then pick the first already allocated ENI which meets the
   358  following criteria:
   359  
   360   * The ENI has addresses associated which are not yet used or the number of
   361     addresses associated with the ENI is lesser than the instance type specific
   362     limit.
   363  
   364   * The subnet associated with the ENI has IPs available for allocation
   365  
   366  The following formula is used to determine how many IPs are allocated on the ENI:
   367  
   368  .. code-block:: go
   369  
   370        // surgeAllocate kicks in if numPendingPods is greater than NeededAddresses
   371        min(AvailableOnSubnet, min(AvailableOnENI, NeededAddresses + spec.ipam.max-above-watermark + surgeAllocate))
   372  
   373  .. note::
   374  
   375     In scenarios where the pre-allocated number is lower than the number of pending pods on the node, the operator will
   376     pro-actively allocate more than the pre-allocated number of IPs to avoid having to wait for the next allocation
   377     cycles.
   378  
   379  This means that the number of IPs allocated in a single allocation cycle can be
   380  less than what is required to fulfill ``spec.ipam.pre-allocate``.
   381  
   382  In order to allocate the IPs, the method ``AssignPrivateIpAddresses`` of the
   383  EC2 service API is called. When no more ENIs are available meeting the above
   384  criteria, a new ENI is created.
   385  
   386  IP Release
   387  ==========
   388  
   389  When performing IP release for a node with IP excess, the operator scans
   390  ENIs attached to the node with an interface index greater than
   391  ``spec.eni.first-interface-index`` and selects an ENI with the most free IPs
   392  available for release. The following formula is used to determine how many IPs
   393  are available for release on the ENI:
   394  
   395  .. code-block:: go
   396  
   397        min(FreeOnENI, (FreeIPs - spec.ipam.pre-allocate - spec.ipam.max-above-watermark))
   398  
   399  Operator releases IPs from the selected ENI, if there is still excess free IP
   400  not released, operator will attempt to release in next release cycle.
   401  
   402  In order to release the IPs, the method ``UnassignPrivateIpAddresses`` of the
   403  EC2 service API is called. There is no limit on ENIs per subnet so ENIs are
   404  remained on the node.
   405  
   406  
   407  ENI Creation
   408  ============
   409  
   410  As long as an instance type is capable allocating additional ENIs, ENIs are
   411  allocated automatically based on demand.
   412  
   413  When allocating an ENI, the first operation performed is to identify the best
   414  subnet. This is done by searching through all subnets and finding a subnet that
   415  matches the following criteria:
   416  
   417   * The VPC ID of the subnet matches ``spec.eni.vpc-id``
   418   * The Availability Zone of the subnet matches
   419     ``spec.eni.availability-zone``
   420  
   421  If set, ``spec.eni.subnet-ids`` or ``spec.eni.subnet-tags`` are used to further
   422  narrow down the set of candidate subnets. Any subnet with an ID in
   423  ``subnet-ids`` is a candidate, whereas a subnet must match all ``subnet-tags``
   424  to be candidate. Note that when ``subnet-ids`` is set, ``subnet-tags`` are
   425  ignored. If multiple subnets match, the subnet with the most available addresses
   426  is selected.
   427  
   428  If neither ``subnet-ids`` nor ``subnet-tags`` are set, the operator consults
   429  ``spec.eni.node-subnet-id``, attempting to create the ENI in the same subnet as
   430  the primary ENI of the instance. If this is not possible (e.g. if there are not
   431  enough IPs in said subnet), the operator falls back to allocating the IP in the
   432  largest subnet matching VPC and Availability Zone.
   433  
   434  After selecting the subnet, the interface index is determined. For this purpose,
   435  all existing ENIs are scanned and the first unused index greater than
   436  ``spec.eni.first-interface-index`` is selected.
   437  
   438  After determining the subnet and interface index, the ENI is created and
   439  attached to the EC2 instance using the methods ``CreateNetworkInterface`` and
   440  ``AttachNetworkInterface`` of the EC2 API.
   441  
   442  The security group ids attached to the ENI are computed in the following order:
   443  
   444   1. The field ``spec.eni.security-groups`` is consulted first. If this is set
   445      then these will be the security group ids attached to the newly created ENI.
   446   2. The filed ``spec.eni.security-group-tags`` is consulted. If this is set then
   447      the operator will list all security groups in the account and will attach to
   448      the ENI the ones that match the list of tags passed.
   449   3. Finally if none of the above fields are set then the newly created ENI will
   450      inherit the security group ids of ``eth0`` of the machine.
   451  
   452  The description will be in the following format:
   453  
   454  .. code-block:: go
   455  
   456       "Cilium-CNI (<EC2 instance ID>)"
   457  
   458  If the ENI tagging feature is enabled then the ENI will be tagged with the provided information.
   459  
   460  ENI Deletion Policy
   461  ===================
   462  
   463  ENIs can be marked for deletion when the EC2 instance to which the ENI is
   464  attached to is terminated. In order to enable this, the option
   465  ``spec.eni.delete-on-termination`` can be enabled. If enabled, the ENI
   466  is modified after creation using ``ModifyNetworkInterfaceAttribute`` to specify this
   467  deletion policy.
   468  
   469  Node Termination
   470  ================
   471  
   472  When a node or instance terminates, the Kubernetes apiserver will send a node
   473  deletion event. This event will be picked up by the operator and the operator
   474  will delete the corresponding ``ciliumnodes.cilium.io`` custom resource.
   475  
   476  .. _ec2privileges:
   477  
   478  *******************
   479  Required Privileges
   480  *******************
   481  
   482  The following EC2 privileges are required by the Cilium operator in order to
   483  perform ENI creation and IP allocation:
   484  
   485   * ``DeleteNetworkInterface``
   486   * ``DescribeNetworkInterfaces``
   487   * ``DescribeSubnets``
   488   * ``DescribeVpcs``
   489   * ``DescribeSecurityGroups``
   490   * ``CreateNetworkInterface``
   491   * ``AttachNetworkInterface``
   492   * ``ModifyNetworkInterfaceAttribute``
   493   * ``AssignPrivateIpAddresses``
   494   * ``CreateTags``
   495  
   496  If ENI GC is enabled (which is the default), and ``--cluster-name`` and ``--eni-gc-tags`` are not set to custom values:
   497  
   498   * ``DescribeTags``
   499  
   500  If release excess IP enabled:
   501  
   502   * ``UnassignPrivateIpAddresses``
   503  
   504  If ``--instance-tags-filter`` is used:
   505  
   506   * ``DescribeInstances``
   507  
   508  *****************************
   509  EC2 instance types ENI limits
   510  *****************************
   511  
   512  Currently the EC2 Instance ENI limits (adapters per instance + IPv4/IPv6 IPs per adapter) are
   513  hardcoded in the Cilium codebase for easy out-of-the box deployment and usage.
   514  
   515  The limits can be modified via the ``--aws-instance-limit-mapping`` CLI flag on
   516  the cilium-operator. This allows the user to supply a custom limit.
   517  
   518  Additionally the limits can be updated via the EC2 API by passing the
   519  ``--update-ec2-adapter-limit-via-api`` CLI flag.
   520  This will require an additional EC2 IAM permission:
   521  
   522   * ``DescribeInstanceTypes``
   523  
   524  *******
   525  Metrics
   526  *******
   527  
   528  The IPAM metrics are documented in the section :ref:`ipam_metrics`.
   529  
   530  ******************
   531  Node Configuration
   532  ******************
   533  
   534  The IP address and routes on ENIs attached to the instance will be
   535  managed by the Cilium agent. Therefore, any system service trying to manage
   536  newly attached network interfaces will interfere with Cilium's configuration.
   537  Common scenarios are ``NetworkManager`` or ``systemd-networkd`` automatically
   538  performing DHCP on these interfaces or removing Cilium's IP address when the
   539  carrier is temporarily lost. Be sure to disable these services or configure
   540  your Linux distribution to not manage the newly attached ENI devices.
   541  The following examples configure all Linux network devices named ``eth*``
   542  except ``eth0`` as unmanaged.
   543  
   544  .. tabs::
   545  
   546     .. group-tab:: Network Manager
   547  
   548          .. code-block:: shell-session
   549  
   550              # cat <<EOF >/etc/NetworkManager/conf.d/99-unmanaged-devices.conf
   551              [keyfile]
   552              unmanaged-devices=interface-name:eth*,except:interface-name:eth0
   553              EOF
   554              # systemctl reload NetworkManager
   555  
   556     .. group-tab:: systemd-networkd
   557  
   558          .. code-block:: shell-session
   559  
   560              # cat <<EOF >/etc/systemd/network/99-unmanaged-devices.network
   561              [Match]
   562              Name=eth[1-9]*
   563  
   564              [Link]
   565              Unmanaged=yes
   566              EOF
   567              # systemctl restart systemd-networkd