github.com/k8snetworkplumbingwg/sriov-network-operator@v1.2.1-0.20240408194816-2d2e5a45d453/README.md (about)

     1  # sriov-network-operator
     2  
     3  The Sriov Network Operator is designed to help the user to provision and configure SR-IOV CNI plugin and Device plugin in the Openshift cluster.
     4  
     5  ## Motivation
     6  
     7  SR-IOV network is an optional feature of an Openshift cluster. To make it work, it requires different components to be provisioned and configured accordingly. It makes sense to have one operator to coordinate those relevant components in one place, instead of having them managed by different operators. And also, to hide the complexity, we should provide an elegant user interface to simplify the process of enabling SR-IOV.
     8  
     9  ## Features
    10  
    11  - Initialize the supported SR-IOV NIC types on selected nodes.
    12  - Provision/upgrade SR-IOV device plugin executable on selected node.  
    13  - Provision/upgrade SR-IOV CNI plugin executable on selected nodes.
    14  - Manage configuration of SR-IOV device plugin on host.
    15  - Generate net-att-def CRs for SR-IOV CNI plugin
    16  - Supports operation in a virtualized Kubernetes deployment
    17    - Discovers VFs attached to the Virtual Machine (VM)
    18    - Does not require attached of associated PFs
    19    - VFs can be associated to SriovNetworks by selecting the appropriate PciAddress as the RootDevice in the SriovNetworkNodePolicy
    20  
    21  ## Quick Start
    22  
    23  For more detail on installing this operator, refer to the [quick-start](doc/quickstart.md) guide.
    24  
    25  ## API
    26  
    27  The SR-IOV network operator introduces following new CRDs:
    28  
    29  - SriovNetwork
    30  
    31  - SriovNetworkNodeState
    32  
    33  - SriovNetworkNodePolicy
    34  
    35  ### SriovNetwork
    36  
    37  A custom resource of SriovNetwork could represent the a layer-2 broadcast domain where some SR-IOV devices are attach to. It is primarily used to generate a NetworkAttachmentDefinition CR with an SR-IOV CNI plugin configuration. 
    38  
    39  This SriovNetwork CR also contains the ‘resourceName’ which is aligned with the ‘resourceName’ of SR-IOV device plugin. One SriovNetwork obj maps to one ‘resoureName’, but one ‘resourceName’ can be shared by different SriovNetwork CRs.
    40  
    41  This CR should be managed by cluster admin. Here is an example:
    42  
    43  ```yaml
    44  apiVersion: sriovnetwork.openshift.io/v1
    45  kind: SriovNetwork
    46  metadata:
    47    name: example-network
    48    namespace: example-namespace
    49  spec:
    50    ipam: |
    51      {
    52        "type": "host-local",
    53        "subnet": "10.56.217.0/24",
    54        "rangeStart": "10.56.217.171",
    55        "rangeEnd": "10.56.217.181",
    56        "routes": [{
    57          "dst": "0.0.0.0/0"
    58        }],
    59        "gateway": "10.56.217.1"
    60      }
    61    vlan: 0
    62    resourceName: intelnics
    63  ```
    64  
    65  #### Chaining CNI metaplugins
    66  
    67  It is possible to add additional capabilities to the device configured via the SR-IOV configuring optional metaplugins.
    68  
    69  In order to do this, the `metaPlugins` field must contain the array of one or more additional configurations used to build a [network configuration list](https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration-lists), as per the following example:
    70  
    71  ```yaml
    72  apiVersion: sriovnetwork.openshift.io/v1
    73  kind: SriovNetwork
    74  metadata:
    75    name: example-network
    76    namespace: example-namespace
    77  spec:
    78    ipam: |
    79      {
    80        "type": "host-local",
    81        "subnet": "10.56.217.0/24",
    82        "rangeStart": "10.56.217.171",
    83        "rangeEnd": "10.56.217.181",
    84        "routes": [{
    85          "dst": "0.0.0.0/0"
    86        }],
    87        "gateway": "10.56.217.1"
    88      }
    89    vlan: 0
    90    resourceName: intelnics
    91    metaPlugins : |
    92      {
    93        "type": "tuning",
    94        "sysctl": {
    95          "net.core.somaxconn": "500"
    96        }
    97      },
    98      {
    99        "type": "vrf",
   100        "vrfname": "red"
   101      }
   102  ```
   103  
   104  ### SriovNetworkNodeState
   105  
   106  The custom resource to represent the SR-IOV interface states of each host, which should only be managed by the operator itself.
   107  
   108  - The ‘spec’ of this CR represents the desired configuration which should be applied to the interfaces and SR-IOV device plugin.
   109  - The ‘status’ contains current states of those PFs (baremetal only), and the states of the VFs. It helps user to discover SR-IOV network hardware on node, or attached VFs in the case of a virtual deployment.
   110  
   111  The spec is rendered by sriov-policy-controller, and consumed by sriov-config-daemon. Sriov-config-daemon is responsible for updating the ‘status’ field to reflect the latest status, this information can be used as input to create SriovNetworkNodePolicy CR.
   112  
   113  An example of SriovNetworkNodeState CR:
   114  
   115  ```yaml
   116  apiVersion: sriovnetwork.openshift.io/v1
   117  kind: SriovNetworkNodeState
   118  metadata:
   119    name: worker-node-1
   120    namespace: sriov-network-operator
   121  spec:
   122    interfaces:
   123    - deviceType: vfio-pci
   124      mtu: 1500
   125      numVfs: 4
   126      pciAddress: 0000:86:00.0
   127  status:
   128    interfaces:
   129    - deviceID: "1583"
   130      driver: i40e
   131      mtu: 1500
   132      numVfs: 4
   133      pciAddress: 0000:86:00.0
   134      maxVfs: 64
   135      vendor: "8086"
   136      Vfs:
   137        - deviceID: 154c
   138        driver: vfio-pci
   139        pciAddress: 0000:86:02.0
   140        vendor: "8086"
   141        - deviceID: 154c
   142        driver: vfio-pci
   143        pciAddress: 0000:86:02.1
   144        vendor: "8086"
   145        - deviceID: 154c
   146        driver: vfio-pci
   147        pciAddress: 0000:86:02.2
   148        vendor: "8086"
   149        - deviceID: 154c
   150        driver: vfio-pci
   151        pciAddress: 0000:86:02.3
   152        vendor: "8086"
   153    - deviceID: "1583"
   154      driver: i40e
   155      mtu: 1500
   156      pciAddress: 0000:86:00.1
   157      maxVfs: 64
   158      vendor: "8086"
   159  ```
   160  
   161  From this example, in status field, the user can find out there are 2 SRIOV capable NICs on node 'work-node-1'; in spec field, user can learn what the expected configure is generated from the combination of SriovNetworkNodePolicy CRs.  In the virtual deployment case, a single VF will be associated with each device.
   162  
   163  ### SriovNetworkNodePolicy
   164  
   165  This CRD is the key of SR-IOV network operator. This custom resource should be managed by cluster admin, to instruct the operator to:
   166  
   167  1. Render the spec of SriovNetworkNodeState CR for selected node, to configure the SR-IOV interfaces.  In virtual deployment, the VF interface is read-only.
   168  2. Deploy SR-IOV CNI plugin and device plugin on selected node.
   169  3. Generate the configuration of SR-IOV device plugin.
   170  
   171  An example of SriovNetworkNodePolicy CR:
   172  
   173  ```yaml
   174  apiVersion: sriovnetwork.openshift.io/v1
   175  kind: SriovNetworkNodePolicy
   176  metadata:
   177    name: policy-1
   178    namespace: sriov-network-operator
   179  spec:
   180    deviceType: vfio-pci
   181    mtu: 1500
   182    nicSelector:
   183      deviceID: "1583"
   184      rootDevices:
   185      - 0000:86:00.0
   186      vendor: "8086"
   187    nodeSelector:
   188      feature.node.kubernetes.io/network-sriov.capable: "true"
   189    numVfs: 4
   190    priority: 90
   191    resourceName: intelnics
   192  ```
   193  
   194  In this example, user selected the nic from vendor '8086' which is intel, device module is '1583' which is XL710 for 40GbE, on nodes labeled with 'network-sriov.capable' equals 'true'. Then for those PFs, create 4 VFs each, set mtu to 1500 and the load the vfio-pci driver to those virtual functions.  
   195  
   196  In a virtual deployment: 
   197  - The mtu of the PF is set by the underlying virtualization platform and cannot be changed by the sriov-network-operator.
   198  - The numVfs parameter has no effect as there is always 1 VF
   199  - The deviceType field depends upon whether the underlying device/driver is [native-bifurcating or non-bifurcating](https://doc.dpdk.org/guides/howto/flow_bifurcation.html) For example, the supported Mellanox devices support native-bifurcating drivers and therefore deviceType should be netdevice (default).  The support Intel devices are non-bifurcating and should be set to vfio-pci.
   200  
   201  #### Multiple policies
   202  
   203  When multiple SriovNetworkNodeConfigPolicy CRs are present, the `priority` field
   204  (0 is the highest priority) is used to resolve any conflicts. Conflicts occur
   205  only when same PF is referenced by multiple policies. The final desired
   206  configuration is saved in `SriovNetworkNodeState.spec.interfaces`.
   207  
   208  Policies processing order is based on priority (lowest first), followed by `name`
   209  field (starting from `a`). Policies with same **priority** or **non-overlapping
   210  VF groups** (when #-notation is used in pfName field) are merged, otherwise only
   211  the highest priority policy is applied. In case of same-priority policies and
   212  overlapping VF groups, only the last processed policy is applied.
   213  
   214  #### Externally Manage virtual functions
   215  
   216  When `ExternallyManage` is request on a policy the operator will only skip the virtual function creation.
   217  The operator will only bind the virtual functions to the requested driver and expose them via the device plugin.
   218  Another difference when this field is requested in the policy is that when this policy is removed the operator
   219  will not remove the virtual functions from the policy.
   220  
   221  *Note:* This means the user must create the virtual functions before they apply the policy or the webhook will reject
   222  the policy creation.
   223  
   224  It's possible to use something like nmstate kubernetes-nmstate or just a simple systemd file to create
   225  the virtual functions on boot.
   226  
   227  This feature was created to support deployments where the user want to use some of the virtual funtions for the host
   228  communication like storage network or out of band managment and the virtual functions must exist on boot and not only
   229  after the operator and config-daemon are running.
   230  
   231  #### Disabling SR-IOV Config Daemon plugins
   232  
   233  It is possible to disable SR-IOV network operator config daemon plugins in case their operation
   234  is not needed or un-desirable.
   235  
   236  As an example, some plugins perform vendor specific firmware configuration
   237  to enable SR-IOV (e.g `mellanox` plugin). certain deployment environments may prefer to perform such configuration
   238  once during node provisioning, while ensuring the configuration will be compatible with any sriov network node policy
   239  defined for the particular environment. This will reduce or completely eliminate the need for reboot of nodes during SR-IOV
   240  configurations by the operator.
   241  
   242  This can be done by setting SriovOperatorConfig `default` CR `spec.disablePlugins` with the list of desired plugins
   243  to disable.
   244  
   245  **Example**:
   246  
   247  ```yaml
   248  apiVersion: sriovnetwork.openshift.io/v1
   249  kind: SriovOperatorConfig
   250  metadata:
   251    name: default
   252    namespace: sriov-network-operator
   253  spec:
   254    ...
   255    disablePlugins:
   256      - mellanox
   257    ...
   258  ```
   259  
   260  > **NOTE**: Currently only `mellanox` plugin can be disabled.
   261  
   262  ### Parallel draining
   263  
   264  It is possible to drain more than one node at a time using this operator.
   265  
   266  The configuration is done via the SriovNetworkNodePool, selecting a number of nodes using the node selector and how many
   267  nodes in parallel from the pool the operator can drain in parallel. maxUnavailable can be a number or a percentage.
   268  
   269  > **NOTE**: every node can only be part of one pool, if a node is selected by more than one pool, then it will not be drained
   270  
   271  > **NOTE**: If a node is not part of any pool it will have a default configuration of maxUnavailable 1
   272  
   273  **Example**:
   274  
   275  ```yaml
   276  apiVersion: sriovnetwork.openshift.io/v1
   277  kind: SriovNetworkPoolConfig
   278  metadata:
   279    name: worker
   280    namespace: sriov-network-operator
   281  spec:
   282    maxUnavailable: 2
   283    nodeSelector:
   284      matchLabels:
   285        node-role.kubernetes.io/worker: ""
   286  ```
   287  
   288  ### Resource Injector Policy
   289  
   290  By default, the Resource injector webhook has a failed policy of ignored, this was implemented to not block pod creation
   291  in case the webhook is not available.
   292  
   293  with a feature introduced in Kubernetes 1.28(Beta) called [MatchConditions](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-matchconditions)
   294  we can move the webhook failed policy to be Fail. In this case the operator configured the Mutating webhook for the resource
   295  injector only on pods with the secondary network annotation of `k8s.v1.cni.cncf.io/networks`.
   296  It's possible to enable the feature with a FeatureGate via the SriovOperatorConfig object
   297  
   298  > **NOTE**: the feature is disabled by default
   299  
   300  **Example**:
   301  
   302  ```yaml
   303  apiVersion: sriovnetwork.openshift.io/v1
   304  kind: SriovOperatorConfig
   305  metadata:
   306    name: default
   307    namespace: sriov-network-operator
   308  spec:
   309    ...
   310    featureGates:
   311      resourceInjectorMatchCondition: true
   312    ...
   313  ```
   314  
   315  ## Components and design
   316  
   317  This operator is split into 2 components:
   318  
   319  - controller
   320  - sriov-config-daemon
   321  
   322  The controller is responsible for:
   323  
   324  1. Read the SriovNetworkNodePolicy CRs and SriovNetwork CRs as input.
   325  2. Render the manifests for SR-IOV CNI plugin and device plugin daemons.
   326  3. Render the spec of SriovNetworkNodeState CR for each node.
   327  
   328  The sriov-config-daemon is responsible for:
   329  
   330  1. Discover the SRIOV NICs on each node, then sync the status of SriovNetworkNodeState CR.
   331  2. Take the spec of SriovNetworkNodeState CR as input to configure those NICs.
   332  
   333  ## Workflow
   334  
   335  ![SRIOV Network Operator work flow](doc/images/workflow.png)