github.com/k8snetworkplumbingwg/sriov-network-operator@v1.2.1-0.20240408194816-2d2e5a45d453/README.md (about) 1 # sriov-network-operator 2 3 The Sriov Network Operator is designed to help the user to provision and configure SR-IOV CNI plugin and Device plugin in the Openshift cluster. 4 5 ## Motivation 6 7 SR-IOV network is an optional feature of an Openshift cluster. To make it work, it requires different components to be provisioned and configured accordingly. It makes sense to have one operator to coordinate those relevant components in one place, instead of having them managed by different operators. And also, to hide the complexity, we should provide an elegant user interface to simplify the process of enabling SR-IOV. 8 9 ## Features 10 11 - Initialize the supported SR-IOV NIC types on selected nodes. 12 - Provision/upgrade SR-IOV device plugin executable on selected node. 13 - Provision/upgrade SR-IOV CNI plugin executable on selected nodes. 14 - Manage configuration of SR-IOV device plugin on host. 15 - Generate net-att-def CRs for SR-IOV CNI plugin 16 - Supports operation in a virtualized Kubernetes deployment 17 - Discovers VFs attached to the Virtual Machine (VM) 18 - Does not require attached of associated PFs 19 - VFs can be associated to SriovNetworks by selecting the appropriate PciAddress as the RootDevice in the SriovNetworkNodePolicy 20 21 ## Quick Start 22 23 For more detail on installing this operator, refer to the [quick-start](doc/quickstart.md) guide. 24 25 ## API 26 27 The SR-IOV network operator introduces following new CRDs: 28 29 - SriovNetwork 30 31 - SriovNetworkNodeState 32 33 - SriovNetworkNodePolicy 34 35 ### SriovNetwork 36 37 A custom resource of SriovNetwork could represent the a layer-2 broadcast domain where some SR-IOV devices are attach to. It is primarily used to generate a NetworkAttachmentDefinition CR with an SR-IOV CNI plugin configuration. 38 39 This SriovNetwork CR also contains the ‘resourceName’ which is aligned with the ‘resourceName’ of SR-IOV device plugin. One SriovNetwork obj maps to one ‘resoureName’, but one ‘resourceName’ can be shared by different SriovNetwork CRs. 40 41 This CR should be managed by cluster admin. Here is an example: 42 43 ```yaml 44 apiVersion: sriovnetwork.openshift.io/v1 45 kind: SriovNetwork 46 metadata: 47 name: example-network 48 namespace: example-namespace 49 spec: 50 ipam: | 51 { 52 "type": "host-local", 53 "subnet": "10.56.217.0/24", 54 "rangeStart": "10.56.217.171", 55 "rangeEnd": "10.56.217.181", 56 "routes": [{ 57 "dst": "0.0.0.0/0" 58 }], 59 "gateway": "10.56.217.1" 60 } 61 vlan: 0 62 resourceName: intelnics 63 ``` 64 65 #### Chaining CNI metaplugins 66 67 It is possible to add additional capabilities to the device configured via the SR-IOV configuring optional metaplugins. 68 69 In order to do this, the `metaPlugins` field must contain the array of one or more additional configurations used to build a [network configuration list](https://github.com/containernetworking/cni/blob/master/SPEC.md#network-configuration-lists), as per the following example: 70 71 ```yaml 72 apiVersion: sriovnetwork.openshift.io/v1 73 kind: SriovNetwork 74 metadata: 75 name: example-network 76 namespace: example-namespace 77 spec: 78 ipam: | 79 { 80 "type": "host-local", 81 "subnet": "10.56.217.0/24", 82 "rangeStart": "10.56.217.171", 83 "rangeEnd": "10.56.217.181", 84 "routes": [{ 85 "dst": "0.0.0.0/0" 86 }], 87 "gateway": "10.56.217.1" 88 } 89 vlan: 0 90 resourceName: intelnics 91 metaPlugins : | 92 { 93 "type": "tuning", 94 "sysctl": { 95 "net.core.somaxconn": "500" 96 } 97 }, 98 { 99 "type": "vrf", 100 "vrfname": "red" 101 } 102 ``` 103 104 ### SriovNetworkNodeState 105 106 The custom resource to represent the SR-IOV interface states of each host, which should only be managed by the operator itself. 107 108 - The ‘spec’ of this CR represents the desired configuration which should be applied to the interfaces and SR-IOV device plugin. 109 - The ‘status’ contains current states of those PFs (baremetal only), and the states of the VFs. It helps user to discover SR-IOV network hardware on node, or attached VFs in the case of a virtual deployment. 110 111 The spec is rendered by sriov-policy-controller, and consumed by sriov-config-daemon. Sriov-config-daemon is responsible for updating the ‘status’ field to reflect the latest status, this information can be used as input to create SriovNetworkNodePolicy CR. 112 113 An example of SriovNetworkNodeState CR: 114 115 ```yaml 116 apiVersion: sriovnetwork.openshift.io/v1 117 kind: SriovNetworkNodeState 118 metadata: 119 name: worker-node-1 120 namespace: sriov-network-operator 121 spec: 122 interfaces: 123 - deviceType: vfio-pci 124 mtu: 1500 125 numVfs: 4 126 pciAddress: 0000:86:00.0 127 status: 128 interfaces: 129 - deviceID: "1583" 130 driver: i40e 131 mtu: 1500 132 numVfs: 4 133 pciAddress: 0000:86:00.0 134 maxVfs: 64 135 vendor: "8086" 136 Vfs: 137 - deviceID: 154c 138 driver: vfio-pci 139 pciAddress: 0000:86:02.0 140 vendor: "8086" 141 - deviceID: 154c 142 driver: vfio-pci 143 pciAddress: 0000:86:02.1 144 vendor: "8086" 145 - deviceID: 154c 146 driver: vfio-pci 147 pciAddress: 0000:86:02.2 148 vendor: "8086" 149 - deviceID: 154c 150 driver: vfio-pci 151 pciAddress: 0000:86:02.3 152 vendor: "8086" 153 - deviceID: "1583" 154 driver: i40e 155 mtu: 1500 156 pciAddress: 0000:86:00.1 157 maxVfs: 64 158 vendor: "8086" 159 ``` 160 161 From this example, in status field, the user can find out there are 2 SRIOV capable NICs on node 'work-node-1'; in spec field, user can learn what the expected configure is generated from the combination of SriovNetworkNodePolicy CRs. In the virtual deployment case, a single VF will be associated with each device. 162 163 ### SriovNetworkNodePolicy 164 165 This CRD is the key of SR-IOV network operator. This custom resource should be managed by cluster admin, to instruct the operator to: 166 167 1. Render the spec of SriovNetworkNodeState CR for selected node, to configure the SR-IOV interfaces. In virtual deployment, the VF interface is read-only. 168 2. Deploy SR-IOV CNI plugin and device plugin on selected node. 169 3. Generate the configuration of SR-IOV device plugin. 170 171 An example of SriovNetworkNodePolicy CR: 172 173 ```yaml 174 apiVersion: sriovnetwork.openshift.io/v1 175 kind: SriovNetworkNodePolicy 176 metadata: 177 name: policy-1 178 namespace: sriov-network-operator 179 spec: 180 deviceType: vfio-pci 181 mtu: 1500 182 nicSelector: 183 deviceID: "1583" 184 rootDevices: 185 - 0000:86:00.0 186 vendor: "8086" 187 nodeSelector: 188 feature.node.kubernetes.io/network-sriov.capable: "true" 189 numVfs: 4 190 priority: 90 191 resourceName: intelnics 192 ``` 193 194 In this example, user selected the nic from vendor '8086' which is intel, device module is '1583' which is XL710 for 40GbE, on nodes labeled with 'network-sriov.capable' equals 'true'. Then for those PFs, create 4 VFs each, set mtu to 1500 and the load the vfio-pci driver to those virtual functions. 195 196 In a virtual deployment: 197 - The mtu of the PF is set by the underlying virtualization platform and cannot be changed by the sriov-network-operator. 198 - The numVfs parameter has no effect as there is always 1 VF 199 - The deviceType field depends upon whether the underlying device/driver is [native-bifurcating or non-bifurcating](https://doc.dpdk.org/guides/howto/flow_bifurcation.html) For example, the supported Mellanox devices support native-bifurcating drivers and therefore deviceType should be netdevice (default). The support Intel devices are non-bifurcating and should be set to vfio-pci. 200 201 #### Multiple policies 202 203 When multiple SriovNetworkNodeConfigPolicy CRs are present, the `priority` field 204 (0 is the highest priority) is used to resolve any conflicts. Conflicts occur 205 only when same PF is referenced by multiple policies. The final desired 206 configuration is saved in `SriovNetworkNodeState.spec.interfaces`. 207 208 Policies processing order is based on priority (lowest first), followed by `name` 209 field (starting from `a`). Policies with same **priority** or **non-overlapping 210 VF groups** (when #-notation is used in pfName field) are merged, otherwise only 211 the highest priority policy is applied. In case of same-priority policies and 212 overlapping VF groups, only the last processed policy is applied. 213 214 #### Externally Manage virtual functions 215 216 When `ExternallyManage` is request on a policy the operator will only skip the virtual function creation. 217 The operator will only bind the virtual functions to the requested driver and expose them via the device plugin. 218 Another difference when this field is requested in the policy is that when this policy is removed the operator 219 will not remove the virtual functions from the policy. 220 221 *Note:* This means the user must create the virtual functions before they apply the policy or the webhook will reject 222 the policy creation. 223 224 It's possible to use something like nmstate kubernetes-nmstate or just a simple systemd file to create 225 the virtual functions on boot. 226 227 This feature was created to support deployments where the user want to use some of the virtual funtions for the host 228 communication like storage network or out of band managment and the virtual functions must exist on boot and not only 229 after the operator and config-daemon are running. 230 231 #### Disabling SR-IOV Config Daemon plugins 232 233 It is possible to disable SR-IOV network operator config daemon plugins in case their operation 234 is not needed or un-desirable. 235 236 As an example, some plugins perform vendor specific firmware configuration 237 to enable SR-IOV (e.g `mellanox` plugin). certain deployment environments may prefer to perform such configuration 238 once during node provisioning, while ensuring the configuration will be compatible with any sriov network node policy 239 defined for the particular environment. This will reduce or completely eliminate the need for reboot of nodes during SR-IOV 240 configurations by the operator. 241 242 This can be done by setting SriovOperatorConfig `default` CR `spec.disablePlugins` with the list of desired plugins 243 to disable. 244 245 **Example**: 246 247 ```yaml 248 apiVersion: sriovnetwork.openshift.io/v1 249 kind: SriovOperatorConfig 250 metadata: 251 name: default 252 namespace: sriov-network-operator 253 spec: 254 ... 255 disablePlugins: 256 - mellanox 257 ... 258 ``` 259 260 > **NOTE**: Currently only `mellanox` plugin can be disabled. 261 262 ### Parallel draining 263 264 It is possible to drain more than one node at a time using this operator. 265 266 The configuration is done via the SriovNetworkNodePool, selecting a number of nodes using the node selector and how many 267 nodes in parallel from the pool the operator can drain in parallel. maxUnavailable can be a number or a percentage. 268 269 > **NOTE**: every node can only be part of one pool, if a node is selected by more than one pool, then it will not be drained 270 271 > **NOTE**: If a node is not part of any pool it will have a default configuration of maxUnavailable 1 272 273 **Example**: 274 275 ```yaml 276 apiVersion: sriovnetwork.openshift.io/v1 277 kind: SriovNetworkPoolConfig 278 metadata: 279 name: worker 280 namespace: sriov-network-operator 281 spec: 282 maxUnavailable: 2 283 nodeSelector: 284 matchLabels: 285 node-role.kubernetes.io/worker: "" 286 ``` 287 288 ### Resource Injector Policy 289 290 By default, the Resource injector webhook has a failed policy of ignored, this was implemented to not block pod creation 291 in case the webhook is not available. 292 293 with a feature introduced in Kubernetes 1.28(Beta) called [MatchConditions](https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/#matching-requests-matchconditions) 294 we can move the webhook failed policy to be Fail. In this case the operator configured the Mutating webhook for the resource 295 injector only on pods with the secondary network annotation of `k8s.v1.cni.cncf.io/networks`. 296 It's possible to enable the feature with a FeatureGate via the SriovOperatorConfig object 297 298 > **NOTE**: the feature is disabled by default 299 300 **Example**: 301 302 ```yaml 303 apiVersion: sriovnetwork.openshift.io/v1 304 kind: SriovOperatorConfig 305 metadata: 306 name: default 307 namespace: sriov-network-operator 308 spec: 309 ... 310 featureGates: 311 resourceInjectorMatchCondition: true 312 ... 313 ``` 314 315 ## Components and design 316 317 This operator is split into 2 components: 318 319 - controller 320 - sriov-config-daemon 321 322 The controller is responsible for: 323 324 1. Read the SriovNetworkNodePolicy CRs and SriovNetwork CRs as input. 325 2. Render the manifests for SR-IOV CNI plugin and device plugin daemons. 326 3. Render the spec of SriovNetworkNodeState CR for each node. 327 328 The sriov-config-daemon is responsible for: 329 330 1. Discover the SRIOV NICs on each node, then sync the status of SriovNetworkNodeState CR. 331 2. Take the spec of SriovNetworkNodeState CR as input to configure those NICs. 332 333 ## Workflow 334 335 ![SRIOV Network Operator work flow](doc/images/workflow.png)