github.com/k8snetworkplumbingwg/sriov-network-operator@v1.2.1-0.20240408194816-2d2e5a45d453/doc/design/software-bridge-management.md (about) 1 --- 2 title: software bridge management 3 authors: 4 - ykulazhenkov 5 reviewers: 6 creation-date: 15-02-2024 7 last-updated: 15-02-2024 8 --- 9 10 # Software bridge management 11 12 ## Summary 13 14 When NIC is configured to switchdev mode, a VF representor net device is created for each VF on it. 15 These representors are used by a software switch (OVS, Linux bridge) to control traffic and configure hardware offloads. 16 The software bridge is an essential part of using NICs in switchdev mode. 17 18 **sriov-network-operator** can set switchdev mode for a NIC and create VFs on it, 19 but it doesn't provide any functionality to create and configure software bridges. 20 21 This document contains a proposal to add limited support for software bridges configuration to the **sriov-network-operator**. 22 23 This feature assumes integration with [ovs-cni](https://github.com/k8snetworkplumbingwg/ovs-cni) and 24 [accelerated-bridge-cni](https://github.com/k8snetworkplumbingwg/accelerated-bridge-cni). 25 26 Depends on [_switchdev and systemd modes refactoring_](switchdev-refactoring.md) feature. 27 28 ## Motivation 29 30 SRIOV Legacy mode is no longer actively developed, and we need to encourage users to migrate to switchdev mode, 31 which is actively developed and will continue to receive new features and improvements. 32 33 To promote switching to switchdev configurations, we need to provide a nice UX for the end user. 34 This requires providing an easy way to configure software switches, which are prerequisites for NICs in switchdev mode. 35 36 37 ### Use Cases 38 39 * As a user, I expect that **sriov-network-operator** will install `ovs-cni` and `accelerated-bridge-cni` to hosts. 40 * As a user, I want to create `OVSNetwork` CR, which will result in creation of `NetworkAttachmentDefinition` CR that uses 41 `ovs-cni` and contains required resource request. 42 * As a user, I want to create `BridgeNetwork` CR which will result in creation of `NetworkAttachmentDefinition` CR that uses 43 `accelerated-bridge-cni` and contains the required resource requests. 44 * As a user, I want to define configuration for software bridges inside the `SriovNetworkNodePolicy` CR and expect that the 45 operator will create required bridges, configure them, and attach uplinks (physical functions). 46 47 ### Goals 48 49 * handle installation of `ovs-cni` and `accelerated-bridge-cni` 50 * add `OVSNetwork` and `BridgeNetwork` CRDs as an API for end users to simplify creation of `NetworkAttachmentDefinition` CR 51 for `ovs-cni` and `accelerated-bridge-cni` 52 * extend `SriovNetworkNodePolicy` CRD to support configuration of software bridges (bridge-level configuration) 53 * extend `SriovNetworkNodeState` (spec and status) CRD to support configuration of software bridges (bridge-level configuration) 54 * support configuration of software bridge in both modes (operator's `configurationMode` setting): `daemon` and `systemd` 55 * implementation should be compatible with [_Externally Managed PF_](externally-manage-pf.md) feature 56 57 58 ### Non-Goals 59 60 * replace `SriovNetworkPoolConfig` CRD 61 * change API for host-level settings, e.g. `ovs-hw-offload` 62 63 _**Note:** we may need to extend this API to support additional options_ 64 65 * add support for VF-lag use-case 66 67 ## Proposal 68 69 1. deploy `ovs-cni` and `accelerated-bridge-cni` with init containers of `sriov-network-config-daemon` Pod 70 2. define `OVSNetwork` and `BridgeNetwork` CRDs and implement controllers for them which will create `NetworkAttachmentDefinition` CRs 71 3. extend `SriovNetworkNodePolicy`and `SriovNetworkNodeState` (spec and status) CRDs to support configuration of software bridges (bridge-level configuration) 72 73 74 ### Workflow Description 75 76 Implementation should be compatible with the following workflows: 77 78 * [Fully automatic workflow](#fully-automatic-workflow) 79 * [NIC configuration only flow](#nic-configuration-only-flow) (Externally managed bridge) 80 * [Externally managed NIC flow](#externally-managed-nic-flow) 81 82 #### Fully automatic workflow 83 84 This workflow assumes that **sriov-network-operator** handles PFs and VFs configuration, creation and configuration of software bridge, announcement of SRIOV resources with device plugin and preparation of `NetworkAttachmentDefinition` CR. 85 86 1. User creates `SriovNetworkNodePolicy` where: 87 * eswitch mode set to `switchdev` 88 * configuration for selected software bridge is defined 89 90 2. The operator populates `SriovNetworkNodeState` for matching nodes with PF and bridge 91 configurations 92 93 3. `sriov-network-config-daemon` applies PF configuration, create and configure bridge, attach PF to the bridge, applies VF configuration 94 95 96 _**Note 1:** if the operator runs in the `systemd` mode then bridge creation should happen in the `pre` phase._ 97 98 _**Note 2:** we should create udev rule which will set `NM_UNMANAGED=1` for bridges created by the operator_ 99 100 4. `sriov-network-config-daemon` should report information about software bridges in the status field of the `SriovNetworkNodeState` CR. 101 102 5. SRIOV resources are announced by the Device plugin 103 104 6. User creates `OVSNetwork` or `BridgeNetwork` CR to create `NetworkAttachmentDefinition` CR, which relies on resources announced by the Device Plugin 105 106 107 _**Note:** created software bridge should be removed during the PF configuration reset_ 108 109 110 #### NIC configuration only flow 111 112 This workflow is kept to support existing HW offloading use-case. 113 114 In this case, **sriov-network-operator** handles PFs and VFs configuration, announcement of SRIOV resources with device plugin and preparation of `NetworkAttachmentDefinition` CR. 115 116 In some scenarios, it may be compatible with `configurationMode: daemon`, but it is supposed to be used when the operator runs in `systemd` mode. 117 118 1. User creates `SriovNetworkNodePolicy` where: 119 * eswitch mode set to `switchdev` 120 121 _**Note:** `SriovNetworkNodePolicy` CR should not include bridge configuration_ 122 123 124 2. The operator populates `SriovNetworkNodeState` for matching nodes with PF configurations 125 126 3. `pre` systemd service creates VFs and configure PF 127 128 4. NetworkManager or systemd-networkd or environment-specific scripts create bridge 129 130 5. `post` systemd service binds VFs to required driver and proceed with other configuration steps 131 132 6. SRIOV resources are announced by the Device plugin 133 134 7. User creates `OVSNetwork` or `BridgeNetwork` CR to create `NetworkAttachmentDefinition` CR which relies on resource announced by the Device Plugin 135 136 _**Note:** it is possible to use externally created `NetworkAttachmentDefinition` CR that contains configuration for any CNI plugin that support 137 resources from NICs in switchdev mode._ 138 139 140 _**Note 1:** bridge is not removed during the PF configuration reset_ 141 142 _**Note 2:** information about bridges is not reported in the status field of the `SriovNetworkNodeState` CR_ 143 144 #### Externally managed NIC flow 145 146 In this case, **sriov-network-operator** handles announcement of SRIOV resources with device plugin and preparation of `NetworkAttachmentDefinition` CR. 147 148 1. User configures PFs, creates VFs, creates bridge and attaches PFs to the bridge with custom scripts. 149 150 2. User creates `SriovNetworkNodePolicy` where: 151 * eswitch mode set to `switchdev` 152 * numVFs set to be equal or less than amount of precreated VFs 153 * `externallyManaged` set to true for PFs 154 155 3. `sriov-network-config-daemon` configure VFs 156 157 3. SRIOV resources are announced by the Device plugin 158 159 4. User creates `OVSNetwork` or `BridgeNetwork` CR to create `NetworkAttachmentDefinition` CR, which relies on resources announced by the Device Plugin 160 161 ### API Extensions 162 163 #### Environment variables for the operator 164 165 | Variable | Description | 166 | --- | --- | 167 | `OVS_CNI_IMAGE` | contains full image name for `ovs-cni` | 168 | `ACCELERATED_BRIDGE_CNI_IMAGE` | contains full image name for `accelerated-bridge-cni` | 169 | `OVSDB_SOCKET_PATH` | path to the OVSDB socket, used to path to the sriov config daemon | 170 171 172 #### Feature flags 173 174 `manageSoftwareBridges` - control state of the feature (default: `false`) 175 176 #### Daemon command line args 177 178 `--ovsdb-socket-path` - path to the OVSDB socket, defaults to `/var/run/openvswitch/db.sock` 179 180 `--manage-software-bridges` - enables management of the OVS and linux bridges by the operator 181 182 #### OVSNetwork CRD `new` 183 184 ```golang 185 // OVSNetworkSpec defines the desired state of OvsNetwork 186 type OVSNetworkSpec struct { 187 // Namespace of the NetworkAttachmentDefinition custom resource 188 NetworkNamespace string `json:"networkNamespace,omitempty"` 189 // OVS Network device plugin endpoint resource name 190 ResourceName string `json:"resourceName"` 191 // Capabilities to be configured for this network. 192 // Capabilities supported: (mac|ips), e.g. '{"mac": true}' 193 Capabilities string `json:"capabilities,omitempty"` 194 // IPAM configuration to be used for this network. 195 IPAM string `json:"ipam,omitempty"` 196 // name of the OVS bridge, if not set OVS will automatically select bridge 197 // based on VF PCI address 198 Bridge string `json:"bridge,omitempty"` 199 // Vlan to assign for the OVS port 200 Vlan uint `json:"vlan,omitempty"` 201 // Mtu for the OVS port 202 MTU uint `json:"mtu",omitempty` 203 // Trunk configuration for the OVS port 204 Trunk []*TrunkConfig `json:"trunk,omitempty"` 205 // The type of interface on ovs. 206 InterfaceType string `json:"interfaceType,omitempty"` 207 // MetaPluginsConfig configuration to be used in order to chain metaplugins 208 MetaPluginsConfig string `json:"metaPlugins,omitempty"` 209 } 210 211 // OVSTrunkConfig contains configuration for OVS trunk 212 type TrunkConfig struct { 213 MinID *uint `json:"minID,omitempty"` 214 MaxID *uint `json:"maxID,omitempty"` 215 ID *uint `json:"id,omitempty"` 216 } 217 218 // OvsNetworkStatus defines the observed state of OvsNetwork 219 type OvsNetworkStatus struct { 220 } 221 222 // OvsNetwork is the Schema for the ovsnetworks API 223 type OvsNetwork struct { 224 metav1.TypeMeta `json:",inline"` 225 metav1.ObjectMeta `json:"metadata,omitempty"` 226 227 Spec OvsNetworkSpec `json:"spec,omitempty"` 228 Status OvsNetworkStatus `json:"status,omitempty"` 229 } 230 231 ``` 232 233 #### BridgeNetwork CRD `new` 234 235 ```golang 236 // BridgeNetworkSpec defines the desired state of BridgeNetwork 237 type BridgeNetworkSpec struct { 238 // Namespace of the NetworkAttachmentDefinition custom resource 239 NetworkNamespace string `json:"networkNamespace,omitempty"` 240 // OVS Network device plugin endpoint resource name 241 ResourceName string `json:"resourceName"` 242 // Capabilities to be configured for this network. 243 // Capabilities supported: (mac|ips), e.g. '{"mac": true}' 244 Capabilities string `json:"capabilities,omitempty"` 245 // IPAM configuration to be used for this network. 246 IPAM string `json:"ipam,omitempty"` 247 // name of the Linux bridge, if not set will automatically select bridge 248 // based on VF PCI address 249 Bridge string `json:"bridge,omitempty"` 250 // VLAN ID for VF 251 Vlan uint `json:"vlan,omitempty"` 252 // VLAN Trunk configuration 253 Trunk []TrunkConfig `json:"trunk,omitempty"` 254 // enable setting matching vlan tags on the bridge uplink interface, default is false 255 SetUplinkVlan bool `json:"setUplinkVlan,omitempty"` 256 // MTU for VF and representor 257 MTU uint `json:"mtu,omitempty"` 258 // MetaPluginsConfig configuration to be used in order to chain metaplugins 259 MetaPluginsConfig string `json:"metaPlugins,omitempty"` 260 } 261 262 // BridgeNetworkStatus defines the observed state of BridgeNetwork 263 type BridgeNetworkStatus struct { 264 } 265 266 // BridgeNetwork is the Schema for the ovsnetworks API 267 type BridgeNetwork struct { 268 metav1.TypeMeta `json:",inline"` 269 metav1.ObjectMeta `json:"metadata,omitempty"` 270 271 Spec BridgeNetworkSpec `json:"spec,omitempty"` 272 Status BridgeNetworkStatus `json:"status,omitempty"` 273 } 274 275 //+kubebuilder:object:root=true 276 277 // BridgeNetworkList contains a list of BridgeNetwork 278 type BridgeNetworkList struct { 279 metav1.TypeMeta `json:",inline"` 280 metav1.ListMeta `json:"metadata,omitempty"` 281 Items []BridgeNetwork `json:"items"` 282 } 283 284 ``` 285 286 #### SriovNetworkNodePolicy CRD 287 288 ```golang 289 // SriovNetworkNodePolicySpec defines the desired state of SriovNetworkNodePolicy 290 type SriovNetworkNodePolicySpec struct { 291 // ...existing fields... 292 // contains spec for the software bridge 293 Bridge Bridge `json:"bridge,omitempty"` 294 } 295 // contains spec for the bridge 296 // only one bridge type can be set 297 type Bridge struct { 298 // contains optional config for OVS bridge 299 Ovs *OVSConfig `json:"ovs,omitempty"` 300 // contains optional config for Linux bridge 301 Linux *LinuxBridgeConfig `json:"linux,omitempty"` 302 } 303 304 // OVSConfig optional configuration for OVS bridge and uplink Interface 305 type OVSConfig struct { 306 // contains bridge level settings 307 Bridge OVSBridgeConfig `json:"bridge,omitempty"` 308 // contains settings for uplink (PF) 309 Uplink OVSUplinkConfig `json:"uplink,omitempty"` 310 } 311 312 // OVSBridgeConfig contains some options from the Bridge table in OVSDB 313 type OVSBridgeConfig struct { 314 DatapathType string `json:"datapathType,omitempty"` 315 ExternalIDs map[string]string `json:"externalIDs,omitempty"` 316 OtherConfig map[string]string `json:"otherConfig,omitempty"` 317 } 318 319 // OVSUplinkConfig contains PF interface configuration for the bridge 320 type OVSUplinkConfig struct { 321 Interface OVSInterfaceConfig `json:"interface,omitempty"` 322 // can be extended to support OVSPortConfig which will include 323 // settings from the OVS Port table 324 } 325 326 // OVSInterfaceConfig contains some options from the Interface table of the OVSDB for PF 327 type OVSInterfaceConfig struct { 328 Type string `json:"type,omitempty"` 329 Options map[string]string `json:"options,omitempty"` 330 ExternalIDs map[string]string `json:"externalIDs,omitempty"` 331 OtherConfig map[string]string `json:"otherConfig,omitempty"` 332 } 333 334 // LinuxBridgeConfig optional configuration for Linux bridge and uplink interface 335 type LinuxBridgeConfig struct { 336 Bridge BridgeConfig `json:"bridge,omitempty"` 337 Uplink map[string]string `json:"uplink,omitempty"` // TODO clarify required settings 338 } 339 340 // BridgeConfig contains some options for linux bridge 341 type BridgeConfig struct { 342 VlanFiltering bool `json:"vlanFiltering,omitempty"` 343 // +kubebuilder:validation:Enum=802.1Q;802.1ad 344 VlanProtocol string `json:"vlanProtocol,omitempty"` 345 } 346 ``` 347 348 _**Note 1**: multiple policies can match single PF (vf range use-case), bridge settings from the policy with 349 the higher priority (which is the one with the lowest `.spec.priority` value) will be applied to PF._ 350 351 _**Note 2**: multiple NICs can match the same policy on a host. In this case a separate bridge will be created for each NIC._ 352 353 #### SriovNetworkNodeState CRD 354 355 ```golang 356 357 type SriovNetworkNodeStateSpec struct { 358 // ...existing fields... 359 Interfaces Interfaces `json:"interfaces,omitempty"` 360 Bridges Bridges `json:"bridges,omitempty"` 361 } 362 363 // SriovNetworkNodeStateStatus defines the observed state of SriovNetworkNodeState 364 type SriovNetworkNodeStateStatus struct { 365 Interfaces InterfaceExts `json:"interfaces,omitempty"` 366 Bridges Bridges `json:"bridges,omitempty"` 367 SyncStatus string `json:"syncStatus,omitempty"` 368 LastSyncError string `json:"lastSyncError,omitempty"` 369 } 370 371 // Bridges contains list of bridges 372 type Bridges struct { 373 OVS []OVSConfigExt `json:"ovs,omitempty"` 374 Linux []LinuxBridgeConfigExt `json:"linux,omitempty"` 375 } 376 377 type OVSConfigExt struct { 378 // name of the bridge 379 Name string `json:"name"` 380 // bridge-level configuration for the bridge 381 Bridge OVSBridgeConfig `json:"bridge,omitempty"` 382 // uplink-level bridge configuration for each uplink(PF). 383 // in the initial implementation will always contain one element 384 Uplinks []OVSUplinkConfigExt `json:"uplinks,omitempty"` 385 } 386 387 type OVSUplinkConfigExt struct { 388 // pci address of the PF 389 PciAddress string `json:"pciAddress"` 390 // name of the PF interface 391 Name string `json:"name,omitempty"` 392 // configuration from the Interface OVS table for the PF 393 Interface OVSInterfaceConfig `json:"interface,omitempty"` 394 } 395 396 type LinuxBridgeConfigExt struct { 397 Name string `json:"name"` 398 Bridge BridgeConfig `json:"bridge,omitempty"` 399 Uplinks []LinuxBridgeUPlinkConfigExt `json:"uplinks,omitempty"` 400 } 401 402 type LinuxBridgeUPlinkConfigExt struct { 403 PciAddress string `json:"pciAddress"` 404 Name string `json:"name,omitempty"` 405 Uplink map[string]string `json:"uplink,omitempty"` 406 } 407 408 type VirtualFunction struct { 409 // ...existing fields... 410 // contains VF representor name for NICs in switchdev mode 411 RepresentorName string `json:"representorName,omitempty"` 412 } 413 414 ``` 415 `SriovNetworkNodeState.spec` and `SriovNetworkNodeState.status` should be extended to contain the same `Bridges` struct. 416 417 _**Note:** The `Bridges` struct in the `SriovNetworkNodeState.status` can later be extended based on user feedback 418 to report additional information required to improve UX._ 419 420 ### Implementation Details/Notes/Constraints 421 422 The feature is only supported on baremetal clusters 423 424 #### Dependencies on changes in other projects 425 426 The proposed implementation requires changes in `ovs-cni` and `accelerated-bridge-cni`. We need to change their behavior when `deviceID` argument is provided in CNI ARGS. 427 If `deviceID` is set and `bridge` arg is empty, the cni plugin should try to automatically select the right bridge by following the chain: 428 429 VF (PCI address is in `deviceID` arg) > PF > Bond (if PF is part of the bond) > Bridge 430 431 _**Note:** `accelerated-bridge-cni` already has similar logic, but now it selects the bridge from the predefined list of bridges._ 432 433 434 #### Phased implementation 435 436 The feature assumes phased implementation. 437 438 ##### Phase 1 439 440 Add support for [NIC configuration only flow](#nic-configuration-only-flow) and 441 [Externally managed NIC flow](#externally-managed-nic-flow) for Open vSwitch 442 443 Requirements: 444 * add bridge auto-selection logic to `ovs-cni` 445 * define `OVSNetwork` CRD and implement controller for it 446 447 ##### Phase 2 448 449 Add support for [Fully automatic workflow](#fully-automatic-workflow) for Open vSwitch 450 451 Requirements: 452 * requirements from phase 1 453 * extend `SriovNetworkNodePolicy` and `SriovNetworkNodeState` CRDs to support configuration for ovs (bridge-level configuration) 454 * modify code: 455 * add support for ovs bridges creation 456 * add support for reporting information about configured ovs bridges on the node 457 * add support for removing auto-created ovs bridges during the PF reset 458 459 ##### Phase 3 460 461 Add support for [NIC configuration only flow](#nic-configuration-only-flow), 462 [Externally managed NIC flow](#externally-managed-nic-flow) and [Fully automatic workflow](#fully-automatic-workflow) for Linux bridge 463 464 Requirements: 465 * add bridge auto-selection logic to `accelerated-bridge-cni` 466 * define `BridgeNetwork` CRD and implement controller for it 467 * extend `SriovNetworkNodePolicy` and `SriovNetworkNodeState` CRDs to support configuration for linux bridge (bridge-level configuration) 468 * modify code: 469 * add support for linux bridges creation 470 * add support for reporting information about configured linux bridges on the node 471 * add support for removing auto-created linux bridges during the PF reset 472 473 ### Upgrade & Downgrade considerations 474 475 This feature doesn't contain any breaking changes. 476 Automatic upgrades should be safe and will not require any manual steps. 477 478 Downgrading without PF configuration reset may be problematic and may keep the node in an inconsistent state. 479 It is recommended to reset PFs attached to bridges first and then do a downgrade. 480 481 ### Test Plan 482 483 New functionality should be covered with unit tests. 484 485 Manual and automatic e2e testing will require hardware with NICs that support switchev mode and hardware offloading for software bridges. 486 487 488 489 ### Alternative options 490 491 #### Set configuration for software bridges in SriovNetworkPoolConfig CRD 492 493 ```golang 494 495 type SriovNetworkPoolConfigSpec struct { 496 // OvsHardwareOffloadConfig describes the OVS HWOL configuration for selected Nodes 497 OvsHardwareOffloadConfig OvsHardwareOffloadConfig `json:"ovsHardwareOffloadConfig,omitempty"` 498 // NodeSelector only valid for the fields below 499 NodeSelector map[string]string `json:"nodeSelector,omitempty"` 500 Bridges []BridgeConf `json:"bridges,omitempty"` 501 } 502 503 type BridgeConf struct { 504 // same configuration as in the main option 505 Bridge *Bridge `json:"bridge"` 506 // NicSelector uses the same type as SriovNetworkNodePolicySpec 507 NicSelector SriovNetworkNicSelector `json:"nicSelector"` 508 } 509 510 ``` 511 512 The main problem with this option is that we can achieve the reliable scheduling of workloads only in a complicated way. 513 The scheduler considers information from the device plugins to ensure that required resources are available on the host before putting workloads on it. 514 In the main option outlined in this doc, a configuration of the bridge is a part of the `SriovNetworkNodePolicy` and that means that the bridge is for sure available on the host if the host announces resource name defined in `SriovNetworkNodePolicy`. 515 516 In that alternative option, there is no warranty that NodeSelector + NicSelector for a configuration of bridges is in sync with the NodeSelector + NicSelectors from a policy, so we can't rely on the sriov resource name from the policy to do a reliable scheduling - host can have SRIOV VFs, but may miss a bridge. 517 To solve this problem, we can create a device plugin, which will expose information about available bridges in the form of resources, 518 e.g., `<prefix>/<bridge_type>_<pool_config_name>`. A user must explicitly request SRIOV + bridge resources while creating a Pod.