github.com/k8snetworkplumbingwg/sriov-network-operator@v1.2.1-0.20240408194816-2d2e5a45d453/doc/design/parallel-node-config.md (about) 1 --- 2 title: Parallel SR-IOV configuration 3 authors: 4 - SchSeba 5 reviewers: 6 - adrianchiris 7 - e0ne 8 creation-date: 18-07-2023 9 last-updated: 18-07-2023 10 --- 11 12 # Parallel SR-IOV configuration 13 14 ## Summary 15 Allow SR-IOV Network Operator to configure more than one node at the same time. 16 17 ## Motivation 18 SR-IOV Network Operator configures SR-IOV one node at a time and one nic at a same time. That means we’ll need to wait 19 hours or even days to configure all NICs on large cluster deployments. Also moving all draining logic to a centralized 20 place which will reduce chances of race conditions and bugs that were encountered before in sriov-network-config-daemon 21 with draining. 22 23 ### Use Cases 24 25 ### Goals 26 * Number of drainable nodes should be 1 by default 27 * Number of drainable nodes should be configured by pool 28 * Nodes pool should be defined by node selector 29 * Move all drain-related logic into the centralized place 30 31 32 ### Non-Goals 33 Parallel NICs configuration on the same node is out of scope of this proposal 34 35 ## Proposal 36 Introduce nodes pool drain configuration and controller to meet goals targets. 37 38 39 ### Workflow Description 40 A new Drain controller will be introduced to manage node drain and cordon procedures. That means we don't need to do 41 drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states: 42 43 ```golang 44 NodeDrainAnnotation = "sriovnetwork.openshift.io/state" 45 NodeStateDrainAnnotation = "sriovnetwork.openshift.io/desired-state" 46 NodeStateDrainAnnotationCurrent = "sriovnetwork.openshift.io/current-state" 47 DrainIdle = "Idle" 48 DrainRequired = "Drain_Required" 49 RebootRequired = "Reboot_Required" 50 Draining = "Draining" 51 DrainComplete = "DrainComplete" 52 ``` 53 54 Drain controller will watch for Node annotation, `sriovnetwork.openshift.io/state` 55 and SriovNetworkNodeState annotation `sriovnetwork.openshift.io/desired-state` 56 and write the `sriovnetwork.openshift.io/current-state` annotation in the SriovNetworkNodeState. 57 58 Config Daemon will read SriovNetworkNodeState annotation `sriovnetwork.openshift.io/current-state` and write both 59 Node annotation, `sriovnetwork.openshift.io/state` and SriovNetworkNodeState annotation `sriovnetwork.openshift.io/desired-state`. 60 61 *NOTE:* In the future we are going to drop the node annotation and only use the SriovNetworkNodeState 62 63 Draining procedure: 64 65 1. config daemon mark the node as `Drain_Required` or `Reboot_Required` by adding that to both the Node annotation, `sriovnetwork.openshift.io/state` 66 and SriovNetworkNodeState annotation `sriovnetwork.openshift.io/desired-state` 67 2. operator drain controller reconcile loop find the node SriovNetworkPoolConfig 68 1. if number of `Draining` nodes is great or equal to the `MaxUnavailable` the operator will re-queue the request 69 2. if number of `Draining` nodes is lower than the `MaxUnavailable` the operator will start the draining process 70 and annotate the SriovNetworkNodeState annotation `sriovnetwork.openshift.io/current-state` with `Draining` 71 5. on Openshift platform we will pause the machine config pool related to the node 72 6. the operator will start the drain process 73 1. if `Drain_Required` the operator will remove ONLY pods used sriov devices 74 2. if `Reboot_Required` the operator will remove ALL the pods on the system 75 9. operator moves the `sriovnetwork.openshift.io/current-state` annotation to `DrainComplete` 76 10. daemon will continue to the configuration when it's done it will move back both `sriovnetwork.openshift.io/state` 77 annotation on Node and `sriovnetwork.openshift.io/desired-state` on SriovNetworkNodeState to `Idle` 78 11. operator runs the complete drain to remove the cordon and mark the `sriovnetwork.openshift.io/current-state` annotation to `Idle` 79 80 ### API Extensions 81 82 #### Extend existing CR SriovNetworkPoolConfig 83 SriovNetworkPoolConfig is used only for OpenShift to provide configuration for 84 OVS Hardware Offloading. We can extend it to add configuration for the drain 85 pool. E.g.: 86 87 ```golang 88 // SriovNetworkPoolConfigSpec defines the desired state of SriovNetworkPoolConfig 89 type SriovNetworkPoolConfigSpec struct { 90 ... 91 92 // nodeSelector specifies a label selector for Nodes 93 NodeSelector *metav1.LabelSelector `json:"nodeSelector,omitempty"` 94 95 // maxUnavailable defines either an integer number or percentage 96 // of nodes in the pool that can go Unavailable during an update. 97 // 98 // A value larger than 1 will mean multiple nodes going unavailable during 99 // the update, which may affect your workload stress on the remaining nodes. 100 // Drain will respect Pod Disruption Budgets (PDBs) such as etcd quorum guards, 101 // even if maxUnavailable is greater than one. 102 MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty"` 103 } 104 ``` 105 106 ```yaml 107 apiVersion: v1 108 kind: SriovNetworkPoolConfig 109 metadata: 110 name: pool-1 111 namespace: network-operator 112 spec: 113 maxUnavailable: "20%" 114 nodeSelector: 115 - matchExpressions: 116 - key: some-label 117 operator: In 118 values: 119 - val-2 120 - matchExpressions: 121 - key: other-label 122 operator: "Exists" 123 ``` 124 125 Once this change will be implemented `SriovNetworkPoolConfig` configuration will be applied both to vanilla Kubernetes 126 and OpenShift clusters. 127 128 ### Implementation Constraints 129 130 Node can only be part of one pool. if the node is not part of any node it will be allocated 131 to a virtual default pool with `maxUnavailable` of 1. 132 133 _*Note:*_ if you create a pool with empty selector it will match all the nodes, and you can not have another pool. 134 135 ### Upgrade & Downgrade considerations 136 After operator upgrade we have to support `sriovnetwork.openshift.io/state` node annotation and `sriovnetwork.openshift.io/desired-state` 137 annotation in the `sriovNetworkNodeState`. in the future we are going to migrate to only using the annotation in the `sriovNetworkNodeState` 138 139 There is no change in upgrade from the user point of view. 140 If there is no pools or the node doesn't belong to any pool the `maxUnavailable` will be 1 to preserve the same functionality after upgrade. 141 142 _*Note:*_ no node should be in `Draining` or `MCP_Paused` state in the node annotation before the upgrade 143 144 ### Alternative APIs 145 #### Option 1: extend SriovOperatorConfig CRD 146 We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: 147 148 ```yaml 149 apiVersion: sriovnetwork.openshift.io/v1 150 kind: SriovOperatorConfig 151 metadata: 152 name: default 153 namespace: network-operator 154 spec: 155 # Add fields here 156 enableInjector: false 157 enableOperatorWebhook: false 158 configDaemonNodeSelector: {} 159 disableDrain: false 160 drainConfig: 161 - name: default 162 maxParallelNodeConfiguration: 1 163 priority: 0 # the lowest priority 164 - name: pool-1 165 maxParallelNodeConfiguration: 5 166 priority: 44 167 # empty nodeSelectorTerms means 'all nodes' 168 nodeSelectorTerms: 169 - matchExpressions: 170 - key: some-label 171 operator: In 172 values: 173 - val-1 174 - val-2 175 - matchExpressions: 176 - key: other-label 177 operator: "Exists" 178 ``` 179 180 We didn't choose this option because SriovOperatorConfig contains Config Daemon-specific options only while draing 181 configuration is node-specific. 182 183 #### Option 2: New CRD 184 Add new `DrainConfiguration`CRD with fields mentioned in previous options. 185 We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.: 186 ```yaml 187 apiVersion: sriovnetwork.openshift.io/v1 188 kind: SriovDrainConfig 189 metadata: 190 name: default 191 namespace: network-operator 192 spec: 193 maxParallelNodeConfiguration: 1 194 priority: 0 # the lowest priority 195 # empty nodeSelectorTerms means 'all nodes' 196 nodeSelectorTerms: 197 - matchExpressions: 198 - key: some-label 199 operator: In 200 ``` 201 202 We didn't choose this option because there is already defined `SriovNetworkPoolConfig` CRD wich could be uses for needed 203 configuration. 204 205 ### Test Plan 206 * Unit tests will be implemented for new Drain Controller. 207 ** E2E, manual or automation functional testing should have such test cases: 208 ** to verify that we actually configure SR-IOV on `MaxParallelNodeConfiguration` nodes at the same time 209 ** to check that we don't configure more than `MaxParallelNodeConfiguration` nodes at the same time