github.com/k8snetworkplumbingwg/sriov-network-operator@v1.2.1-0.20240408194816-2d2e5a45d453/doc/design/parallel-node-config.md (about)

     1  ---
     2  title: Parallel SR-IOV configuration
     3  authors:
     4    - SchSeba
     5  reviewers:
     6    - adrianchiris
     7    - e0ne
     8  creation-date: 18-07-2023
     9  last-updated: 18-07-2023
    10  ---
    11  
    12  # Parallel SR-IOV configuration
    13  
    14  ## Summary
    15  Allow SR-IOV Network Operator to configure more than one node at the same time.
    16  
    17  ## Motivation
    18  SR-IOV Network Operator configures SR-IOV one node at a time and one nic at a same time. That means we’ll need to wait
    19  hours or even days to configure all NICs  on large cluster deployments. Also moving all draining logic to a centralized
    20  place which will reduce chances of race conditions and bugs that were encountered before in sriov-network-config-daemon
    21  with draining.
    22  
    23  ### Use Cases
    24  
    25  ### Goals
    26  * Number of drainable nodes should be 1 by default
    27  * Number of drainable nodes should be configured by pool
    28  * Nodes pool should be defined by node selector
    29  * Move all drain-related logic into the centralized place
    30  
    31  
    32  ### Non-Goals
    33  Parallel NICs configuration on the same node is out of scope of this proposal
    34  
    35  ## Proposal
    36  Introduce nodes pool drain configuration and controller to meet goals targets.
    37  
    38  
    39  ### Workflow Description
    40  A new Drain controller will be introduced to manage node drain and cordon procedures. That means we don't need to do
    41  drain and use `drain lock` in config daemon anymore. The overall drain process will be covered by the following states:
    42  
    43  ```golang
    44  NodeDrainAnnotation             = "sriovnetwork.openshift.io/state"
    45  NodeStateDrainAnnotation        = "sriovnetwork.openshift.io/desired-state"
    46  NodeStateDrainAnnotationCurrent = "sriovnetwork.openshift.io/current-state"
    47  DrainIdle                       = "Idle"
    48  DrainRequired                   = "Drain_Required"
    49  RebootRequired                  = "Reboot_Required"
    50  Draining                        = "Draining"
    51  DrainComplete                   = "DrainComplete"
    52  ```
    53  
    54  Drain controller will watch for Node annotation, `sriovnetwork.openshift.io/state` 
    55  and SriovNetworkNodeState annotation `sriovnetwork.openshift.io/desired-state`
    56  and write the `sriovnetwork.openshift.io/current-state` annotation in the SriovNetworkNodeState.
    57  
    58  Config Daemon will read SriovNetworkNodeState annotation `sriovnetwork.openshift.io/current-state` and write both
    59  Node annotation, `sriovnetwork.openshift.io/state` and SriovNetworkNodeState annotation `sriovnetwork.openshift.io/desired-state`.
    60  
    61  *NOTE:* In the future we are going to drop the node annotation and only use the SriovNetworkNodeState
    62  
    63  Draining procedure:
    64  
    65  1. config daemon mark the node as `Drain_Required` or `Reboot_Required` by adding that to both the Node annotation, `sriovnetwork.openshift.io/state`
    66     and SriovNetworkNodeState annotation `sriovnetwork.openshift.io/desired-state`
    67  2. operator drain controller reconcile loop find the node SriovNetworkPoolConfig
    68     1. if number of `Draining` nodes is great or equal to the `MaxUnavailable` the operator will re-queue the request
    69     2. if number of `Draining` nodes is lower than the `MaxUnavailable` the operator will start the draining process
    70     and annotate the SriovNetworkNodeState annotation `sriovnetwork.openshift.io/current-state` with `Draining`
    71  5. on Openshift platform we will pause the machine config pool related to the node
    72  6. the operator will start the drain process
    73     1. if `Drain_Required` the operator will remove ONLY pods used sriov devices
    74     2. if `Reboot_Required` the operator will remove ALL the pods on the system
    75  9. operator moves the `sriovnetwork.openshift.io/current-state` annotation to `DrainComplete`
    76  10. daemon will continue to the configuration when it's done it will move back both `sriovnetwork.openshift.io/state` 
    77  annotation on Node and `sriovnetwork.openshift.io/desired-state` on SriovNetworkNodeState to `Idle`
    78  11. operator runs the complete drain to remove the cordon and mark the `sriovnetwork.openshift.io/current-state` annotation to `Idle`
    79  
    80  ### API Extensions
    81  
    82  #### Extend existing CR SriovNetworkPoolConfig
    83  SriovNetworkPoolConfig is used only for OpenShift to provide configuration for
    84  OVS Hardware Offloading. We can extend it to add configuration for the drain
    85  pool. E.g.:
    86  
    87  ```golang
    88  // SriovNetworkPoolConfigSpec defines the desired state of SriovNetworkPoolConfig
    89  type SriovNetworkPoolConfigSpec struct {
    90      ...
    91  	
    92      // nodeSelector specifies a label selector for Nodes
    93      NodeSelector *metav1.LabelSelector `json:"nodeSelector,omitempty"`
    94  
    95      // maxUnavailable defines either an integer number or percentage
    96      // of nodes in the pool that can go Unavailable during an update.
    97      //
    98      // A value larger than 1 will mean multiple nodes going unavailable during
    99      // the update, which may affect your workload stress on the remaining nodes.
   100      // Drain will respect Pod Disruption Budgets (PDBs) such as etcd quorum guards,
   101      // even if maxUnavailable is greater than one.
   102      MaxUnavailable *intstr.IntOrString `json:"maxUnavailable,omitempty"`
   103  }
   104  ```
   105  
   106  ```yaml
   107  apiVersion: v1
   108  kind: SriovNetworkPoolConfig
   109  metadata:
   110    name: pool-1
   111    namespace: network-operator
   112  spec:
   113    maxUnavailable: "20%"
   114    nodeSelector:
   115      - matchExpressions:
   116        - key: some-label
   117          operator: In
   118          values:
   119            - val-2
   120      - matchExpressions:
   121        - key: other-label
   122          operator: "Exists"
   123  ```
   124  
   125  Once this change will be implemented `SriovNetworkPoolConfig` configuration will be applied both to vanilla Kubernetes
   126  and OpenShift clusters.
   127  
   128  ### Implementation Constraints
   129  
   130  Node can only be part of one pool. if the node is not part of any node it will be allocated
   131  to a virtual default pool with `maxUnavailable` of 1.
   132  
   133  _*Note:*_ if you create a pool with empty selector it will match all the nodes, and you can not have another pool.
   134  
   135  ### Upgrade & Downgrade considerations
   136  After operator upgrade we have to support `sriovnetwork.openshift.io/state` node annotation and `sriovnetwork.openshift.io/desired-state`
   137  annotation in the `sriovNetworkNodeState`. in the future we are going to migrate to only using the annotation in the `sriovNetworkNodeState`
   138  
   139  There is no change in upgrade from the user point of view.
   140  If there is no pools or the node doesn't belong to any pool the `maxUnavailable` will be 1 to preserve the same functionality after upgrade.
   141  
   142  _*Note:*_ no node should be in `Draining` or `MCP_Paused` state in the node annotation before the upgrade
   143  
   144  ### Alternative APIs
   145  #### Option 1: extend SriovOperatorConfig CRD
   146  We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.:
   147  
   148  ```yaml
   149  apiVersion: sriovnetwork.openshift.io/v1
   150  kind: SriovOperatorConfig
   151  metadata:
   152  name: default
   153  namespace: network-operator
   154  spec:
   155  # Add fields here
   156  enableInjector: false
   157  enableOperatorWebhook: false
   158  configDaemonNodeSelector: {}
   159  disableDrain: false
   160  drainConfig:
   161  - name: default
   162    maxParallelNodeConfiguration: 1
   163    priority: 0 # the lowest priority
   164  - name: pool-1
   165    maxParallelNodeConfiguration: 5
   166    priority: 44
   167    # empty nodeSelectorTerms means 'all nodes'
   168    nodeSelectorTerms:
   169    - matchExpressions:
   170      - key: some-label
   171        operator: In
   172        values:
   173        - val-1
   174        - val-2
   175    - matchExpressions:
   176      - key: other-label
   177        operator: "Exists"
   178  ```
   179  
   180  We didn't choose this option because SriovOperatorConfig contains Config Daemon-specific options only while draing
   181  configuration is node-specific.
   182  
   183  #### Option 2:  New CRD
   184  Add new `DrainConfiguration`CRD with fields mentioned in previous options.
   185  We can extend SriovOperatorConfig CRD to include drain pools configuration. E.g.:
   186  ```yaml
   187  apiVersion: sriovnetwork.openshift.io/v1
   188  kind: SriovDrainConfig
   189  metadata:
   190    name: default
   191    namespace: network-operator
   192  spec:
   193    maxParallelNodeConfiguration: 1
   194    priority: 0 # the lowest priority
   195    # empty nodeSelectorTerms means 'all nodes'
   196    nodeSelectorTerms:
   197    - matchExpressions:
   198    - key: some-label
   199    operator: In
   200  ```
   201  
   202  We didn't choose this option because there is already defined `SriovNetworkPoolConfig` CRD wich could be uses for needed
   203  configuration.
   204  
   205  ### Test Plan
   206  * Unit tests will be implemented for new Drain Controller.
   207  ** E2E, manual or automation functional testing should have such test cases:
   208  ** to verify that we actually configure SR-IOV on `MaxParallelNodeConfiguration` nodes at the same time
   209  ** to check that we don't configure more than `MaxParallelNodeConfiguration` nodes at the same time