github.com/k8snetworkplumbingwg/sriov-network-operator@v1.2.1-0.20240408194816-2d2e5a45d453/doc/design/externally-manage-pf.md (about)

     1  ---
     2  title: Externally Manage PF
     3  authors:
     4    - SchSeba
     5  reviewers:
     6    - zeeke
     7    - adrianchiris
     8  creation-date: 12-07-2023
     9  last-updated: 12-07-2023
    10  ---
    11  
    12  # Externally Manage PF
    13  
    14  ## Summary
    15  
    16  Allow the SR-IOV network operator to configure and allocate a subset of virtual functions from
    17  a physical function that is configured externally from SR-IOV network operator.
    18  
    19  ## Motivation
    20  
    21  The feature is needed to allow the operator to only configure a subset of virtual functions.
    22  This allows a third party component like nmstate, kubernetes-nmstate, NetworkManager to handle the creation
    23  and the usage of the virtual functions on the system. Some of the examples are using the virtual function as the primary
    24  nic for the k8s SDN network or a storage network.
    25  
    26  Before this change the SR-IOV network operator is the only component that should use/configure VFs. not allowing the user
    27  to use some of the VFs for host networking.
    28  
    29  ### Use Cases
    30  
    31  * As a user I want to use a virtual function for SDN network, for SDN the network need to be configured before
    32  k8s is deployed and these VFs should be available at system startup before pods start running
    33  * As a user I want to create the virtual functions via nmstate
    34  * As a user I want pods to use virtual functions from a pre-configured PF
    35  * As a user I want to allocate virtual functions to pods from a PF with custom configuration/driver
    36  * As a user I want to use virtual functions to be configured for the storage subsystem before k8s is deployed / pods spinning up at system startup
    37  
    38  ### Goals
    39  
    40  * Allow the SR-IOV network operator to handle the configuration and pod allocation of some or all virtual functions
    41  while PF configuration are managed by an external entity
    42  * Allow the user to Allocate the number of virtual functions he wants for the system and the subset he wants for pods
    43  
    44  ### Non-Goals
    45  
    46  * Supporting switchdev mode (may change in the future if there is a request)
    47  * Supporting the creation of the VFs on boot by the operator possible to use operator systemd mode for that
    48  
    49  ## Proposal
    50  
    51  Create a sub-flow in the SR-IOV network operator where the user can request a configuration for all/subset of virtual functions
    52  without any changes in the PF level.
    53  
    54  The operator will first validate the requested PF contains the requested amount of virtual functions allocated, it
    55  will also validate the requested MTU is configured as expected on the PF.
    56  If that is not the case the `sriovNetworkNodeState.status.SyncStatus` field will be report a `Failed`
    57  
    58  Then the operator will configure the subset of virtual functions with the requested driver and will update the device plugin
    59  configmap with the expected information to create the relevant pools.
    60  
    61  Existing sriov network config daemon flow:
    62  1. Apply the `numOfVfs`
    63  2. Configure the MTU on the PF
    64  3. Copy the Administrative mac address from the VFs
    65  4. Bind the right driver for the VF
    66  5. restart sriov network device plugin
    67  
    68  Externally manage sriov network config daemon flow:
    69  1. Copy the Administrative mac address from the VFs
    70  2. Bind the right driver for the VF
    71  3. restart sriov network device plugin
    72  
    73  In both flows:
    74  * In case of Infiniband link type it will generate random node and port GUID for the interface.
    75  * In case of RDMA (both for ETH and IB) it will perform an unbind/bind of the VF driver to set RDMA Node/Port GUID.
    76  
    77  ### Workflow Description
    78  
    79  The user will allocate the virtual functions on the system with any third party tool like nmstate, Kubnernetes-nmstate,
    80  systemd scripts, etc..
    81  
    82  The user must perform the sriov allocation/configuration before kubelet starts or more specifically
    83  before SR-IOV Network operator configuration daemon starts running on the node.
    84  
    85  Then the user will be able to create a policy telling the operator that the PF is externally managed by the user.
    86  
    87  If the user want to create the virtual functions after the SR-IOV Network config daemon is already running on the system he will need
    88  to disable the webhook. the policy will be on failed state until the virtual functions needed for the policy exist
    89  on the node. the SR-IOV Network config daemon will continue to reconcile until the virtual functions exists
    90  
    91  #### Policy Example:
    92  ```yaml
    93  apiVersion: sriovnetwork.openshift.io/v1
    94  kind: SriovNetworkNodePolicy
    95  metadata:
    96    name: sriov-nic-1
    97    namespace: sriov-network-operator
    98  spec:
    99    deviceType: netdevice
   100    nicSelector:
   101      pfNames: ["ens3f0#5-9"]
   102    nodeSelector:
   103      node-role.kubernetes.io/worker: ""
   104    numVfs: 10
   105    priority: 99
   106    resourceName: sriov_nic_1
   107    externallyManaged: true
   108  ```
   109  
   110  The PF and VFs 0-4 are externally managed. 
   111  For example nmstate will create 10 vfs, but will only consume VF 0 and 4 in its configuration. Nmstate will also manage the MTU and other parameters of the PF.
   112  
   113  #### Another Policy Example:
   114  In this case we allocate all the virtual functions from the PF
   115  
   116  ```yaml
   117  apiVersion: sriovnetwork.openshift.io/v1
   118  kind: SriovNetworkNodePolicy
   119  metadata:
   120    name: sriov-nic-2
   121    namespace: sriov-network-operator
   122  spec:
   123    deviceType: netdevice
   124    nicSelector:
   125      pfNames: ["ens3f0"]
   126    nodeSelector:
   127      node-role.kubernetes.io/worker: ""
   128    numVfs: 10
   129    priority: 99
   130    resourceName: sriov_nic_1
   131    externallyManaged: true
   132  ```
   133  
   134  The SR-IOV network operator will use all the 10 virtual functions created externally by the user.
   135  One if the main use cases for this is if the user want to do some custom configuration to the PF and VFs like loading
   136  out of tree drivers or other stuff that the operator doesn't support.
   137  
   138  #### Validation
   139  The SR-IOV network operator will do a validation webhook to check if the requested `numVfs` is equal to what the user allocate
   140  if not it will reject the policy creation.
   141  
   142  The SR-IOV network operator will do a validation webhook to check if the requested MTU is lower or equal to what exist on the PF 
   143  if not it will reject the policy creation.
   144  
   145  
   146  *Note:* Same validation will be done in the SR-IOV config-daemon container to cover cases where the user doesn't want to deploy"
   147  the webhook and to cover scale-up adding new nodes. If the verification failed in the policy apply stage
   148  the `sriovNetworkNodeState.status.SyncStatus` field will be report a `Failed` status and the error description will 
   149  get exposed in `sriovNetworkNodeState.status.LastSyncError`
   150  
   151  
   152  #### Configuration
   153  
   154  The SR-IOV network operator config daemon will reconcile on the SriovNetworkNodeState update and will follow the regular
   155  flow of virtual functions *SKIPPING* only the Virtual function allocation.
   156  
   157  The SR-IOV network operator will update the SR-IOV Network Device Plugin with the pool information
   158  
   159  Another change with the operator beavior is when we delete a policy with had `externallyManaged: true` the SR-IOV network operator
   160  will *NOT* reset the `numVfs`
   161  
   162  ### API Extensions
   163  
   164  For SriovNetworkNodePolicy
   165  
   166  ```golang
   167  // SriovNetworkNodePolicySpec defines the desired state of SriovNetworkNodePolicy
   168  type SriovNetworkNodePolicySpec struct {
   169  ...
   170  + // don't create the virtual function only assign to the driver and allocated them to device plugin. Defaults to false.
   171  + ExternallyManaged bool `json:"externallyManaged,omitempty"`
   172  }
   173  ```
   174  
   175  For SriovNetworkNodeState
   176  
   177  ```golang
   178  type Interface struct {
   179  ...
   180  + ExternallyManaged bool      `json:"externallyManaged,omitempty"`
   181  }
   182  ```
   183  
   184  ### Implementation Details/Notes/Constraints
   185  
   186  #### Webhook
   187  For the webhook we add more validations when the policy contains `ExternallyManaged: true`
   188  * `numVfs` in the policy equal is equal or lower the number of virtual functions on the system
   189  * `MTU` in the policy equals or lower the MTU we discover on the PF
   190  * `LinkType` in the policy equals the link type we discover on the PF
   191  
   192  #### Controller/Manager
   193  
   194  The changes in the manager for this feature are minimal we only copy the `ExternallyManaged` boolean from the policy
   195  to the generated `nodeState.Spec`
   196  
   197  #### Config Daemon
   198  
   199  This is where most of the changes for this feature are implemented.
   200  
   201  * do a validation same as on the webhook to check the PF have everything we need to apply the requested
   202  policy, by checking the `numVfs`, `MTU` and `LinkType`.
   203  * skip all the PF configuration like `numVfs`, `MTU` and `LinkType`. he will only perform the virtual function 
   204  driver binding, administrative mac allocation and MTU. 
   205  * in case of Infiniband link type it will generate random node and port GUID for the interface
   206  * in case of RDMA (both for ETH and IB) it will perform an unbind/bind of the VF driver to set RDMA Node/Port GUID.
   207  * reset the device plugin so kubelet will be able to discover the SR-IOV devices.
   208  
   209  *NOTE:* The config-daemon will also save on the node a cache (file) of the last applied policy. this is needed to be able and understand
   210  if we need to reset the PF configuration(`ExternallyManaged` was false) or not when policy is removed.
   211  
   212  ### Upgrade & Downgrade considerations
   213  
   214  The feature supports both Upgrade and Downgrade as we are introducing a new field in the API.
   215  Downgrade will cause the operator to treat an externally managed PF as non externally managed and actually configure PF,
   216  this may cause conflicts in the system.
   217  
   218  ### Test Plan
   219  
   220  * Should not allow to create a policy with externallyManaged true if there are no vfs configured
   221  * Should create a policy if the number of requested vfs is equal
   222  * Should create a policy if the number of requested vfs is equal and not delete them when the policy is removed
   223  * should reset the virtual functions if externallyCreated is false
   224  * should to configure a policy with externallyManaged true if there are no vfs configured with disabled webhook