github.com/kubewharf/katalyst-core@v0.5.3/docs/proposals/qos-management/qos-resource-manager/20221018-qos-resource-manager.md

github.com/kubewharf/katalyst-core@v0.5.3/docs/proposals/qos-management/qos-resource-manager/20221018-qos-resource-manager.md (about)

     1  ---
     2  title: QoS Resource Manager
     3  authors:
     4    - "csfldf"
     5  reviewers:
     6    - "waynepeking348"
     7    - "xuchen-xiaoying"
     8  creation-date: 2022-10-18
     9  last-updated: 2023-02-22
    10  status: implemented
    11  see-also:
    12    - "https://github.com/kubewharf/enhanced-k8s/tree/70544afae9af1c7e1d129069e80f4eceb7d039f5/docs/design/qos-resource-manager"
    13  ---
    14  
    15  # QoS Resource Manager
    16  
    17  <!-- toc -->
    18  - [Summary](#summary)
    19  - [Motivation](#motivation)
    20      - [Goals](#goals)
    21      - [Non-Goals](#non-goals)
    22  - [Proposal](#proposal)
    23      - [User Stories](#user-stories)
    24          - [Story 1: Adjust resource allocation results dynamically in QoS aware systems](#story-1-adjust-resource-allocation-results-dynamically-in-qos-aware-systems)
    25          - [Story 2: Expand customized resource allocation policies in user-developed plugins](#story-2-expand-customized-resource-allocation-policies-in-user-developed-plugins)
    26          - [Story 3: Allocate share devices with NUMA affinity for multiple pods](#story-3-allocate-share-devices-with-numa-affinity-for-multiple-pods)
    27      - [Risks and Mitigations](#risks-and-mitigations)
    28          - [UX](#ux)
    29  - [Design Details](#design-details)
    30      - [Detailed Working Flow](#detailed-working-flow)
    31          - [Synchronous Pod Admission](#synchronous-pod-admission)
    32          - [Asynchronous Resource Adjustment](#asynchronous-resource-adjustment)
    33      - [Pod Resources Checkpoint](#pod-resources-checkpoint)
    34      - [Simulation: how QRM works](#simulation-how-qrm-works)
    35          - [Example 1: running a QoS System using QRM](#example-1-running-a-qos-system-using-qrm)
    36              - [Initialize plugins](#initialize-plugins)
    37              - [Admit pod with online role](#admit-pod-with-online-role)
    38              - [Admit pod with offline role](#admit-pod-with-offline-role)
    39              - [Admit another pod with online role](#admit-another-pod-with-online-role)
    40              - [Periodically adjust resource allocation](#periodically-adjust-resource-allocation)
    41          - [Example 2: allocate NUMA-affinity resources with extended policy](#example-2-allocate-numa-affinity-resources-with-extended-policy)
    42              - [Initialize plugins](#initialize-plugins-1)
    43              - [Admit pod with storage-service role](#admit-pod-with-storage-service-role)
    44              - [Admit pod with reranker role](#admit-pod-with-reranker-role)
    45          - [Example 3: allocate shared NUMA affinitive NICs](#example-3-allocate-shared-numa-affinitive-nics)
    46              - [Initialize plugins](#initialize-plugins-2)
    47              - [Admit pod with numa-sharing &amp;&amp; cpu-execlusive role](#admit-pod-with-numa-sharing--cpu-execlusive-role)
    48              - [Admit another pod with the same role](#admit-another-pod-with-the-same-role)
    49      - [New Flags and Configuration of QRM](#new-flags-and-configuration-of-qrm)
    50          - [Feature Gate Flag](#feature-gate-flag)
    51          - [QRM Reconcile Period Flag](#qrm-reconcile-period-flag)
    52          - [How this proposal affects the kubelet ecosystem](#how-this-proposal-affects-the-kubelet-ecosystem)
    53              - [Container Manager](#container-manager)
    54              - [Topology Manager](#topology-manager)
    55              - [kubeGenericRuntimeManager](#kubegenericruntimemanager)
    56              - [Kubelet Node Status Setter](#kubelet-node-status-setter)
    57              - [Pod Resources Server](#pod-resources-server)
    58      - [Test Plan](#test-plan)
    59  - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
    60      - [Feature Enablement and Rollback](#feature-enablement-and-rollback)
    61          - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
    62          - [Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior)
    63          - [Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement)
    64          - [What happens if we reenable the feature if it was previously rolled back?](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back)
    65          - [Are there any tests for feature enablement/disablement?](#are-there-any-tests-for-feature-enablementdisablement)
    66      - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning)
    67          - [How can a rollout or rollback fail? Can it impact already running workloads?](#how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads)
    68          - [What specific metrics should inform a rollback?](#what-specific-metrics-should-inform-a-rollback)
    69          - [Were upgrade and rollback tested? Was the upgrade-&gt;downgrade-&gt;upgrade path tested?](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested)
    70          - [Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?](#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc)
    71      - [Monitoring Requirements](#monitoring-requirements)
    72      - [Dependencies](#dependencies)
    73      - [Scalability](#scalability)
    74      - [Troubleshooting](#troubleshooting)
    75          - [How does this feature react if the API server and/or etcd is unavailable?](#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable)
    76          - [What are other known failure modes?](#what-are-other-known-failure-modes)
    77          - [What steps should be taken if SLOs are not being met to determine the problem?](#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem)
    78  - [Implementation History](#implementation-history)
    79  - [Drawbacks](#drawbacks)
    80  - [Appendix](#appendix)
    81      - [Related Features](#related-features)
    82  <!-- /toc -->
    83  
    84  ## Summary
    85  
    86  Despite the fact that the CPU Manager and Memory Manager in kubelet can allocate `cpuset.cpus` and `cpuset.mems`
    87  with numa affinity, they have some restrictions and are difficult to customize since all policies of them share
    88  the same checkpoint.
    89  
    90  * For instance, only pods with Guaranteed QoS class can be allocated with exclusive `cpuset.cpus` and `cpuset.mems`,
    91    but in each individual production environment, Kubernetes original QoS classes may be not flexible enough to depict
    92    workloads with different QoS requirements.
    93  * Besides, those allocation logic works in a static way cause it only counts on numerical values of each resource,
    94    without considering the running state of each node.
    95  * Finally, the current implementation is not pluggable, and if new policies or additional resource managers
    96    (like disk quota or network bandwidth) are needed, we have to make changes for kubelet source codes and
    97    update kubelet for clusters and that would be costly.
    98  
    99  Thus, we propose the `QoS Resource Manager` (abbreviated to `QRM` later in the article) as a new component in kubelet
   100  ecosystem. It extends the ability of resource allocation in admission phase, and enable dynamic resource allocation
   101  adjustment for pods with better flexibility.
   102  
   103  QRM works in a similar way like the Device Manager, and resource allocation logic will be implemented in external plugins.
   104  It will then periodically collect latest resource allocation results, along with real-time node running states, and
   105  assemble them as parameters to update through standard CRI interface, like `cpuset.cpus`, `cpuset.mems` or any other
   106  potential resources needed in the future,.
   107  
   108  In this way, the allocation and adjustment logic is offloaded to different plugins, and can be customized by
   109  user-defined QoS requirements. In addition, we can implement setting and adjustment logic in plugins for cgroup
   110  parameters supported in LinuxContainerResources config (eg. `memory.oom_control`, `io.weight`), QRM will set and
   111  update those parameters for pods when corresponding plugins registered.
   112  
   113  Currently, we have already implemented QRM framework and multiple plugins, and they are already running in production
   114  to support QoS-Aware and heterogeneous systems.
   115  
   116  ## Motivation
   117  
   118  ### Goals
   119  
   120  * **Pluggable:** make it easier to extend additional resource allocation and adjustment (NUMA affinitive NICs, network or memory
   121    bandwidth, disk capacity etc.) without modifying kubelet source codes.
   122  * **Adjustable:** dynamically adjust resource allocation and qos-control strategies according to the real-time node running states.
   123  * **Customizable:** all resource plugins can do the customized qos managing by the specific qos definition.
   124  
   125  ### Non-Goals
   126  
   127  * Expand or overturn current pod QoS definitions, thus they'll still remain as Guarantee, Burstable and BestEffort.
   128    Instead, users should use common annotations to reflect customized QoS types as needed.
   129  * Replace current implementation of the CPU Manager and Memory Manager, and they will still work as general resource
   130    allocation components to match native QoS definitions.
   131  
   132  ## Proposal
   133  
   134  QRM is a new component of the kubelet ecosystem proposed to extend resource allocation policies. Besides:
   135  * QRM is also a hint provider for Topology Manager like Device Manager.
   136  * The hints are intended to indicate preferred resource affinity, and pind the resources for a container
   137    either to a single or a group of NUMA nodes.
   138  * QRM will not restrict to any native QoS definition; instead, it will pass meta to plugins and plugins should
   139    make it with customized policies.
   140  
   141  ### User Stories
   142  
   143  #### Story 1: Adjust resource allocation results dynamically in QoS aware systems
   144  
   145  To improve resource utilization rate, overselling, either by VPA or by running complementary workloads in one node,
   146  is usually used in production environments. As a result, the resource consumption states will always be in flux,
   147  thus static resource allocation (i.e. `cpuset.cpus` or `cpu.cfs_quota_us`) is not enough, especially for workloads
   148  with high performance requirements. So a real-time, customized and dynamic resource allocation results adjustment
   149  mechanism will be needed.
   150  
   151  The dynamic adjustment for resource allocation results is usually closely tied to the implementation of QoS aware
   152  systems and workload characteristics, so it would be better to provide a general framework in kubelet and offload
   153  the resource allocation in plugins. QRM works as such a framework.
   154  
   155  #### Story 2: Expand customized resource allocation policies in user-developed plugins
   156  
   157  The native CPU/Memory Manager requires that only pods with Guarantee QoS can be allocated with exclusive `cpuset.cpus`
   158  or `cpuset.mems`, but this abstraction lacks flexibility.
   159  
   160  For instance, in a hybrid cluster, `offline ETL workloads`, `latency-sensitive web services` and`storage services`
   161  may run in one same node. In this case, there may be three kinds of `cpuset.cpus` pools, one for offline workloads
   162  with shared `cpuset.cpus`, one for web services with shared or exclusive `cpuset.cpus`, and one for storage services
   163  with exclusive `cpuset.cpus`. And the same logic may be required for `cpuset.mems` allocation.
   164  
   165  In other words, we need a `role-based` or `fine-grained` QoS classification and corresponding resource allocation
   166  logic, which is uneasy to implement in general CPU/Memory Manager, but can be implemented in user-developed plugins.
   167  
   168  #### Story 3: Allocate share devices with NUMA affinity for multiple pods
   169  
   170  Consider a node that has multiple NUMA nodes and network interfaces, and pods scheduled to this node want to stick
   171  to single NUMA and only use network interface that is affiliated with the NUMA.
   172  
   173  In this case, multiple pods need to be allocated with the same network device. Although the Device Manager and
   174  device plugins can be used, it only allows for `exclusive mode`, ie, device can only be allocated to one certain
   175  container. A possible workaround is to allocate `fake device` and set its amount to a large enough value, but
   176  `fake device` is kind of weird for end users to request as a resource.
   177  
   178  With the help of QRM, we also express this `implicit` allocation requirements in annotations and make the customized
   179  plugins support it.
   180  
   181  ### Risks and Mitigations
   182  
   183  #### UX
   184  
   185  To increase the UX, the number of new kubelet flags was minimized to a minimum. The minimum set of kubelet flags,
   186  which is necessary to configure the QoS Resource Manger, is presented in this section.
   187  
   188  ## Design Details
   189  
   190  ![](/docs/imgs/qrm-design-overview.png)
   191  As shown in the figure above, QRM works as both a plugin handler added to kubelet plugin manager,
   192  and a hint provider for Topology Manager.
   193  
   194  As a plugin handler, QRM is responsible for plugin registration of new plugins, and brings the resource
   195  allocation results into effect through standard CRI interface. Detailed strategy is actually implemented
   196  in plugins, including NUMA affinity calculation, resources allocation and dynamic adjustment for resources or cgroup
   197  parameter control knobs (eg. `memory.oom_control`, `io.weight`, ...). Based on dynamic plugin discovery functionality
   198  in kubelet, plugins will register to QRM automatically and make effects during pod lifecycle.
   199  
   200  As a hint provider, QRM will get preferred NUMA affinity hints of resources with corresponding plugins registered
   201  for a container.
   202  
   203  ### Detailed Working Flow
   204  
   205  ![](/docs/imgs/qrm-detailed-working-flow.png)
   206  The figure below illustrates the workflow of QRM, including two major processes
   207  * the synchronous workflow of pod admission and resource allocation
   208  * and the asynchronous workflow of periodical resource adjustment
   209  
   210  #### Synchronous Pod Admission
   211  
   212  Once kubelet requests a pod admission, for each container in a pod, Topology Manager will query QRM (along with
   213  other hind providers if needed) about the preferred NUMA affinity for each resource that container requests.
   214  
   215  QRM will call `GetTopologyHints()` of the resource plugin and get the preferred NUMA affinity for this resource,
   216  and return hints for each resource to Topology Manager. Topology Manager then will figure out which NUMA node or a
   217  group of NUMA nodes are the best fit for resources/devices affinity aligned allocation after merging hints from
   218  all hint providers.
   219  
   220  After getting the best fit, Topology Manager will call `Allocate()` of all hint providers for the container.
   221  When `Allocate()` is called, QRM will assemble a ResourceRequest for each resource. ResourceRequest contains
   222  pod and container metadata, pod `role` and resource `QoS type`, the requested resource name and request/limit quantities
   223  and the best hint got in the previous step. To be noted, pod role and resource QoS type are newly defined,
   224  and are extracted by `kubernetes.io/pod-role` and `kubernetes.io/resource-type` from pod annotations.
   225  They are used to uniquely identify the pod QoS type which will influence the allocation results.
   226  
   227  QRM will then call `Allocate()` of resource plugin by the ResourceRequest and get the ResourceAllocationResponse.
   228  The ResourceAllocationResponse contains properties AllocatedQuantity, AllocationResult, OciPropertyName, Envs,
   229  Annotations. etc. A possible ResourceAllocationResponse example for QRM CPU plugin would be like:
   230  ```
   231  * AllocatedQuantity: 4
   232  * OciPropertyName: CpusetCpus (matching with property name in LinuxContainerResources)
   233  * AllocatationResult: "0-1,8-9" (matching with cgroup cpuset.cpus format)
   234  ```
   235  
   236  After successfully getting the ResourceAllocationResponse, QRM will cache the allocation result in the pod resources
   237  checkpoint, and make the checkpoint persistent by writing checkpoint file.
   238  
   239  In PreCreateContainer phase, kubeGenericRuntimeManager indirectly will call `GetResourceRunContainerOptions()` of QRM.
   240  And the allocation results for the container cached in the pod resources checkpoint will be populated into
   241  [LinuxContainerResources](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1alpha2/api.pb.go#L3075)
   242  of CRI api by reflecting mechanism.
   243  
   244  The completed LinuxContainerResources config will be embedded in the
   245  [ContainerConfig](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1alpha2/api.pb.go#L4137),
   246  and be passed as a parameter of
   247  [`CreateContainer()`](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/services.go#L35) API. In this way,
   248  the resource allocation results or cgroup parameter control knobs generated by QRM will be taken into the runtime.
   249  
   250  The overall calculation is performed for all containers in the pod, and if none of containers is rejected, pod finally
   251  becomes admitted and deployed.
   252  
   253  #### Asynchronous Resource Adjustment
   254  
   255  Dynamic resource adjustment is provided as an enhancement for static resource allocation, and is needed in
   256  some cases (e.g. QoS aware resource adjustment for resource utilization improvement mentioned in user story 1 above).
   257  
   258  To support this case, QRM invokes `reconcileState()` periodically, and afterwards calls`GetResourcesAllocation()` of
   259  every registered plugin to get latest resource allocation results. QRM will then update pod resources checkpoint and call
   260  [`UpdateContainerResources()`](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/services.go#L47)
   261  to update cgroup configs for containers by latest resource allocation results.
   262  
   263  ### Pod Resources Checkpoint
   264  
   265  Pod resources checkpoint is a cache for QRM to keep track of the resource allocation results. It is made by
   266  registered resource plugins for all active containers and their resources requests. The structure is shown below:
   267  ```
   268  type ResourceAllocation map[string]*ResourceAllocationInfo    // Keyed by resourceName
   269  type ContainerResources map[string]ResourceAllocation         // Keyed by containerName
   270  type PodResources map[string]ContainerResources               // Keyed by podUID
   271  type podResourcesChk struct {
   272      sync.RWMutex
   273      resources PodResources    // Keyed by podUID
   274  }
   275  
   276  type ResourceAllocationInfo struct {
   277      OciPropertyName  string `protobuf:"bytes,1,opt,name=oci_property_name,json=ociPropertyName,proto3" json:"oci_property_name,omitempty"`
   278      IsNodeResource   bool   `protobuf:"varint,2,opt,name=is_node_resource,json=isNodeResource,proto3" json:"is_node_resource,omitempty"`
   279      IsScalarResource bool   `protobuf:"varint,3,opt,name=is_scalar_resource,json=isScalarResource,proto3" json:"is_scalar_resource,omitempty"`
   280      // only for resources with true value of IsScalarResource
   281      AllocatedQuantity  float64           `protobuf:"fixed64,4,opt,name=allocated_quantity,json=allocatedQuantity,proto3" json:"allocated_quantity,omitempty"`
   282      AllocatationResult string            `protobuf:"bytes,5,opt,name=allocatation_result,json=allocatationResult,proto3" json:"allocatation_result,omitempty"`
   283      Envs               map[string]string `protobuf:"bytes,6,rep,name=envs,proto3" json:"envs,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
   284      Annotations        map[string]string `protobuf:"bytes,7,rep,name=annotations,proto3" json:"annotations,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"`
   285      ResourceHints      *ListOfTopologyHints `protobuf:"bytes,8,opt,name=resource_hints,json=resourceHints,proto3" json:"resource_hints,omitempty"`
   286  }
   287  ```
   288  
   289  PodResources structure is organized as a three-layer map, and it uses pod UID, container name and resource name
   290  as keys for each level. ResourceAllocationInfo is stored in the lowest map, and contains the allocation result
   291  of a specific resource for the identified container. ResourceAllocationInfo currently has those properties:
   292  ```
   293  * OCIPropertyName
   294    - it's used to identify which property in the config LinuxContainerResources should be populated into.
   295  * IsScalarResource
   296    - if it's true, resource allocation results can be quantified, and possibly be used as foundation of scheduling and 
   297    admitting logic. QRM will compare the requested quantity with the allocated quantity when the active container is re-admitted.
   298  * IsNodeResource
   299    - if this property and IsScalarResource are both true, QRM will expose "allocatable" and "capacity" quantities of the 
   300      resource to kubelet node status setter, and finally set to node status.
   301    - For instance, quantified resources have been covered in kubelet node status setter (eg. cpu, memory, ..), it's 
   302      unnecessary to set quantity of them to node status again, so we should set IsNodeResource as false. Otherwise, we 
   303      should set IsNodeResource as true for those extended quantified resources, and set quantity of them to node status.
   304  * AllocatedQuantity
   305    - it represents resource allocated quantity for the container, and it's only for resources with IsScalarResource 
   306      set as true.
   307  * AllocatationResult
   308    - it represents resource allocation result for the container, and must be a valid value of the property in 
   309      LinuxContainerResources that OciPropertyName indicates. For example, if OciPropertyName is CpusetCpus, the 
   310      AllocatationResult should be like "0-1,8-9" (a valid value for cgroup cpuset.cpus).
   311  * Envs
   312    - environmental variables that the resource plugin returns and should be set in the container.
   313  * Annotations
   314    - annotations that the resource plugin returns and should be taken to runtime.
   315  * ResourceHints
   316    it's the preferred NUMA affinity matching with the AllocatationResult. It will be used when kubelet restarts, and 
   317    active containers with allocated resources are re-admitted.
   318  ```
   319  
   320  ### Simulation: how QRM works
   321  
   322  #### Example 1: running a QoS System using QRM
   323  
   324  Assume that we have a machine with CPU Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GH
   325  (80 CPUs, 0-39 in NUMA node0, 40-79 in NUMA node1).
   326  
   327  We divide CPUs in this machine into two pools, one for common online micro-service (eg. web service),
   328  and the other for common offline workloads (eg. ETL or video transcoding tasks).
   329  And we deploy a QoS aware system to adjust the size of those two pools according to the performance metrics.
   330  
   331  ##### Initialize plugins
   332  
   333  QRM CPU Plugin starts to work:
   334  - Initialize those two `cpuset.cpus` pools
   335      - suppose `0-37,40-77` for online pool, and `38-39,78-79` for offline pool by default
   336  - QRM CPU Plugin is discovered by kubelet plugin manager dynamically and register to QRM
   337  
   338  ##### Admit pod with online role
   339  
   340  Suppose there comes a pod1 with one container requesting 4 CPUs, and its pod role is online-micro_service,
   341  meaning that it should be placed in the online pool.
   342  
   343  In pod1 admission phase, QRM calls `Allocate()` of QRM CPU plugin. And plugin identifies that the container is for
   344  online `cpuset.cpus` pool, so it returns `ResourceAllocationResponse` like below:
   345  ```
   346  { 
   347      "pod_uid": "054a696f-d176-488d-b228-d6046faaf67c",
   348      "pod_namespace": "default",
   349      "pod_name": "pod1",
   350      "container_name": "container0",
   351      "container_type": "MAIN",
   352      "container_index": 0,
   353      "pod_role": "online-micro_service",
   354      "resource_name": "cpu",
   355      "allocatation_result": {
   356          "resource_allocation": {
   357              "cpu": {
   358                  "oci_property_name": "CpusetCpus",
   359                  "is_node_resource": false,
   360                  "is_scalar_resource": true,
   361                  "allocated_quantity": 76,
   362                  "allocatation_result": "0-37,40-77",
   363                  "resource_hints": [] // no NUMA preference
   364              }
   365          }
   366      }
   367  }
   368  ```
   369  
   370  QRM caches the allocation result in the pod resources checkpoint.
   371  
   372  In the PreCreateContainer phase, the kubeGenericRuntimeManager calls `GetResourceRunContainerOptions()` of
   373  QRM and gets a `LinuxContainerResources` config like below
   374  ```
   375  {
   376      "cpusetCpus": "0-37,40-77",
   377  }
   378  ```
   379  
   380  pod1 starts successfully with the `cpuset.cpus` allocated.
   381  
   382  ##### Admit pod with offline role
   383  
   384  Suppose there comes a pod2 with one container requesting 2 CPUs, and its pod role is ETL,
   385  meaning that it should be placed in the offline pool.
   386  
   387  In pod2 admission phase, QRM call `Allocate()` of QRM CPU plugin. And plugin identifies the container identifies
   388  that the container is for offline `cpuset.cpus` pool, so it returns `ResourceAllocationResponse` like below:
   389  ```
   390  { 
   391      "pod_uid": "fe8d9c25-6fb4-4983-908f-08e39ebeafe7",
   392      "pod_namespace": "default",
   393      "pod_name": "pod2",
   394      "container_name": "container0",
   395      "container_type": "MAIN",
   396      "container_index": 0,
   397      "pod_role": "ETL",
   398      "resource_name": "cpu",
   399      "allocatation_result": {
   400          "resource_allocation": {
   401              "cpu": {
   402                  "oci_property_name": "CpusetCpus",
   403                  "is_node_resource": false,
   404                  "is_scalar_resource": true,
   405                  "allocated_quantity": 4,
   406                  "allocatation_result": "38-39,78-79",
   407                  "resource_hints": [] // no NUMA preference
   408              }
   409          }
   410      }
   411  }
   412  ```
   413  
   414  QRM cache the allocation result in the pod resources checkpoint.
   415  
   416  In the PreCreateContainer phase, the kubeGenericRuntimeManager calls `GetResourceRunContainerOptions()` of
   417  QRM and gets a `LinuxContainerResources` config like below:
   418  ```
   419  {
   420      "cpusetCpus": "38-39,78-79"",
   421  }
   422  ```
   423  The pod2 starts successfully with the `cpuset.cpus` allocated.
   424  
   425  ##### Admit another pod with online role
   426  
   427  Similar to the pod1, if there comes a pod3 also with online-micro_service role, it will be placed in the online
   428  pool too. So it will get a `LinuxContainerResources` config like below in the `PreCreateContainer` phase:
   429  ```
   430  {
   431      "cpusetCpus": "0-37,40-77",
   432  }
   433  ```
   434  
   435  ##### Periodically adjust resource allocation
   436  
   437  After a period of time, the QoS aware system adjusts the online `cpuset.cpus` pool to `0-13,40-53` and
   438  the offline `cpuset.cpus` pool `14-39,54-79`, according the system indicators.
   439  
   440  QRM invokes `reconcileState()` for latest resource allocation results (like below) from QRM CPU plugin.
   441  It then updates the pod resources checkpoint, and calls `UpdateContainerResources()` to update the cgroup resources
   442  for containers by latest resource allocation results.
   443  ```
   444   { 
   445      "pod_resources": {
   446          "054a696f-d176-488d-b228-d6046faaf67c": { // pod1
   447              "container0": {
   448                  "cpu": {
   449                      "oci_property_name": "CpusetCpus",
   450                      "is_node_resource": false,
   451                      "is_scalar_resource": true,
   452                      "allocated_quantity": 28,
   453                      "allocatation_result": "0-13,40-53",
   454                      "resource_hints": [] // no NUMA preference
   455                  }
   456              }
   457          },
   458          "fe8d9c25-6fb4-4983-908f-08e39ebeafe7": { // pod2
   459              "container0": {
   460                  "cpu": {
   461                      "oci_property_name": "CpusetCpus",
   462                      "is_node_resource": false,
   463                      "is_scalar_resource": true,
   464                      "allocated_quantity": 52,
   465                      "allocatation_result": "14-39,54-79",
   466                      "resource_hints": [] // no NUMA preference
   467                  }
   468              }
   469          },
   470          "26731da7-b283-488b-b232-cff611c914e1": { // pod3
   471              "container0": {
   472                  "cpu": {
   473                      "oci_property_name": "CpusetCpus",
   474                      "is_node_resource": false,
   475                      "is_scalar_resource": true,
   476                      "allocated_quantity": 28,
   477                      "allocatation_result": "0-13,40-53",
   478                      "resource_hints": [] // no NUMA preference
   479                  }
   480              }
   481          },
   482      }
   483  }
   484  ```
   485  
   486  #### Example 2: allocate NUMA-affinity resources with extended policy
   487  
   488  Assume that we have a machine with CPU Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GH
   489  (80 CPUs, 0-39 in NUMA node0, 40-79 in NUMA node1; 452 GB memory, 226 GB in NUMA node0, 26 GB in NUMA node1)
   490  
   491  And we have multiple latency-sensitive services, including storage services, re-caller, retriever and re-ranker services
   492  in information retrieval systems. etc. To meet QoS requirements, we should pin them to exclusive
   493  NUMA nodes by setting `cpuset.cpus` and `cpuset.mems`.
   494  
   495  Although the default CPU/Memory Manager have already provided the ability to set `cpuset.cpus` and `cpuset.mems`
   496  for containers, current policies are only toward Guaranteed pod QoS class, and can't handle container anti-affinity
   497  in NUMA nodes scope with flexibility.
   498  
   499  
   500  ##### Initialize plugins
   501  
   502  QRM CPU/Memory plugins start to work
   503  - QRM CPU plugin initializes checkpoint with default machine state like below:
   504  ```
   505  {
   506      "machineState": {
   507          "0": {
   508              "reserved_cpuset": "0-1",
   509              "allocated_cpuset": "",
   510              "default_cpuset": "2-39"
   511          },
   512          "1": {
   513              "reserved_cpuset": "40-41",
   514              "allocated_cpuset": "",
   515              "default_cpuset": "42-79"
   516          }
   517      },
   518      "pod_entries": {}
   519  }
   520  ```
   521  - QRM Memory plugin initializes checkpoint with default machine state like below:
   522  ```
   523  {
   524      "machineState": {
   525          "memory": {
   526              "0": {
   527                  "Allocated": "0",
   528                  "allocatable": "237182648320",
   529                  "free": "237182648320",
   530                  "pod_entries": {},
   531                  "systemReserved": "524288000",
   532                  "total": "237706936320"
   533              },
   534              "1": {
   535                  "Allocated": "0",
   536                  "allocatable": "237282263040",
   537                  "free": "237282263040",
   538                  "pod_entries": {},
   539                  "systemReserved": "524288000",
   540                  "total": "237806551040"
   541              }
   542          }
   543      },
   544      "pod_resource_entries": {},
   545  }
   546  ```
   547  - All QRM plugins are discovered by kubelet plugin manager dynamically and register to QRM
   548  
   549  ##### Admit pod with storage-service role
   550  
   551  There comes a pod1 with one container requesting 20 CPUs, 40GB memory and its pod role is storage-service.
   552  
   553  In pod1 admission phase, QRM calls `GetTopologyHints()` of QoS plugins, and gets preferred NUMA affinity
   554  hint (10) from both of them. Then QRM calls `Allocate()` of plugins
   555  - calls Allocate() of QoS CPU plugin gets `ResourceAllocationResponse` like below:
   556  ```
   557  { 
   558      "pod_uid": "bc9e28df-1b5c-4099-8866-6110277184e0",
   559      "pod_namespace": "default",
   560      "pod_name": "pod1",
   561      "container_name": "container0",
   562      "container_type": "MAIN",
   563      "container_index": 0,
   564      "pod_role": "storage-service",
   565      "resource_name": "cpu",
   566      "allocatation_result": {
   567          "resource_allocation": {
   568              "cpu": {
   569                  "oci_property_name": "CpusetCpus",
   570                  "is_node_resource": false,
   571                  "is_scalar_resource": true,
   572                  "allocated_quantity": 20,
   573                  "allocatation_result": "2-21",
   574                  "resource_hints": [
   575                      {
   576                          "nodes": [0],
   577                          "Preferred": true
   578                      }
   579                  ]
   580              }
   581          }
   582      }
   583  }
   584  ```
   585  - calls Allocate() of QoS Memory plugin and gets `ResourceAllocationResponse` like below:
   586  ```
   587  { 
   588      "pod_uid": "bc9e28df-1b5c-4099-8866-6110277184e0",
   589      "pod_namespace": "default",
   590      "pod_name": "pod1",
   591      "container_name": "container0",
   592      "container_type": "MAIN",
   593      "container_index": 0,
   594      "pod_role": "storage-service",
   595      "resource_name": "memory",
   596      "allocatation_result": {
   597          "resource_allocation": {
   598              "cpu": {
   599                  "oci_property_name": "CpusetMems",
   600                  "is_node_resource": false,
   601                  "is_scalar_resource": true,
   602                  "allocated_quantity": 1,
   603                  "allocatation_result": "0",
   604                  "resource_hints": [
   605                      {
   606                          "nodes": [0],
   607                          "Preferred": true
   608                      }
   609                  ]
   610              }
   611          }
   612      }
   613  }
   614  ```
   615  QRM caches the allocation results in the pod resources checkpoint.
   616  In the `PreCreateContainer` phase, the kubeGenericRuntimeManager calls `GetResourceRunContainerOptions()` of
   617  QRM and gets a `LinuxContainerResources` config like below:
   618  ```
   619  {
   620      "cpusetCpus": "2-21",
   621      "cpusetMems": "0",
   622  }
   623  ```
   624  The pod1 starts successfully with the `cpuset.cpus`, `cpuset.mems` allocated.
   625  
   626  ##### Admit pod with reranker role
   627  
   628  There comes a pod2 with one container requesting 10 CPUs, 20GB memory and its pod role is reranker.
   629  
   630  Although the quantity of available resources is enough for the pod2, QRM plugins identify that pod1 and
   631  pod2 should follow anti-affinity requirement in NUMA nodes scope by pod roles. So `ResourceAllocationResponses`
   632  for pod2 from QRM CPU/Memory plugins are like below:
   633  ```
   634  { 
   635      "pod_uid": "6f695526-b07c-4baa-90e3-af1dfed2faf8",
   636      "pod_namespace": "default",
   637      "pod_name": "pod2",
   638      "container_name": "container0",
   639      "container_type": "MAIN",
   640      "container_index": 0,
   641      "pod_role": "storage-service",
   642      "resource_name": "cpu",
   643      "allocatation_result": {
   644          "resource_allocation": {
   645              "cpu": {
   646                  "oci_property_name": "CpusetCpus",
   647                  "is_node_resource": false,
   648                  "is_scalar_resource": true,
   649                  "allocated_quantity": 10,
   650                  "allocatation_result": "42-51",
   651                  "resource_hints": [
   652                      {
   653                          "nodes": [1],
   654                          "Preferred": true
   655                      }
   656                  ]
   657              }
   658          }
   659      }
   660  }
   661  
   662  { 
   663      "pod_uid": "6f695526-b07c-4baa-90e3-af1dfed2faf8",
   664      "pod_namespace": "default",
   665      "pod_name": "pod2",
   666      "container_name": "container0",
   667      "container_type": "MAIN",
   668      "container_index": 0,
   669      "pod_role": "storage-service",
   670      "resource_name": "memory",
   671      "allocatation_result": {
   672          "resource_allocation": {
   673              "cpu": {
   674                  "oci_property_name": "CpusetMems",
   675                  "is_node_resource": false,
   676                  "is_scalar_resource": true,
   677                  "allocated_quantity": 1,
   678                  "allocatation_result": "1",
   679                  "resource_hints": [
   680                      {
   681                          "nodes": [1],
   682                          "Preferred": true
   683                      }
   684                  ]
   685              }
   686          }
   687      }
   688  }
   689  ```
   690  
   691  QRM caches the allocation result in the pod resources checkpoint.In the `PreCreateContainer` phase,
   692  the `kubeGenericRuntimeManager` calls `GetResourceRunContainerOptions()` of QRM and
   693  gets a `LinuxContainerResources` config like below:
   694  ```
   695  {
   696      "cpusetCpus": "42-51",
   697      "cpusetMems": "1",
   698  }
   699  ```
   700  The pod2 starts successfully with the cpuset.cpus allocated.
   701  
   702  #### Example 3: allocate shared NUMA affinitive NICs
   703  
   704  Assume that we have a machine with CPU Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GH
   705  (80 CPUs, 0-39 in NUMA node0, 40-79 in NUMA node1) and 2 Mellanox Technologies MT28841 NICs with speed 25000Mb/s
   706  (eth0 is affinitive to NUMA node0, eth1 is affinitive to NUMA node1 and).
   707  
   708  ##### Initialize plugins
   709  
   710  QRM CPU/Memory/NIC plugins start to work
   711  - QRM CPU/Memory plugins initialize checkpoint with default machine state (same as example 2)
   712  - QRM NIC plugin initializes checkpoint with default machine state containing NICs information organized by NUMA affinity:
   713  ```
   714  {
   715      "machineState": {
   716          "0": {
   717              "nics": [
   718                  {
   719                      "interface_name": "eth0",
   720                      "nuam_node": 0,
   721                      "address": {
   722                          "ipv6": "fdbd:dc05:3:154::20"
   723                      }
   724                  }
   725              ]
   726          },
   727          "1": {
   728              "nics": [
   729                  {
   730                      "interface_name": "eth1",
   731                      "nuam_node": 1,
   732                      "address": {
   733                          "ipv6": "fdbd:dc05:3:155::20"
   734                      },
   735                      "netns": "/var/run/netns/ns1" // must be filled in if the NIC is added to the non-default network namespace 
   736                  }
   737              ]
   738          },
   739      },
   740      "pod_entries": {}
   741  }
   742  ```
   743  - All QRM plugins are discovered by kubelet plugin manager dynamically and register to QRM
   744  
   745  
   746  ##### Admit pod with numa-sharing && cpu-execlusive role
   747  
   748  We assume that all related resources in NUMA node0 have been allocated and allocatable resources in
   749  NUMA node1 are all available. There comes a pod1 with one container requesting 10 CPUs, 20GB memory,
   750  hostNetwork true and its pod role is `numa-sharing && cpu-execlusive` (`numa-enhancement` for abbreviation).
   751  This role means the pod can be located in the same NUMA node with other pods with the same role,
   752  but it should be allocated with exclusive `cpuset.cpus`.
   753  
   754  In pod1 admission phase, QRM calls `GetTopologyHints()` of QoS plugins, and gets preferred NUMA affinity
   755  hint (01) from all of them. Then QRM calls `Allocate()` of plugins
   756  - calls `Allocate()` of QRM CPU/Memory plugins and gets `ResourceAllocationResponses` similar to example 2
   757  - calls `Allocate()` of QRM NIC plugin and gets `ResourceAllocationResponse` like below:
   758  ```
   759  { 
   760      "pod_uid": "c44ba6fd-2ce5-43e0-8d4d-b2224dfaeebd",
   761      "pod_namespace": "default",
   762      "pod_name": "pod1",
   763      "container_name": "container0",
   764      "container_type": "MAIN",
   765      "container_index": 0,
   766      "pod_role": "numa-enhancement",
   767      "resource_name": "NIC",
   768      "allocatation_result": {
   769          "resource_allocation": {
   770              "NIC": {
   771                  "oci_property_name": "",
   772                  "is_node_resource": false,
   773                  "is_scalar_resource": false,
   774                  "resource_hints": [
   775                      {
   776                          "nodes": [1],
   777                          "Preferred": true
   778                      }
   779                  ],
   780                  "envs": {
   781                      "AFFINITY_NIC_ADDR_IPV6": "fdbd:dc05:3:155::20"
   782                  },
   783                  "annotations": {
   784                      "kubernetes.io/host-netns-path": "/var/run/netns/ns1"
   785                  }
   786              }
   787          }
   788      }
   789  }
   790  ```
   791  
   792  QRM caches the allocation results in the pod resources checkpoint.
   793  In the `PreCreateContainer` phase, the kubeGenericRuntimeManager calls `GetResourceRunContainerOptions()`
   794  of QRM and gets a `ResourceRunContainerOptions` like below, and the content will be filled into `ContainerConfig`
   795  ```
   796  {
   797      "envs": {
   798          "AFFINITY_NIC_ADDR_IPV6": "fdbd:dc05:3:155::20",
   799      },
   800      "annotations": {
   801          "kubernetes.io/host-netns-path": "/var/run/netns/ns1"
   802      },
   803      "linux_container_resources": { 
   804          "cpusetCpus": "42-51",
   805          "cpusetMems": "1",
   806      },
   807  }
   808  ```
   809  The pod1 starts successfully with the cpuset.cpus, cpuset.mems allocated. In addition, the container process can
   810  get IP address of the NIC with NUMA affinity from the environment variable `AFFINITY_NIC_ADDR_IPV6`
   811  and bind sockets to the address.
   812  
   813  ##### Admit another pod with the same role
   814  
   815  There comes a pod2 with one container requesting 10 CPUs, 20GB memory, hostNetwork true and its pod role is also
   816  `numa-enhancement`. That means the pod2 can be located in the same NUMA node with the pod1 if the available
   817  resources satisfy the requirements.
   818  
   819  Similar to the pod1, the pod2 gets a `ResourceRunContainerOptions` like below in admission phase
   820  ```
   821  {
   822      "envs": {
   823          "AFFINITY_NIC_ADDR_IPV6": "fdbd:dc05:3:155::20",
   824      },
   825      "annotations": {
   826          "kubernetes.io/host-netns-path": "/var/run/netns/ns1"
   827      },
   828      "linux_container_resources": { 
   829          "cpusetCpus": "52-61",
   830          "cpusetMems": "1",
   831      },
   832  }
   833  ```
   834  
   835  Notice that the pod1 and pod2 share the same NIC. If we implement a device plugin with the device manager to
   836  provide NIC information with NUMA affinity, it's exclusive so we can't make multiple pods share the same device ID.
   837  
   838  ### New Flags and Configuration of QRM
   839  
   840  #### Feature Gate Flag
   841  
   842  A new feature gate flag will be added to enable QRM feature.
   843  This feature gate will be disabled by default in the initial releases.
   844  
   845  Syntax: `--feature-gate=QoSResourceManager=false|true`
   846  
   847  #### QRM Reconcile Period Flag
   848  
   849  This flag controls the interval time between QRM invoking `reconcileState()` to adjust allocation results dynamically.
   850  If not supplied, its default value is 3s.
   851  
   852  Syntax: `--qos-resource-manager-reconcile-period=10s|1m`
   853  
   854  #### How this proposal affects the kubelet ecosystem
   855  
   856  ##### Container Manager
   857  
   858  Container Manager will create QRM and register it to Topology Manager as a hint provider.
   859  
   860  ##### Topology Manager
   861  
   862  Topology Manager will call out QRM to gather topology hints,
   863  and allocate resources with corresponding resource plugins registered during pod admission sequence.
   864  
   865  ##### kubeGenericRuntimeManager
   866  
   867  kubeGenericRuntimeManager will indirectly call `GetResourceRunContainerOptions()` of QRM, and get
   868  `LinuxContainerResources` config populated with allocation results of requested resources during executing `startContainer()`.
   869  
   870  ##### Kubelet Node Status Setter
   871  
   872  MachineInfo setter will be extended to indirectly call `GetCapacity()` of QRM to get capacity,
   873  allocatable quantity of resources and already-removed resource names from registered resource plugins.
   874  
   875  ##### Pod Resources Server
   876  
   877  In order to get the allocated and allocatable resources managed by QRM and registered resource plugin from
   878  pod resources server, we make the container manager implement ResourcesProvider defined below:
   879  ```
   880  // ResourcesProvider knows how to provide the resources used by the given container
   881  type ResourcesProvider interface {
   882      // UpdateAllocatedResources frees any Resources that are bound to terminated pods.
   883      UpdateAllocatedResources()
   884      // GetResources returns information about the resources assigned to pods and containers in topology aware format
   885      GetTopologyAwareResources(pod *v1.Pod, container *v1.Container) []*podresourcesapi.TopologyAwareResource
   886      // GetAllocatableResources returns information about all the resources known to the manager in topology aware format
   887      GetTopologyAwareAllocatableResources() []*podresourcesapi.AllocatableTopologyAwareResource
   888  }
   889  
   890  When List() of the pod resources server is called, below calling train will be trigger:
   891  (*v1PodResourcesServer).List(...) -> 
   892  (*containerManagerImpl)).GetTopologyAwareResources(...) -> 
   893  (*QoSResourceManagerImpl).GetTopologyAwareResources(...) -> 
   894  (*resourcePluginEndpoint).GetTopologyAwareResources(...) for each registered resource plugin
   895  QRM will merge allocated resource responses from resource plugins and the pod resources server 
   896  will return the merged information to the end user.
   897  
   898  When GetAllocatableResources() of the pod resources server is called, below calling train will be trigger:
   899  (*v1PodResourcesServer).GetAllocatableResources(...) -> 
   900  (*containerManagerImpl)).GetTopologyAwareAllocatableResources(...) -> 
   901  (*QoSResourceManagerImpl).GetTopologyAwareAllocatableResources(...) -> 
   902  (*resourcePluginEndpoint).GetTopologyAwareAllocatableResources(...) for each registered resource plugin
   903  QRM will merge allocatable resource responses from resource plugins and the pod resources server 
   904  will return the merged information to the end user.
   905  ```
   906  
   907  ### Test Plan
   908  
   909  We will initialize QRM with mock resource plugins registered and cover key points listed below:
   910  - the plugin mananger can discover the listening resource plugin dynamically and register it into QRM successfully.
   911  - QRM can return correct LinuxContainerResources config populated with allocation results.
   912  - Validate allocating resources to containers and getting preferred NUMA affinity hints in QRM through registered resource plugins.
   913  - Validate that reconcileState() of QRM will update the cgroup configs for containers by latest resource allocation results.
   914  - GetTopologyAwareAllocatableResources() and GetTopologyAwareResources()of QRM return correct allocatable
   915    and allocated to the pod resources server.
   916  - pod resources checkpoint is stored and resumed normally and its basic operations work as expected.
   917  
   918  ## Production Readiness Review Questionnaire
   919  
   920  ### Feature Enablement and Rollback
   921  
   922  ##### How can this feature be enabled / disabled in a live cluster?
   923  
   924  Feature gate
   925  - Feature gate name: QoSResourceManager
   926  - Components depending on the feature gate: kubelet
   927  - Will enabling / disabling the feature require downtime of the control plane? No
   928  - Will enabling / disabling the feature require downtime or reprovisioning of a node?
   929    (Do not assume Dynamic Kubelet Config feature is enabled). Yes, it uses a feature gate.
   930  
   931  ##### Does enabling the feature change any default behavior?
   932  
   933  Yes, the pod admission flow will change if QRM is enabled, it will call the plugins to determine
   934  whether the pod is admitted or not. And QRM can't work together with CPU Manager or Memory Manager if
   935  there are plugins for CPU/Memory allocation registered to QRM.
   936  
   937  ##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?
   938  
   939  Yes, it uses a feature gate.
   940  
   941  ##### What happens if we reenable the feature if it was previously rolled back?
   942  
   943  QRM uses the state file to track resource allocations. If the state file is not valid,
   944  it's better to remove the state file and restart kubelet.
   945  (e.g. the state file might become invalid that some pods in it have been removed).
   946  But the Manager will reconcile to fix it, so it won't be a big deal.
   947  
   948  ##### Are there any tests for feature enablement/disablement?
   949  
   950  Yes, there is a number of Unit Tests designated for State file validation.
   951  
   952  ### Rollout, Upgrade and Rollback Planning
   953  
   954  ##### How can a rollout or rollback fail? Can it impact already running workloads?
   955  
   956  It is possible that the State file will have inconsistent data during the rollout,
   957  because of kubelet restart, but you can easily fix it by removing State file and restarting kubelet.
   958  It should not affect any running workloads. And the Manager will reconcile to fix it, so it won't be a big deal.
   959  
   960  ##### What specific metrics should inform a rollback?
   961  
   962  The pod may fail with the admission error because the plugin fails to allocate resources.
   963  You can see the error message under the pod events.
   964  
   965  ##### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?
   966  
   967  Yes.
   968  
   969  ##### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
   970  
   971  No.
   972  
   973  ### Monitoring Requirements
   974  
   975  ###### How can an operator determine if the feature is in use by workloads?
   976  
   977  QRM data will also be available under Pod Resources API.
   978  
   979  ###### How can someone using this feature know that it is working for their instance?
   980  
   981  - After pod starts, you have two options to verify if containers work as expected
   982      - via Pod Resources API, you will need to connect to grpc socket and get information from it.
   983        Please see pod resource API doc page for more information.
   984      - checking the relevant container cgroup under the node.
   985  - Pod failed to start because of the admission error.
   986  
   987  ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement?
   988  
   989  This does not seem relevant to this feature.
   990  
   991  ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
   992  
   993  A bunch of metrics will be added to indicate the running states of QRM and the plugins.
   994  The detailed metrics name and meaning will be added to this doc in the future.
   995  
   996  ### Dependencies
   997  
   998  ###### Does this feature depend on any specific services running in the cluster?
   999  
  1000  None
  1001  
  1002  ### Scalability
  1003  
  1004  ###### Will enabling / using this feature result in any new API calls?
  1005  
  1006  No
  1007  
  1008  ###### Will enabling / using this feature result in introducing new API types?
  1009  
  1010  No
  1011  
  1012  ###### Will enabling / using this feature result in any new calls to the cloud provider?
  1013  
  1014  No
  1015  
  1016  ###### Will enabling / using this feature result in increasing size or count of the existing API objects?
  1017  
  1018  No
  1019  
  1020  ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
  1021  
  1022  No
  1023  
  1024  ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
  1025  
  1026  No, the algorithm will run on a single `goroutine` with minimal memory requirements.
  1027  
  1028  ### Troubleshooting
  1029  ##### How does this feature react if the API server and/or etcd is unavailable?
  1030  
  1031  No.
  1032  
  1033  ##### What are other known failure modes?
  1034  
  1035  During the enabling and disabling of QRM, you must remove the memory manager
  1036  state file(/var/lib/kubelet/qos_resource_manager_state), otherwise the kubelet start will fail.
  1037  You can identify the issue via checking the kubelet log.
  1038  
  1039  ##### What steps should be taken if SLOs are not being met to determine the problem?
  1040  
  1041  Not applicable.
  1042  
  1043  ## Implementation History
  1044  
  1045  QRM has been developed and running in our production environment,
  1046  but still needs a little effort to make an open-source version.
  1047  
  1048  ## Drawbacks
  1049  
  1050  No objections exist to implement this KEP.
  1051  
  1052  ## Appendix
  1053  ### Related Features
  1054  
  1055  - [Topology Manager](https://github.com/kubernetes/enhancements/blob/dcc8c7241513373b606198ab0405634af643c500/keps/sig-node/0035-20190130-topology-manager.md)
  1056    collects topology hints from various hint providers (e.g. CPU Manager or Device Manager) in order to calculate which
  1057    NUMA nodes can offer a suitable amount of resources for a container. The final decision of Topology Manager is
  1058    subjected to the topology policy (i.e. best-effort, restricted, single-numa-policy) and
  1059    possible NUMA affinity for containers. Finally, the Topology Manager determines whether a container in a pod
  1060    can be deployed to the node or rejected.
  1061  
  1062  * [CPU Manager](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/cpu-manager.md)
  1063    provides CPU pinning functionality by using cgroups cpuset subsystem, and it also provides topology hints,
  1064    which indicate CPU core availability at particular NUMA nodes, to Topology Manager.
  1065  
  1066  - [Device Manager](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-management/device-plugin.md)
  1067    allows device vendors to advertise their device resources such as NIC device or GPU device through their
  1068    device plugins to kubelet so that the devices can be utilized by containers. Similarly, Device Manager provides
  1069    topology hints to the Topology Manager. The hints indicate the availability of devices at particular NUMA nodes.
  1070  
  1071  * [Hugepages enables](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190129-hugepages.md)
  1072    the assignment of pre-allocated hugepage resources to a container.
  1073  
  1074  - [Node Allocatable Feature](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-allocatable.md)
  1075    helps to increase the stability of node operation, while it pre-reserves compute resources for kubelet and system
  1076    processes. In v1.17, the feature supports the following reservable resources: CPU, memory, and ephemeral storage.