github.com/kubewharf/katalyst-core@v0.5.3/docs/proposals/qos-management/qos-resource-manager/20221018-qos-resource-manager.md (about) 1 --- 2 title: QoS Resource Manager 3 authors: 4 - "csfldf" 5 reviewers: 6 - "waynepeking348" 7 - "xuchen-xiaoying" 8 creation-date: 2022-10-18 9 last-updated: 2023-02-22 10 status: implemented 11 see-also: 12 - "https://github.com/kubewharf/enhanced-k8s/tree/70544afae9af1c7e1d129069e80f4eceb7d039f5/docs/design/qos-resource-manager" 13 --- 14 15 # QoS Resource Manager 16 17 <!-- toc --> 18 - [Summary](#summary) 19 - [Motivation](#motivation) 20 - [Goals](#goals) 21 - [Non-Goals](#non-goals) 22 - [Proposal](#proposal) 23 - [User Stories](#user-stories) 24 - [Story 1: Adjust resource allocation results dynamically in QoS aware systems](#story-1-adjust-resource-allocation-results-dynamically-in-qos-aware-systems) 25 - [Story 2: Expand customized resource allocation policies in user-developed plugins](#story-2-expand-customized-resource-allocation-policies-in-user-developed-plugins) 26 - [Story 3: Allocate share devices with NUMA affinity for multiple pods](#story-3-allocate-share-devices-with-numa-affinity-for-multiple-pods) 27 - [Risks and Mitigations](#risks-and-mitigations) 28 - [UX](#ux) 29 - [Design Details](#design-details) 30 - [Detailed Working Flow](#detailed-working-flow) 31 - [Synchronous Pod Admission](#synchronous-pod-admission) 32 - [Asynchronous Resource Adjustment](#asynchronous-resource-adjustment) 33 - [Pod Resources Checkpoint](#pod-resources-checkpoint) 34 - [Simulation: how QRM works](#simulation-how-qrm-works) 35 - [Example 1: running a QoS System using QRM](#example-1-running-a-qos-system-using-qrm) 36 - [Initialize plugins](#initialize-plugins) 37 - [Admit pod with online role](#admit-pod-with-online-role) 38 - [Admit pod with offline role](#admit-pod-with-offline-role) 39 - [Admit another pod with online role](#admit-another-pod-with-online-role) 40 - [Periodically adjust resource allocation](#periodically-adjust-resource-allocation) 41 - [Example 2: allocate NUMA-affinity resources with extended policy](#example-2-allocate-numa-affinity-resources-with-extended-policy) 42 - [Initialize plugins](#initialize-plugins-1) 43 - [Admit pod with storage-service role](#admit-pod-with-storage-service-role) 44 - [Admit pod with reranker role](#admit-pod-with-reranker-role) 45 - [Example 3: allocate shared NUMA affinitive NICs](#example-3-allocate-shared-numa-affinitive-nics) 46 - [Initialize plugins](#initialize-plugins-2) 47 - [Admit pod with numa-sharing && cpu-execlusive role](#admit-pod-with-numa-sharing--cpu-execlusive-role) 48 - [Admit another pod with the same role](#admit-another-pod-with-the-same-role) 49 - [New Flags and Configuration of QRM](#new-flags-and-configuration-of-qrm) 50 - [Feature Gate Flag](#feature-gate-flag) 51 - [QRM Reconcile Period Flag](#qrm-reconcile-period-flag) 52 - [How this proposal affects the kubelet ecosystem](#how-this-proposal-affects-the-kubelet-ecosystem) 53 - [Container Manager](#container-manager) 54 - [Topology Manager](#topology-manager) 55 - [kubeGenericRuntimeManager](#kubegenericruntimemanager) 56 - [Kubelet Node Status Setter](#kubelet-node-status-setter) 57 - [Pod Resources Server](#pod-resources-server) 58 - [Test Plan](#test-plan) 59 - [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) 60 - [Feature Enablement and Rollback](#feature-enablement-and-rollback) 61 - [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster) 62 - [Does enabling the feature change any default behavior?](#does-enabling-the-feature-change-any-default-behavior) 63 - [Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?](#can-the-feature-be-disabled-once-it-has-been-enabled-ie-can-we-roll-back-the-enablement) 64 - [What happens if we reenable the feature if it was previously rolled back?](#what-happens-if-we-reenable-the-feature-if-it-was-previously-rolled-back) 65 - [Are there any tests for feature enablement/disablement?](#are-there-any-tests-for-feature-enablementdisablement) 66 - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) 67 - [How can a rollout or rollback fail? Can it impact already running workloads?](#how-can-a-rollout-or-rollback-fail-can-it-impact-already-running-workloads) 68 - [What specific metrics should inform a rollback?](#what-specific-metrics-should-inform-a-rollback) 69 - [Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?](#were-upgrade-and-rollback-tested-was-the-upgrade-downgrade-upgrade-path-tested) 70 - [Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?](#is-the-rollout-accompanied-by-any-deprecations-andor-removals-of-features-apis-fields-of-api-types-flags-etc) 71 - [Monitoring Requirements](#monitoring-requirements) 72 - [Dependencies](#dependencies) 73 - [Scalability](#scalability) 74 - [Troubleshooting](#troubleshooting) 75 - [How does this feature react if the API server and/or etcd is unavailable?](#how-does-this-feature-react-if-the-api-server-andor-etcd-is-unavailable) 76 - [What are other known failure modes?](#what-are-other-known-failure-modes) 77 - [What steps should be taken if SLOs are not being met to determine the problem?](#what-steps-should-be-taken-if-slos-are-not-being-met-to-determine-the-problem) 78 - [Implementation History](#implementation-history) 79 - [Drawbacks](#drawbacks) 80 - [Appendix](#appendix) 81 - [Related Features](#related-features) 82 <!-- /toc --> 83 84 ## Summary 85 86 Despite the fact that the CPU Manager and Memory Manager in kubelet can allocate `cpuset.cpus` and `cpuset.mems` 87 with numa affinity, they have some restrictions and are difficult to customize since all policies of them share 88 the same checkpoint. 89 90 * For instance, only pods with Guaranteed QoS class can be allocated with exclusive `cpuset.cpus` and `cpuset.mems`, 91 but in each individual production environment, Kubernetes original QoS classes may be not flexible enough to depict 92 workloads with different QoS requirements. 93 * Besides, those allocation logic works in a static way cause it only counts on numerical values of each resource, 94 without considering the running state of each node. 95 * Finally, the current implementation is not pluggable, and if new policies or additional resource managers 96 (like disk quota or network bandwidth) are needed, we have to make changes for kubelet source codes and 97 update kubelet for clusters and that would be costly. 98 99 Thus, we propose the `QoS Resource Manager` (abbreviated to `QRM` later in the article) as a new component in kubelet 100 ecosystem. It extends the ability of resource allocation in admission phase, and enable dynamic resource allocation 101 adjustment for pods with better flexibility. 102 103 QRM works in a similar way like the Device Manager, and resource allocation logic will be implemented in external plugins. 104 It will then periodically collect latest resource allocation results, along with real-time node running states, and 105 assemble them as parameters to update through standard CRI interface, like `cpuset.cpus`, `cpuset.mems` or any other 106 potential resources needed in the future,. 107 108 In this way, the allocation and adjustment logic is offloaded to different plugins, and can be customized by 109 user-defined QoS requirements. In addition, we can implement setting and adjustment logic in plugins for cgroup 110 parameters supported in LinuxContainerResources config (eg. `memory.oom_control`, `io.weight`), QRM will set and 111 update those parameters for pods when corresponding plugins registered. 112 113 Currently, we have already implemented QRM framework and multiple plugins, and they are already running in production 114 to support QoS-Aware and heterogeneous systems. 115 116 ## Motivation 117 118 ### Goals 119 120 * **Pluggable:** make it easier to extend additional resource allocation and adjustment (NUMA affinitive NICs, network or memory 121 bandwidth, disk capacity etc.) without modifying kubelet source codes. 122 * **Adjustable:** dynamically adjust resource allocation and qos-control strategies according to the real-time node running states. 123 * **Customizable:** all resource plugins can do the customized qos managing by the specific qos definition. 124 125 ### Non-Goals 126 127 * Expand or overturn current pod QoS definitions, thus they'll still remain as Guarantee, Burstable and BestEffort. 128 Instead, users should use common annotations to reflect customized QoS types as needed. 129 * Replace current implementation of the CPU Manager and Memory Manager, and they will still work as general resource 130 allocation components to match native QoS definitions. 131 132 ## Proposal 133 134 QRM is a new component of the kubelet ecosystem proposed to extend resource allocation policies. Besides: 135 * QRM is also a hint provider for Topology Manager like Device Manager. 136 * The hints are intended to indicate preferred resource affinity, and pind the resources for a container 137 either to a single or a group of NUMA nodes. 138 * QRM will not restrict to any native QoS definition; instead, it will pass meta to plugins and plugins should 139 make it with customized policies. 140 141 ### User Stories 142 143 #### Story 1: Adjust resource allocation results dynamically in QoS aware systems 144 145 To improve resource utilization rate, overselling, either by VPA or by running complementary workloads in one node, 146 is usually used in production environments. As a result, the resource consumption states will always be in flux, 147 thus static resource allocation (i.e. `cpuset.cpus` or `cpu.cfs_quota_us`) is not enough, especially for workloads 148 with high performance requirements. So a real-time, customized and dynamic resource allocation results adjustment 149 mechanism will be needed. 150 151 The dynamic adjustment for resource allocation results is usually closely tied to the implementation of QoS aware 152 systems and workload characteristics, so it would be better to provide a general framework in kubelet and offload 153 the resource allocation in plugins. QRM works as such a framework. 154 155 #### Story 2: Expand customized resource allocation policies in user-developed plugins 156 157 The native CPU/Memory Manager requires that only pods with Guarantee QoS can be allocated with exclusive `cpuset.cpus` 158 or `cpuset.mems`, but this abstraction lacks flexibility. 159 160 For instance, in a hybrid cluster, `offline ETL workloads`, `latency-sensitive web services` and`storage services` 161 may run in one same node. In this case, there may be three kinds of `cpuset.cpus` pools, one for offline workloads 162 with shared `cpuset.cpus`, one for web services with shared or exclusive `cpuset.cpus`, and one for storage services 163 with exclusive `cpuset.cpus`. And the same logic may be required for `cpuset.mems` allocation. 164 165 In other words, we need a `role-based` or `fine-grained` QoS classification and corresponding resource allocation 166 logic, which is uneasy to implement in general CPU/Memory Manager, but can be implemented in user-developed plugins. 167 168 #### Story 3: Allocate share devices with NUMA affinity for multiple pods 169 170 Consider a node that has multiple NUMA nodes and network interfaces, and pods scheduled to this node want to stick 171 to single NUMA and only use network interface that is affiliated with the NUMA. 172 173 In this case, multiple pods need to be allocated with the same network device. Although the Device Manager and 174 device plugins can be used, it only allows for `exclusive mode`, ie, device can only be allocated to one certain 175 container. A possible workaround is to allocate `fake device` and set its amount to a large enough value, but 176 `fake device` is kind of weird for end users to request as a resource. 177 178 With the help of QRM, we also express this `implicit` allocation requirements in annotations and make the customized 179 plugins support it. 180 181 ### Risks and Mitigations 182 183 #### UX 184 185 To increase the UX, the number of new kubelet flags was minimized to a minimum. The minimum set of kubelet flags, 186 which is necessary to configure the QoS Resource Manger, is presented in this section. 187 188 ## Design Details 189 190 ![](/docs/imgs/qrm-design-overview.png) 191 As shown in the figure above, QRM works as both a plugin handler added to kubelet plugin manager, 192 and a hint provider for Topology Manager. 193 194 As a plugin handler, QRM is responsible for plugin registration of new plugins, and brings the resource 195 allocation results into effect through standard CRI interface. Detailed strategy is actually implemented 196 in plugins, including NUMA affinity calculation, resources allocation and dynamic adjustment for resources or cgroup 197 parameter control knobs (eg. `memory.oom_control`, `io.weight`, ...). Based on dynamic plugin discovery functionality 198 in kubelet, plugins will register to QRM automatically and make effects during pod lifecycle. 199 200 As a hint provider, QRM will get preferred NUMA affinity hints of resources with corresponding plugins registered 201 for a container. 202 203 ### Detailed Working Flow 204 205 ![](/docs/imgs/qrm-detailed-working-flow.png) 206 The figure below illustrates the workflow of QRM, including two major processes 207 * the synchronous workflow of pod admission and resource allocation 208 * and the asynchronous workflow of periodical resource adjustment 209 210 #### Synchronous Pod Admission 211 212 Once kubelet requests a pod admission, for each container in a pod, Topology Manager will query QRM (along with 213 other hind providers if needed) about the preferred NUMA affinity for each resource that container requests. 214 215 QRM will call `GetTopologyHints()` of the resource plugin and get the preferred NUMA affinity for this resource, 216 and return hints for each resource to Topology Manager. Topology Manager then will figure out which NUMA node or a 217 group of NUMA nodes are the best fit for resources/devices affinity aligned allocation after merging hints from 218 all hint providers. 219 220 After getting the best fit, Topology Manager will call `Allocate()` of all hint providers for the container. 221 When `Allocate()` is called, QRM will assemble a ResourceRequest for each resource. ResourceRequest contains 222 pod and container metadata, pod `role` and resource `QoS type`, the requested resource name and request/limit quantities 223 and the best hint got in the previous step. To be noted, pod role and resource QoS type are newly defined, 224 and are extracted by `kubernetes.io/pod-role` and `kubernetes.io/resource-type` from pod annotations. 225 They are used to uniquely identify the pod QoS type which will influence the allocation results. 226 227 QRM will then call `Allocate()` of resource plugin by the ResourceRequest and get the ResourceAllocationResponse. 228 The ResourceAllocationResponse contains properties AllocatedQuantity, AllocationResult, OciPropertyName, Envs, 229 Annotations. etc. A possible ResourceAllocationResponse example for QRM CPU plugin would be like: 230 ``` 231 * AllocatedQuantity: 4 232 * OciPropertyName: CpusetCpus (matching with property name in LinuxContainerResources) 233 * AllocatationResult: "0-1,8-9" (matching with cgroup cpuset.cpus format) 234 ``` 235 236 After successfully getting the ResourceAllocationResponse, QRM will cache the allocation result in the pod resources 237 checkpoint, and make the checkpoint persistent by writing checkpoint file. 238 239 In PreCreateContainer phase, kubeGenericRuntimeManager indirectly will call `GetResourceRunContainerOptions()` of QRM. 240 And the allocation results for the container cached in the pod resources checkpoint will be populated into 241 [LinuxContainerResources](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1alpha2/api.pb.go#L3075) 242 of CRI api by reflecting mechanism. 243 244 The completed LinuxContainerResources config will be embedded in the 245 [ContainerConfig](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/runtime/v1alpha2/api.pb.go#L4137), 246 and be passed as a parameter of 247 [`CreateContainer()`](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/services.go#L35) API. In this way, 248 the resource allocation results or cgroup parameter control knobs generated by QRM will be taken into the runtime. 249 250 The overall calculation is performed for all containers in the pod, and if none of containers is rejected, pod finally 251 becomes admitted and deployed. 252 253 #### Asynchronous Resource Adjustment 254 255 Dynamic resource adjustment is provided as an enhancement for static resource allocation, and is needed in 256 some cases (e.g. QoS aware resource adjustment for resource utilization improvement mentioned in user story 1 above). 257 258 To support this case, QRM invokes `reconcileState()` periodically, and afterwards calls`GetResourcesAllocation()` of 259 every registered plugin to get latest resource allocation results. QRM will then update pod resources checkpoint and call 260 [`UpdateContainerResources()`](https://github.com/kubernetes/cri-api/blob/master/pkg/apis/services.go#L47) 261 to update cgroup configs for containers by latest resource allocation results. 262 263 ### Pod Resources Checkpoint 264 265 Pod resources checkpoint is a cache for QRM to keep track of the resource allocation results. It is made by 266 registered resource plugins for all active containers and their resources requests. The structure is shown below: 267 ``` 268 type ResourceAllocation map[string]*ResourceAllocationInfo // Keyed by resourceName 269 type ContainerResources map[string]ResourceAllocation // Keyed by containerName 270 type PodResources map[string]ContainerResources // Keyed by podUID 271 type podResourcesChk struct { 272 sync.RWMutex 273 resources PodResources // Keyed by podUID 274 } 275 276 type ResourceAllocationInfo struct { 277 OciPropertyName string `protobuf:"bytes,1,opt,name=oci_property_name,json=ociPropertyName,proto3" json:"oci_property_name,omitempty"` 278 IsNodeResource bool `protobuf:"varint,2,opt,name=is_node_resource,json=isNodeResource,proto3" json:"is_node_resource,omitempty"` 279 IsScalarResource bool `protobuf:"varint,3,opt,name=is_scalar_resource,json=isScalarResource,proto3" json:"is_scalar_resource,omitempty"` 280 // only for resources with true value of IsScalarResource 281 AllocatedQuantity float64 `protobuf:"fixed64,4,opt,name=allocated_quantity,json=allocatedQuantity,proto3" json:"allocated_quantity,omitempty"` 282 AllocatationResult string `protobuf:"bytes,5,opt,name=allocatation_result,json=allocatationResult,proto3" json:"allocatation_result,omitempty"` 283 Envs map[string]string `protobuf:"bytes,6,rep,name=envs,proto3" json:"envs,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"` 284 Annotations map[string]string `protobuf:"bytes,7,rep,name=annotations,proto3" json:"annotations,omitempty" protobuf_key:"bytes,1,opt,name=key,proto3" protobuf_val:"bytes,2,opt,name=value,proto3"` 285 ResourceHints *ListOfTopologyHints `protobuf:"bytes,8,opt,name=resource_hints,json=resourceHints,proto3" json:"resource_hints,omitempty"` 286 } 287 ``` 288 289 PodResources structure is organized as a three-layer map, and it uses pod UID, container name and resource name 290 as keys for each level. ResourceAllocationInfo is stored in the lowest map, and contains the allocation result 291 of a specific resource for the identified container. ResourceAllocationInfo currently has those properties: 292 ``` 293 * OCIPropertyName 294 - it's used to identify which property in the config LinuxContainerResources should be populated into. 295 * IsScalarResource 296 - if it's true, resource allocation results can be quantified, and possibly be used as foundation of scheduling and 297 admitting logic. QRM will compare the requested quantity with the allocated quantity when the active container is re-admitted. 298 * IsNodeResource 299 - if this property and IsScalarResource are both true, QRM will expose "allocatable" and "capacity" quantities of the 300 resource to kubelet node status setter, and finally set to node status. 301 - For instance, quantified resources have been covered in kubelet node status setter (eg. cpu, memory, ..), it's 302 unnecessary to set quantity of them to node status again, so we should set IsNodeResource as false. Otherwise, we 303 should set IsNodeResource as true for those extended quantified resources, and set quantity of them to node status. 304 * AllocatedQuantity 305 - it represents resource allocated quantity for the container, and it's only for resources with IsScalarResource 306 set as true. 307 * AllocatationResult 308 - it represents resource allocation result for the container, and must be a valid value of the property in 309 LinuxContainerResources that OciPropertyName indicates. For example, if OciPropertyName is CpusetCpus, the 310 AllocatationResult should be like "0-1,8-9" (a valid value for cgroup cpuset.cpus). 311 * Envs 312 - environmental variables that the resource plugin returns and should be set in the container. 313 * Annotations 314 - annotations that the resource plugin returns and should be taken to runtime. 315 * ResourceHints 316 it's the preferred NUMA affinity matching with the AllocatationResult. It will be used when kubelet restarts, and 317 active containers with allocated resources are re-admitted. 318 ``` 319 320 ### Simulation: how QRM works 321 322 #### Example 1: running a QoS System using QRM 323 324 Assume that we have a machine with CPU Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GH 325 (80 CPUs, 0-39 in NUMA node0, 40-79 in NUMA node1). 326 327 We divide CPUs in this machine into two pools, one for common online micro-service (eg. web service), 328 and the other for common offline workloads (eg. ETL or video transcoding tasks). 329 And we deploy a QoS aware system to adjust the size of those two pools according to the performance metrics. 330 331 ##### Initialize plugins 332 333 QRM CPU Plugin starts to work: 334 - Initialize those two `cpuset.cpus` pools 335 - suppose `0-37,40-77` for online pool, and `38-39,78-79` for offline pool by default 336 - QRM CPU Plugin is discovered by kubelet plugin manager dynamically and register to QRM 337 338 ##### Admit pod with online role 339 340 Suppose there comes a pod1 with one container requesting 4 CPUs, and its pod role is online-micro_service, 341 meaning that it should be placed in the online pool. 342 343 In pod1 admission phase, QRM calls `Allocate()` of QRM CPU plugin. And plugin identifies that the container is for 344 online `cpuset.cpus` pool, so it returns `ResourceAllocationResponse` like below: 345 ``` 346 { 347 "pod_uid": "054a696f-d176-488d-b228-d6046faaf67c", 348 "pod_namespace": "default", 349 "pod_name": "pod1", 350 "container_name": "container0", 351 "container_type": "MAIN", 352 "container_index": 0, 353 "pod_role": "online-micro_service", 354 "resource_name": "cpu", 355 "allocatation_result": { 356 "resource_allocation": { 357 "cpu": { 358 "oci_property_name": "CpusetCpus", 359 "is_node_resource": false, 360 "is_scalar_resource": true, 361 "allocated_quantity": 76, 362 "allocatation_result": "0-37,40-77", 363 "resource_hints": [] // no NUMA preference 364 } 365 } 366 } 367 } 368 ``` 369 370 QRM caches the allocation result in the pod resources checkpoint. 371 372 In the PreCreateContainer phase, the kubeGenericRuntimeManager calls `GetResourceRunContainerOptions()` of 373 QRM and gets a `LinuxContainerResources` config like below 374 ``` 375 { 376 "cpusetCpus": "0-37,40-77", 377 } 378 ``` 379 380 pod1 starts successfully with the `cpuset.cpus` allocated. 381 382 ##### Admit pod with offline role 383 384 Suppose there comes a pod2 with one container requesting 2 CPUs, and its pod role is ETL, 385 meaning that it should be placed in the offline pool. 386 387 In pod2 admission phase, QRM call `Allocate()` of QRM CPU plugin. And plugin identifies the container identifies 388 that the container is for offline `cpuset.cpus` pool, so it returns `ResourceAllocationResponse` like below: 389 ``` 390 { 391 "pod_uid": "fe8d9c25-6fb4-4983-908f-08e39ebeafe7", 392 "pod_namespace": "default", 393 "pod_name": "pod2", 394 "container_name": "container0", 395 "container_type": "MAIN", 396 "container_index": 0, 397 "pod_role": "ETL", 398 "resource_name": "cpu", 399 "allocatation_result": { 400 "resource_allocation": { 401 "cpu": { 402 "oci_property_name": "CpusetCpus", 403 "is_node_resource": false, 404 "is_scalar_resource": true, 405 "allocated_quantity": 4, 406 "allocatation_result": "38-39,78-79", 407 "resource_hints": [] // no NUMA preference 408 } 409 } 410 } 411 } 412 ``` 413 414 QRM cache the allocation result in the pod resources checkpoint. 415 416 In the PreCreateContainer phase, the kubeGenericRuntimeManager calls `GetResourceRunContainerOptions()` of 417 QRM and gets a `LinuxContainerResources` config like below: 418 ``` 419 { 420 "cpusetCpus": "38-39,78-79"", 421 } 422 ``` 423 The pod2 starts successfully with the `cpuset.cpus` allocated. 424 425 ##### Admit another pod with online role 426 427 Similar to the pod1, if there comes a pod3 also with online-micro_service role, it will be placed in the online 428 pool too. So it will get a `LinuxContainerResources` config like below in the `PreCreateContainer` phase: 429 ``` 430 { 431 "cpusetCpus": "0-37,40-77", 432 } 433 ``` 434 435 ##### Periodically adjust resource allocation 436 437 After a period of time, the QoS aware system adjusts the online `cpuset.cpus` pool to `0-13,40-53` and 438 the offline `cpuset.cpus` pool `14-39,54-79`, according the system indicators. 439 440 QRM invokes `reconcileState()` for latest resource allocation results (like below) from QRM CPU plugin. 441 It then updates the pod resources checkpoint, and calls `UpdateContainerResources()` to update the cgroup resources 442 for containers by latest resource allocation results. 443 ``` 444 { 445 "pod_resources": { 446 "054a696f-d176-488d-b228-d6046faaf67c": { // pod1 447 "container0": { 448 "cpu": { 449 "oci_property_name": "CpusetCpus", 450 "is_node_resource": false, 451 "is_scalar_resource": true, 452 "allocated_quantity": 28, 453 "allocatation_result": "0-13,40-53", 454 "resource_hints": [] // no NUMA preference 455 } 456 } 457 }, 458 "fe8d9c25-6fb4-4983-908f-08e39ebeafe7": { // pod2 459 "container0": { 460 "cpu": { 461 "oci_property_name": "CpusetCpus", 462 "is_node_resource": false, 463 "is_scalar_resource": true, 464 "allocated_quantity": 52, 465 "allocatation_result": "14-39,54-79", 466 "resource_hints": [] // no NUMA preference 467 } 468 } 469 }, 470 "26731da7-b283-488b-b232-cff611c914e1": { // pod3 471 "container0": { 472 "cpu": { 473 "oci_property_name": "CpusetCpus", 474 "is_node_resource": false, 475 "is_scalar_resource": true, 476 "allocated_quantity": 28, 477 "allocatation_result": "0-13,40-53", 478 "resource_hints": [] // no NUMA preference 479 } 480 } 481 }, 482 } 483 } 484 ``` 485 486 #### Example 2: allocate NUMA-affinity resources with extended policy 487 488 Assume that we have a machine with CPU Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GH 489 (80 CPUs, 0-39 in NUMA node0, 40-79 in NUMA node1; 452 GB memory, 226 GB in NUMA node0, 26 GB in NUMA node1) 490 491 And we have multiple latency-sensitive services, including storage services, re-caller, retriever and re-ranker services 492 in information retrieval systems. etc. To meet QoS requirements, we should pin them to exclusive 493 NUMA nodes by setting `cpuset.cpus` and `cpuset.mems`. 494 495 Although the default CPU/Memory Manager have already provided the ability to set `cpuset.cpus` and `cpuset.mems` 496 for containers, current policies are only toward Guaranteed pod QoS class, and can't handle container anti-affinity 497 in NUMA nodes scope with flexibility. 498 499 500 ##### Initialize plugins 501 502 QRM CPU/Memory plugins start to work 503 - QRM CPU plugin initializes checkpoint with default machine state like below: 504 ``` 505 { 506 "machineState": { 507 "0": { 508 "reserved_cpuset": "0-1", 509 "allocated_cpuset": "", 510 "default_cpuset": "2-39" 511 }, 512 "1": { 513 "reserved_cpuset": "40-41", 514 "allocated_cpuset": "", 515 "default_cpuset": "42-79" 516 } 517 }, 518 "pod_entries": {} 519 } 520 ``` 521 - QRM Memory plugin initializes checkpoint with default machine state like below: 522 ``` 523 { 524 "machineState": { 525 "memory": { 526 "0": { 527 "Allocated": "0", 528 "allocatable": "237182648320", 529 "free": "237182648320", 530 "pod_entries": {}, 531 "systemReserved": "524288000", 532 "total": "237706936320" 533 }, 534 "1": { 535 "Allocated": "0", 536 "allocatable": "237282263040", 537 "free": "237282263040", 538 "pod_entries": {}, 539 "systemReserved": "524288000", 540 "total": "237806551040" 541 } 542 } 543 }, 544 "pod_resource_entries": {}, 545 } 546 ``` 547 - All QRM plugins are discovered by kubelet plugin manager dynamically and register to QRM 548 549 ##### Admit pod with storage-service role 550 551 There comes a pod1 with one container requesting 20 CPUs, 40GB memory and its pod role is storage-service. 552 553 In pod1 admission phase, QRM calls `GetTopologyHints()` of QoS plugins, and gets preferred NUMA affinity 554 hint (10) from both of them. Then QRM calls `Allocate()` of plugins 555 - calls Allocate() of QoS CPU plugin gets `ResourceAllocationResponse` like below: 556 ``` 557 { 558 "pod_uid": "bc9e28df-1b5c-4099-8866-6110277184e0", 559 "pod_namespace": "default", 560 "pod_name": "pod1", 561 "container_name": "container0", 562 "container_type": "MAIN", 563 "container_index": 0, 564 "pod_role": "storage-service", 565 "resource_name": "cpu", 566 "allocatation_result": { 567 "resource_allocation": { 568 "cpu": { 569 "oci_property_name": "CpusetCpus", 570 "is_node_resource": false, 571 "is_scalar_resource": true, 572 "allocated_quantity": 20, 573 "allocatation_result": "2-21", 574 "resource_hints": [ 575 { 576 "nodes": [0], 577 "Preferred": true 578 } 579 ] 580 } 581 } 582 } 583 } 584 ``` 585 - calls Allocate() of QoS Memory plugin and gets `ResourceAllocationResponse` like below: 586 ``` 587 { 588 "pod_uid": "bc9e28df-1b5c-4099-8866-6110277184e0", 589 "pod_namespace": "default", 590 "pod_name": "pod1", 591 "container_name": "container0", 592 "container_type": "MAIN", 593 "container_index": 0, 594 "pod_role": "storage-service", 595 "resource_name": "memory", 596 "allocatation_result": { 597 "resource_allocation": { 598 "cpu": { 599 "oci_property_name": "CpusetMems", 600 "is_node_resource": false, 601 "is_scalar_resource": true, 602 "allocated_quantity": 1, 603 "allocatation_result": "0", 604 "resource_hints": [ 605 { 606 "nodes": [0], 607 "Preferred": true 608 } 609 ] 610 } 611 } 612 } 613 } 614 ``` 615 QRM caches the allocation results in the pod resources checkpoint. 616 In the `PreCreateContainer` phase, the kubeGenericRuntimeManager calls `GetResourceRunContainerOptions()` of 617 QRM and gets a `LinuxContainerResources` config like below: 618 ``` 619 { 620 "cpusetCpus": "2-21", 621 "cpusetMems": "0", 622 } 623 ``` 624 The pod1 starts successfully with the `cpuset.cpus`, `cpuset.mems` allocated. 625 626 ##### Admit pod with reranker role 627 628 There comes a pod2 with one container requesting 10 CPUs, 20GB memory and its pod role is reranker. 629 630 Although the quantity of available resources is enough for the pod2, QRM plugins identify that pod1 and 631 pod2 should follow anti-affinity requirement in NUMA nodes scope by pod roles. So `ResourceAllocationResponses` 632 for pod2 from QRM CPU/Memory plugins are like below: 633 ``` 634 { 635 "pod_uid": "6f695526-b07c-4baa-90e3-af1dfed2faf8", 636 "pod_namespace": "default", 637 "pod_name": "pod2", 638 "container_name": "container0", 639 "container_type": "MAIN", 640 "container_index": 0, 641 "pod_role": "storage-service", 642 "resource_name": "cpu", 643 "allocatation_result": { 644 "resource_allocation": { 645 "cpu": { 646 "oci_property_name": "CpusetCpus", 647 "is_node_resource": false, 648 "is_scalar_resource": true, 649 "allocated_quantity": 10, 650 "allocatation_result": "42-51", 651 "resource_hints": [ 652 { 653 "nodes": [1], 654 "Preferred": true 655 } 656 ] 657 } 658 } 659 } 660 } 661 662 { 663 "pod_uid": "6f695526-b07c-4baa-90e3-af1dfed2faf8", 664 "pod_namespace": "default", 665 "pod_name": "pod2", 666 "container_name": "container0", 667 "container_type": "MAIN", 668 "container_index": 0, 669 "pod_role": "storage-service", 670 "resource_name": "memory", 671 "allocatation_result": { 672 "resource_allocation": { 673 "cpu": { 674 "oci_property_name": "CpusetMems", 675 "is_node_resource": false, 676 "is_scalar_resource": true, 677 "allocated_quantity": 1, 678 "allocatation_result": "1", 679 "resource_hints": [ 680 { 681 "nodes": [1], 682 "Preferred": true 683 } 684 ] 685 } 686 } 687 } 688 } 689 ``` 690 691 QRM caches the allocation result in the pod resources checkpoint.In the `PreCreateContainer` phase, 692 the `kubeGenericRuntimeManager` calls `GetResourceRunContainerOptions()` of QRM and 693 gets a `LinuxContainerResources` config like below: 694 ``` 695 { 696 "cpusetCpus": "42-51", 697 "cpusetMems": "1", 698 } 699 ``` 700 The pod2 starts successfully with the cpuset.cpus allocated. 701 702 #### Example 3: allocate shared NUMA affinitive NICs 703 704 Assume that we have a machine with CPU Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GH 705 (80 CPUs, 0-39 in NUMA node0, 40-79 in NUMA node1) and 2 Mellanox Technologies MT28841 NICs with speed 25000Mb/s 706 (eth0 is affinitive to NUMA node0, eth1 is affinitive to NUMA node1 and). 707 708 ##### Initialize plugins 709 710 QRM CPU/Memory/NIC plugins start to work 711 - QRM CPU/Memory plugins initialize checkpoint with default machine state (same as example 2) 712 - QRM NIC plugin initializes checkpoint with default machine state containing NICs information organized by NUMA affinity: 713 ``` 714 { 715 "machineState": { 716 "0": { 717 "nics": [ 718 { 719 "interface_name": "eth0", 720 "nuam_node": 0, 721 "address": { 722 "ipv6": "fdbd:dc05:3:154::20" 723 } 724 } 725 ] 726 }, 727 "1": { 728 "nics": [ 729 { 730 "interface_name": "eth1", 731 "nuam_node": 1, 732 "address": { 733 "ipv6": "fdbd:dc05:3:155::20" 734 }, 735 "netns": "/var/run/netns/ns1" // must be filled in if the NIC is added to the non-default network namespace 736 } 737 ] 738 }, 739 }, 740 "pod_entries": {} 741 } 742 ``` 743 - All QRM plugins are discovered by kubelet plugin manager dynamically and register to QRM 744 745 746 ##### Admit pod with numa-sharing && cpu-execlusive role 747 748 We assume that all related resources in NUMA node0 have been allocated and allocatable resources in 749 NUMA node1 are all available. There comes a pod1 with one container requesting 10 CPUs, 20GB memory, 750 hostNetwork true and its pod role is `numa-sharing && cpu-execlusive` (`numa-enhancement` for abbreviation). 751 This role means the pod can be located in the same NUMA node with other pods with the same role, 752 but it should be allocated with exclusive `cpuset.cpus`. 753 754 In pod1 admission phase, QRM calls `GetTopologyHints()` of QoS plugins, and gets preferred NUMA affinity 755 hint (01) from all of them. Then QRM calls `Allocate()` of plugins 756 - calls `Allocate()` of QRM CPU/Memory plugins and gets `ResourceAllocationResponses` similar to example 2 757 - calls `Allocate()` of QRM NIC plugin and gets `ResourceAllocationResponse` like below: 758 ``` 759 { 760 "pod_uid": "c44ba6fd-2ce5-43e0-8d4d-b2224dfaeebd", 761 "pod_namespace": "default", 762 "pod_name": "pod1", 763 "container_name": "container0", 764 "container_type": "MAIN", 765 "container_index": 0, 766 "pod_role": "numa-enhancement", 767 "resource_name": "NIC", 768 "allocatation_result": { 769 "resource_allocation": { 770 "NIC": { 771 "oci_property_name": "", 772 "is_node_resource": false, 773 "is_scalar_resource": false, 774 "resource_hints": [ 775 { 776 "nodes": [1], 777 "Preferred": true 778 } 779 ], 780 "envs": { 781 "AFFINITY_NIC_ADDR_IPV6": "fdbd:dc05:3:155::20" 782 }, 783 "annotations": { 784 "kubernetes.io/host-netns-path": "/var/run/netns/ns1" 785 } 786 } 787 } 788 } 789 } 790 ``` 791 792 QRM caches the allocation results in the pod resources checkpoint. 793 In the `PreCreateContainer` phase, the kubeGenericRuntimeManager calls `GetResourceRunContainerOptions()` 794 of QRM and gets a `ResourceRunContainerOptions` like below, and the content will be filled into `ContainerConfig` 795 ``` 796 { 797 "envs": { 798 "AFFINITY_NIC_ADDR_IPV6": "fdbd:dc05:3:155::20", 799 }, 800 "annotations": { 801 "kubernetes.io/host-netns-path": "/var/run/netns/ns1" 802 }, 803 "linux_container_resources": { 804 "cpusetCpus": "42-51", 805 "cpusetMems": "1", 806 }, 807 } 808 ``` 809 The pod1 starts successfully with the cpuset.cpus, cpuset.mems allocated. In addition, the container process can 810 get IP address of the NIC with NUMA affinity from the environment variable `AFFINITY_NIC_ADDR_IPV6` 811 and bind sockets to the address. 812 813 ##### Admit another pod with the same role 814 815 There comes a pod2 with one container requesting 10 CPUs, 20GB memory, hostNetwork true and its pod role is also 816 `numa-enhancement`. That means the pod2 can be located in the same NUMA node with the pod1 if the available 817 resources satisfy the requirements. 818 819 Similar to the pod1, the pod2 gets a `ResourceRunContainerOptions` like below in admission phase 820 ``` 821 { 822 "envs": { 823 "AFFINITY_NIC_ADDR_IPV6": "fdbd:dc05:3:155::20", 824 }, 825 "annotations": { 826 "kubernetes.io/host-netns-path": "/var/run/netns/ns1" 827 }, 828 "linux_container_resources": { 829 "cpusetCpus": "52-61", 830 "cpusetMems": "1", 831 }, 832 } 833 ``` 834 835 Notice that the pod1 and pod2 share the same NIC. If we implement a device plugin with the device manager to 836 provide NIC information with NUMA affinity, it's exclusive so we can't make multiple pods share the same device ID. 837 838 ### New Flags and Configuration of QRM 839 840 #### Feature Gate Flag 841 842 A new feature gate flag will be added to enable QRM feature. 843 This feature gate will be disabled by default in the initial releases. 844 845 Syntax: `--feature-gate=QoSResourceManager=false|true` 846 847 #### QRM Reconcile Period Flag 848 849 This flag controls the interval time between QRM invoking `reconcileState()` to adjust allocation results dynamically. 850 If not supplied, its default value is 3s. 851 852 Syntax: `--qos-resource-manager-reconcile-period=10s|1m` 853 854 #### How this proposal affects the kubelet ecosystem 855 856 ##### Container Manager 857 858 Container Manager will create QRM and register it to Topology Manager as a hint provider. 859 860 ##### Topology Manager 861 862 Topology Manager will call out QRM to gather topology hints, 863 and allocate resources with corresponding resource plugins registered during pod admission sequence. 864 865 ##### kubeGenericRuntimeManager 866 867 kubeGenericRuntimeManager will indirectly call `GetResourceRunContainerOptions()` of QRM, and get 868 `LinuxContainerResources` config populated with allocation results of requested resources during executing `startContainer()`. 869 870 ##### Kubelet Node Status Setter 871 872 MachineInfo setter will be extended to indirectly call `GetCapacity()` of QRM to get capacity, 873 allocatable quantity of resources and already-removed resource names from registered resource plugins. 874 875 ##### Pod Resources Server 876 877 In order to get the allocated and allocatable resources managed by QRM and registered resource plugin from 878 pod resources server, we make the container manager implement ResourcesProvider defined below: 879 ``` 880 // ResourcesProvider knows how to provide the resources used by the given container 881 type ResourcesProvider interface { 882 // UpdateAllocatedResources frees any Resources that are bound to terminated pods. 883 UpdateAllocatedResources() 884 // GetResources returns information about the resources assigned to pods and containers in topology aware format 885 GetTopologyAwareResources(pod *v1.Pod, container *v1.Container) []*podresourcesapi.TopologyAwareResource 886 // GetAllocatableResources returns information about all the resources known to the manager in topology aware format 887 GetTopologyAwareAllocatableResources() []*podresourcesapi.AllocatableTopologyAwareResource 888 } 889 890 When List() of the pod resources server is called, below calling train will be trigger: 891 (*v1PodResourcesServer).List(...) -> 892 (*containerManagerImpl)).GetTopologyAwareResources(...) -> 893 (*QoSResourceManagerImpl).GetTopologyAwareResources(...) -> 894 (*resourcePluginEndpoint).GetTopologyAwareResources(...) for each registered resource plugin 895 QRM will merge allocated resource responses from resource plugins and the pod resources server 896 will return the merged information to the end user. 897 898 When GetAllocatableResources() of the pod resources server is called, below calling train will be trigger: 899 (*v1PodResourcesServer).GetAllocatableResources(...) -> 900 (*containerManagerImpl)).GetTopologyAwareAllocatableResources(...) -> 901 (*QoSResourceManagerImpl).GetTopologyAwareAllocatableResources(...) -> 902 (*resourcePluginEndpoint).GetTopologyAwareAllocatableResources(...) for each registered resource plugin 903 QRM will merge allocatable resource responses from resource plugins and the pod resources server 904 will return the merged information to the end user. 905 ``` 906 907 ### Test Plan 908 909 We will initialize QRM with mock resource plugins registered and cover key points listed below: 910 - the plugin mananger can discover the listening resource plugin dynamically and register it into QRM successfully. 911 - QRM can return correct LinuxContainerResources config populated with allocation results. 912 - Validate allocating resources to containers and getting preferred NUMA affinity hints in QRM through registered resource plugins. 913 - Validate that reconcileState() of QRM will update the cgroup configs for containers by latest resource allocation results. 914 - GetTopologyAwareAllocatableResources() and GetTopologyAwareResources()of QRM return correct allocatable 915 and allocated to the pod resources server. 916 - pod resources checkpoint is stored and resumed normally and its basic operations work as expected. 917 918 ## Production Readiness Review Questionnaire 919 920 ### Feature Enablement and Rollback 921 922 ##### How can this feature be enabled / disabled in a live cluster? 923 924 Feature gate 925 - Feature gate name: QoSResourceManager 926 - Components depending on the feature gate: kubelet 927 - Will enabling / disabling the feature require downtime of the control plane? No 928 - Will enabling / disabling the feature require downtime or reprovisioning of a node? 929 (Do not assume Dynamic Kubelet Config feature is enabled). Yes, it uses a feature gate. 930 931 ##### Does enabling the feature change any default behavior? 932 933 Yes, the pod admission flow will change if QRM is enabled, it will call the plugins to determine 934 whether the pod is admitted or not. And QRM can't work together with CPU Manager or Memory Manager if 935 there are plugins for CPU/Memory allocation registered to QRM. 936 937 ##### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? 938 939 Yes, it uses a feature gate. 940 941 ##### What happens if we reenable the feature if it was previously rolled back? 942 943 QRM uses the state file to track resource allocations. If the state file is not valid, 944 it's better to remove the state file and restart kubelet. 945 (e.g. the state file might become invalid that some pods in it have been removed). 946 But the Manager will reconcile to fix it, so it won't be a big deal. 947 948 ##### Are there any tests for feature enablement/disablement? 949 950 Yes, there is a number of Unit Tests designated for State file validation. 951 952 ### Rollout, Upgrade and Rollback Planning 953 954 ##### How can a rollout or rollback fail? Can it impact already running workloads? 955 956 It is possible that the State file will have inconsistent data during the rollout, 957 because of kubelet restart, but you can easily fix it by removing State file and restarting kubelet. 958 It should not affect any running workloads. And the Manager will reconcile to fix it, so it won't be a big deal. 959 960 ##### What specific metrics should inform a rollback? 961 962 The pod may fail with the admission error because the plugin fails to allocate resources. 963 You can see the error message under the pod events. 964 965 ##### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? 966 967 Yes. 968 969 ##### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? 970 971 No. 972 973 ### Monitoring Requirements 974 975 ###### How can an operator determine if the feature is in use by workloads? 976 977 QRM data will also be available under Pod Resources API. 978 979 ###### How can someone using this feature know that it is working for their instance? 980 981 - After pod starts, you have two options to verify if containers work as expected 982 - via Pod Resources API, you will need to connect to grpc socket and get information from it. 983 Please see pod resource API doc page for more information. 984 - checking the relevant container cgroup under the node. 985 - Pod failed to start because of the admission error. 986 987 ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? 988 989 This does not seem relevant to this feature. 990 991 ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? 992 993 A bunch of metrics will be added to indicate the running states of QRM and the plugins. 994 The detailed metrics name and meaning will be added to this doc in the future. 995 996 ### Dependencies 997 998 ###### Does this feature depend on any specific services running in the cluster? 999 1000 None 1001 1002 ### Scalability 1003 1004 ###### Will enabling / using this feature result in any new API calls? 1005 1006 No 1007 1008 ###### Will enabling / using this feature result in introducing new API types? 1009 1010 No 1011 1012 ###### Will enabling / using this feature result in any new calls to the cloud provider? 1013 1014 No 1015 1016 ###### Will enabling / using this feature result in increasing size or count of the existing API objects? 1017 1018 No 1019 1020 ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? 1021 1022 No 1023 1024 ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? 1025 1026 No, the algorithm will run on a single `goroutine` with minimal memory requirements. 1027 1028 ### Troubleshooting 1029 ##### How does this feature react if the API server and/or etcd is unavailable? 1030 1031 No. 1032 1033 ##### What are other known failure modes? 1034 1035 During the enabling and disabling of QRM, you must remove the memory manager 1036 state file(/var/lib/kubelet/qos_resource_manager_state), otherwise the kubelet start will fail. 1037 You can identify the issue via checking the kubelet log. 1038 1039 ##### What steps should be taken if SLOs are not being met to determine the problem? 1040 1041 Not applicable. 1042 1043 ## Implementation History 1044 1045 QRM has been developed and running in our production environment, 1046 but still needs a little effort to make an open-source version. 1047 1048 ## Drawbacks 1049 1050 No objections exist to implement this KEP. 1051 1052 ## Appendix 1053 ### Related Features 1054 1055 - [Topology Manager](https://github.com/kubernetes/enhancements/blob/dcc8c7241513373b606198ab0405634af643c500/keps/sig-node/0035-20190130-topology-manager.md) 1056 collects topology hints from various hint providers (e.g. CPU Manager or Device Manager) in order to calculate which 1057 NUMA nodes can offer a suitable amount of resources for a container. The final decision of Topology Manager is 1058 subjected to the topology policy (i.e. best-effort, restricted, single-numa-policy) and 1059 possible NUMA affinity for containers. Finally, the Topology Manager determines whether a container in a pod 1060 can be deployed to the node or rejected. 1061 1062 * [CPU Manager](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/cpu-manager.md) 1063 provides CPU pinning functionality by using cgroups cpuset subsystem, and it also provides topology hints, 1064 which indicate CPU core availability at particular NUMA nodes, to Topology Manager. 1065 1066 - [Device Manager](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/resource-management/device-plugin.md) 1067 allows device vendors to advertise their device resources such as NIC device or GPU device through their 1068 device plugins to kubelet so that the devices can be utilized by containers. Similarly, Device Manager provides 1069 topology hints to the Topology Manager. The hints indicate the availability of devices at particular NUMA nodes. 1070 1071 * [Hugepages enables](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20190129-hugepages.md) 1072 the assignment of pre-allocated hugepage resources to a container. 1073 1074 - [Node Allocatable Feature](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-allocatable.md) 1075 helps to increase the stability of node operation, while it pre-reserves compute resources for kubelet and system 1076 processes. In v1.17, the feature supports the following reservable resources: CPU, memory, and ephemeral storage.