github.com/kubewharf/katalyst-core@v0.5.3/docs/proposals/qos-management/orm-nri/20240303-orm-nri.md (about) 1 --- 2 title: Enhance ORM by NRI 3 authors: 4 - "airren" 5 - "hle2" 6 reviewers: 7 - "caohe" 8 creation-date: 2024-03-03 9 last-updated: 2024-04-24 10 status: implementable 11 12 --- 13 14 # Enhance ORM by NRI 15 16 <!--ts--> 17 * [Enhance ORM by NRI](#enhance-orm-by-nri) 18 * [Summary](#summary) 19 * [Motivation](#motivation) 20 * [Goals](#goals) 21 * [Non-Goals/Future Work](#non-goalsfuture-work) 22 * [Proposal](#proposal) 23 * [User Stories](#user-stories) 24 * [Story1: Use origin kubernetes without intrusive modifications](#story1-use-origin-kubernetes-without--intrusive-modifications) 25 * [Story2: Synchronous configuration of QoS policies and injection of environment variables](#story2-synchronous-configuration-of-qos-policies-and-injection-of-environment-variables) 26 * [Requirements](#requirements) 27 * [Functional Requirements](#functional-requirements) 28 * [Non-Functional Requirements](#non-functional-requirements) 29 * [Design Details](#design-details) 30 * [Detailed working flow](#detailed-working-flow) 31 * [Addon](#addon) 32 * [Modification](#modification) 33 * [Test Plan](#test-plan) 34 * [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) 35 * [Feature Enablement and Rollback](#feature-enablement-and-rollback) 36 * [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster) 37 * [Troubleshooting](#troubleshooting) 38 * [How does this feature react if the NRI not supported?](#how-does-this-feature-react-if-the-nri-not-supported) 39 * [How to handle resource allocation failures?](#how-to-handle-resource-allocation-failures) 40 * [What happens if the NRI stub times out or if the socket connection fails?](#what-happens-if-the-nri-stub-times-out-or-if-the-socket-connection-fails) 41 * [Appendix](#appendix) 42 * [Implementation History](#implementation-history) 43 44 <!-- Created by https://github.com/ekalinin/github-markdown-toc --> 45 <!-- Added by: airren, at: Wed Mar 27 14:55:54 CST 2024 --> 46 47 <!--te--> 48 49 ## Summary 50 51 To meet the needs of various business application scenarios, ensuring sufficient 52 resource guarantees for latency-sensitive services is necessary, especially when 53 online and offline tasks are mixed. This requires Kubernetes to provide more 54 granular resource management capabilities, enhance container isolation, and reduce 55 interference between containers. 56 57 As of now, Kubernetes does not offer a fully comprehensive resource management 58 solution. Many open-source projects in the Kubernetes ecosystem have devised 59 their methods to modify the deployment and management processes of pods, enabling 60 fine-grained resource allocation. 61 62 There are various approaches to extending Kubernetes, which we have summarized 63 as follows. 64 65 ![kubernetes-enhance-overview](kubernetes-enhance-overview.png) 66 67 All the methods listed above can enhance Kubernetes, but except for the standalone 68 approach, they unavoidably involve intrusive modifications to the upstream Kubernetes 69 components, making it difficult for users to stay synchronized with upstream 70 components. Although the standalone approach avoids modifications to upstream 71 components, this asynchronous update method also has numerous drawbacks. 72 73 To address the need for intrusive modifications to Kubernetes and changes to the 74 default process, enabling developers to have a more unified implementation 75 approach, NRI has emerged. 76 77 [NRI](https://github.com/containerd/nri) is a plugin-based node resource management approach introduced by 78 the upstream community. Using NRI, Kubernetes' node resource management capabilities 79 can be enhanced through plugins without intrusive modifications to the upstream 80 Kubernetes components. 81 82 > NRI allows plugging domain- or vendor-specific custom logic into OCI- compatible 83 > runtimes. This logic can make controlled changes to containers or perform extra 84 > actions outside the scope of OCI at certain points in a containers lifecycle. 85 > This can be used, for instance, for improved allocation and management of devices 86 > and other container resources. 87 88 ![nri-architecture](nri-architecture.png) 89 90 This proposal introduces how to enhance Katalyst using NRI, allowing Katalyst to 91 be deployed based on origin Kubernetes and making it easier to maintain and use. 92 93 ## Motivation 94 95 Katalyst enhances Kubernetes resource management policies on a single node through 96 the QoS Resource Manager (QRM). However, the current QRM mode involves intrusive 97 modifications to the Kubelet, which makes it inconvenient for some users who use 98 the origin Kubernetes but not the distribution Kubewharf. To address this, Katalyst 99 proposes the ORM architecture, which provides a decoupled solution from Kubelet as 100 a supplement to the QRM solution. 101 102 In the ORM architecture, there are two implementation approaches. The first approach 103 is named Bypass, which polls Kubelet's API for pod events on the current node and 104 updates pod resources. This approach is asynchronous and cannot inject parameters 105 such as environment variables. The other approach is based on NRI. NRI (Node 106 Resource Interface) is a general framework for CRI-compatible container runtime 107 plugin extensions. It offers a mechanism for extensions to monitor pod/container 108 states and make limited configuration modifications. Using NRI, Katalyst can 109 synchronously modify resources and inject other information, such as environment 110 variables, during pod events. 111 112 ### Goals 113 114 - Expand Katalyst‘s ORM mode using NRI to enhance the Resource management capabilities 115 of Kubernetes。 116 - Support for fine-grained resource control when containerd is used as the CRI runtime. 117 118 ### Non-Goals/Future Work 119 120 - Support for other runtimes besides containerd, such as cri-o and docker. 121 122 ## Proposal 123 124 Diverging from QRM or ORM's Bypass Mode, the Katalyst-agent will work as an NRI 125 plugin to subscribe pod/container lifecycle events from CRI runtime (in this 126 proposal, it is containerd), and then the Katalyst-agent will return an adjusted 127 Container spec in the hook events, or update the container spec by an active update. 128 129 - Get pod/container lifecycle events and pod or container information from NRI. 130 - Transform the NRI format information into CRI format to reuse existing admit 131 implementation by QRM Plugins. 132 - Update the NRI format container spec to the CRI runtime. 133 - While reconciling use NRI UpdateContainter to reconfigure resources. 134 135 **NRI Enhanced ORM(Along with kubelet polling)** 136 137 ![orm-architecture](orm-architecture.png) 138 139 ### User Stories 140 141 #### Story1: Use origin kubernetes without intrusive modifications 142 143 Extending and enhancing Kubernetes' resource management capabilities is a common 144 requirement in many business scenarios. However, while enhancing Kubernetes, it's 145 a common requirement to ensure that all Kubernetes components remain consistent 146 with the upstream community and avoid making any intrusive modifications to the 147 original Kubernetes components. After enabling NRI mode, deploying Katalyst on 148 existing clusters does not require restarting the original cluster. Enhancements 149 to the original Kubernetes can be achieved through a plugin-based approach. 150 151 #### Story2: Synchronous configuration of QoS policies and injection of environment variables 152 153 When enhancing QoS policies in Kubernetes, synchronous modification is the most 154 efficient method. With NRI Mode enabled, Katalyst plugins can synchronously modify 155 pod resources during pod creation, ensuring QoS policy allocation before pod 156 execution. Additionally, through NRI Mode, dynamic updates to pod resources 157 are possible. During pod creation, adjustments to pod resources, device binding, 158 RDT, and environment variable injection can be achieved via NRI Mode. 159 160 ### Requirements 161 162 - Need to upgrade containerd to >= v1.7.0 163 164 #### Functional Requirements 165 166 - Support all functionalities corresponding to Bypass Mode under the existing ORM 167 architecture. This includes: adjusting container's cpuset / cfsquota, memory QoS. 168 - Support injecting environment variables into containers 169 170 #### Non-Functional Requirements 171 172 - It can achieve synchronous configuration of QoS policies, improving the 173 responsiveness of QoS policy configuration. 174 - Fully compatible with upstream native Kubernetes components, requiring no 175 intrusive modifications. 176 177 ### Design Details 178 179 #### Detailed working flow 180 181 ![orm-nri-details](orm-nir-details.png) 182 183 In this part, the method based on the Kubelet API polling is referred to as 184 **_Bypass_** Mode, while another method based on NRI is referred to as **_NRI_** Mode. 185 186 #### Addon 187 188 - The ORM support two operational modes: Bypass or NRI. Only one mode can be active 189 at any given time. When creating a new ORM Manger, the current operational mode can 190 be determined by reading the configuration, and it does not support changing the 191 mode during runtime. 192 193 ```go 194 type workMode string 195 const ( 196 workModeNri workMode = "nri" 197 workModeBypass workMode = "bypass" 198 ) 199 200 201 type ManagerImpl struct { 202 ctx context.Context 203 .... 204 // ORM run mode: bypass or nri. 205 // Bypass mode is triggered by polling kubelet api to get the pod event. 206 // NRI mode is required containerd version >= 1.7.0 and NRI enabled. 207 mode workMode 208 .... 209 } 210 211 func NewManger(... config *config.Configuration){ 212 // init orm work mode with essential components 213 m.initORMWorkMode(config, metaServer, emitter) 214 } 215 216 func (m *ManagerImpl) initORMWorkMode(config *config.Configuration, metaServer *metaserver.MetaServer, emitter metrics.MetricEmitter) { 217 // init ORM work node according to the configuration and NRI status 218 } 219 ``` 220 221 - The ORM ManagerImpl functions as an NRI stub, implementing processing logic 222 within the corresponding hook event functions. 223 224 ```go 225 import "github.com/containerd/nri/pkg/stub" 226 227 type ManagerImpl struct { 228 ctx context.Context 229 .... 230 // nriStub is the implementtion of NRI events handlers 231 nriStub stub.Stub 232 // nriMask stores the specific events that need to be hooked 233 nriMask stub.EventMask 234 nriOptions []stub.Option 235 nriConf nriConfig 236 .... 237 } 238 ``` 239 240 - In enhancing the ORM implementation, three hook functions are required: 241 `RunPodSandbox()`, `CreateContainer()`, and `RemovePodSandbox()`. 242 243 **Step 1**, during `RunPodSanbox()`, the `Admit()` function is triggered. 244 If `Admit()` succeeds, resources are allocated for the container, and the pod 245 creation process continues. If `Admit()` fails, pod creation also fails. 246 ```go 247 func (m *MangerImpl) RunPodSandbox(podSandbox *api.PodSandbox) error { 248 err := m.processAddPod(pod.Uid) 249 if err != nil { 250 klog.Errorf("[ORM] RunPodSandbox processAddPod fail, pod: %s/%s/%s, err: %v", 251 pod.Namespace, pod.Name, pod.Uid, err) 252 } 253 return err 254 } 255 ``` 256 257 **Step 2**, after a successful `Admit()`, the process proceeds to the 258 `CreateContainer()` event. At this point, resources have been allocated for the 259 container by `Admit()`. The corresponding resources are updated in the container's 260 spec and returned. 261 ```go 262 func (m *MangerImpl) CreateContainer(pod *api.PodSandbox, container *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) { 263 // Update Container Spec from the podResources 264 adjust, err:= m.updateContainer(pod, container) 265 return adjust, nil, err 266 } 267 ``` 268 269 **Step 3**, During `RemovePodSandbox()`, all resource allocations related to 270 the pod are returned. 271 272 ```go 273 func (p *plugin) RemovePodSandbox(pod *api.PodSandbox) error { 274 err := m.processDeletePod(pod.Uid) 275 if err != nil { 276 klog.Errorf("[ORM] RemovePodSandbox processDeletePod fail, pod: %s/%s/%s, err: %v", 277 pod.Namespace, pod.Name, pod.Uid, err) 278 } 279 return err 280 } 281 ``` 282 283 #### Modification 284 285 - If using the NRI Mode, after the allocation of resources is completed in the 286 `Admit()` , the `Allocate()` does not need to execute `syncContainer()`; it should 287 simply return after the resources have been allocated. 288 289 ```go 290 func (m *ManagerImpl) Allocate(pod *v1.Pod, container *v1.Container) error { 291 .... 292 err := m.addContainer(pod, container) 293 // return after resource allocate when run in NRIMode 294 if err != nil || m.mode == workModeNri { 295 return err 296 } 297 err = m.syncContainer(pod, container) 298 return err 299 } 300 ``` 301 302 - In NRI Mode, the executer in `syncContainer()` can be implemented through NRI's 303 `updateContainer()` . 304 305 ```go 306 if m.mode == workModeNri { 307 m.updateContainerByNRI(pod, container) 308 } else { 309 m.syncContainer(pod, &container) 310 } 311 ``` 312 313 - The `metaServer` as a member variable of the ORM `ManagerImpl` because it is 314 used in both Bypass and NRI modes. 315 - During NRI mode, halt the MetaManager's Reconcile, user NRI to hook the Pod/Container events. 316 - During NRI mode, the executor is conduct by NRI, do not need to create an Executor. 317 318 #### Test Plan 319 320 We will test the enhancement of ORM by NRI in a real cluster by deploying simulated 321 task invocation resource management plugins to configure QoS policies, which will 322 cover key points listed below: 323 324 - ORM completes registration to Containerd as an NRI plugin and establishes a connection. 325 - ORM can configure the correct LinuxContainerResources configuration with allocation 326 results for containers through NRI. 327 - ORM can add environment variables to containers through NRI. 328 - Validate that reconcileState() of ORM will update the cgroup configs for containers 329 by the latest resource allocation results. 330 331 ## Production Readiness Review Questionnaire 332 333 ### Feature Enablement and Rollback 334 335 #### How can this feature be enabled / disabled in a live cluster? 336 337 This feature is disable by default, you can enable it by configuration. 338 If a failure is detected in the NRI runtime environment while NRI mode enables, 339 it will fall back to Bypass Mode. 340 341 ### Troubleshooting 342 343 #### How does this feature react if the NRI not supported? 344 345 It will fall back to Bypass mode of ORM. 346 347 #### How to handle resource allocation failures? 348 349 If encounter admit failure, the pod will enter a retry loop. 350 351 #### What happens if the NRI stub times out or if the socket connection fails? 352 353 Currently, if the NRI plugin times out, it leads to Containerd no longer invoking 354 this plugin. To address this, the following strategy needs to be adopted. 355 356 While timeout, in `OnClose()` invoke `stub.Restart` to re-create connection to containerd 357 358 And, do `Admit()` with a timeout (configured) context, if timeout try to create again. 359 360 ## Appendix 361 362 NRI : [https://github.com/containerd/nri](https://github.com/containerd/nri) 363 364 ORM PR: [#406](https://github.com/kubewharf/katalyst-core/pull/406) [#430](https://github.com/kubewharf/katalyst-core/issues/430) 365 366 ## Implementation History 367 - [x] 01/16/2024 Proposed idea in community meeting 368 - [x] 03/12/2024 Compile a document following the proposal template 369 - [x] 03/19/2024 Present proposal at a community meeting 370 - [x] 04/20/2024 Complete the basic functionalities of NRI as covered in the detailed 371 design 372 - [ ] 05/10/2024 commence the first round of testing 373 - [ ] 05/20/2024 open proposal PR for code