
     1  ---
     2  title: Enhance ORM by NRI
     3  authors:
     4    - "airren"
     5    - "hle2"
     6  reviewers:
     7    - "caohe"
     8  creation-date: 2024-03-03
     9  last-updated: 2024-04-24
    10  status: implementable
    12  ---
    14  # Enhance ORM by NRI
    16  <!--ts-->
    17  * [Enhance ORM by NRI](#enhance-orm-by-nri)
    18     * [Summary](#summary)
    19     * [Motivation](#motivation)
    20        * [Goals](#goals)
    21        * [Non-Goals/Future Work](#non-goalsfuture-work)
    22     * [Proposal](#proposal)
    23        * [User Stories](#user-stories)
    24           * [Story1: Use origin kubernetes without  intrusive modifications](#story1-use-origin-kubernetes-without--intrusive-modifications)
    25           * [Story2: Synchronous configuration of QoS policies and injection of environment variables](#story2-synchronous-configuration-of-qos-policies-and-injection-of-environment-variables)
    26        * [Requirements](#requirements)
    27           * [Functional Requirements](#functional-requirements)
    28           * [Non-Functional Requirements](#non-functional-requirements)
    29        * [Design Details](#design-details)
    30           * [Detailed working flow](#detailed-working-flow)
    31           * [Addon](#addon)
    32           * [Modification](#modification)
    33           * [Test Plan](#test-plan)
    34     * [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
    35        * [Feature Enablement and Rollback](#feature-enablement-and-rollback)
    36           * [How can this feature be enabled / disabled in a live cluster?](#how-can-this-feature-be-enabled--disabled-in-a-live-cluster)
    37        * [Troubleshooting](#troubleshooting)
    38           * [How does this feature react if the NRI not supported?](#how-does-this-feature-react-if-the-nri-not-supported)
    39           * [How to handle resource allocation failures?](#how-to-handle-resource-allocation-failures)
    40           * [What happens if the NRI stub times out or if the socket connection fails?](#what-happens-if-the-nri-stub-times-out-or-if-the-socket-connection-fails)
    41     * [Appendix](#appendix)
    42     * [Implementation History](#implementation-history)
    44  <!-- Created by -->
    45  <!-- Added by: airren, at: Wed Mar 27 14:55:54 CST 2024 -->
    47  <!--te-->
    49  ## Summary
    51  To meet the needs of various business application scenarios, ensuring sufficient 
    52  resource guarantees for latency-sensitive services is necessary, especially when 
    53  online and offline tasks are mixed. This requires Kubernetes to provide more 
    54  granular resource management capabilities, enhance container isolation, and reduce 
    55  interference between containers.
    57  As of now, Kubernetes does not offer a fully comprehensive resource management 
    58  solution. Many open-source projects in the Kubernetes ecosystem have devised 
    59  their methods to modify the deployment and management processes of pods, enabling 
    60  fine-grained resource allocation.
    62  There are various approaches to extending Kubernetes, which we have summarized 
    63  as follows.
    65  ![kubernetes-enhance-overview](kubernetes-enhance-overview.png)
    67  All the methods listed above can enhance Kubernetes, but except for the standalone 
    68  approach, they unavoidably involve  intrusive modifications to the upstream Kubernetes 
    69  components, making it difficult for users to stay synchronized with upstream 
    70  components. Although the standalone approach avoids modifications to upstream 
    71  components, this asynchronous update method also has numerous drawbacks.
    73  To address the need for intrusive modifications to Kubernetes and changes to the 
    74  default process, enabling developers to have a more unified implementation 
    75  approach, NRI has emerged.
    77  [NRI]( is a plugin-based node resource management approach introduced by 
    78  the upstream community. Using NRI, Kubernetes' node resource management capabilities 
    79  can be enhanced through plugins without  intrusive modifications to the upstream 
    80  Kubernetes components.
    82  > NRI allows plugging domain- or vendor-specific custom logic into OCI- compatible 
    83  > runtimes. This logic can make controlled changes to containers or perform extra 
    84  > actions outside the scope of OCI at certain points in a containers lifecycle. 
    85  > This can be used, for instance, for improved allocation and management of devices 
    86  > and other container resources.
    88  ![nri-architecture](nri-architecture.png)
    90  This proposal introduces how to enhance Katalyst using NRI, allowing Katalyst to 
    91  be deployed based on origin Kubernetes and making it easier to maintain and use.
    93  ## Motivation
    95  Katalyst enhances Kubernetes resource management policies on a single node through 
    96  the QoS Resource Manager (QRM). However, the current QRM mode involves  intrusive 
    97  modifications to the Kubelet, which makes it inconvenient for some users who use 
    98  the origin Kubernetes but not the distribution Kubewharf. To address this, Katalyst 
    99  proposes the ORM architecture, which provides a decoupled solution from Kubelet as 
   100  a supplement to the QRM solution.
   102  In the ORM architecture, there are two implementation approaches. The first approach 
   103  is named Bypass, which polls Kubelet's API for pod events on the current node and 
   104  updates pod resources. This approach is asynchronous and cannot inject parameters 
   105  such as environment variables. The other approach is based on NRI. NRI (Node 
   106  Resource Interface) is a general framework for CRI-compatible container runtime 
   107  plugin extensions. It offers a mechanism for extensions to monitor pod/container 
   108  states and make limited configuration modifications. Using NRI, Katalyst can 
   109  synchronously modify resources and inject other information, such as environment 
   110  variables, during pod events.
   112  ### Goals
   114  - Expand Katalyst‘s ORM mode using NRI to enhance the Resource management capabilities 
   115  of Kubernetes。
   116  - Support for fine-grained resource control when containerd is used as the CRI runtime.
   118  ### Non-Goals/Future Work
   120  - Support for other runtimes besides containerd, such as cri-o and docker.
   122  ## Proposal
   124  Diverging from  QRM or ORM's Bypass Mode, the Katalyst-agent will work as an NRI 
   125  plugin to subscribe pod/container lifecycle events from CRI runtime (in this 
   126  proposal, it is containerd), and then the Katalyst-agent will return an adjusted 
   127  Container spec in the hook events, or update the container spec by an active update.
   129  - Get pod/container lifecycle events and pod or container information from NRI.
   130  - Transform the NRI format information into CRI format to reuse existing admit 
   131  implementation by QRM Plugins.
   132  - Update the NRI format container spec to the CRI runtime.
   133  - While reconciling use NRI UpdateContainter to reconfigure resources.
   135  **NRI Enhanced ORM(Along with kubelet polling)**
   137  ![orm-architecture](orm-architecture.png)
   139  ### User Stories
   141  #### Story1: Use origin kubernetes without  intrusive modifications 
   143  Extending and enhancing Kubernetes' resource management capabilities is a common 
   144  requirement in many business scenarios. However, while enhancing Kubernetes, it's 
   145  a common requirement to ensure that all Kubernetes components remain consistent 
   146  with the upstream community and avoid making any  intrusive modifications to the 
   147  original Kubernetes components. After enabling NRI mode, deploying Katalyst on 
   148  existing clusters does not require restarting the original cluster. Enhancements 
   149  to the original Kubernetes can be achieved through a plugin-based approach.
   151  #### Story2: Synchronous configuration of QoS policies and injection of environment variables
   153  When enhancing QoS policies in Kubernetes, synchronous modification is the most 
   154  efficient method. With NRI Mode enabled, Katalyst plugins can synchronously modify 
   155  pod resources during pod creation, ensuring QoS policy allocation before pod 
   156  execution. Additionally, through NRI Mode, dynamic updates to pod resources 
   157  are possible. During pod creation, adjustments to pod resources, device binding, 
   158  RDT, and environment variable injection can be achieved via NRI Mode.
   160  ### Requirements
   162  - Need to upgrade containerd to >= v1.7.0
   164  #### Functional Requirements
   166  - Support all functionalities corresponding to Bypass Mode under the existing ORM 
   167  architecture. This includes: adjusting container's cpuset / cfsquota, memory QoS.
   168  - Support injecting environment variables into containers
   170  #### Non-Functional Requirements
   172  - It can achieve synchronous configuration of QoS policies, improving the 
   173  responsiveness of QoS policy configuration.
   174  - Fully compatible with upstream native Kubernetes components, requiring no 
   175   intrusive modifications.
   177  ### Design Details
   179  #### Detailed working flow
   181  ![orm-nri-details](orm-nir-details.png)
   183  In this part, the method based on the Kubelet API polling is referred to as 
   184  **_Bypass_** Mode, while another method based on NRI is referred to as **_NRI_** Mode.
   186  #### Addon
   188  - The ORM support two operational modes: Bypass or NRI. Only one mode can be active 
   189  at any given time. When creating a new ORM Manger, the current operational mode can 
   190  be determined by reading the configuration, and it does not support changing the 
   191  mode during runtime.
   193      ```go
   194          type workMode string
   195          const (
   196              workModeNri    workMode = "nri"
   197              workModeBypass workMode = "bypass"
   198          )
   201          type ManagerImpl struct {
   202              ctx context.Context
   203              ....
   204              // ORM run mode: bypass or nri.
   205              // Bypass mode is triggered by polling kubelet api to get the pod event.
   206              // NRI mode is required containerd version >= 1.7.0 and NRI enabled. 
   207              mode workMode
   208              ....
   209          }
   211          func NewManger(... config *config.Configuration){
   212              // init orm work mode with essential components 
   213              m.initORMWorkMode(config, metaServer, emitter)
   214          }
   216          func (m *ManagerImpl) initORMWorkMode(config *config.Configuration, metaServer *metaserver.MetaServer, emitter metrics.MetricEmitter) { 
   217              // init ORM work node according to the configuration and NRI status
   218          }
   219      ```
   221  - The ORM ManagerImpl functions as an NRI stub, implementing processing logic 
   222  within the corresponding hook event functions.
   224      ```go
   225          import ""
   227          type ManagerImpl struct {
   228              ctx context.Context
   229              ....
   230              // nriStub is the implementtion of NRI events handlers
   231              nriStub stub.Stub
   232              // nriMask stores the specific events that need to be hooked
   233              nriMask stub.EventMask
   234              nriOptions []stub.Option
   235              nriConf nriConfig
   236              ....
   237          }
   238      ```
   240  - In enhancing the ORM implementation, three hook functions are required: 
   241  `RunPodSandbox()`, `CreateContainer()`, and `RemovePodSandbox()`.
   243    **Step 1**, during `RunPodSanbox()`, the `Admit()` function is triggered. 
   244  If `Admit()` succeeds, resources are allocated for the container, and the pod 
   245  creation process continues. If `Admit()` fails, pod creation also fails.
   246      ```go
   247          func (m *MangerImpl) RunPodSandbox(podSandbox *api.PodSandbox) error {
   248            	err := m.processAddPod(pod.Uid)
   249          	if err != nil {
   250          		klog.Errorf("[ORM] RunPodSandbox processAddPod fail, pod: %s/%s/%s, err: %v",
   251          			pod.Namespace, pod.Name, pod.Uid, err)
   252          	}
   253          	return err
   254          }
   255      ```
   257    **Step 2**, after a successful `Admit()`, the process proceeds to the 
   258  `CreateContainer()` event. At this point, resources have been allocated for the 
   259  container by `Admit()`. The corresponding resources are updated in the container's 
   260  spec and returned.
   261      ```go
   262          func (m *MangerImpl) CreateContainer(pod *api.PodSandbox, container *api.Container) (*api.ContainerAdjustment, []*api.ContainerUpdate, error) {
   263              // Update Container Spec from the podResources
   264              adjust, err:= m.updateContainer(pod, container)
   265              return adjust, nil, err
   266          }
   267      ```
   269    **Step 3**, During `RemovePodSandbox()`, all resource allocations related to 
   270  the pod are returned.
   272      ```go
   273          func (p *plugin) RemovePodSandbox(pod *api.PodSandbox) error {
   274          err := m.processDeletePod(pod.Uid)
   275          if err != nil {
   276          	klog.Errorf("[ORM] RemovePodSandbox processDeletePod fail, pod: %s/%s/%s, err: %v",
   277          		pod.Namespace, pod.Name, pod.Uid, err)
   278          }
   279          return err
   280          }
   281      ```
   283  #### Modification
   285  - If using the NRI Mode, after the allocation of resources is completed in the 
   286  `Admit()` , the `Allocate()` does not need to execute `syncContainer()`; it should 
   287  simply return after the resources have been allocated.
   289      ```go
   290          func (m *ManagerImpl) Allocate(pod *v1.Pod, container *v1.Container) error {
   291              ....
   292              err := m.addContainer(pod, container)
   293              // return after resource allocate when run in NRIMode
   294  	    	if err != nil || m.mode == workModeNri {
   295              	  return err
   296              }
   297              err = m.syncContainer(pod, container)
   298              return err
   299          }
   300      ```
   302  - In NRI Mode, the executer in `syncContainer()` can be implemented through NRI's 
   303  `updateContainer()` .
   305      ```go
   306          if m.mode == workModeNri {
   307              m.updateContainerByNRI(pod, container)
   308  	    } else {
   309              m.syncContainer(pod, &container)
   310          }
   311      ```
   313  - The `metaServer` as a member variable of the ORM `ManagerImpl` because it is 
   314  used in both Bypass and NRI modes.
   315  - During NRI mode, halt the MetaManager's Reconcile, user NRI to hook the Pod/Container events.
   316  - During NRI mode, the executor is conduct by NRI, do not need to create an Executor.
   318  #### Test Plan
   320  We will test the enhancement of ORM by NRI in a real cluster by deploying simulated 
   321  task invocation resource management plugins to configure QoS policies, which will 
   322  cover key points listed below:
   324  - ORM completes registration to Containerd as an NRI plugin and establishes a connection.
   325  - ORM can configure the correct LinuxContainerResources configuration with allocation 
   326  results for containers through NRI.
   327  - ORM can add environment variables to containers through NRI.
   328  - Validate that reconcileState() of ORM will update the cgroup configs for containers 
   329  by the latest resource allocation results.
   331  ## Production Readiness Review Questionnaire
   333  ### Feature Enablement and Rollback
   335  #### How can this feature be enabled / disabled in a live cluster?
   337  This feature is disable by default, you can enable it by configuration.   
   338  If a failure is detected in the NRI runtime environment while NRI mode enables, 
   339  it will fall back to Bypass Mode.
   341  ### Troubleshooting
   343  #### How does this feature react if the NRI not supported?
   345  It will fall back to Bypass mode of ORM.
   347  #### How to handle resource allocation failures?
   349  If encounter admit failure, the pod will enter a retry loop.
   351  #### What happens if the NRI stub times out or if the socket connection fails?
   353  Currently, if the NRI plugin times out, it leads to Containerd no longer invoking 
   354  this plugin. To address this, the following strategy needs to be adopted.
   356  While timeout, in `OnClose()` invoke `stub.Restart`  to re-create connection to containerd
   358  And, do `Admit()` with a timeout (configured) context, if timeout try to create again.
   360  ## Appendix
   362  NRI : [](
   364  ORM PR: [#406]( [#430](
   366  ## Implementation History
   367  - [x] 01/16/2024 Proposed idea in community meeting
   368  - [x] 03/12/2024 Compile a document following the proposal template
   369  - [x] 03/19/2024 Present proposal at a community meeting
   370  - [x] 04/20/2024 Complete the basic functionalities of NRI as covered in the detailed 
   371  design 
   372  - [ ] 05/10/2024 commence the first round of testing
   373  - [ ] 05/20/2024 open proposal PR for code