volcano.sh/volcano@v1.9.0/docs/design/device-sharing.md (about) 1 # Sharing devices in volcano 2 3 ## Introduction 4 5 We implement a common interface for shareable devices(GPU,NPU,FPGA,...) called Devices, and use it to reimplement current gpu-share mechanism. The goal is to let device-sharing easy to implement, and better organised. If you wish to grant vc-scheduler the ability to share another device, all you need is to implement these methods in Devices, and place your logic under pkg/scheduler/api/devices. 6 7 ## Backguards 8 9 We intended to provide volcano the ability to share third-party resources link GPU,NPU,etc in the near future. At first, I tried to implement these logics based on predicate.gpushare, but i sooner realised that these logics scattered in device_info.go, node_info.go, pod_info.go, and whole predicate folder. if i follow the implementation of predicate.gpushare, i will have no choice but hack deeply into vc-scheduler api. Sooner or later vc-scheduler api will be crowded with various device-sharing logic, which is probably not what we wished. 10 11 ## Implementation 12 13 ### Interface Devices design 14 15 The design of Devices is shown below: 16 17 ``` 18 type Devices interface { 19 //following two functions used in node_info 20 //AddResource is to add the corresponding device resource of this 'pod' into current scheduler cache 21 AddResource(pod *v1.Pod) 22 //SubResoure is to substract the corresponding device resource of this 'pod' from current scheduler cache 23 SubResource(pod *v1.Pod) 24 25 //following four functions used in predicate 26 //HasDeviceRequest checks if the 'pod' request this device 27 HasDeviceRequest(pod *v1.Pod) bool 28 //FiltreNode checks if the 'pod' fit in current node 29 // The first return value represents the filtering result, and the value range is "0, 1, 2, 3" 30 // 0: Success 31 // Success means that plugin ran correctly and found pod schedulable. 32 33 // 1: Error 34 // Error is used for internal plugin errors, unexpected input, etc. 35 36 // 2: Unschedulable 37 // Unschedulable is used when a plugin finds a pod unschedulable. The scheduler might attempt to 38 // preempt other pods to get this pod scheduled. Use UnschedulableAndUnresolvable to make the 39 // scheduler skip preemption. 40 // The accompanying status message should explain why the pod is unschedulable. 41 42 // 3: UnschedulableAndUnresolvable 43 // UnschedulableAndUnresolvable is used when a plugin finds a pod unschedulable and 44 // preemption would not change anything. Plugins should return Unschedulable if it is possible 45 // that the pod can get scheduled with preemption. 46 // The accompanying status message should explain why the pod is unschedulable. 47 FilterNode(pod *v1.Pod) (int, string, error) 48 49 //Allocate action in predicate 50 Allocate(kubeClient kubernetes.Interface, pod *v1.Pod) error 51 //Release action in predicate 52 Release(kubeClient kubernetes.Interface, pod *v1.Pod) error 53 54 //used for debug and monitor 55 GetStatus() string 56 } 57 ``` 58 59 The first two method are used for node_info to update cluster status. The following four methods are used in predicate which allocatation and deallocation actually take place. Finally a monitor mothod for debug. 60 61 ### Create a seperate package for gpushare related methods, and use Devices method to reimplement it. 62 63 There are two steps we need to do, first, we need to create a new package in "pkg/scheduler/api/devices/nvidia/gpushare", and implement Devices methods in it, then we need to seperate gpushare-related logic from "scheduler.api" and "predicate plugin", and convert them to package "pkg/scheduler/api/devices/nvidia/gpushare". The package contains the following files: device.go(which implement SharedDevicePool interface methods), share.go(which contains private methods for device.go), type.go(which contains const values and definations). 64 65 Details of methods mapping is shown in the table below: 66 67 | origin file | corresponding file(s) in new package | 68 | ------------- | ------------- | 69 | pkg/scheduler/api/node_info.go | pkg/scheduler/api/devices/nvidia/gpushare/device_info.go, pkg/scheduler/api/devices/nvidia/gpushare/share.go | 70 | pkg/scheduler/api/device_info.go | pkg/scheduler/api/devices/nvidia/gpushare/device_info.go, pkg/scheduler/api/devices/nvidia/gpushare/share.go | 71 | pkg/scheduler/api/pod_info.go | pkg/scheduler/api/devices/nvidia/gpushare/share.go | 72 | pkg/scheduler/plugins/predicates/predicates.go | pkg/scheduler/api/devices/nvidia/gpushare/device_info.go | 73 | pkg/scheduler/plugins/predicates/gpu.go | pkg/scheduler/api/devices/nvidia/gpushare/share.go | 74 75 ## How to add a new device-share policy 76 77 ### 1. Define your device in /pkg/scheduler/api/shared_device_pool.go 78 79 Name your policy and put it in shared_device_pool.go as follows: 80 81 ``` 82 const ( 83 GPUSharingDevice = "GpuShare" 84 Your_new_sharing_policy = "xxxxx" 85 ) 86 ``` 87 88 ### 2. Create a new package in /pkg/scheduler/api/devices/"your device name"/"your policy name" 89 90 For example, if you try to implement a NPU share policy, then you are recommended to create a package in /pkg/scheduler/api/device/ascend/npushare 91 92 ### 3. Implement methods of interface shared_device_pool, and put them in your new package 93 94 Note that, you can't to refer to any struct of methods in scheduler.api to avoid cycle importing. If there is anything in scheduler.api you *must* need, then you should modify the SharedDevicePool interface to pass it. 95 The methods defined in SharedDevicePool interface and its information is shown in table below: 96 97 | interface | invoker file | information | 98 | ------------- | ------------ | ------------- | 99 | AddResource(pod *v1.Pod) | pkg/scheduler/api/node_info.go | Add the 'pod' and its resources into scheduler cache | 100 | SubResource(pod *v1.Pod) | pkg/scheduler/api/node_info.go | Delete the 'pod' and substract its resources from scheduler cache | 101 | HasDeviceRequest(pod *v1.Pod) bool | pkg/scheduler/plugins/predicates/predicate.go | Check whether this 'pod' request a portion of this device | 102 | FilterNode(pod *v1.Pod)| pkg/scheduler/plugins/predicates/predicate.go | Check whether the portion of device this pod requests can fit in current node | 103 | Allocate(kubeClient kubernetes.Interface, pod *v1.Pod) error | pkg/scheduler/plugins/predicates/predicate.go | Allocate the portion of this device from the current node to this pod | 104 | Release(kubeClient kubernetes.Interface, pod *v1.Pod) error | pkg/scheduler/plugins/predicates/predicate.go | Dellocate the portion of this device from this pod | 105 | GetStatus() string | none | Used for debug and monitor | 106 107 ### 4. Add your initialization code in /pkg/scheduler/api/node_info.go 108 109 This is the *only* place you hack into scheduler.api ,which you have to register your policy during initialization of node_struct. 110 111 ``` 112 113 // setNodeOthersResource initialize sharable devices 114 func (ni *NodeInfo) setNodeOthersResource(node *v1.Node) { 115 ni.Others[GPUSharingDevice] = gpushare.NewGPUDevices(ni.Name, node) 116 //ni.Others["your device sharing policy name"] = your device sharing package initialization method 117 } 118 119 ``` 120 121 ### 5. Check if your policy is enabled in /pkg/scheduler/plugins/predicate/predicates.go 122 123 This is the *only* plae you hack into predicates.go, when the scheduler checks if your policy is enabled in scheduler configuration. 124 125 predicates.go: 126 127 ``` 128 ... 129 // Checks whether predicate.GPUSharingEnable is provided or not, if given, modifies the value in predicateEnable struct. 130 args.GetBool(&gpushare.GpuSharingEnable, GPUSharingPredicate) 131 args.GetBool(&gpushare.GpuNumberEnable, GPUNumberPredicate) 132 args.GetBool(&gpushare.NodeLockEnable, NodeLockEnable) 133 args.GetBool("your policy enable variable","your policy enable parameter") 134 ... 135 ``` 136 137 138 139