github.com/qsunny/k8s@v0.0.0-20220101153623-e6dca256d5bf/examples-master/staging/runtime-constraints/README.md (about) 1 ## Runtime Constraints example 2 3 This example demonstrates how Kubernetes enforces runtime constraints for compute resources. 4 5 ### Prerequisites 6 7 For the purpose of this example, we will spin up a 1 node cluster using the Vagrant provider that 8 is not running with any additional add-ons that consume node resources. This keeps our demonstration 9 of compute resources easier to follow by starting with an empty cluster. 10 11 ``` 12 $ export KUBERNETES_PROVIDER=vagrant 13 $ export NUM_NODES=1 14 $ export KUBE_ENABLE_CLUSTER_MONITORING=none 15 $ export KUBE_ENABLE_CLUSTER_DNS=false 16 $ export KUBE_ENABLE_CLUSTER_UI=false 17 $ cluster/kube-up.sh 18 ``` 19 20 We should now have a single node cluster running 0 pods. 21 22 ``` 23 $ cluster/kubectl.sh get nodes 24 NAME LABELS STATUS AGE 25 10.245.1.3 kubernetes.io/hostname=10.245.1.3 Ready 17m 26 $ cluster/kubectl.sh get pods --all-namespaces 27 ``` 28 29 When demonstrating runtime constraints, it's useful to show what happens when a node is under heavy load. For 30 this scenario, we have a single node with 2 cpus and 1GB of memory to demonstrate behavior under load, but the 31 results extend to multi-node scenarios. 32 33 ### CPU requests 34 35 Each container in a pod may specify the amount of CPU it requests on a node. CPU requests are used at schedule time, and represent a minimum amount of CPU that should be reserved for your container to run. 36 37 When executing your container, the Kubelet maps your containers CPU requests to CFS shares in the Linux kernel. CFS CPU shares do not impose a ceiling on the actual amount of CPU the container can use. Instead, it defines a relative weight across all containers on the system for how much CPU time the container should get if there is CPU contention. 38 39 Let's demonstrate this concept using a simple container that will consume as much CPU as possible. 40 41 ``` 42 $ cluster/kubectl.sh run cpuhog \ 43 --image=busybox \ 44 --requests=cpu=100m \ 45 -- md5sum /dev/urandom 46 ``` 47 48 This will create a single pod on your node that requests 1/10 of a CPU, but it has no limit on how much CPU it may actually consume 49 on the node. 50 51 To demonstrate this, if you SSH into your machine, you will see it is consuming as much CPU as possible on the node. 52 53 ``` 54 $ vagrant ssh node-1 55 $ sudo docker stats $(sudo docker ps -q) 56 CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O 57 6b593b1a9658 0.00% 1.425 MB/1.042 GB 0.14% 1.038 kB/738 B 58 ae8ae4ffcfe4 150.06% 831.5 kB/1.042 GB 0.08% 0 B/0 B 59 ``` 60 61 As you can see, its consuming 150% of the total CPU. 62 63 If we scale our replication controller to 20 pods, we should see that each container is given an equal proportion of CPU time. 64 65 ``` 66 $ cluster/kubectl.sh scale rc/cpuhog --replicas=20 67 ``` 68 69 Once all the pods are running, you will see on your node that each container is getting approximately an equal proportion of CPU time. 70 71 ``` 72 $ sudo docker stats $(sudo docker ps -q) 73 CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O 74 089e2d061dee 9.24% 786.4 kB/1.042 GB 0.08% 0 B/0 B 75 0be33d6e8ddb 10.48% 823.3 kB/1.042 GB 0.08% 0 B/0 B 76 0f4e3c4a93e0 10.43% 786.4 kB/1.042 GB 0.08% 0 B/0 B 77 ``` 78 79 Each container is getting 10% of the CPU time per their scheduling request, and we are unable to schedule more. 80 81 As you can see CPU requests are used to schedule pods to the node in a manner that provides weighted distribution of CPU time 82 when under contention. If the node is not being actively consumed by other containers, a container is able to burst up to as much 83 available CPU time as possible. If there is contention for CPU, CPU time is shared based on the requested value. 84 85 Let's delete all existing resources in preparation for the next scenario. Verify all the pods are deleted and terminated. 86 87 ``` 88 $ cluster/kubectl.sh delete rc --all 89 $ cluster/kubectl.sh get pods 90 NAME READY STATUS RESTARTS AGE 91 ``` 92 93 ### CPU limits 94 95 So what do you do if you want to control the maximum amount of CPU that your container can burst to use in order provide a consistent 96 level of service independent of CPU contention on the node? You can specify an upper limit on the total amount of CPU that a pod's 97 container may consume. 98 99 To enforce this feature, your node must run a docker version >= 1.7, and your operating system kernel must 100 have support for CFS quota enabled. Finally, your the Kubelet must be started with the following flag: 101 102 ``` 103 kubelet --cpu-cfs-quota=true 104 ``` 105 106 To demonstrate, let's create the same pod again, but this time set an upper limit to use 50% of a single CPU. 107 108 ``` 109 $ cluster/kubectl.sh run cpuhog \ 110 --image=busybox \ 111 --requests=cpu=100m \ 112 --limits=cpu=500m \ 113 -- md5sum /dev/urandom 114 ``` 115 116 Let's SSH into the node, and look at usage stats. 117 118 ``` 119 $ vagrant ssh node-1 120 $ sudo su 121 $ docker stats $(docker ps -q) 122 CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O 123 2a196edf7de2 47.38% 835.6 kB/1.042 GB 0.08% 0 B/0 B 124 ... 125 ``` 126 127 As you can see, the container is no longer allowed to consume all available CPU on the node. Instead, it is being limited to use 128 50% of a CPU over every 100ms period. As a result, the reported value will be in the range of 50% but may oscillate above and below. 129 130 Let's delete all existing resources in preparation for the next scenario. Verify all the pods are deleted and terminated. 131 132 ``` 133 $ cluster/kubectl.sh delete rc --all 134 $ cluster/kubectl.sh get pods 135 NAME READY STATUS RESTARTS AGE 136 ``` 137 138 ### Memory requests 139 140 By default, a container is able to consume as much memory on the node as possible. In order to improve placement of your 141 pods in the cluster, it is recommended to specify the amount of memory your container will require to run. The scheduler 142 will then take available node memory capacity into account prior to binding your pod to a node. 143 144 Let's demonstrate this by creating a pod that runs a single container which requests 100Mi of memory. The container will 145 allocate and write to 200MB of memory every 2 seconds. 146 147 ``` 148 $ cluster/kubectl.sh run memhog \ 149 --image=derekwaynecarr/memhog \ 150 --requests=memory=100Mi \ 151 --command \ 152 -- /bin/sh -c "while true; do memhog -r100 200m; sleep 1; done" 153 ``` 154 155 If you look at output of docker stats on the node: 156 157 ``` 158 $ docker stats $(docker ps -q) 159 CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O 160 2badf74ae782 0.00% 1.425 MB/1.042 GB 0.14% 816 B/348 B 161 a320182967fa 105.81% 214.2 MB/1.042 GB 20.56% 0 B/0 B 162 163 ``` 164 165 As you can see, the container is using approximately 200MB of memory, and is only limited to the 1GB of memory on the node. 166 167 We scheduled against 100Mi, but have burst our memory usage to a greater value. 168 169 We refer to this as memory having __Burstable__ quality of service for this container. 170 171 Let's delete all existing resources in preparation for the next scenario. Verify all the pods are deleted and terminated. 172 173 ``` 174 $ cluster/kubectl.sh delete rc --all 175 $ cluster/kubectl.sh get pods 176 NAME READY STATUS RESTARTS AGE 177 ``` 178 179 ### Memory limits 180 181 If you specify a memory limit, you can constrain the amount of memory your container can use. 182 183 For example, let's limit our container to 200Mi of memory, and just consume 100MB. 184 185 ``` 186 $ cluster/kubectl.sh run memhog \ 187 --image=derekwaynecarr/memhog \ 188 --limits=memory=200Mi \ 189 --command -- /bin/sh -c "while true; do memhog -r100 100m; sleep 1; done" 190 ``` 191 192 If you look at output of docker stats on the node: 193 194 ``` 195 $ docker stats $(docker ps -q) 196 CONTAINER CPU % MEM USAGE/LIMIT MEM % NET I/O 197 5a7c22ae1837 125.23% 109.4 MB/209.7 MB 52.14% 0 B/0 B 198 c1d7579c9291 0.00% 1.421 MB/1.042 GB 0.14% 1.038 kB/816 B 199 ``` 200 201 As you can see, we are limited to 200Mi memory, and are only consuming 109.4MB on the node. 202 203 Let's demonstrate what happens if you exceed your allowed memory usage by creating a replication controller 204 whose pod will keep being OOM killed because it attempts to allocate 300MB of memory, but is limited to 200Mi. 205 206 ``` 207 $ cluster/kubectl.sh run memhog-oom --image=derekwaynecarr/memhog --limits=memory=200Mi --command -- memhog -r100 300m 208 ``` 209 210 If we describe the created pod, you will see that it keeps restarting until it ultimately goes into a CrashLoopBackOff. 211 212 The reason it is killed and restarts is because it is OOMKilled as it attempts to exceed its memory limit. 213 214 ``` 215 $ cluster/kubectl.sh get pods 216 NAME READY STATUS RESTARTS AGE 217 memhog-oom-gj9hw 0/1 CrashLoopBackOff 2 26s 218 $ cluster/kubectl.sh describe pods/memhog-oom-gj9hw | grep -C 3 "Terminated" 219 memory: 200Mi 220 State: Waiting 221 Reason: CrashLoopBackOff 222 Last Termination State: Terminated 223 Reason: OOMKilled 224 Exit Code: 137 225 Started: Wed, 23 Sep 2015 15:23:58 -0400 226 ``` 227 228 Let's clean-up before proceeding further. 229 230 ``` 231 $ cluster/kubectl.sh delete rc --all 232 ``` 233 234 ### What if my node runs out of memory? 235 236 If you only schedule __Guaranteed__ memory containers, where the request is equal to the limit, then you are not in major danger of 237 causing an OOM event on your node. If any individual container consumes more than their specified limit, it will be killed. 238 239 If you schedule __BestEffort__ memory containers, where the request and limit is not specified, or __Burstable__ memory containers, where 240 the request is less than any specified limit, then it is possible that a container will request more memory than what is actually available on the node. 241 242 If this occurs, the system will attempt to prioritize the containers that are killed based on their quality of service. This is done 243 by using the OOMScoreAdjust feature in the Linux kernel which provides a heuristic to rank a process between -1000 and 1000. Processes 244 with lower values are preserved in favor of processes with higher values. The system daemons (kubelet, kube-proxy, docker) all run with 245 low OOMScoreAdjust values. 246 247 In simplest terms, containers with __Guaranteed__ memory containers are given a lower value than __Burstable__ containers which has 248 a lower value than __BestEffort__ containers. As a consequence, containers with __BestEffort__ should be killed before the other tiers. 249 250 To demonstrate this, let's spin up a set of different replication controllers that will over commit the node. 251 252 ``` 253 $ cluster/kubectl.sh run mem-guaranteed --image=derekwaynecarr/memhog --replicas=2 --requests=cpu=10m --limits=memory=600Mi --command -- memhog -r100000 500m 254 $ cluster/kubectl.sh run mem-burstable --image=derekwaynecarr/memhog --replicas=2 --requests=cpu=10m,memory=600Mi --command -- memhog -r100000 100m 255 $ cluster/kubectl.sh run mem-besteffort --replicas=10 --image=derekwaynecarr/memhog --requests=cpu=10m --command -- memhog -r10000 500m 256 ``` 257 258 This will induce a SystemOOM 259 260 ``` 261 $ cluster/kubectl.sh get events | grep OOM 262 43m 8m 178 10.245.1.3 Node SystemOOM {kubelet 10.245.1.3} System OOM encountered 263 ``` 264 265 If you look at the pods: 266 267 ``` 268 $ cluster/kubectl.sh get pods 269 NAME READY STATUS RESTARTS AGE 270 ... 271 mem-besteffort-zpnpm 0/1 CrashLoopBackOff 4 3m 272 mem-burstable-n0yz1 1/1 Running 0 4m 273 mem-burstable-q3dts 1/1 Running 0 4m 274 mem-guaranteed-fqsw8 1/1 Running 0 4m 275 mem-guaranteed-rkqso 1/1 Running 0 4m 276 ``` 277 278 You see that our BestEffort pod goes in a restart cycle, but the pods with greater levels of quality of service continue to function. 279 280 As you can see, we rely on the Kernel to react to system OOM events. Depending on how your host operating 281 system was configured, and which process the Kernel ultimately decides to kill on your Node, you may experience unstable results. In addition, during an OOM event, while the kernel is cleaning up processes, the system may experience significant periods of slow down or appear unresponsive. As a result, while the system allows you to overcommit on memory, we recommend to not induce a Kernel sys OOM. 282 283 <!-- BEGIN MUNGE: GENERATED_ANALYTICS --> 284 []() 285 <!-- END MUNGE: GENERATED_ANALYTICS -->