github.com/mirantis/virtlet@v1.5.2-0.20191204181327-1659b8a48e9b/docs/design-proposals/pv.md (about) 1 # Making use of Local PVs and Block Volume mode 2 3 Virtlet uses custom flexvolume driver to handle raw block devices and 4 Ceph volumes right now. This makes VM pods less consistent with 5 "plain" Kubernetes pods. Another problem is that we may want to support 6 persistent rootfs in future. As there's now Local Persistent Volume 7 support (beta as of 1.10) and Block Volume support (alpha as of 1.10) 8 in Kubernetes, we may use these features in Virtlet to avoid the 9 flexvolume hacks and gain persistent rootfs support. 10 11 This document contains the results of the research and will be turned 12 into a more detailed proposal later if we decide to make use of 13 the block PVs. 14 15 The research is based on 16 [this Kubernetes blog post](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/#enabling-smarter-scheduling-and-volume-binding) 17 and 18 [the raw block volume description](https://kubernetes.io/docs/concepts/storage/persistent-volumes/#raw-block-volume-support) 19 from Kubernetes documentation. 20 21 First, I'll describe how the block PVs can be used in Virtlet, and 22 then I'll give a detailed description of how the experiments were 23 conducted. 24 25 ## Using block PVs in Virtlet 26 27 As it turns out, the non-local block PVs aren't different from local 28 block PVs from the CRI point of view. They're configured using 29 `volumeDevices` section of the container spec in the pod and `volumes` 30 section of the pod spec, and passed as `devices` section in the 31 container config to `CreateContainer()` CRI call: 32 33 ```yaml 34 devices: 35 - container_path: /dev/testpvc 36 host_path: /var/lib/kubelet/pods/65b0c985-6e81-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~local-volume/local-block-pv 37 permissions: mrw 38 ``` 39 40 Virtlet can use `host_path` to attach the device to the VM using a 41 `DomainDisk`, and `container_path` to mount it inside the VM using 42 cloud-init. The handling of local and non-local PVs doesn't differ 43 on the CRI level. 44 45 Supporting non-local PVs will automatically give Virtlet support for 46 all the Kubernetes volume types that support the block mode, which 47 include Ceph, FibreChannel, and the persistent disks on AWS, GCP and 48 Azure, with the list probably growing larger in the future. It will 49 also give automatic support for CSI plugins that support the block 50 mode. The caveat is that the block mode is Alpha as of Kubernetes 51 1.10 and it wasn't checked for earlier Kubernetes versions. 52 53 The use of block PVs will eliminate the need for custom flexvolumes at 54 some point (after block volumes become GA and we stop supporting 55 earlier Kubernetes versions). There's one caveat, with block PVs the 56 Ceph RBDs will be mapped on the nodes by `kubelet`, instead of being 57 consumed by qemu by the means of `librbd`. It's not clear though if 58 this will be good or bad from the performance standpoint. If we'll 59 still need custom volume types, flexvolumes may be replaced with 60 [CSI](https://kubernetes.io/blog/2018/04/10/container-storage-interface-beta/). 61 62 More advantages of using block PVs instead of custom flexvolumes 63 include having VM pods differ even less from "plain" pods, and a 64 possibility to make use automatic PV provisioning in future. 65 66 There's also a possibility of using the block PVs (local or non-local) 67 for the persistent rootfs. It's possible to copy the image onto PV 68 upon the first use, and then have another pod reuse the PV after the 69 original one is destroyed. For local PVs, the scheduler will always 70 place the pod on the node where the local PV resides (this constitutes 71 so called "gravity"). There's a problem with this approach, namely, 72 there's no reliable way for a CRI implementation to find a PV that 73 corresponds to a block device, so Virtlet will have to examine the 74 contents of the PV to see if it's used for the first time. This also 75 means that Virtlet will have hard time establishing the correspondence 76 between PVs and the images that are applied to them (e.g. imagine a PV 77 being used by a pod with different image later). It's possible to 78 overcome these problems by either storing the metadata on the block 79 device itself somehow, or using CRDs and PV metadata to keep track of 80 "pet" VMs and their root filesystems. The use of local PVs will take 81 much of the burden from the corresponding controller, though. 82 83 ## Experimenting with the Local Persistent Volumes 84 85 First, we need to define a storage class that specifies 86 `volumeBindingMode: WaitForFirstConsumer` that's 87 [needed](https://kubernetes.io/blog/2018/04/13/local-persistent-volumes-beta/#enabling-smarter-scheduling-and-volume-binding) 88 for propoper pod scheduling: 89 ```yaml 90 --- 91 kind: StorageClass 92 apiVersion: storage.k8s.io/v1 93 metadata: 94 name: local-storage 95 provisioner: kubernetes.io/no-provisioner 96 volumeBindingMode: WaitForFirstConsumer 97 ``` 98 99 Below is a definition of a Local Persistent Volume: 100 ```yaml 101 apiVersion: v1 102 kind: PersistentVolume 103 metadata: 104 name: local-block-pv 105 spec: 106 capacity: 107 storage: 100Mi 108 accessModes: 109 - ReadWriteOnce 110 persistentVolumeReclaimPolicy: Retain 111 storageClassName: local-storage 112 volumeMode: Block 113 local: 114 path: /dev/loop3 115 claimRef: 116 name: local-block-pvc 117 namespace: default 118 nodeAffinity: 119 required: 120 nodeSelectorTerms: 121 - matchExpressions: 122 - key: kubernetes.io/hostname 123 operator: In 124 values: 125 - kube-node-1 126 ``` 127 128 The important parts here are the following: `volumeMode: Block` 129 setting the block volume mode, local volume source specification 130 that makes the PV use `/dev/loop3` 131 ```yaml 132 local: 133 path: /dev/loop3 134 ``` 135 and a `nodeAffinity` spec that pins the local PV to `kube-node-1`: 136 ``` 137 nodeAffinity: 138 required: 139 nodeSelectorTerms: 140 - matchExpressions: 141 - key: kubernetes.io/hostname 142 operator: In 143 values: 144 - kube-node-1 145 ``` 146 147 The following PVC makes use of that PV (it's referenced explicitly via 148 `claimRef` above but we could allow Kubernetes to associate the PV 149 with PVC instead), also including `volumeMode: Block` in it: 150 ```yaml 151 kind: PersistentVolumeClaim 152 apiVersion: v1 153 metadata: 154 name: local-block-pvc 155 spec: 156 accessModes: 157 - ReadWriteOnce 158 volumeMode: Block 159 storageClassName: local-storage 160 resources: 161 requests: 162 storage: 100Mi 163 ``` 164 165 And, finally, a pod that makes use of the PVC: 166 ``` 167 --- 168 kind: Pod 169 apiVersion: v1 170 metadata: 171 name: test-block-pod 172 spec: 173 containers: 174 - name: ubuntu 175 image: ubuntu:16.04 176 command: 177 - /bin/sh 178 - -c 179 - sleep 30000 180 volumeDevices: 181 - devicePath: /dev/testpvc 182 name: testpvc 183 volumes: 184 - name: testpvc 185 persistentVolumeClaim: 186 claimName: local-block-pvc 187 ``` 188 189 In the pod definition, we're using `volumeDevices` with `devicePath` 190 instead of `volumeMounts` with `mountPath`. This will make the node's 191 `/dev/loop3` appear as `/dev/testpvc` inside the pod's container: 192 193 ``` 194 $ kubectl exec test-block-pod -- ls -l /dev/testpvc 195 brw-rw---- 1 root disk 7, 3 Jun 12 20:44 /dev/testpvc 196 $ kubectl exec test-block-pod -- mkfs.ext4 /dev/testpvc 197 Discarding device blocks: done 198 Creating filesystem with 102400 1k blocks and 25688 inodes 199 Filesystem UUID: a02f7560-23a6-45c1-b10a-6e0a1b1eee72 200 Superblock backups stored on blocks: 201 8193, 24577, 40961, 57345, 73729 202 203 Allocating group tables: done 204 Writing inode tables: done 205 Creating journal (4096 blocks): mke2fs 1.42.13 (17-May-2015) 206 done 207 Writing superblocks and filesystem accounting information: done 208 ``` 209 210 The important part is that the pod gets automatically scheduled on the 211 node where the local PV used by the PVC resides: 212 ``` 213 $ kubectl get pods test-block-pod -o wide 214 NAME READY STATUS RESTARTS AGE IP NODE 215 test-block-pod 1/1 Running 0 21m 10.244.2.9 kube-node-1 216 ``` 217 218 From CRI point of view, the following container config is passed to 219 the `CreateContainer()` call, as seen in CRI Proxy logs (pod sandbox 220 config omitted for brevity as it doesn't contain the mount or device 221 related information): 222 ```yaml 223 I0612 20:44:29.869566 1038 proxy.go:126] ENTER: /runtime.v1alpha2.RuntimeService/CreateContainer(): 224 config: 225 annotations: 226 io.kubernetes.container.hash: ff82c6d3 227 io.kubernetes.container.restartCount: "0" 228 io.kubernetes.container.terminationMessagePath: /dev/termination-log 229 io.kubernetes.container.terminationMessagePolicy: File 230 io.kubernetes.pod.terminationGracePeriod: "30" 231 command: 232 - /bin/sh 233 - -c 234 - sleep 30000 235 devices: 236 - container_path: /dev/testpvc 237 host_path: /var/lib/kubelet/pods/65b0c985-6e81-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~local-volume/local-block-pv 238 permissions: mrw 239 envs: 240 - key: KUBERNETES_SERVICE_PORT_HTTPS 241 value: "443" 242 - key: KUBERNETES_PORT 243 value: tcp://10.96.0.1:443 244 - key: KUBERNETES_PORT_443_TCP 245 value: tcp://10.96.0.1:443 246 - key: KUBERNETES_PORT_443_TCP_PROTO 247 value: tcp 248 - key: KUBERNETES_PORT_443_TCP_PORT 249 value: "443" 250 - key: KUBERNETES_PORT_443_TCP_ADDR 251 value: 10.96.0.1 252 - key: KUBERNETES_SERVICE_HOST 253 value: 10.96.0.1 254 - key: KUBERNETES_SERVICE_PORT 255 value: "443" 256 image: 257 image: sha256:5e8b97a2a0820b10338bd91674249a94679e4568fd1183ea46acff63b9883e9c 258 labels: 259 io.kubernetes.container.name: ubuntu 260 io.kubernetes.pod.name: test-block-pod 261 io.kubernetes.pod.namespace: default 262 io.kubernetes.pod.uid: 65b0c985-6e81-11e8-be27-769e6e14e66a 263 linux: 264 resources: 265 cpu_shares: 2 266 oom_score_adj: 1000 267 security_context: 268 namespace_options: 269 pid: 1 270 run_as_user: {} 271 log_path: ubuntu/0.log 272 metadata: 273 name: ubuntu 274 mounts: 275 - container_path: /var/run/secrets/kubernetes.io/serviceaccount 276 host_path: /var/lib/kubelet/pods/65b0c985-6e81-11e8-be27-769e6e14e66a/volumes/kubernetes.io~secret/default-token-7zwlh 277 readonly: true 278 - container_path: /etc/hosts 279 host_path: /var/lib/kubelet/pods/65b0c985-6e81-11e8-be27-769e6e14e66a/etc-hosts 280 - container_path: /dev/termination-log 281 host_path: /var/lib/kubelet/pods/65b0c985-6e81-11e8-be27-769e6e14e66a/containers/ubuntu/2be42601 282 ``` 283 284 The important part is this: 285 ```yaml 286 devices: 287 - container_path: /dev/testpvc 288 host_path: /var/lib/kubelet/pods/65b0c985-6e81-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~local-volume/local-block-pv 289 permissions: mrw 290 ``` 291 292 If we look at the node, we'll see that `host_path` points to a symlink to `/dev/loop3` which 293 is specified in the local block PV: 294 ``` 295 root@kube-node-1:/# ls -l /var/lib/kubelet/pods/65b0c985-6e81-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~local-volume/local-block-pv 296 lrwxrwxrwx 1 root root 10 Jun 13 08:31 /var/lib/kubelet/pods/65b0c985-6e81-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~local-volume/local-block-pv -> /dev/loop3 297 ``` 298 299 `container_path` denotes the path to the device inside the container. 300 301 The `permissions` is described in CRI spec as follows: 302 ``` 303 // Cgroups permissions of the device, candidates are one or more of 304 // * r - allows container to read from the specified device. 305 // * w - allows container to write to the specified device. 306 // * m - allows container to create device files that do not yet exist. 307 ``` 308 309 Also note that the device is not listed in `mounts`. 310 311 There's a 312 [tool](https://github.com/kubernetes-incubator/external-storage/tree/master/local-volume) 313 for automatic provisioning of Local Persistent Volumes that's part of 314 [external-storage](https://github.com/kubernetes-incubator/external-storage) 315 project. Right now it may not be very useful for Virtlet, but it may 316 gain some important features later, like support for automatic 317 partitioning and fs formatting. 318 319 ## Experimenting with non-local ("plain") Persistent Volumes 320 321 Let's check "plain" PVs now. We'll be using Ceph block volumes. 322 323 Below are some tricks that make kubeadm-dind-cluster compatible with 324 Ceph. Some of them may be useful for running 325 [Rook](https://github.com/rook/rook) on k-d-c, too. 326 327 For Ceph RBDs to work with Kubernetes Ceph PVs (not just Virtlet's 328 flexvolume-based ones), I had to make `rbd` work on the DIND nodes, so 329 the following change had to be made to the kubeadm-dind-cluster's main 330 script (observed in [Rook's](https://github.com/rook/rook) DIND 331 setup): 332 ``` 333 diff --git a/dind-cluster.sh b/dind-cluster.sh 334 index e9118e2..24a0a78 100755 335 --- a/dind-cluster.sh 336 +++ b/dind-cluster.sh 337 @@ -645,6 +645,9 @@ function dind::run { 338 --hostname "${container_name}" \ 339 -l mirantis.kubeadm_dind_cluster \ 340 -v ${volume_name}:/dind \ 341 + -v /dev:/dev \ 342 + -v /sys/bus:/sys/bus \ 343 + -v /var/run/docker.sock:/opt/outer-docker.sock \ 344 ${opts[@]+"${opts[@]}"} \ 345 "${DIND_IMAGE}" \ 346 ${args[@]+"${args[@]}"} 347 ``` 348 349 The following file had to be added as a fake `rbd` command to each DIND node 350 (borrowed from [Rook scripts](https://github.com/rook/rook/blob/cd2b69915958e7453b3fc5031f59179058163dcd/tests/scripts/dind-cluster-rbd)): 351 ``` 352 #!/bin/bash 353 DOCKER_HOST=unix:///opt/outer-docker.sock /usr/bin/docker run --rm -v /sys:/sys --net=host --privileged=true ceph/base rbd "$@" 354 ``` 355 It basically executes rbd command using `ceph/base` images using the 356 host docker in the host network namespace. 357 358 So let's bring up the cluster: 359 ```bash 360 ./dind-cluster.sh up 361 ``` 362 363 Disable rate limiting so journald doesn't choke on CRI proxy logs on the node 1: 364 ```bash 365 docker exec kube-node-1 /bin/bash -c 'echo "RateLimitInterval=0" >>/etc/systemd/journald.conf && systemctl restart systemd-journald' 366 ``` 367 368 Enable `BlockVolume` mode for kubelet on the node 1 369 (`MountPropagation` is enabled by default in 1.10, so let's just 370 replace it): 371 ```bash 372 docker exec kube-node-1 /bin/bash -c 'sed -i "s/MountPropagation/BlockVolume/" /lib/systemd/system/kubelet.service && systemctl daemon-reload && systemctl restart kubelet' 373 ``` 374 375 Install CRI Proxy so we can grab the logs: 376 ```bash 377 CRIPROXY_DEB_URL="${CRIPROXY_DEB_URL:-https://github.com/Mirantis/criproxy/releases/download/v0.14.0/criproxy-nodeps_0.14.0_amd64.deb}" 378 docker exec kube-node-1 /bin/bash -c "curl -sSL '${CRIPROXY_DEB_URL}' >/criproxy.deb && dpkg -i /criproxy.deb && rm /criproxy.deb" 379 ``` 380 381 Taint node 2 so we get everything scheduled on node 1: 382 ```bash 383 kubectl taint nodes kube-node-2 dedicated=foobar:NoSchedule 384 ``` 385 386 Now we need to add `rbd` command to the 'hypokube' image that's used 387 by the control plane (we need it for `kube-controller-manager`). The 388 proper way would be using the node's `rbd` command with mounting host 389 docker socket into the container, but as the controller manager 390 doesn't need `rbd map` command which needs host access, we can just 391 install `rbd` package here, just make sure it's new enough to support 392 commands like `rbd status` that are invoked by the controller manager: 393 394 ```bash 395 docker exec kube-master /bin/bash -c 'docker rm -f tmp; docker run --name tmp mirantis/hypokube:final /bin/bash -c "echo deb http://ftp.debian.org/debian jessie-backports main >>/etc/apt/sources.list && apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y ceph-common=10.2.5-6~bpo8+1 libradosstriper1 ntp librados2=10.2.5-6~bpo8+1 librbd1=10.2.5-6~bpo8+1 python-cephfs=10.2.5-6~bpo8+1 libcephfs1=10.2.5-6~bpo8+1" && docker commit tmp mirantis/hypokube:final && docker rm -f tmp' 396 ``` 397 398 At this point, we must edit the following files on `kube-master` node, adding 399 `--feature-gates=BlockVolume=true` to the end of `command:` in each pod's only container: 400 401 * `/etc/kubernetes/manifests/kube-apiserver.yaml` 402 * `/etc/kubernetes/manifests/kube-scheduler.yaml` 403 * `/etc/kubernetes/manifests/kube-controller-manager.yaml` 404 405 Likely, updating just the controller manager may suffice, but I didn't 406 check. This will cause the pods to restart and use the updated 407 `mirantis/hypokube:final` image. 408 409 Now let's start the Ceph demo container: 410 ```bash 411 MON_IP=$(docker exec kube-master route | grep default | awk '{print $2}') 412 CEPH_PUBLIC_NETWORK=${MON_IP}/16 413 docker run -d --net=host -e MON_IP=${MON_IP} \ 414 -e CEPH_PUBLIC_NETWORK=${CEPH_PUBLIC_NETWORK} \ 415 -e CEPH_DEMO_UID=foo \ 416 -e CEPH_DEMO_ACCESS_KEY=foo \ 417 -e CEPH_DEMO_SECRET_KEY=foo \ 418 -e CEPH_DEMO_BUCKET=foo \ 419 -e DEMO_DAEMONS="osd mds" \ 420 --name ceph_cluster docker.io/ceph/daemon demo 421 ``` 422 423 Create a pool there: 424 ```bash 425 docker exec ceph_cluster ceph osd pool create kube 8 8 426 ``` 427 428 Create an image for testing (it's important to use `rbd create` with 429 `layering` feature here so as not to get a feature mismatch error 430 later when creating a pod): 431 ```bash 432 docker exec ceph_cluster rbd create tstimg \ 433 --size 11M --pool kube --image-feature layering 434 ``` 435 436 Set up a Kubernetes secret for use with Ceph: 437 ```bash 438 admin_secret="$(docker exec ceph_cluster ceph auth get-key client.admin)" 439 kubectl create secret generic ceph-admin \ 440 --type="kubernetes.io/rbd" \ 441 --from-literal=key="${admin_secret}" \ 442 --namespace=kube-system 443 ``` 444 445 Copy the `rbd` replacement script presented earlier above to each node: 446 ```bash 447 for n in kube-{master,node-{1,2}}; do 448 docker cp dind-cluster-rbd ${n}:/usr/bin/rbd 449 done 450 ``` 451 452 Now we can create a test PV, PVC and a pod. 453 454 Let's define a storage class: 455 ```yaml 456 kind: StorageClass 457 apiVersion: storage.k8s.io/v1 458 metadata: 459 name: ceph-testnew 460 provisioner: kubernetes.io/rbd 461 parameters: 462 monitors: 10.192.0.1:6789 463 adminId: admin 464 adminSecretName: ceph-admin 465 adminSecretNamespace: kube-system 466 pool: kube 467 userId: admin 468 userSecretName: ceph-admin 469 userSecretNamespace: kube-system 470 fsType: ext4 471 imageFormat: "1" 472 # the following was disabled while testing non-block PVs 473 imageFeatures: "layering" 474 ``` 475 Actually, the automatic provisioning didn't work for me because it 476 was setting `volumeMode: Filesystem` in the PVs, but this was probably 477 due to my mistake, or otherwise from looking at Kubernetes source 478 it should be fixable. 479 480 Let's define a block PV: 481 ```yaml 482 --- 483 apiVersion: v1 484 kind: PersistentVolume 485 metadata: 486 name: test-block-pv 487 spec: 488 accessModes: 489 - ReadWriteOnce 490 capacity: 491 storage: 10Mi 492 claimRef: 493 name: ceph-block-pvc 494 namespace: default 495 persistentVolumeReclaimPolicy: Delete 496 rbd: 497 image: tstimg 498 keyring: /etc/ceph/keyring 499 monitors: 500 - 10.192.0.1:6789 501 pool: kube 502 secretRef: 503 name: ceph-admin 504 namespace: kube-system 505 user: admin 506 storageClassName: ceph-testnew 507 volumeMode: Block 508 ``` 509 510 The difference from the "usual" RBD PV is `volumeMode: Block` here, 511 and the same goes for the PVC: 512 ```yaml 513 --- 514 kind: PersistentVolumeClaim 515 apiVersion: v1 516 metadata: 517 name: ceph-block-pvc 518 spec: 519 accessModes: 520 - ReadWriteOnce 521 volumeMode: Block 522 storageClassName: ceph-testnew 523 resources: 524 requests: 525 storage: 10Mi 526 ``` 527 528 Now, the pod itself, with `volumeDevices` instead of `volumeMounts`: 529 ```yaml 530 kind: Pod 531 apiVersion: v1 532 metadata: 533 name: ceph-block-pod 534 spec: 535 containers: 536 - name: ubuntu 537 image: ubuntu:16.04 538 command: 539 - /bin/sh 540 - -c 541 - sleep 30000 542 volumeDevices: 543 - name: data 544 devicePath: /dev/cephdev 545 volumes: 546 - name: data 547 persistentVolumeClaim: 548 claimName: ceph-block-pvc 549 ``` 550 551 Let's do `kubectl apply -f ceph-test.yaml` (`ceph-test.yaml` 552 containing all of the yaml documents above), and try it out: 553 554 ``` 555 $ kubectl exec ceph-block-pod -- ls -l /dev/cephdev 556 brw-rw---- 1 root disk 252, 0 Jun 12 20:19 /dev/cephdev 557 $ kubectl exec ceph-block-pod -- mkfs.ext4 /dev/cephdev 558 mke2fs 1.42.13 (17-May-2015) 559 Discarding device blocks: done 560 Creating filesystem with 11264 1k blocks and 2816 inodes 561 Filesystem UUID: 81ce32e8-bf37-4bc8-88bf-674bf6f79d14 562 Superblock backups stored on blocks: 563 8193 564 565 Allocating group tables: done 566 Writing inode tables: done 567 Creating journal (1024 blocks): done 568 Writing superblocks and filesystem accounting information: done 569 ``` 570 571 Let's capture CRI Proxy logs: 572 ``` 573 docker exec kube-node-1 journalctl -xe -n 20000 -u criproxy|egrep --line-buffered -v '/run/virtlet.sock|\]: \{\}|/var/run/dockershim.sock|ImageFsInfo' >/tmp/log.txt 574 ``` 575 576 The following is the important part of the log which is slightly 577 cleaned up: 578 ``` 579 I0612 20:19:38.681852 1038 proxy.go:126] ENTER: /runtime.v1alpha2.RuntimeService/CreateContainer(): 580 config: 581 annotations: 582 io.kubernetes.container.hash: d0c4a380 583 io.kubernetes.container.restartCount: "0" 584 io.kubernetes.container.terminationMessagePath: /dev/termination-log 585 io.kubernetes.container.terminationMessagePolicy: File 586 io.kubernetes.pod.terminationGracePeriod: "30" 587 command: 588 - /bin/sh 589 - -c 590 - sleep 30000 591 devices: 592 - container_path: /dev/cephdev 593 host_path: /var/lib/kubelet/pods/ebb11dcb-6e7d-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~rbd/test-block-pv 594 permissions: mrw 595 envs: 596 - key: KUBERNETES_PORT 597 value: tcp://10.96.0.1:443 598 - key: KUBERNETES_PORT_443_TCP 599 value: tcp://10.96.0.1:443 600 - key: KUBERNETES_PORT_443_TCP_PROTO 601 value: tcp 602 - key: KUBERNETES_PORT_443_TCP_PORT 603 value: "443" 604 - key: KUBERNETES_PORT_443_TCP_ADDR 605 value: 10.96.0.1 606 - key: KUBERNETES_SERVICE_HOST 607 value: 10.96.0.1 608 - key: KUBERNETES_SERVICE_PORT 609 value: "443" 610 - key: KUBERNETES_SERVICE_PORT_HTTPS 611 value: "443" 612 image: 613 image: sha256:5e8b97a2a0820b10338bd91674249a94679e4568fd1183ea46acff63b9883e9c 614 labels: 615 io.kubernetes.container.name: ubuntu 616 io.kubernetes.pod.name: ceph-block-pod 617 io.kubernetes.pod.namespace: default 618 io.kubernetes.pod.uid: ebb11dcb-6e7d-11e8-be27-769e6e14e66a 619 linux: 620 resources: 621 cpu_shares: 2 622 oom_score_adj: 1000 623 security_context: 624 namespace_options: 625 pid: 1 626 run_as_user: {} 627 log_path: ubuntu/0.log 628 metadata: 629 name: ubuntu 630 mounts: 631 - container_path: /var/run/secrets/kubernetes.io/serviceaccount 632 host_path: /var/lib/kubelet/pods/ebb11dcb-6e7d-11e8-be27-769e6e14e66a/volumes/kubernetes.io~secret/default-token-7zwlh 633 readonly: true 634 - container_path: /etc/hosts 635 host_path: /var/lib/kubelet/pods/ebb11dcb-6e7d-11e8-be27-769e6e14e66a/etc-hosts 636 - container_path: /dev/termination-log 637 host_path: /var/lib/kubelet/pods/ebb11dcb-6e7d-11e8-be27-769e6e14e66a/containers/ubuntu/577593a5 638 ``` 639 640 Again, we have this here: 641 ```yaml 642 devices: 643 - container_path: /dev/cephdev 644 host_path: /var/lib/kubelet/pods/ebb11dcb-6e7d-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~rbd/test-block-pv 645 permissions: mrw 646 ``` 647 648 The `host_path` points to a mapped RBD: 649 ```yaml 650 root@kube-node-1:/# ls -l /var/lib/kubelet/pods/ebb11dcb-6e7d-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~rbd/test-block-pv 651 lrwxrwxrwx 1 root root 9 Jun 12 20:19 /var/lib/kubelet/pods/ebb11dcb-6e7d-11e8-be27-769e6e14e66a/volumeDevices/kubernetes.io~rbd/test-block-pv -> /dev/rbd0 652 ``` 653 654 An unpleasant part about RBDs+DIND is that the machine may hang on 655 some commands / refuse to reboot if RBDs aren't unmapped. If kdc 656 cluster is already teared down (but `ceph_cluster` container is still 657 alive), the following commands can be used to list and unmap RBDs on 658 the Linux host: 659 660 ``` 661 # rbd showmapped 662 id pool image snap device 663 0 kube tstimg - /dev/rbd0 664 # rbd unmap -o force kube/tstimg 665 ```