github.com/mirantis/virtlet@v1.5.2-0.20191204181327-1659b8a48e9b/docs/design-proposals/fast-persistent-rootfs.md (about) 1 # The problem of fast persistent rootfs setup 2 3 Right now, the persistent rootfs for VMs is initialized by the means 4 of `qemu-img convert` which converts a QCOW2 image to raw one and 5 writes the result over a block device mapped on the host. It's 6 possible to overcome the problem by utilizing the new persistent 7 volume snapshotting feature in Kubernetes 1.13. It's also possible to 8 implement a solution for VMs which use local libvirt volume as their 9 root filesystem. In both cases, we'll need to add a CRD and a 10 controller for managing persistent root filesystems. 11 12 ## Defining VM identities 13 14 To support faster rootfs initialization, we need to introduce the 15 concept of VM Identity. VM Identities are defined like this: 16 17 ```yaml 18 --- 19 apiVersion: "virtlet.k8s/v1" 20 kind: VirtletVMIdentitySet 21 metadata: 22 name: test-identity-set 23 spec: 24 # specify the image to use. sha256 digest is required here 25 # to avoid possible inconsistencies. 26 image: virtlet.cloud/foobar-image@sha256:a8dd75ecffd4cdd96072d60c2237b448e0c8b2bc94d57f10fdbc8c481d9005b8 27 # specify SMBIOS UUID (optional). If the UUID is specified, 28 # only single VM at a time can utilize this IdentitySet 29 firmwareUUID: 1b4d298f-6151-40b0-a1d4-269fc41d48f0 30 # specify the type of VM identity: 31 # 'Local' for libvirt-backed identities 32 # 'PersistentVolume' fo PV-backed identities 33 type: Local 34 # specify the size to use, defaults to the virtual size 35 # of the image 36 size: 1Gi 37 # for non-local PVs, storageClassName specifies the storage 38 # class to use 39 # storageClassName: csi-rbd 40 ``` 41 42 and they can be associated with VM pods like this: 43 44 ```yaml 45 apiVersion: v1 46 kind: Pod 47 metadata: 48 name: persistent-vm 49 annotations: 50 kubernetes.io/target-runtime: virtlet.cloud 51 VirtletVMIdentitySet: test-identity-set 52 # uncomment to use instead of the pod name 53 # VirtletVMIdentityName: override-identity-name 54 spec: 55 ... 56 containers: 57 - name: vm 58 # must match the image specified in test-identity 59 image: virtlet.cloud/foobar-image@sha256:a8dd75ecffd4cdd96072d60c2237b448e0c8b2bc94d57f10fdbc8c481d9005b8 60 ... 61 ``` 62 63 The VM Identity Sets and VM Identity objects which are generated from 64 them are managed by the VM Identity Controller which is created as 65 a Deployment from `virtletctl`-generated yaml. 66 67 The VM identity object looks like this: 68 ```yaml 69 apiVersion: "virtlet.k8s/v1" 70 kind: VirtletVMIdentity 71 metadata: 72 name: persistent-vm 73 spec: 74 # spec fields are copied from VMIdentitySet 75 image: virtlet.cloud/foobar-image@sha256:a8dd75ecffd4cdd96072d60c2237b448e0c8b2bc94d57f10fdbc8c481d9005b8 76 firmwareUUID: 1b4d298f-6151-40b0-a1d4-269fc41d48f0 77 type: Local 78 size: 1Gi 79 status: 80 creationTime: 2018-10-19T06:07:40Z 81 # 'ready' field is true when this identity is 82 # ready for using in a VM. Virtlet will delay 83 # the actual startup of the pod till the identity 84 # is ready. 85 ready: true 86 ``` 87 88 When a pod is created, an VM Identity object is instantiated if it 89 doesn't exist. The VM Identity objects are CRDs that have their own 90 name (as all k8s objects do) and a pointer to `VMIdentitySet`. The 91 name of the VM Identity object defaults to that of the pod unless 92 `VirtletVMIdentityName` annotation is specified. This extra level 93 of indirection is needed so as to enable VM identities for 94 `StatefulSets` 95 96 VM Identity controller takes care of managing PVCs corresponding to 97 the identities as well as asking local Virtlet processes to make 98 libvirt volumes. 99 100 ## Using libvirt-backed identities 101 102 The libvirt-backed VM identity objects correspond to libvirt volumes 103 that have QCOW2 images as backing files. When such identity object is 104 deleted, the volume is deleted, too. When a VM pod is created on a 105 host other than one currently hosting the corresponding libvirt 106 volume, the volume is moved to the target host. This enables offline 107 migrations of the VMs with libvirt-based persistent root filesystem. 108 109 The basic idea of orchestrating the lifecycle of libvirt volumes 110 corresponding to VM identities is having VM Identity controller 111 contact virtlet process on the node in order to make it perform the 112 actions on libvirt volume objects. 113 114 ## Using persistent volume snapshots 115 116 Kubernetes v1.13 adds support for 117 [persistent volume snapshots](https://kubernetes.io/blog/2018/10/09/introducing-volume-snapshot-alpha-for-kubernetes/) 118 for CSI, which is supported by Ceph CSI driver among others. We can 119 keep a pool of snapshots that correspond to different images. After 120 that, we can create new PVs from these snapshots, resizing them if 121 necessary (we should recheck that such resizes are possible after 122 snapshotting stabilizes for Ceph CSI). 123 124 There are some issues with current Ceph CSI and the snapshotter, see 125 [Appendix A](#appendix-a-experimenting-with-ceph-csi-and-external-snapshotter) 126 for details. 127 128 We'll be dealing with Ceph CSI in the description below. 129 130 For each image with a hash mentioned in a VM identity, Virtlet will 131 create a PVC, wait for it to be provisioned, write the raw image 132 contents to it and then make a snapshot. Once an image stops being 133 used, the corresponding PVC, PV and the snapshots are automatically 134 removed (implementation note: this can probably be achieved by the 135 means of an extra CRD for the image plus Kubernetes ownership 136 mechanisms). Then, for each identity, a new final PV is made based on 137 the snapshot. The final PVs are deleted when the identity disappears. 138 139 ## More possibilities 140 141 We can consider extending 142 [LVM2 CSI plusin](https://github.com/mesosphere/csilvm) to support 143 snapshots so it can be used for local PV snapshots. 144 145 ## Further ideas 146 147 We can auto-create a 1-replica StatefulSet for identities if the user 148 wants it. This will make VM pods "hidden" and we'll have a CRD for 149 "pet" VMs. 150 151 We can also generate SMBIOS UUIDs for the pods of multi-replica 152 StatefulSets using UUIDv5 (a hash based on an UUID from 153 VirtletVMIdentitySet yaml and the name of the pod). 154 155 ## Appendix A. Experimenting with Ceph CSI and external-snapshotter 156 157 For this experiment, kubeadm-dind-cluster with k8s 1.13 was used. 158 The following settings were applied: 159 ```console 160 $ export FEATURE_GATES="BlockVolume=true,CSIBlockVolume=true,VolumeSnapshotDataSource=true" 161 $ export KUBELET_FEATURE_GATES="BlockVolume=true,CSIBlockVolume=true,VolumeSnapshotDataSource=true" 162 $ export ENABLE_CEPH=1 163 ``` 164 165 We'll need ceph-csi: 166 ```console 167 $ git clone https://github.com/ceph/ceph-csi.git 168 ``` 169 170 We'll be using `external-snapshotter` for snapshots. For me, it was 171 having some gRPC errors probably due to the race with csi socket 172 initialization. I had to add a container with `sleep` command 173 to CSI RBD plugin yaml: 174 175 ``` 176 diff --git a/deploy/rbd/kubernetes/csi-rbdplugin.yaml b/deploy/rbd/kubernetes/csi-rbdplugin.yaml 177 index d641a78..c8da074 100644 178 --- a/deploy/rbd/kubernetes/csi-rbdplugin.yaml 179 +++ b/deploy/rbd/kubernetes/csi-rbdplugin.yaml 180 @@ -38,6 +38,22 @@ spec: 181 mountPath: /var/lib/kubelet/plugins/csi-rbdplugin 182 - name: registration-dir 183 mountPath: /registration 184 + - name: csi-snapshotter 185 + image: quay.io/k8scsi/csi-snapshotter:v0.4.0 186 + command: 187 + - /bin/sh 188 + - -c 189 + - "sleep 60000" 190 + # args: 191 + # - "--csi-address=$(CSI_ENDPOINT)" 192 + # - "--connection-timeout=15s" 193 + env: 194 + - name: CSI_ENDPOINT 195 + value: unix://var/lib/kubelet/plugins/csi-rbdplugin/csi.sock 196 + imagePullPolicy: Always 197 + volumeMounts: 198 + - name: plugin-dir 199 + mountPath: /var/lib/kubelet/plugins/csi-rbdplugin 200 - name: csi-rbdplugin 201 securityContext: 202 privileged: true 203 ``` 204 205 Deploy Ceph CSI plugin: 206 ```console 207 $ ./plugin-deploy.sh 208 ``` 209 210 And after deploying Ceph CSI plugin I had to start the snapshotter 211 manually from within `csi-snapshotter` container of the csi-rbdplugin 212 pod 213 214 ```console 215 $ kubectl exec -it -c csi-snapshotter csi-rbdplugin-zwjl2 /bin/sh 216 $ /csi-snapshotter --logtostderr --csi-address /var/lib/kubelet/plugins/csi- 217 rbdplugin/csi.sock --connection-timeout=15s --v=10 218 ``` 219 220 As a quick hack (for testing only!), you can dumb down RBAC to make it 221 work: 222 ```console 223 kubectl create clusterrolebinding permissive-binding \ 224 --clusterrole=cluster-admin \ 225 --user=admin \ 226 --user=kubelet \ 227 --group=system:serviceaccounts 228 ``` 229 230 Start Ceph demo container: 231 ```console 232 $ MON_IP=$(docker exec kube-master route | grep default | awk '{print $2}') 233 $ CEPH_PUBLIC_NETWORK=${MON_IP}/16 234 $ docker run -d --net=host -e MON_IP=${MON_IP} \ 235 -e CEPH_PUBLIC_NETWORK=${CEPH_PUBLIC_NETWORK} \ 236 -e CEPH_DEMO_UID=foo \ 237 -e CEPH_DEMO_ACCESS_KEY=foo \ 238 -e CEPH_DEMO_SECRET_KEY=foo \ 239 -e CEPH_DEMO_BUCKET=foo \ 240 -e DEMO_DAEMONS="osd mds" \ 241 --name ceph_cluster docker.io/ceph/daemon demo 242 ``` 243 244 Create the pool: 245 ```console 246 $ docker exec ceph_cluster ceph osd pool create kube 8 8 247 ``` 248 249 Create the secret: 250 ```console 251 admin_secret="$(docker exec ceph_cluster ceph auth get-key client.admin)" 252 kubectl create secret generic csi-rbd-secret \ 253 --type="kubernetes.io/rbd" \ 254 --from-literal=admin="${admin_secret}" \ 255 --from-literal=kubernetes="${admin_secret}" 256 ``` 257 258 Then we can create k8s objects for storage class, PVC and the pod: 259 ```yaml 260 --- 261 apiVersion: storage.k8s.io/v1 262 kind: StorageClass 263 metadata: 264 name: csi-rbd 265 annotations: 266 storageclass.kubernetes.io/is-default-class: true 267 provisioner: csi-rbdplugin 268 parameters: 269 # Comma separated list of Ceph monitors 270 # if using FQDN, make sure csi plugin's dns policy is appropriate. 271 monitors: 10.192.0.1:6789 272 273 # if "monitors" parameter is not set, driver to get monitors from same 274 # secret as admin/user credentials. "monValueFromSecret" provides the 275 # key in the secret whose value is the mons 276 #monValueFromSecret: "monitors" 277 278 279 # Ceph pool into which the RBD image shall be created 280 pool: kube 281 282 # RBD image format. Defaults to "2". 283 imageFormat: "2" 284 285 # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature. 286 imageFeatures: layering 287 288 # The secrets have to contain Ceph admin credentials. 289 csiProvisionerSecretName: csi-rbd-secret 290 csiProvisionerSecretNamespace: default 291 csiNodePublishSecretName: csi-rbd-secret 292 csiNodePublishSecretNamespace: default 293 294 # Ceph users for operating RBD 295 adminid: admin 296 userid: admin 297 # uncomment the following to use rbd-nbd as mounter on supported nodes 298 #mounter: rbd-nbd 299 reclaimPolicy: Delete 300 --- 301 apiVersion: v1 302 kind: PersistentVolumeClaim 303 metadata: 304 name: rbd-pvc 305 spec: 306 accessModes: 307 - ReadWriteOnce 308 resources: 309 requests: 310 storage: 1Gi 311 storageClassName: csi-rbd 312 --- 313 apiVersion: v1 314 kind: Pod 315 metadata: 316 name: csirbd-demo-pod 317 spec: 318 containers: 319 - name: web-server 320 image: nginx 321 volumeMounts: 322 - name: mypvc 323 mountPath: /var/lib/www/html 324 volumes: 325 - name: mypvc 326 persistentVolumeClaim: 327 claimName: rbd-pvc 328 readOnly: false 329 ``` 330 331 The pod will start and you'll observe the PV being mounted under 332 `/var/lib/www/html` inside it (there'll be `lost+found` directory 333 there). 334 335 For snapshots, you can make snapshot class like this: 336 ```yaml 337 apiVersion: snapshot.storage.k8s.io/v1alpha1 338 kind: VolumeSnapshotClass 339 metadata: 340 name: csi-snapclass 341 snapshotter: csi-rbdplugin 342 parameters: 343 monitors: 10.192.0.1:6789 344 pool: kube 345 imageFormat: "2" 346 imageFeatures: layering 347 csiSnapshotterSecretName: csi-rbd-secret 348 csiSnapshotterSecretNamespace: default 349 adminid: admin 350 userid: admin 351 ``` 352 353 Then you can create a snapshot: 354 ```yaml 355 apiVersion: snapshot.storage.k8s.io/v1alpha1 356 kind: VolumeSnapshot 357 metadata: 358 name: rbd-pvc-snapshot 359 spec: 360 snapshotClassName: csi-snapclass 361 source: 362 name: rbd-pvc 363 kind: PersistentVolumeClaim 364 ``` 365 366 The snapshot can be observed to become `ready`: 367 ```console 368 $ kubectl get volumesnapshot rbd-pvc-snapshot -o yaml 369 apiVersion: snapshot.storage.k8s.io/v1alpha1 370 kind: VolumeSnapshot 371 metadata: 372 annotations: 373 kubectl.kubernetes.io/last-applied-configuration: | 374 {"apiVersion":"snapshot.storage.k8s.io/v1alpha1","kind":"VolumeSnapshot","metadata":{"annotations":{},"name":"rbd-pvc-snapshot","namespace":"default"},"spec":{"snapshotClassName":"csi-snapclass","source":{"kind":"PersistentVolumeClaim","name":"rbd-pvc"}}} 375 creationTimestamp: 2018-10-19T06:07:39Z 376 generation: 1 377 name: rbd-pvc-snapshot 378 namespace: default 379 resourceVersion: "7444" 380 selfLink: /apis/snapshot.storage.k8s.io/v1alpha1/namespaces/default/volumesnapshots/rbd-pvc-snapshot 381 uid: 48337825-d365-11e8-aec0-fae2979a43cc 382 spec: 383 snapshotClassName: csi-snapclass 384 snapshotContentName: snapcontent-48337825-d365-11e8-aec0-fae2979a43cc 385 source: 386 kind: PersistentVolumeClaim 387 name: rbd-pvc 388 status: 389 creationTime: 2018-10-19T06:07:40Z 390 ready: true 391 restoreSize: 1Gi 392 ``` 393 394 You can also find it using Ceph tools in the `ceph_cluster` container. 395 You can use `rbd ls --pool=kube` to list the RBD images and then 396 inspecting the one corresponding to your PV, e.g. 397 ```console 398 $ rbd snap ls kube/pvc-aff10346d36411e8 399 ``` 400 (if you've created just one PV/PVC, there'll be just one image in the 401 `kube` pool). 402 403 In theory, you should be able to make a new PV from the snapshot like this: 404 ```yaml 405 apiVersion: v1 406 kind: PersistentVolumeClaim 407 metadata: 408 name: pvc-restore 409 spec: 410 # storageClassName: csi-rbd 411 dataSource: 412 name: rbd-pvc-snapshot 413 kind: VolumeSnapshot 414 apiGroup: snapshot.storage.k8s.io 415 accessModes: 416 - ReadWriteOnce 417 resources: 418 requests: 419 storage: 1Gi 420 ``` 421 422 but unfortunately this part didn't work for me. I was getting PVC 423 modification errors and even commented out `storageClassName: csi-rbd` 424 didn't help (it did help with similar problems in external-storage 425 project). 426 427 Another problem that I've encountered that when I've attempted to make 428 a block PVC for Virtlet persistent rootfs 429 ```yaml 430 apiVersion: v1 431 kind: PersistentVolume 432 metadata: 433 name: test-block-pv 434 spec: 435 accessModes: 436 - ReadWriteOnce 437 capacity: 438 storage: 10Mi 439 claimRef: 440 name: ceph-block-pvc 441 namespace: default 442 persistentVolumeReclaimPolicy: Delete 443 volumeMode: Block 444 rbd: 445 image: tstimg 446 monitors: 447 - 10.192.0.1:6789 448 pool: kube 449 secretRef: 450 name: ceph-admin 451 ``` 452 453 the PV got actually created but PVC remained pending because of 454 another problem with updating PVC. Obviously either 455 `external-snapshotter`, `ceph-csi` or both projects need some fixes, 456 but we can hope this functionality will get mature soon (or we can 457 help fixing it).