github.com/mirantis/virtlet@v1.5.2-0.20191204181327-1659b8a48e9b/docs/design-proposals/fast-persistent-rootfs.md (about)

     1  # The problem of fast persistent rootfs setup
     2  
     3  Right now, the persistent rootfs for VMs is initialized by the means
     4  of `qemu-img convert` which converts a QCOW2 image to raw one and
     5  writes the result over a block device mapped on the host. It's
     6  possible to overcome the problem by utilizing the new persistent
     7  volume snapshotting feature in Kubernetes 1.13. It's also possible to
     8  implement a solution for VMs which use local libvirt volume as their
     9  root filesystem. In both cases, we'll need to add a CRD and a
    10  controller for managing persistent root filesystems.
    11  
    12  ## Defining VM identities
    13  
    14  To support faster rootfs initialization, we need to introduce the
    15  concept of VM Identity. VM Identities are defined like this:
    16  
    17  ```yaml
    18  ---
    19  apiVersion: "virtlet.k8s/v1"
    20  kind: VirtletVMIdentitySet
    21  metadata:
    22    name: test-identity-set
    23  spec:
    24    # specify the image to use. sha256 digest is required here
    25    # to avoid possible inconsistencies.
    26    image: virtlet.cloud/foobar-image@sha256:a8dd75ecffd4cdd96072d60c2237b448e0c8b2bc94d57f10fdbc8c481d9005b8
    27    # specify SMBIOS UUID (optional). If the UUID is specified,
    28    # only single VM at a time can utilize this IdentitySet
    29    firmwareUUID: 1b4d298f-6151-40b0-a1d4-269fc41d48f0
    30    # specify the type of VM identity:
    31    #   'Local' for libvirt-backed identities
    32    #   'PersistentVolume' fo PV-backed identities
    33    type: Local
    34    # specify the size to use, defaults to the virtual size
    35    # of the image
    36    size: 1Gi
    37    # for non-local PVs, storageClassName specifies the storage
    38    # class to use
    39    # storageClassName: csi-rbd
    40  ```
    41  
    42  and they can be associated with VM pods like this:
    43  
    44  ```yaml
    45  apiVersion: v1
    46  kind: Pod
    47  metadata:
    48    name: persistent-vm
    49    annotations:
    50      kubernetes.io/target-runtime: virtlet.cloud
    51      VirtletVMIdentitySet: test-identity-set
    52      # uncomment to use instead of the pod name
    53      # VirtletVMIdentityName: override-identity-name
    54  spec:
    55    ...
    56    containers:
    57    - name: vm
    58      # must match the image specified in test-identity
    59      image: virtlet.cloud/foobar-image@sha256:a8dd75ecffd4cdd96072d60c2237b448e0c8b2bc94d57f10fdbc8c481d9005b8
    60  ...
    61  ```
    62  
    63  The VM Identity Sets and VM Identity objects which are generated from
    64  them are managed by the VM Identity Controller which is created as
    65  a Deployment from `virtletctl`-generated yaml.
    66  
    67  The VM identity object looks like this:
    68  ```yaml
    69  apiVersion: "virtlet.k8s/v1"
    70  kind: VirtletVMIdentity
    71  metadata:
    72    name: persistent-vm
    73  spec:
    74    # spec fields are copied from VMIdentitySet
    75    image: virtlet.cloud/foobar-image@sha256:a8dd75ecffd4cdd96072d60c2237b448e0c8b2bc94d57f10fdbc8c481d9005b8
    76    firmwareUUID: 1b4d298f-6151-40b0-a1d4-269fc41d48f0
    77    type: Local
    78    size: 1Gi
    79  status:
    80    creationTime: 2018-10-19T06:07:40Z
    81    # 'ready' field is true when this identity is
    82    # ready for using in a VM. Virtlet will delay
    83    # the actual startup of the pod till the identity
    84    # is ready.
    85    ready: true
    86  ```
    87  
    88  When a pod is created, an VM Identity object is instantiated if it
    89  doesn't exist. The VM Identity objects are CRDs that have their own
    90  name (as all k8s objects do) and a pointer to `VMIdentitySet`. The
    91  name of the VM Identity object defaults to that of the pod unless
    92  `VirtletVMIdentityName` annotation is specified. This extra level
    93  of indirection is needed so as to enable VM identities for
    94  `StatefulSets`
    95  
    96  VM Identity controller takes care of managing PVCs corresponding to
    97  the identities as well as asking local Virtlet processes to make
    98  libvirt volumes.
    99  
   100  ## Using libvirt-backed identities
   101  
   102  The libvirt-backed VM identity objects correspond to libvirt volumes
   103  that have QCOW2 images as backing files. When such identity object is
   104  deleted, the volume is deleted, too. When a VM pod is created on a
   105  host other than one currently hosting the corresponding libvirt
   106  volume, the volume is moved to the target host. This enables offline
   107  migrations of the VMs with libvirt-based persistent root filesystem.
   108  
   109  The basic idea of orchestrating the lifecycle of libvirt volumes
   110  corresponding to VM identities is having VM Identity controller
   111  contact virtlet process on the node in order to make it perform the
   112  actions on libvirt volume objects.
   113  
   114  ## Using persistent volume snapshots
   115  
   116  Kubernetes v1.13 adds support for
   117  [persistent volume snapshots](https://kubernetes.io/blog/2018/10/09/introducing-volume-snapshot-alpha-for-kubernetes/)
   118  for CSI, which is supported by Ceph CSI driver among others. We can
   119  keep a pool of snapshots that correspond to different images. After
   120  that, we can create new PVs from these snapshots, resizing them if
   121  necessary (we should recheck that such resizes are possible after
   122  snapshotting stabilizes for Ceph CSI).
   123  
   124  There are some issues with current Ceph CSI and the snapshotter, see
   125  [Appendix A](#appendix-a-experimenting-with-ceph-csi-and-external-snapshotter)
   126  for details.
   127  
   128  We'll be dealing with Ceph CSI in the description below.
   129  
   130  For each image with a hash mentioned in a VM identity, Virtlet will
   131  create a PVC, wait for it to be provisioned, write the raw image
   132  contents to it and then make a snapshot. Once an image stops being
   133  used, the corresponding PVC, PV and the snapshots are automatically
   134  removed (implementation note: this can probably be achieved by the
   135  means of an extra CRD for the image plus Kubernetes ownership
   136  mechanisms). Then, for each identity, a new final PV is made based on
   137  the snapshot. The final PVs are deleted when the identity disappears.
   138  
   139  ## More possibilities
   140  
   141  We can consider extending
   142  [LVM2 CSI plusin](https://github.com/mesosphere/csilvm) to support
   143  snapshots so it can be used for local PV snapshots.
   144  
   145  ## Further ideas
   146  
   147  We can auto-create a 1-replica StatefulSet for identities if the user
   148  wants it. This will make VM pods "hidden" and we'll have a CRD for
   149  "pet" VMs.
   150  
   151  We can also generate SMBIOS UUIDs for the pods of multi-replica
   152  StatefulSets using UUIDv5 (a hash based on an UUID from
   153  VirtletVMIdentitySet yaml and the name of the pod).
   154  
   155  ## Appendix A. Experimenting with Ceph CSI and external-snapshotter
   156  
   157  For this experiment, kubeadm-dind-cluster with k8s 1.13 was used.
   158  The following settings were applied:
   159  ```console
   160  $ export FEATURE_GATES="BlockVolume=true,CSIBlockVolume=true,VolumeSnapshotDataSource=true"
   161  $ export KUBELET_FEATURE_GATES="BlockVolume=true,CSIBlockVolume=true,VolumeSnapshotDataSource=true"
   162  $ export ENABLE_CEPH=1
   163  ```
   164  
   165  We'll need ceph-csi:
   166  ```console
   167  $ git clone https://github.com/ceph/ceph-csi.git
   168  ```
   169  
   170  We'll be using `external-snapshotter` for snapshots. For me, it was
   171  having some gRPC errors probably due to the race with csi socket
   172  initialization. I had to add a container with `sleep` command
   173  to CSI RBD plugin yaml:
   174  
   175  ```
   176  diff --git a/deploy/rbd/kubernetes/csi-rbdplugin.yaml b/deploy/rbd/kubernetes/csi-rbdplugin.yaml
   177  index d641a78..c8da074 100644
   178  --- a/deploy/rbd/kubernetes/csi-rbdplugin.yaml
   179  +++ b/deploy/rbd/kubernetes/csi-rbdplugin.yaml
   180  @@ -38,6 +38,22 @@ spec:
   181                 mountPath: /var/lib/kubelet/plugins/csi-rbdplugin
   182               - name: registration-dir
   183                 mountPath: /registration
   184  +        - name: csi-snapshotter
   185  +          image: quay.io/k8scsi/csi-snapshotter:v0.4.0
   186  +          command:
   187  +          - /bin/sh
   188  +          - -c
   189  +          - "sleep 60000"
   190  +          # args:
   191  +          #   - "--csi-address=$(CSI_ENDPOINT)"
   192  +          #   - "--connection-timeout=15s"
   193  +          env:
   194  +            - name: CSI_ENDPOINT
   195  +              value: unix://var/lib/kubelet/plugins/csi-rbdplugin/csi.sock
   196  +          imagePullPolicy: Always
   197  +          volumeMounts:
   198  +            - name: plugin-dir
   199  +              mountPath: /var/lib/kubelet/plugins/csi-rbdplugin
   200           - name: csi-rbdplugin
   201             securityContext:
   202               privileged: true
   203  ```
   204  
   205  Deploy Ceph CSI plugin:
   206  ```console
   207  $ ./plugin-deploy.sh
   208  ```
   209  
   210  And after deploying Ceph CSI plugin I had to start the snapshotter
   211  manually from within `csi-snapshotter` container of the csi-rbdplugin
   212  pod
   213  
   214  ```console
   215  $ kubectl exec -it -c csi-snapshotter csi-rbdplugin-zwjl2 /bin/sh
   216  $ /csi-snapshotter --logtostderr --csi-address /var/lib/kubelet/plugins/csi-
   217  rbdplugin/csi.sock --connection-timeout=15s --v=10
   218  ```
   219  
   220  As a quick hack (for testing only!), you can dumb down RBAC to make it
   221  work:
   222  ```console
   223  kubectl create clusterrolebinding permissive-binding   \
   224          --clusterrole=cluster-admin \
   225          --user=admin \
   226          --user=kubelet \
   227          --group=system:serviceaccounts
   228  ```
   229  
   230  Start Ceph demo container:
   231  ```console
   232  $ MON_IP=$(docker exec kube-master route | grep default | awk '{print $2}')
   233  $ CEPH_PUBLIC_NETWORK=${MON_IP}/16
   234  $ docker run -d --net=host -e MON_IP=${MON_IP} \
   235           -e CEPH_PUBLIC_NETWORK=${CEPH_PUBLIC_NETWORK} \
   236           -e CEPH_DEMO_UID=foo \
   237           -e CEPH_DEMO_ACCESS_KEY=foo \
   238           -e CEPH_DEMO_SECRET_KEY=foo \
   239           -e CEPH_DEMO_BUCKET=foo \
   240           -e DEMO_DAEMONS="osd mds" \
   241           --name ceph_cluster docker.io/ceph/daemon demo
   242  ```
   243  
   244  Create the pool:
   245  ```console
   246  $ docker exec ceph_cluster ceph osd pool create kube 8 8
   247  ```
   248  
   249  Create the secret:
   250  ```console
   251  admin_secret="$(docker exec ceph_cluster ceph auth get-key client.admin)"
   252  kubectl create secret generic csi-rbd-secret \
   253          --type="kubernetes.io/rbd" \
   254          --from-literal=admin="${admin_secret}" \
   255          --from-literal=kubernetes="${admin_secret}"
   256  ```
   257  
   258  Then we can create k8s objects for storage class, PVC and the pod:
   259  ```yaml
   260  ---
   261  apiVersion: storage.k8s.io/v1
   262  kind: StorageClass
   263  metadata:
   264     name: csi-rbd
   265     annotations:
   266         storageclass.kubernetes.io/is-default-class: true
   267  provisioner: csi-rbdplugin
   268  parameters:
   269      # Comma separated list of Ceph monitors
   270      # if using FQDN, make sure csi plugin's dns policy is appropriate.
   271      monitors: 10.192.0.1:6789
   272  
   273      # if "monitors" parameter is not set, driver to get monitors from same
   274      # secret as admin/user credentials. "monValueFromSecret" provides the
   275      # key in the secret whose value is the mons
   276      #monValueFromSecret: "monitors"
   277  
   278  
   279      # Ceph pool into which the RBD image shall be created
   280      pool: kube
   281  
   282      # RBD image format. Defaults to "2".
   283      imageFormat: "2"
   284  
   285      # RBD image features. Available for imageFormat: "2". CSI RBD currently supports only `layering` feature.
   286      imageFeatures: layering
   287  
   288      # The secrets have to contain Ceph admin credentials.
   289      csiProvisionerSecretName: csi-rbd-secret
   290      csiProvisionerSecretNamespace: default
   291      csiNodePublishSecretName: csi-rbd-secret
   292      csiNodePublishSecretNamespace: default
   293  
   294      # Ceph users for operating RBD
   295      adminid: admin
   296      userid: admin
   297      # uncomment the following to use rbd-nbd as mounter on supported nodes
   298      #mounter: rbd-nbd
   299  reclaimPolicy: Delete
   300  ---
   301  apiVersion: v1
   302  kind: PersistentVolumeClaim
   303  metadata:
   304    name: rbd-pvc
   305  spec:
   306    accessModes:
   307    - ReadWriteOnce
   308    resources:
   309      requests:
   310        storage: 1Gi
   311    storageClassName: csi-rbd
   312  ---
   313  apiVersion: v1
   314  kind: Pod
   315  metadata:
   316    name: csirbd-demo-pod
   317  spec:
   318    containers:
   319     - name: web-server
   320       image: nginx 
   321       volumeMounts:
   322         - name: mypvc
   323           mountPath: /var/lib/www/html
   324    volumes:
   325     - name: mypvc
   326       persistentVolumeClaim:
   327         claimName: rbd-pvc
   328         readOnly: false
   329  ```
   330  
   331  The pod will start and you'll observe the PV being mounted under
   332  `/var/lib/www/html` inside it (there'll be `lost+found` directory
   333  there).
   334  
   335  For snapshots, you can make snapshot class like this:
   336  ```yaml
   337  apiVersion: snapshot.storage.k8s.io/v1alpha1
   338  kind: VolumeSnapshotClass
   339  metadata:
   340    name: csi-snapclass
   341  snapshotter: csi-rbdplugin
   342  parameters:
   343    monitors: 10.192.0.1:6789
   344    pool: kube
   345    imageFormat: "2"
   346    imageFeatures: layering
   347    csiSnapshotterSecretName: csi-rbd-secret
   348    csiSnapshotterSecretNamespace: default
   349    adminid: admin
   350    userid: admin
   351  ```
   352  
   353  Then you can create a snapshot:
   354  ```yaml
   355  apiVersion: snapshot.storage.k8s.io/v1alpha1
   356  kind: VolumeSnapshot
   357  metadata:
   358    name: rbd-pvc-snapshot
   359  spec:
   360    snapshotClassName: csi-snapclass
   361    source:
   362      name: rbd-pvc
   363      kind: PersistentVolumeClaim
   364  ```
   365  
   366  The snapshot can be observed to become `ready`:
   367  ```console
   368  $ kubectl get volumesnapshot rbd-pvc-snapshot -o yaml
   369  apiVersion: snapshot.storage.k8s.io/v1alpha1
   370  kind: VolumeSnapshot
   371  metadata:
   372    annotations:
   373      kubectl.kubernetes.io/last-applied-configuration: |
   374        {"apiVersion":"snapshot.storage.k8s.io/v1alpha1","kind":"VolumeSnapshot","metadata":{"annotations":{},"name":"rbd-pvc-snapshot","namespace":"default"},"spec":{"snapshotClassName":"csi-snapclass","source":{"kind":"PersistentVolumeClaim","name":"rbd-pvc"}}}
   375    creationTimestamp: 2018-10-19T06:07:39Z
   376    generation: 1
   377    name: rbd-pvc-snapshot
   378    namespace: default
   379    resourceVersion: "7444"
   380    selfLink: /apis/snapshot.storage.k8s.io/v1alpha1/namespaces/default/volumesnapshots/rbd-pvc-snapshot
   381    uid: 48337825-d365-11e8-aec0-fae2979a43cc
   382  spec:
   383    snapshotClassName: csi-snapclass
   384    snapshotContentName: snapcontent-48337825-d365-11e8-aec0-fae2979a43cc
   385    source:
   386      kind: PersistentVolumeClaim
   387      name: rbd-pvc
   388  status:
   389    creationTime: 2018-10-19T06:07:40Z
   390    ready: true
   391    restoreSize: 1Gi
   392  ```
   393  
   394  You can also find it using Ceph tools in the `ceph_cluster` container.
   395  You can use `rbd ls --pool=kube` to list the RBD images and then
   396  inspecting the one corresponding to your PV, e.g.
   397  ```console
   398  $ rbd snap ls kube/pvc-aff10346d36411e8
   399  ```
   400  (if you've created just one PV/PVC, there'll be just one image in the
   401  `kube` pool).
   402  
   403  In theory, you should be able to make a new PV from the snapshot like this:
   404  ```yaml
   405  apiVersion: v1
   406  kind: PersistentVolumeClaim
   407  metadata:
   408    name: pvc-restore
   409  spec:
   410    # storageClassName: csi-rbd
   411    dataSource:
   412      name: rbd-pvc-snapshot
   413      kind: VolumeSnapshot
   414      apiGroup: snapshot.storage.k8s.io
   415    accessModes:
   416      - ReadWriteOnce
   417    resources:
   418      requests:
   419        storage: 1Gi
   420  ```
   421  
   422  but unfortunately this part didn't work for me. I was getting PVC
   423  modification errors and even commented out `storageClassName: csi-rbd`
   424  didn't help (it did help with similar problems in external-storage
   425  project).
   426  
   427  Another problem that I've encountered that when I've attempted to make
   428  a block PVC for Virtlet persistent rootfs
   429  ```yaml
   430  apiVersion: v1
   431  kind: PersistentVolume
   432  metadata:
   433    name: test-block-pv
   434  spec:
   435    accessModes:
   436    - ReadWriteOnce
   437    capacity:
   438      storage: 10Mi
   439    claimRef:
   440      name: ceph-block-pvc
   441      namespace: default
   442    persistentVolumeReclaimPolicy: Delete
   443    volumeMode: Block
   444    rbd:
   445      image: tstimg
   446      monitors:
   447      - 10.192.0.1:6789
   448      pool: kube
   449      secretRef:
   450        name: ceph-admin
   451  ```
   452  
   453  the PV got actually created but PVC remained pending because of
   454  another problem with updating PVC. Obviously either
   455  `external-snapshotter`, `ceph-csi` or both projects need some fixes,
   456  but we can hope this functionality will get mature soon (or we can
   457  help fixing it).