github.com/Equinix-Metal/virtlet@v1.5.2-0.20210807010419-342346535dc5/docs/design-proposals/vm-isolation.md (about)

     1  # Improving VM isolation
     2  
     3  The current `qemu.conf` used by Virtlet's libvirt deployment looks
     4  like this:
     5  ```
     6  stdio_handler = "file"
     7  user = "root"
     8  group = "root"
     9  # we need to do network management stuff in vmwrapper
    10  clear_emulator_capabilities = 0
    11  
    12  cgroup_device_acl = [
    13           "/dev/null", "/dev/full", "/dev/zero",
    14           "/dev/random", "/dev/urandom",
    15           "/dev/ptmx", "/dev/kvm", "/dev/kqemu",
    16           "/dev/rtc", "/dev/hpet", "/dev/net/tun",
    17       ]
    18  ```
    19  
    20  It has some apparent problems, including running VMs as root, not
    21  dropping capabilities and extending the list of devices available to
    22  processes in VM's cgroup. On top of that if an attacker breaks out of
    23  a VM (s)he will be able to take the complete control of the Kubernetes
    24  node where Virtlet pod runs, as Virtlet and Libvirt run in a single
    25  privileged container.
    26  
    27  There are 3 approaches to fixing it which are described in the
    28  following sections, followed by a section with some recommendations
    29  which are common for all the approaches.
    30  
    31  ## Using a single separate container for VMs
    32  
    33  One possibility to reduce the attack surface is adding an extra
    34  non-privileged container for all the VMs. This way an attacker who
    35  escaped a VM will only be able to compromise other VMs on the same
    36  node, but not the node itself.
    37  
    38  When using this approach, we'll still have to use libvirt's cgroup
    39  management mechanism, relying on libvirt API for VM resource
    40  management. Moreover, we'll have to use libvirt's own cgroup
    41  hierarchy.  In theory, it should be possible to tell libvirt to make
    42  VM cgroups as children of some docker container cgroup by following
    43  [the example](https://libvirt.org/cgroups.html#customPartiton) given in libvirt documentation,
    44  e.g.:
    45  ```xml
    46  <resource>
    47    <partition>/docker/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939/vm</partition>
    48  </resource>
    49  ```
    50  so e.g. for `cpu` controller libvirt will use this path
    51  `/sys/fs/cgroup/cpu/docker/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939/vm.partition`
    52  
    53  Unfortunately this doesn't work because libvirt adds `.partition`
    54  suffix not just to the last item in the path but to every part or it,
    55  so it tries to use
    56  `/sys/fs/cgroup/cpu/docker.partition/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939.partition/vm.partition`
    57  and apparently this behavior can't be overridden. Thus, we'll have to
    58  use default libvirt cgroup hierarchy, and any resource limits imposed
    59  on the VM container itself will not affect the VMs.
    60  
    61  The plan of this approach is to make a separate container for VMs in
    62  Virtlet DaemonSet definitions and then make `vmwrapper` perform extra
    63  steps instead of just executing the actual emulator:
    64  
    65  1. Find a process running inside the VM container
    66  2. Find the required emulator binary and open it for reading
    67  3. Use
    68     [nsenter-like](https://github.com/docker/libcontainer/tree/6aeb7e1fa51f04f1253f79fc86da4b608fcb3b59/nsenter)
    69     mechanism to enter the namespaces of the process that runs in VM
    70     container, including mount namespace
    71  4. Using `devices` cgroup controller to disable access to the devices
    72     that are not needed for VM (although probably needed for
    73     `vmwrapper`'s network setup mechanism)
    74  5. Drop capabilities of the current process (another option: switch to
    75     a non-root user).
    76  6. Execute emulator binary using `/proc/self/fd/NNN`, where `NNN` is
    77     the file descriptor from step 2 (that's what
    78     [fexecve()](https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/sysdeps/unix/sysv/linux/fexecve.c#L28)
    79     does on Linux). This way, we may avoid the need for emulator binary
    80     to be available in the VM container.
    81  
    82  A non-virtlet simple image like 'busybox' should be used for VM
    83  container (the emulator itself). The VM container must be able to
    84  access paths used by the emulator, such as volumes and monitor socket.
    85  
    86  Pros:
    87  * Relatively easy to implement
    88  * Provides reasonable level of security
    89  
    90  Cons:
    91  * Escaping single VM means controlling all the VMs on the node
    92  * Need to implement libvirt-specific resource monitoring functionality
    93    (`ContainerStats()`, `ListContainerStats()` and `ImageFsInfo()` CRI
    94    calls)
    95  
    96  ## Using dedicated QEMU/KVM container per VM with libvirt cgroups
    97  
    98  This approach basically repeats the previous one, with one important
    99  difference: instead of just starting a single VM container as part of
   100  a Virtlet pod, we use Docker API to run a new container for each VM.
   101  The mechanics of vmwrapper remains the same, differing only in how PID
   102  of a process inside target container is obtained.
   103  
   104  Pros:
   105  * Provides better level of security than single container for all VMs
   106    as escaping the VM will only lead to compromise of a single container
   107  
   108  Cons:
   109  * Harder to implement than single container for all VMs
   110  * Need to access docker socket
   111  * Need to implement libvirt-specific resource monitoring functionality
   112    (`ContainerStats()`, `ListContainerStats()` and `ImageFsInfo()` CRI
   113    calls)
   114    
   115  For more info on implementing resource monitoring, see
   116  [the corresponding section](http://libvirt.org/apps.html#monitoring)
   117  in libvirt documentation. But basically we can just tap directly into
   118  libvirt cgroup hierarchy to get CPU and memory usage, and for
   119  filesystem usage just look at the number and size of the volumes in
   120  use.
   121  
   122  ## Using dedicated QEMU/KVM container per VM with docker cgroups
   123  
   124  This approach is the same as the previous one except that we also use
   125  Docker cgroups for VM containers. In this case, libvirt cgroups just
   126  don't do anything and we use the mechanisms closely resembling those
   127  used in standard kubelet dockershim for resource limits and
   128  monitoring.
   129  
   130  Pros:
   131  * Provides better level of security than single container for all VMs
   132    as escaping the VM will only lead to compromise of a single container
   133  * Resources are managed in standard Kubernetes way, we just mimic
   134    kubelet's dockershim
   135  
   136  Cons:
   137  * Harder to implement than single container for all VMs
   138  * Need to access docker socket
   139  * Need to redo resource limits in Virtlet
   140  
   141  ## Additional security measures
   142  
   143  The following additional security measures can be taken no matter what
   144  approach we take:
   145  * Use separate container for libvirt. This entails changing how we
   146    prepare the tap fd in vmwrapper because mounting network namespace
   147    directory can be problematic in some cases (e.g. because of one of
   148    `/run` or `/var/run` being a symbolic link). Basically tap fd must
   149    be prepared on Virtlet side and then sent over a Unix domain socket
   150    to vmwrapper process (the socket may reside on an `emptyDir`
   151    volume). With current version of Go this is somewhat complicated
   152    because the problem with switching namespaces inside Go process, so
   153    this will mean starting a subprocess that will prepare and sned the
   154    file descriptor.
   155  * Use Unix domain socket for libvirt. Currently we use TCP socket and
   156    listen for connections on `localhost`, thus providing additional
   157    possibilities for an attack
   158  * Add AppArmor/SELinux support out of the box
   159  
   160  ## Recommendations
   161  
   162  The recommendation is to begin with the first approach as the easiest
   163  one, moving to container-per-VM approach with libvirt cgroups later
   164  when we can. After we have container per VM, we need to decide on
   165  whether moving to docker cgroups will really help us.
   166  
   167  *Additional security measures* need to be implemented, too. The
   168  changes may be done in any sequence.
   169  
   170  ## Appendix
   171  
   172  CRI data structures relevant to resource usage statistics:
   173  
   174  ```protobuf
   175  // ContainerAttributes provides basic information of the container.
   176  message ContainerAttributes {
   177      // ID of the container.
   178      string id = 1;
   179      // Metadata of the container.
   180      ContainerMetadata metadata = 2;
   181      // Key-value pairs that may be used to scope and select individual resources.
   182      map<string,string> labels = 3;
   183      // Unstructured key-value map holding arbitrary metadata.
   184      // Annotations MUST NOT be altered by the runtime; the value of this field
   185      // MUST be identical to that of the corresponding ContainerConfig used to
   186      // instantiate the Container this status represents.
   187      map<string,string> annotations = 4;
   188  }
   189  
   190  // ContainerStats provides the resource usage statistics for a container.
   191  message ContainerStats {
   192      // Information of the container.
   193      ContainerAttributes attributes = 1;
   194      // CPU usage gathered from the container.
   195      CpuUsage cpu = 2;
   196      // Memory usage gathered from the container.
   197      MemoryUsage memory = 3;
   198      // Usage of the writeable layer.
   199      FilesystemUsage writable_layer = 4;
   200  }
   201  
   202  // CpuUsage provides the CPU usage information.
   203  message CpuUsage {
   204      // Timestamp in nanoseconds at which the information were collected. Must be > 0.
   205      int64 timestamp = 1;
   206      // Cumulative CPU usage (sum across all cores) since object creation.
   207      UInt64Value usage_core_nano_seconds = 2;
   208  }
   209  
   210  // MemoryUsage provides the memory usage information.
   211  message MemoryUsage {
   212      // Timestamp in nanoseconds at which the information were collected. Must be > 0.
   213      int64 timestamp = 1;
   214      // The amount of working set memory in bytes.
   215      UInt64Value working_set_bytes = 2;
   216  }
   217  
   218  // FilesystemUsage provides the filesystem usage information.
   219  message FilesystemUsage {
   220      // Timestamp in nanoseconds at which the information were collected. Must be > 0.
   221      int64 timestamp = 1;
   222      // The underlying storage of the filesystem.
   223      StorageIdentifier storage_id = 2;
   224      // UsedBytes represents the bytes used for images on the filesystem.
   225      // This may differ from the total bytes used on the filesystem and may not
   226      // equal CapacityBytes - AvailableBytes.
   227      UInt64Value used_bytes = 3;
   228      // InodesUsed represents the inodes used by the images.
   229      // This may not equal InodesCapacity - InodesAvailable because the underlying
   230      // filesystem may also be used for purposes other than storing images.
   231      UInt64Value inodes_used = 4;
   232  }
   233  ```
   234  
   235  ## References
   236  
   237  The initial research of Virtlet security was done by [Adam Heczko](mailto:aheczko@mirantis.com).