github.com/Equinix-Metal/virtlet@v1.5.2-0.20210807010419-342346535dc5/docs/design-proposals/vm-isolation.md (about) 1 # Improving VM isolation 2 3 The current `qemu.conf` used by Virtlet's libvirt deployment looks 4 like this: 5 ``` 6 stdio_handler = "file" 7 user = "root" 8 group = "root" 9 # we need to do network management stuff in vmwrapper 10 clear_emulator_capabilities = 0 11 12 cgroup_device_acl = [ 13 "/dev/null", "/dev/full", "/dev/zero", 14 "/dev/random", "/dev/urandom", 15 "/dev/ptmx", "/dev/kvm", "/dev/kqemu", 16 "/dev/rtc", "/dev/hpet", "/dev/net/tun", 17 ] 18 ``` 19 20 It has some apparent problems, including running VMs as root, not 21 dropping capabilities and extending the list of devices available to 22 processes in VM's cgroup. On top of that if an attacker breaks out of 23 a VM (s)he will be able to take the complete control of the Kubernetes 24 node where Virtlet pod runs, as Virtlet and Libvirt run in a single 25 privileged container. 26 27 There are 3 approaches to fixing it which are described in the 28 following sections, followed by a section with some recommendations 29 which are common for all the approaches. 30 31 ## Using a single separate container for VMs 32 33 One possibility to reduce the attack surface is adding an extra 34 non-privileged container for all the VMs. This way an attacker who 35 escaped a VM will only be able to compromise other VMs on the same 36 node, but not the node itself. 37 38 When using this approach, we'll still have to use libvirt's cgroup 39 management mechanism, relying on libvirt API for VM resource 40 management. Moreover, we'll have to use libvirt's own cgroup 41 hierarchy. In theory, it should be possible to tell libvirt to make 42 VM cgroups as children of some docker container cgroup by following 43 [the example](https://libvirt.org/cgroups.html#customPartiton) given in libvirt documentation, 44 e.g.: 45 ```xml 46 <resource> 47 <partition>/docker/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939/vm</partition> 48 </resource> 49 ``` 50 so e.g. for `cpu` controller libvirt will use this path 51 `/sys/fs/cgroup/cpu/docker/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939/vm.partition` 52 53 Unfortunately this doesn't work because libvirt adds `.partition` 54 suffix not just to the last item in the path but to every part or it, 55 so it tries to use 56 `/sys/fs/cgroup/cpu/docker.partition/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939.partition/vm.partition` 57 and apparently this behavior can't be overridden. Thus, we'll have to 58 use default libvirt cgroup hierarchy, and any resource limits imposed 59 on the VM container itself will not affect the VMs. 60 61 The plan of this approach is to make a separate container for VMs in 62 Virtlet DaemonSet definitions and then make `vmwrapper` perform extra 63 steps instead of just executing the actual emulator: 64 65 1. Find a process running inside the VM container 66 2. Find the required emulator binary and open it for reading 67 3. Use 68 [nsenter-like](https://github.com/docker/libcontainer/tree/6aeb7e1fa51f04f1253f79fc86da4b608fcb3b59/nsenter) 69 mechanism to enter the namespaces of the process that runs in VM 70 container, including mount namespace 71 4. Using `devices` cgroup controller to disable access to the devices 72 that are not needed for VM (although probably needed for 73 `vmwrapper`'s network setup mechanism) 74 5. Drop capabilities of the current process (another option: switch to 75 a non-root user). 76 6. Execute emulator binary using `/proc/self/fd/NNN`, where `NNN` is 77 the file descriptor from step 2 (that's what 78 [fexecve()](https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/sysdeps/unix/sysv/linux/fexecve.c#L28) 79 does on Linux). This way, we may avoid the need for emulator binary 80 to be available in the VM container. 81 82 A non-virtlet simple image like 'busybox' should be used for VM 83 container (the emulator itself). The VM container must be able to 84 access paths used by the emulator, such as volumes and monitor socket. 85 86 Pros: 87 * Relatively easy to implement 88 * Provides reasonable level of security 89 90 Cons: 91 * Escaping single VM means controlling all the VMs on the node 92 * Need to implement libvirt-specific resource monitoring functionality 93 (`ContainerStats()`, `ListContainerStats()` and `ImageFsInfo()` CRI 94 calls) 95 96 ## Using dedicated QEMU/KVM container per VM with libvirt cgroups 97 98 This approach basically repeats the previous one, with one important 99 difference: instead of just starting a single VM container as part of 100 a Virtlet pod, we use Docker API to run a new container for each VM. 101 The mechanics of vmwrapper remains the same, differing only in how PID 102 of a process inside target container is obtained. 103 104 Pros: 105 * Provides better level of security than single container for all VMs 106 as escaping the VM will only lead to compromise of a single container 107 108 Cons: 109 * Harder to implement than single container for all VMs 110 * Need to access docker socket 111 * Need to implement libvirt-specific resource monitoring functionality 112 (`ContainerStats()`, `ListContainerStats()` and `ImageFsInfo()` CRI 113 calls) 114 115 For more info on implementing resource monitoring, see 116 [the corresponding section](http://libvirt.org/apps.html#monitoring) 117 in libvirt documentation. But basically we can just tap directly into 118 libvirt cgroup hierarchy to get CPU and memory usage, and for 119 filesystem usage just look at the number and size of the volumes in 120 use. 121 122 ## Using dedicated QEMU/KVM container per VM with docker cgroups 123 124 This approach is the same as the previous one except that we also use 125 Docker cgroups for VM containers. In this case, libvirt cgroups just 126 don't do anything and we use the mechanisms closely resembling those 127 used in standard kubelet dockershim for resource limits and 128 monitoring. 129 130 Pros: 131 * Provides better level of security than single container for all VMs 132 as escaping the VM will only lead to compromise of a single container 133 * Resources are managed in standard Kubernetes way, we just mimic 134 kubelet's dockershim 135 136 Cons: 137 * Harder to implement than single container for all VMs 138 * Need to access docker socket 139 * Need to redo resource limits in Virtlet 140 141 ## Additional security measures 142 143 The following additional security measures can be taken no matter what 144 approach we take: 145 * Use separate container for libvirt. This entails changing how we 146 prepare the tap fd in vmwrapper because mounting network namespace 147 directory can be problematic in some cases (e.g. because of one of 148 `/run` or `/var/run` being a symbolic link). Basically tap fd must 149 be prepared on Virtlet side and then sent over a Unix domain socket 150 to vmwrapper process (the socket may reside on an `emptyDir` 151 volume). With current version of Go this is somewhat complicated 152 because the problem with switching namespaces inside Go process, so 153 this will mean starting a subprocess that will prepare and sned the 154 file descriptor. 155 * Use Unix domain socket for libvirt. Currently we use TCP socket and 156 listen for connections on `localhost`, thus providing additional 157 possibilities for an attack 158 * Add AppArmor/SELinux support out of the box 159 160 ## Recommendations 161 162 The recommendation is to begin with the first approach as the easiest 163 one, moving to container-per-VM approach with libvirt cgroups later 164 when we can. After we have container per VM, we need to decide on 165 whether moving to docker cgroups will really help us. 166 167 *Additional security measures* need to be implemented, too. The 168 changes may be done in any sequence. 169 170 ## Appendix 171 172 CRI data structures relevant to resource usage statistics: 173 174 ```protobuf 175 // ContainerAttributes provides basic information of the container. 176 message ContainerAttributes { 177 // ID of the container. 178 string id = 1; 179 // Metadata of the container. 180 ContainerMetadata metadata = 2; 181 // Key-value pairs that may be used to scope and select individual resources. 182 map<string,string> labels = 3; 183 // Unstructured key-value map holding arbitrary metadata. 184 // Annotations MUST NOT be altered by the runtime; the value of this field 185 // MUST be identical to that of the corresponding ContainerConfig used to 186 // instantiate the Container this status represents. 187 map<string,string> annotations = 4; 188 } 189 190 // ContainerStats provides the resource usage statistics for a container. 191 message ContainerStats { 192 // Information of the container. 193 ContainerAttributes attributes = 1; 194 // CPU usage gathered from the container. 195 CpuUsage cpu = 2; 196 // Memory usage gathered from the container. 197 MemoryUsage memory = 3; 198 // Usage of the writeable layer. 199 FilesystemUsage writable_layer = 4; 200 } 201 202 // CpuUsage provides the CPU usage information. 203 message CpuUsage { 204 // Timestamp in nanoseconds at which the information were collected. Must be > 0. 205 int64 timestamp = 1; 206 // Cumulative CPU usage (sum across all cores) since object creation. 207 UInt64Value usage_core_nano_seconds = 2; 208 } 209 210 // MemoryUsage provides the memory usage information. 211 message MemoryUsage { 212 // Timestamp in nanoseconds at which the information were collected. Must be > 0. 213 int64 timestamp = 1; 214 // The amount of working set memory in bytes. 215 UInt64Value working_set_bytes = 2; 216 } 217 218 // FilesystemUsage provides the filesystem usage information. 219 message FilesystemUsage { 220 // Timestamp in nanoseconds at which the information were collected. Must be > 0. 221 int64 timestamp = 1; 222 // The underlying storage of the filesystem. 223 StorageIdentifier storage_id = 2; 224 // UsedBytes represents the bytes used for images on the filesystem. 225 // This may differ from the total bytes used on the filesystem and may not 226 // equal CapacityBytes - AvailableBytes. 227 UInt64Value used_bytes = 3; 228 // InodesUsed represents the inodes used by the images. 229 // This may not equal InodesCapacity - InodesAvailable because the underlying 230 // filesystem may also be used for purposes other than storing images. 231 UInt64Value inodes_used = 4; 232 } 233 ``` 234 235 ## References 236 237 The initial research of Virtlet security was done by [Adam Heczko](mailto:aheczko@mirantis.com).