github.com/Equinix-Metal/virtlet@v1.5.2-0.20210807010419-342346535dc5/docs/design-proposals/vm-isolation.md

github.com/Equinix-Metal/virtlet@v1.5.2-0.20210807010419-342346535dc5/docs/design-proposals/vm-isolation.md (about)

1 # Improving VM isolation
2
3 The current `qemu.conf` used by Virtlet's libvirt deployment looks
4 like this:
5 ```
6 stdio_handler = "file"
7 user = "root"
8 group = "root"
9 # we need to do network management stuff in vmwrapper
10 clear_emulator_capabilities = 0
11
12 cgroup_device_acl = [
13 "/dev/null", "/dev/full", "/dev/zero",
14 "/dev/random", "/dev/urandom",
15 "/dev/ptmx", "/dev/kvm", "/dev/kqemu",
16 "/dev/rtc", "/dev/hpet", "/dev/net/tun",
17 ]
18 ```
19
20 It has some apparent problems, including running VMs as root, not
21 dropping capabilities and extending the list of devices available to
22 processes in VM's cgroup. On top of that if an attacker breaks out of
23 a VM (s)he will be able to take the complete control of the Kubernetes
24 node where Virtlet pod runs, as Virtlet and Libvirt run in a single
25 privileged container.
26
27 There are 3 approaches to fixing it which are described in the
28 following sections, followed by a section with some recommendations
29 which are common for all the approaches.
30
31 ## Using a single separate container for VMs
32
33 One possibility to reduce the attack surface is adding an extra
34 non-privileged container for all the VMs. This way an attacker who
35 escaped a VM will only be able to compromise other VMs on the same
36 node, but not the node itself.
37
38 When using this approach, we'll still have to use libvirt's cgroup
39 management mechanism, relying on libvirt API for VM resource
40 management. Moreover, we'll have to use libvirt's own cgroup
41 hierarchy. In theory, it should be possible to tell libvirt to make
42 VM cgroups as children of some docker container cgroup by following
43 [the example](https://libvirt.org/cgroups.html#customPartiton) given in libvirt documentation,
44 e.g.:
45 ```xml
46 <resource>
47 <partition>/docker/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939/vm</partition>
48 </resource>
49 ```
50 so e.g. for `cpu` controller libvirt will use this path
51 `/sys/fs/cgroup/cpu/docker/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939/vm.partition`
52
53 Unfortunately this doesn't work because libvirt adds `.partition`
54 suffix not just to the last item in the path but to every part or it,
55 so it tries to use
56 `/sys/fs/cgroup/cpu/docker.partition/158056e84da21859909b054fa8143f505579f6743715cc289e2231445fb4e939.partition/vm.partition`
57 and apparently this behavior can't be overridden. Thus, we'll have to
58 use default libvirt cgroup hierarchy, and any resource limits imposed
59 on the VM container itself will not affect the VMs.
60
61 The plan of this approach is to make a separate container for VMs in
62 Virtlet DaemonSet definitions and then make `vmwrapper` perform extra
63 steps instead of just executing the actual emulator:
64
65 1. Find a process running inside the VM container
66 2. Find the required emulator binary and open it for reading
67 3. Use
68 [nsenter-like](https://github.com/docker/libcontainer/tree/6aeb7e1fa51f04f1253f79fc86da4b608fcb3b59/nsenter)
69 mechanism to enter the namespaces of the process that runs in VM
70 container, including mount namespace
71 4. Using `devices` cgroup controller to disable access to the devices
72 that are not needed for VM (although probably needed for
73 `vmwrapper`'s network setup mechanism)
74 5. Drop capabilities of the current process (another option: switch to
75 a non-root user).
76 6. Execute emulator binary using `/proc/self/fd/NNN`, where `NNN` is
77 the file descriptor from step 2 (that's what
78 [fexecve()](https://github.com/lattera/glibc/blob/a2f34833b1042d5d8eeb263b4cf4caaea138c4ad/sysdeps/unix/sysv/linux/fexecve.c#L28)
79 does on Linux). This way, we may avoid the need for emulator binary
80 to be available in the VM container.
81
82 A non-virtlet simple image like 'busybox' should be used for VM
83 container (the emulator itself). The VM container must be able to
84 access paths used by the emulator, such as volumes and monitor socket.
85
86 Pros:
87 * Relatively easy to implement
88 * Provides reasonable level of security
89
90 Cons:
91 * Escaping single VM means controlling all the VMs on the node
92 * Need to implement libvirt-specific resource monitoring functionality
93 (`ContainerStats()`, `ListContainerStats()` and `ImageFsInfo()` CRI
94 calls)
95
96 ## Using dedicated QEMU/KVM container per VM with libvirt cgroups
97
98 This approach basically repeats the previous one, with one important
99 difference: instead of just starting a single VM container as part of
100 a Virtlet pod, we use Docker API to run a new container for each VM.
101 The mechanics of vmwrapper remains the same, differing only in how PID
102 of a process inside target container is obtained.
103
104 Pros:
105 * Provides better level of security than single container for all VMs
106 as escaping the VM will only lead to compromise of a single container
107
108 Cons:
109 * Harder to implement than single container for all VMs
110 * Need to access docker socket
111 * Need to implement libvirt-specific resource monitoring functionality
112 (`ContainerStats()`, `ListContainerStats()` and `ImageFsInfo()` CRI
113 calls)
114
115 For more info on implementing resource monitoring, see
116 [the corresponding section](http://libvirt.org/apps.html#monitoring)
117 in libvirt documentation. But basically we can just tap directly into
118 libvirt cgroup hierarchy to get CPU and memory usage, and for
119 filesystem usage just look at the number and size of the volumes in
120 use.
121
122 ## Using dedicated QEMU/KVM container per VM with docker cgroups
123
124 This approach is the same as the previous one except that we also use
125 Docker cgroups for VM containers. In this case, libvirt cgroups just
126 don't do anything and we use the mechanisms closely resembling those
127 used in standard kubelet dockershim for resource limits and
128 monitoring.
129
130 Pros:
131 * Provides better level of security than single container for all VMs
132 as escaping the VM will only lead to compromise of a single container
133 * Resources are managed in standard Kubernetes way, we just mimic
134 kubelet's dockershim
135
136 Cons:
137 * Harder to implement than single container for all VMs
138 * Need to access docker socket
139 * Need to redo resource limits in Virtlet
140
141 ## Additional security measures
142
143 The following additional security measures can be taken no matter what
144 approach we take:
145 * Use separate container for libvirt. This entails changing how we
146 prepare the tap fd in vmwrapper because mounting network namespace
147 directory can be problematic in some cases (e.g. because of one of
148 `/run` or `/var/run` being a symbolic link). Basically tap fd must
149 be prepared on Virtlet side and then sent over a Unix domain socket
150 to vmwrapper process (the socket may reside on an `emptyDir`
151 volume). With current version of Go this is somewhat complicated
152 because the problem with switching namespaces inside Go process, so
153 this will mean starting a subprocess that will prepare and sned the
154 file descriptor.
155 * Use Unix domain socket for libvirt. Currently we use TCP socket and
156 listen for connections on `localhost`, thus providing additional
157 possibilities for an attack
158 * Add AppArmor/SELinux support out of the box
159
160 ## Recommendations
161
162 The recommendation is to begin with the first approach as the easiest
163 one, moving to container-per-VM approach with libvirt cgroups later
164 when we can. After we have container per VM, we need to decide on
165 whether moving to docker cgroups will really help us.
166
167 *Additional security measures* need to be implemented, too. The
168 changes may be done in any sequence.
169
170 ## Appendix
171
172 CRI data structures relevant to resource usage statistics:
173
174 ```protobuf
175 // ContainerAttributes provides basic information of the container.
176 message ContainerAttributes {
177 // ID of the container.
178 string id = 1;
179 // Metadata of the container.
180 ContainerMetadata metadata = 2;
181 // Key-value pairs that may be used to scope and select individual resources.
182 map<string,string> labels = 3;
183 // Unstructured key-value map holding arbitrary metadata.
184 // Annotations MUST NOT be altered by the runtime; the value of this field
185 // MUST be identical to that of the corresponding ContainerConfig used to
186 // instantiate the Container this status represents.
187 map<string,string> annotations = 4;
188 }
189
190 // ContainerStats provides the resource usage statistics for a container.
191 message ContainerStats {
192 // Information of the container.
193 ContainerAttributes attributes = 1;
194 // CPU usage gathered from the container.
195 CpuUsage cpu = 2;
196 // Memory usage gathered from the container.
197 MemoryUsage memory = 3;
198 // Usage of the writeable layer.
199 FilesystemUsage writable_layer = 4;
200 }
201
202 // CpuUsage provides the CPU usage information.
203 message CpuUsage {
204 // Timestamp in nanoseconds at which the information were collected. Must be > 0.
205 int64 timestamp = 1;
206 // Cumulative CPU usage (sum across all cores) since object creation.
207 UInt64Value usage_core_nano_seconds = 2;
208 }
209
210 // MemoryUsage provides the memory usage information.
211 message MemoryUsage {
212 // Timestamp in nanoseconds at which the information were collected. Must be > 0.
213 int64 timestamp = 1;
214 // The amount of working set memory in bytes.
215 UInt64Value working_set_bytes = 2;
216 }
217
218 // FilesystemUsage provides the filesystem usage information.
219 message FilesystemUsage {
220 // Timestamp in nanoseconds at which the information were collected. Must be > 0.
221 int64 timestamp = 1;
222 // The underlying storage of the filesystem.
223 StorageIdentifier storage_id = 2;
224 // UsedBytes represents the bytes used for images on the filesystem.
225 // This may differ from the total bytes used on the filesystem and may not
226 // equal CapacityBytes - AvailableBytes.
227 UInt64Value used_bytes = 3;
228 // InodesUsed represents the inodes used by the images.
229 // This may not equal InodesCapacity - InodesAvailable because the underlying
230 // filesystem may also be used for purposes other than storing images.
231 UInt64Value inodes_used = 4;
232 }
233 ```
234
235 ## References
236
237 The initial research of Virtlet security was done by [Adam Heczko](mailto:aheczko@mirantis.com).