github.com/kata-containers/runtime@v0.0.0-20210505125100-04f29832a923/virtcontainers/README.md (about) 1 Table of Contents 2 ================= 3 4 * [What is it?](#what-is-it) 5 * [Background](#background) 6 * [Out of scope](#out-of-scope) 7 * [virtcontainers and Kubernetes CRI](#virtcontainers-and-kubernetes-cri) 8 * [Design](#design) 9 * [Sandboxes](#sandboxes) 10 * [Hypervisors](#hypervisors) 11 * [Agents](#agents) 12 * [Shim](#shim) 13 * [Proxy](#proxy) 14 * [API](#api) 15 * [Sandbox API](#sandbox-api) 16 * [Container API](#container-api) 17 * [Networking](#networking) 18 * [CNM](#cnm) 19 * [Storage](#storage) 20 * [How to check if container uses devicemapper block device as its rootfs](#how-to-check-if-container-uses-devicemapper-block-device-as-its-rootfs) 21 * [Devices](#devices) 22 * [How to pass a device using VFIO-passthrough](#how-to-pass-a-device-using-vfio-passthrough) 23 * [Developers](#developers) 24 * [Persistent storage plugin support](#persistent-storage-plugin-support) 25 * [Experimental features](#experimental-features) 26 27 # What is it? 28 29 `virtcontainers` is a Go library that can be used to build hardware-virtualized container 30 runtimes. 31 32 # Background 33 34 The few existing VM-based container runtimes (Clear Containers, runV, rkt's 35 KVM stage 1) all share the same hardware virtualization semantics but use different 36 code bases to implement them. `virtcontainers`'s goal is to factorize this code into 37 a common Go library. 38 39 Ideally, VM-based container runtime implementations would become translation 40 layers from the runtime specification they implement (e.g. the [OCI runtime-spec][oci] 41 or the [Kubernetes CRI][cri]) to the `virtcontainers` API. 42 43 `virtcontainers` was used as a foundational package for the [Clear Containers][cc] [runtime][cc-runtime] implementation. 44 45 [oci]: https://github.com/opencontainers/runtime-spec 46 [cri]: https://git.k8s.io/community/contributors/devel/sig-node/container-runtime-interface.md 47 [cc]: https://github.com/clearcontainers/ 48 [cc-runtime]: https://github.com/clearcontainers/runtime/ 49 50 # Out of scope 51 52 Implementing a container runtime is out of scope for this project. Any 53 tools or executables in this repository are only provided for demonstration or 54 testing purposes. 55 56 ## virtcontainers and Kubernetes CRI 57 58 `virtcontainers`'s API is loosely inspired by the Kubernetes [CRI][cri] because 59 we believe it provides the right level of abstractions for containerized sandboxes. 60 However, despite the API similarities between the two projects, the goal of 61 `virtcontainers` is _not_ to build a CRI implementation, but instead to provide a 62 generic, runtime-specification agnostic, hardware-virtualized containers 63 library that other projects could leverage to implement CRI themselves. 64 65 # Design 66 67 ## Sandboxes 68 69 The `virtcontainers` execution unit is a _sandbox_, i.e. `virtcontainers` users start sandboxes where 70 containers will be running. 71 72 `virtcontainers` creates a sandbox by starting a virtual machine and setting the sandbox 73 up within that environment. Starting a sandbox means launching all containers with 74 the VM sandbox runtime environment. 75 76 ## Hypervisors 77 78 The `virtcontainers` package relies on hypervisors to start and stop virtual machine where 79 sandboxes will be running. An hypervisor is defined by an Hypervisor interface implementation, 80 and the default implementation is the QEMU one. 81 82 ### Update cloud-hypervisor client code 83 84 See [docs](pkg/cloud-hypervisor/README.md) 85 86 ## Agents 87 88 During the lifecycle of a container, the runtime running on the host needs to interact with 89 the virtual machine guest OS in order to start new commands to be executed as part of a given 90 container workload, set new networking routes or interfaces, fetch a container standard or 91 error output, and so on. 92 There are many existing and potential solutions to resolve that problem and `virtcontainers` abstracts 93 this through the Agent interface. 94 95 ## Shim 96 97 In some cases the runtime will need a translation shim between the higher level container 98 stack (e.g. Docker) and the virtual machine holding the container workload. This is needed 99 for container stacks that make strong assumptions on the nature of the container they're 100 monitoring. In cases where they assume containers are simply regular host processes, a shim 101 layer is needed to translate host specific semantics into e.g. agent controlled virtual 102 machine ones. 103 104 ## Proxy 105 106 When hardware virtualized containers have limited I/O multiplexing capabilities, 107 runtimes may decide to rely on an external host proxy to support cases where several 108 runtime instances are talking to the same container. 109 110 # API 111 112 The high level `virtcontainers` API is the following one: 113 114 ## Sandbox API 115 116 * `CreateSandbox(sandboxConfig SandboxConfig)` creates a Sandbox. 117 The virtual machine is started and the Sandbox is prepared. 118 119 * `DeleteSandbox(sandboxID string)` deletes a Sandbox. 120 The virtual machine is shut down and all information related to the Sandbox are removed. 121 The function will fail if the Sandbox is running. In that case `StopSandbox()` has to be called first. 122 123 * `StartSandbox(sandboxID string)` starts an already created Sandbox. 124 The Sandbox and all its containers are started. 125 126 * `RunSandbox(sandboxConfig SandboxConfig)` creates and starts a Sandbox. 127 This performs `CreateSandbox()` + `StartSandbox()`. 128 129 * `StopSandbox(sandboxID string)` stops an already running Sandbox. 130 The Sandbox and all its containers are stopped. 131 132 * `PauseSandbox(sandboxID string)` pauses an existing Sandbox. 133 134 * `ResumeSandbox(sandboxID string)` resume a paused Sandbox. 135 136 * `StatusSandbox(sandboxID string)` returns a detailed Sandbox status. 137 138 * `ListSandbox()` lists all Sandboxes on the host. 139 It returns a detailed status for every Sandbox. 140 141 ## Container API 142 143 * `CreateContainer(sandboxID string, containerConfig ContainerConfig)` creates a Container on an existing Sandbox. 144 145 * `DeleteContainer(sandboxID, containerID string)` deletes a Container from a Sandbox. 146 If the Container is running it has to be stopped first. 147 148 * `StartContainer(sandboxID, containerID string)` starts an already created Container. 149 The Sandbox has to be running. 150 151 * `StopContainer(sandboxID, containerID string)` stops an already running Container. 152 153 * `EnterContainer(sandboxID, containerID string, cmd Cmd)` enters an already running Container and runs a given command. 154 155 * `StatusContainer(sandboxID, containerID string)` returns a detailed Container status. 156 157 * `KillContainer(sandboxID, containerID string, signal syscall.Signal, all bool)` sends a signal to all or one container inside a Sandbox. 158 159 An example tool using the `virtcontainers` API is provided in the `hack/virtc` package. 160 161 For further details, see the [API documentation](documentation/api/1.0/api.md). 162 163 # Networking 164 165 `virtcontainers` supports the 2 major container networking models: the [Container Network Model (CNM)][cnm] and the [Container Network Interface (CNI)][cni]. 166 167 Typically the former is the Docker default networking model while the later is used on Kubernetes deployments. 168 169 [cnm]: https://github.com/docker/libnetwork/blob/master/docs/design.md 170 [cni]: https://github.com/containernetworking/cni/ 171 172 ## CNM 173 174 ![High-level CNM Diagram](documentation/network/CNM_overall_diagram.png) 175 176 __CNM lifecycle__ 177 178 1. `RequestPool` 179 180 2. `CreateNetwork` 181 182 3. `RequestAddress` 183 184 4. `CreateEndPoint` 185 186 5. `CreateContainer` 187 188 6. Create `config.json` 189 190 7. Create PID and network namespace 191 192 8. `ProcessExternalKey` 193 194 9. `JoinEndPoint` 195 196 10. `LaunchContainer` 197 198 11. Launch 199 200 12. Run container 201 202 ![Detailed CNM Diagram](documentation/network/CNM_detailed_diagram.png) 203 204 __Runtime network setup with CNM__ 205 206 1. Read `config.json` 207 208 2. Create the network namespace ([code](https://github.com/containers/virtcontainers/blob/0.5.0/cnm.go#L108-L120)) 209 210 3. Call the prestart hook (from inside the netns) ([code](https://github.com/containers/virtcontainers/blob/0.5.0/api.go#L46-L49)) 211 212 4. Scan network interfaces inside netns and get the name of the interface created by prestart hook ([code](https://github.com/containers/virtcontainers/blob/0.5.0/cnm.go#L70-L106)) 213 214 5. Create bridge, TAP, and link all together with network interface previously created ([code](https://github.com/containers/virtcontainers/blob/0.5.0/network.go#L123-L205)) 215 216 6. Start VM inside the netns and start the container ([code](https://github.com/containers/virtcontainers/blob/0.5.0/api.go#L66-L70)) 217 218 __Drawbacks of CNM__ 219 220 There are three drawbacks about using CNM instead of CNI: 221 * The way we call into it is not very explicit: Have to re-exec `dockerd` binary so that it can accept parameters and execute the prestart hook related to network setup. 222 * Implicit way to designate the network namespace: Instead of explicitly giving the netns to `dockerd`, we give it the PID of our runtime so that it can find the netns from this PID. This means we have to make sure being in the right netns while calling the hook, otherwise the VETH pair will be created with the wrong netns. 223 * No results are back from the hook: We have to scan the network interfaces to discover which one has been created inside the netns. This introduces more latency in the code because it forces us to scan the network in the `CreateSandbox` path, which is critical for starting the VM as quick as possible. 224 225 # Storage 226 227 Container workloads are shared with the virtualized environment through 9pfs. 228 The devicemapper storage driver is a special case. The driver uses dedicated block devices rather than formatted filesystems, and operates at the block level rather than the file level. This knowledge has been used to directly use the underlying block device instead of the overlay file system for the container root file system. The block device maps to the top read-write layer for the overlay. This approach gives much better I/O performance compared to using 9pfs to share the container file system. 229 230 The approach above does introduce a limitation in terms of dynamic file copy in/out of the container via `docker cp` operations. 231 The copy operation from host to container accesses the mounted file system on the host side. This is not expected to work and may lead to inconsistencies as the block device will be simultaneously written to, from two different mounts. 232 The copy operation from container to host will work, provided the user calls `sync(1)` from within the container prior to the copy to make sure any outstanding cached data is written to the block device. 233 234 ``` 235 docker cp [OPTIONS] CONTAINER:SRC_PATH HOST:DEST_PATH 236 docker cp [OPTIONS] HOST:SRC_PATH CONTAINER:DEST_PATH 237 ``` 238 239 Ability to hotplug block devices has been added, which makes it possible to use block devices for containers started after the VM has been launched. 240 241 ## How to check if container uses devicemapper block device as its rootfs 242 243 Start a container. Call `mount(8)` within the container. You should see `/` mounted on `/dev/vda` device. 244 245 # Devices 246 247 Support has been added to pass [VFIO](https://www.kernel.org/doc/Documentation/vfio.txt) 248 assigned devices on the docker command line with --device. 249 Support for passing other devices including block devices with --device has 250 not been added added yet. 251 252 ## How to pass a device using VFIO-passthrough 253 254 1. Requirements 255 256 IOMMU group represents the smallest set of devices for which the IOMMU has 257 visibility and which is isolated from other groups. VFIO uses this information 258 to enforce safe ownership of devices for userspace. 259 260 You will need Intel VT-d capable hardware. Check if IOMMU is enabled in your host 261 kernel by verifying `CONFIG_VFIO_NOIOMMU` is not in the kernel configuration. If it is set, 262 you will need to rebuild your kernel. 263 264 The following kernel configuration options need to be enabled: 265 ``` 266 CONFIG_VFIO_IOMMU_TYPE1=m 267 CONFIG_VFIO=m 268 CONFIG_VFIO_PCI=m 269 ``` 270 271 In addition, you need to pass `intel_iommu=on` on the kernel command line. 272 273 2. Identify BDF(Bus-Device-Function) of the PCI device to be assigned. 274 275 276 ``` 277 $ lspci -D | grep -e Ethernet -e Network 278 0000:01:00.0 Ethernet controller: Intel Corporation Ethernet Controller 10-Gigabit X540-AT2 (rev 01) 279 280 $ BDF=0000:01:00.0 281 ``` 282 283 3. Find vendor and device id. 284 285 ``` 286 $ lspci -n -s $BDF 287 01:00.0 0200: 8086:1528 (rev 01) 288 ``` 289 290 4. Find IOMMU group. 291 292 ``` 293 $ readlink /sys/bus/pci/devices/$BDF/iommu_group 294 ../../../../kernel/iommu_groups/16 295 ``` 296 297 5. Unbind the device from host driver. 298 299 ``` 300 $ echo $BDF | sudo tee /sys/bus/pci/devices/$BDF/driver/unbind 301 ``` 302 303 6. Bind the device to `vfio-pci`. 304 305 ``` 306 $ sudo modprobe vfio-pci 307 $ echo 8086 1528 | sudo tee /sys/bus/pci/drivers/vfio-pci/new_id 308 $ echo $BDF | sudo tee --append /sys/bus/pci/drivers/vfio-pci/bind 309 ``` 310 311 7. Check `/dev/vfio` 312 313 ``` 314 $ ls /dev/vfio 315 16 vfio 316 ``` 317 318 8. Start a Clear Containers container passing the VFIO group on the docker command line. 319 320 ``` 321 docker run -it --device=/dev/vfio/16 centos/tools bash 322 ``` 323 324 9. Running `lspci` within the container should show the device among the 325 PCI devices. The driver for the device needs to be present within the 326 Clear Containers kernel. If the driver is missing, you can add it to your 327 custom container kernel using the [osbuilder](https://github.com/clearcontainers/osbuilder) 328 tooling. 329 330 # Developers 331 332 For information on how to build, develop and test `virtcontainers`, see the 333 [developer documentation](documentation/Developers.md). 334 335 # Persistent storage plugin support 336 337 See the [persistent storage plugin documentation](persist/plugin). 338 339 # Experimental features 340 341 See the [experimental features documentation](experimental).