github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/architecture_guide/security.md (about) 1 # Security Model 2 3 [TOC] 4 5 gVisor was created in order to provide additional defense against the 6 exploitation of kernel bugs by untrusted userspace code. In order to understand 7 how gVisor achieves this goal, it is first necessary to understand the basic 8 threat model. 9 10 ## Threats: The Anatomy of an Exploit 11 12 An exploit takes advantage of a software or hardware bug in order to escalate 13 privileges, gain access to privileged data, or disrupt services. All of the 14 possible interactions that a malicious application can have with the rest of the 15 system (attack vectors) define the attack surface. We categorize these attack 16 vectors into several common classes. 17 18 ### System API 19 20 An operating system or hypervisor exposes an abstract System API in the form of 21 system calls and traps. This API may be documented and stable, as with Linux, or 22 it may be abstracted behind a library, as with Windows (i.e. win32.dll or 23 ntdll.dll). The System API includes all standard interfaces that application 24 code uses to interact with the system. This includes high-level abstractions 25 that are derived from low-level system calls, such as system files, sockets and 26 namespaces. 27 28 Although the System API is exposed to applications by design, bugs and race 29 conditions within the kernel or hypervisor may occasionally be exploitable via 30 the API. This is common in part due to the fact that most kernels and 31 hypervisors are written in [C][clang], which is well-suited to interfacing with 32 hardware but often prone to security issues. In order to exploit these issues, a 33 typical attack might involve some combination of the following: 34 35 1. Opening or creating some combination of files, sockets or other descriptors. 36 1. Passing crafted, malicious arguments, structures or packets. 37 1. Racing with multiple threads in order to hit specific code paths. 38 39 For example, for the [Dirty Cow][dirtycow] privilege escalation bug, an 40 application would open a specific file in `/proc` or use a specific `ptrace` 41 system call, and use multiple threads in order to trigger a race condition when 42 touching a fresh page of memory. The attacker then gains control over a page of 43 memory belonging to the system. With additional privileges or access to 44 privileged data in the kernel, an attacker will often be able to employ 45 additional techniques to gain full access to the rest of the system. 46 47 While bugs in the implementation of the System API are readily fixed, they are 48 also the most common form of exploit. The exposure created by this class of 49 exploit is what gVisor aims to minimize and control, described in detail below. 50 51 ### System ABI 52 53 Hardware and software exploits occasionally exist in execution paths that are 54 not part of an intended System API. In this case, exploits may be found as part 55 of implicit actions the hardware or privileged system code takes in response to 56 certain events, such as traps or interrupts. For example, the recent 57 [POPSS][popss] flaw required only native code execution (no specific system call 58 or file access). In that case, the Xen hypervisor was similarly vulnerable, 59 highlighting that hypervisors are not immune to this vector. 60 61 ### Side Channels 62 63 Hardware side channels may be exploitable by any code running on a system: 64 native, sandboxed, or virtualized. However, many host-level mitigations against 65 hardware side channels are still effective with a sandbox. For example, kernels 66 built with retpoline protect against some speculative execution attacks 67 (Spectre) and frame poisoning may protect against L1 terminal fault (L1TF) 68 attacks. Hypervisors may introduce additional complications in this regard, as 69 there is no mitigation against an application in a normally functioning Virtual 70 Machine (VM) exploiting the L1TF vulnerability for another VM on the sibling 71 hyperthread. 72 73 ### Other Vectors 74 75 The above categories in no way represent an exhaustive list of exploits, as we 76 focus only on running untrusted code from within the operating system or 77 hypervisor. We do not consider other ways that a more generic adversary may 78 interact with a system, such as inserting a portable storage device with a 79 malicious filesystem image, using a combination of crafted keyboard or touch 80 inputs, or saturating a network device with ill-formed packets. 81 82 Furthermore, high-level systems may contain exploitable components. An attacker 83 need not escalate privileges within a container if there’s an exploitable 84 network-accessible service on the host or some other API path. *A sandbox is not 85 a substitute for a secure architecture*. 86 87 ## Goals: Limiting Exposure 88 89 ![Threat model](security.png "Threat model.") 90 91 gVisor’s primary design goal is to minimize the System API attack vector through 92 multiple layers of defense, while still providing a process model. There are two 93 primary security principles that inform this design. First, the application’s 94 direct interactions with the host System API are intercepted by the Sentry, 95 which implements the System API instead. Second, the System API accessible to 96 the Sentry itself is minimized to a safer, restricted set. The first principle 97 minimizes the possibility of direct exploitation of the host System API by 98 applications, and the second principle minimizes indirect exploitability, which 99 is the exploitation by an exploited or buggy Sentry (e.g. chaining an exploit). 100 101 The first principle is similar to the security basis for a Virtual Machine (VM). 102 With a VM, an application’s interactions with the host are replaced by 103 interactions with a guest operating system and a set of virtualized hardware 104 devices. These hardware devices are then implemented via the host System API by 105 a Virtual Machine Monitor (VMM). The Sentry similarly prevents direct 106 interactions by providing its own implementation of the System API that the 107 application must interact with. Applications are not able to directly craft 108 specific arguments or flags for the host System API, or interact directly with 109 host primitives. 110 111 For both the Sentry and a VMM, it’s worth noting that while direct interactions 112 are not possible, indirect interactions are still possible. For example, a read 113 on a host-backed file in the Sentry may ultimately result in a host read system 114 call (made by the Sentry, not by passing through arguments from the 115 application), similar to how a read on a block device in a VM may result in the 116 VMM issuing a corresponding host read system call from a backing file. 117 118 An important distinction from a VM is that the Sentry implements a System API 119 based directly on host System API primitives instead of relying on virtualized 120 hardware and a guest operating system. This selects a distinct set of 121 trade-offs, largely in the performance, efficiency and compatibility domains. 122 Since transitions in and out of the sandbox are relatively expensive, a guest 123 operating system will typically take ownership of resources. For example, in the 124 above case, the guest operating system may read the block device data in a local 125 page cache, to avoid subsequent reads. This may lead to better performance but 126 lower efficiency, since memory may be wasted or duplicated. The Sentry opts 127 instead to defer to the host for many operations during runtime, for improved 128 efficiency but lower performance in some use cases. 129 130 ### What can a sandbox do? 131 132 An application in a gVisor sandbox is permitted to do most things a standard 133 container can do: for example, applications can read and write files mapped 134 within the container, make network connections, etc. As described above, 135 gVisor's primary goal is to limit exposure to bugs and exploits while still 136 allowing most applications to run. Even so, gVisor will limit some operations 137 that might be permitted with a standard container. Even with appropriate 138 capabilities, a user in a gVisor sandbox will only be able to manipulate 139 virtualized system resources (e.g. the system time, kernel settings or 140 filesystem attributes) and not underlying host system resources. 141 142 While the sandbox virtualizes many operations for the application, we limit the 143 sandbox's own interactions with the host to the following high-level operations: 144 145 1. Communicate with a Gofer process via a connected socket. The sandbox may 146 receive new file descriptors from the Gofer process, corresponding to opened 147 files. These files can then be read from and written to by the sandbox. 148 1. Make a minimal set of host system calls. The calls do not include the 149 creation of new sockets (unless host networking mode is enabled) or opening 150 files. The calls include duplication and closing of file descriptors, 151 synchronization, timers and signal management. 152 1. Read and write packets to a virtual ethernet device. This is not required if 153 host networking is enabled (or networking is disabled). 154 155 ### System ABI, Side Channels and Other Vectors 156 157 gVisor relies on the host operating system and the platform for defense against 158 hardware-based attacks. Given the nature of these vulnerabilities, there is 159 little defense that gVisor can provide (there’s no guarantee that additional 160 hardware measures, such as virtualization, memory encryption, etc. would 161 actually decrease the attack surface). Note that this is true even when using 162 hardware virtualization for acceleration, as the host kernel or hypervisor is 163 ultimately responsible for defending against attacks from within malicious 164 guests. 165 166 gVisor similarly relies on the host resource mechanisms (cgroups) for defense 167 against resource exhaustion and denial of service attacks. Network policy 168 controls should be applied at the container level to ensure appropriate network 169 policy enforcement. Note that the sandbox itself is not capable of altering or 170 configuring these mechanisms, and the sandbox itself should make an attacker 171 less likely to exploit or override these controls through other means. 172 173 ## Principles: Defense-in-Depth 174 175 For gVisor development, there are several engineering principles that are 176 employed in order to ensure that the system meets its design goals. 177 178 1. No system call is passed through directly to the host. Every supported call 179 has an independent implementation in the Sentry, that is unlikely to suffer 180 from identical vulnerabilities that may appear in the host. This has the 181 consequence that all kernel features used by applications require an 182 implementation within the Sentry. 183 1. Only common, universal functionality is implemented. Some filesystems, 184 network devices or modules may expose specialized functionality to user 185 space applications via mechanisms such as extended attributes, raw sockets 186 or ioctls. Since the Sentry is responsible for implementing the full system 187 call surface, we do not implement or pass through these specialized APIs. 188 1. The host surface exposed to the Sentry is minimized. While the system call 189 surface is not trivial, it is explicitly enumerated and controlled. The 190 Sentry is not permitted to open new files, create new sockets or do many 191 other interesting things on the host. 192 193 Additionally, we have practical restrictions that are imposed on the project to 194 minimize the risk of Sentry exploitability. For example: 195 196 1. Unsafe code is carefully controlled. All unsafe code is isolated in files 197 that end with "unsafe.go", in order to facilitate validation and auditing. 198 No file without the unsafe suffix may import the unsafe package. 199 1. No CGo is allowed. The Sentry must be a pure Go binary. 200 1. External imports are not generally allowed within the core packages. Only 201 limited external imports are used within the setup code. The code available 202 inside the Sentry is carefully controlled, to ensure that the above rules 203 are effective. 204 205 Finally, we recognize that security is a process, and that vigilance is 206 critical. Beyond our security disclosure process, the Sentry is fuzzed 207 continuously to identify potential bugs and races proactively, and production 208 crashes are recorded and triaged to similarly identify material issues. 209 210 ## FAQ 211 212 ### Is this more or less secure than a Virtual Machine? 213 214 The security of a VM depends to a large extent on what is exposed from the host 215 kernel and userspace support code. For example, device emulation code in the 216 host kernel (e.g. APIC) or optimizations (e.g. vhost) can be more complex than a 217 simple system call, and exploits carry the same risks. Similarly, the userspace 218 support code is frequently unsandboxed, and exploits, while rare, may allow 219 unfettered access to the system. 220 221 Some platforms leverage the same virtualization hardware as VMs in order to 222 provide better system call interception performance. However, gVisor does not 223 implement any device emulation, and instead opts to use a sandboxed host System 224 API directly. Both approaches significantly reduce the original attack surface. 225 Ultimately, since gVisor is capable of using the same hardware mechanism, one 226 should not assume that the mere use of virtualization hardware makes a system 227 more or less secure, just as it would be a mistake to make the claim that the 228 use of a unibody alone makes a car safe. 229 230 ### Does this stop hardware side channels? 231 232 In general, gVisor does not provide protection against hardware side channels, 233 although it may make exploits that rely on direct access to the host System API 234 more difficult to use. To minimize exposure, you should follow relevant guidance 235 from vendors and keep your host kernel and firmware up-to-date. 236 237 ### Is this just a ptrace sandbox? 238 239 No: the term “ptrace sandbox” generally refers to software that uses the Linux 240 ptrace facility to inspect and authorize system calls made by applications, 241 enforcing a specific policy. These commonly suffer from two issues. First, 242 vulnerable system calls may be authorized by the sandbox, as the application 243 still has direct access to some System API. Second, it’s impossible to avoid 244 time-of-check, time-of-use race conditions without disabling multi-threading. 245 246 In gVisor, the platforms that use ptrace operate differently. The stubs that are 247 traced are never allowed to continue execution into the host kernel and complete 248 a call directly. Instead, all system calls are interpreted and handled by the 249 Sentry itself, who reflects resulting register state back into the tracee before 250 continuing execution in userspace. This is very similar to the mechanism used by 251 User-Mode Linux (UML). 252 253 [dirtycow]: https://en.wikipedia.org/wiki/Dirty_COW 254 [clang]: https://en.wikipedia.org/wiki/C_(programming_language) 255 [popss]: https://nvd.nist.gov/vuln/detail/CVE-2018-8897