github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/website/blog/2019-11-18-security-basics.md (about) 1 # gVisor Security Basics - Part 1 2 3 This blog is a space for engineers and community members to share perspectives 4 and deep dives on technology and design within the gVisor project. Though our 5 logo suggests we're in the business of space exploration (or perhaps fighting 6 sea monsters), we're actually in the business of sandboxing Linux containers. 7 When we created gVisor, we had three specific goals in mind; _container-native 8 security_, _resource efficiency_, and _platform portability_. To put it simply, 9 gVisor provides _efficient defense-in-depth for containers anywhere_. 10 11 This post addresses gVisor's _container-native security_, specifically how 12 gVisor provides strong isolation between an application and the host OS. Future 13 posts will address _resource efficiency_ (how gVisor preserves container 14 benefits like fast starts, smaller snapshots, and less memory overhead than VMs) 15 and _platform portability_ (run gVisor wherever Linux OCI containers run). 16 Delivering on each of these goals requires careful security considerations and a 17 robust design. 18 19 ## What does "sandbox" mean? 20 21 gVisor allows the execution of untrusted containers, preventing them from 22 adversely affecting the host. This means that the untrusted container is 23 prevented from attacking or spying on either the host kernel or any other peer 24 userspace processes on the host. 25 26 For example, if you are a cloud container hosting service, running containers 27 from different customers on the same virtual machine means that compromises 28 expose customer data. Properly configured, gVisor can provide sufficient 29 isolation to allow different customers to run containers on the same host. There 30 are many aspects to the proper configuration, including limiting file and 31 network access, which we will discuss in future posts. 32 33 ## The cost of compromise 34 35 gVisor was designed around the premise that any security boundary could 36 potentially be compromised with enough time and resources. We tried to optimize 37 for a solution that was as costly and time-consuming for an attacker as 38 possible, at every layer. 39 40 Consequently, gVisor was built through a combination of intentional design 41 principles and specific technology choices that work together to provide the 42 security isolation needed for running hostile containers on a host. We'll dig 43 into it in the next section! 44 45 # Design Principles 46 47 gVisor was designed with some common 48 [secure design](https://en.wikipedia.org/wiki/Secure_by_design) principles in 49 mind: Defense-in-Depth, Principle of Least-Privilege, Attack Surface Reduction 50 and Secure-by-Default[^1]. 51 52 In general, Design Principles outline good engineering practices, but in the 53 case of security, they also can be thought of as a set of tactics. In a 54 real-life castle, there is no single defensive feature. Rather, there are many 55 in combination: redundant walls, scattered draw bridges, small bottle-neck 56 entrances, moats, etc. 57 58 A simplified version of the design is below 59 ([more detailed version](/docs/))[^2]: 60 61 ![Figure 1](/assets/images/2019-11-18-security-basics-figure1.png "Simplified design of gVisor.") 62 63 In order to discuss design principles, the following components are important to 64 know: 65 66 * runsc - binary that packages the Sentry, platform, and Gofer(s) that run 67 containers. runsc is the drop-in binary for running gVisor in Docker and 68 Kubernetes. 69 * Untrusted Application - container running in the sandbox. Untrusted 70 application/container are used interchangeably in this article. 71 * Platform Syscall Switcher - intercepts syscalls from the application and 72 passes them to the Sentry with no further handling. 73 * Sentry - The "application kernel" in userspace that serves the untrusted 74 application. Each application instance has its own Sentry. The Sentry 75 handles syscalls, routes I/O to gofers, and manages memory and CPU, all in 76 userspace. The Sentry is allowed to make limited, filtered syscalls to the 77 host OS. 78 * Gofer - a process that specifically handles different types of I/O for the 79 Sentry (usually disk I/O). Gofers are also allowed to make filtered syscalls 80 to the Host OS. 81 * Host OS - the actual OS on which gVisor containers are running, always some 82 flavor of Linux (sorry, Windows/MacOS users). 83 84 It is important to emphasize what is being protected from the untrusted 85 application in this diagram: the host OS and other userspace applications. 86 87 In this post, we are only discussing security-related features of gVisor, and 88 you might ask, "What about performance, compatibility and stability?" We will 89 cover these considerations in future posts. 90 91 ## Defense-in-Depth 92 93 For gVisor, Defense-in-Depth means each component of the software stack trusts 94 the other components as little as possible. 95 96 It may seem strange that we would want our own software components to distrust 97 each other. But by limiting the trust between small, discrete components, each 98 component is forced to defend itself against potentially malicious input. And 99 when you stack these components on top of each other, you can ensure that 100 multiple security barriers must be overcome by an attacker. 101 102 And this leads us to how Defense-in-Depth is applied to gVisor: no single 103 vulnerability should compromise the host. 104 105 In the "Attacker's Advantage / Defender's Dilemma," the defender must succeed 106 all the time while the attacker only needs to succeed once. Defense in Depth 107 inverts this principle: once the attacker successfully compromises any given 108 software component, they are immediately faced with needing to compromise a 109 subsequent, distinct layer in order to move laterally or acquire more privilege. 110 111 For example, the untrusted container is isolated from the Sentry. The Sentry is 112 isolated from host I/O operations by serving those requests in separate 113 processes called Gofers. And both the untrusted container and its associated 114 Gofers are isolated from the host process that is running the sandbox. 115 116 An additional benefit is that this generally leads to more robust and stable 117 software, forcing interfaces to be strictly defined and tested to ensure all 118 inputs are properly parsed and bounds checked. 119 120 ## Least-Privilege 121 122 The principle of Least-Privilege implies that each software component has only 123 the permissions it needs to function, and no more. 124 125 Least-Privilege is applied throughout gVisor. Each component and more 126 importantly, each interface between the components, is designed so that only the 127 minimum level of permission is required for it to perform its function. 128 Specifically, the closer you are to the untrusted application, the less 129 privilege you have. 130 131 ![Figure 2](/assets/images/2019-11-18-security-basics-figure2.png "runsc components and their privileges.") 132 133 This is evident in how runsc (the drop in gVisor binary for Docker/Kubernetes) 134 constructs the sandbox. The Sentry has the least privilege possible (it can't 135 even open a file!). Gofers are only allowed file access, so even if it were 136 compromised, the host network would be unavailable. Only the runsc binary itself 137 has full access to the host OS, and even runsc's access to the host OS is often 138 limited through capabilities / chroot / namespacing. 139 140 Designing a system with Defense-in-Depth and Least-Privilege in mind encourages 141 small, separate, single-purpose components, each with very restricted 142 privileges. 143 144 ## Attack Surface Reduction 145 146 There are no bugs in unwritten code. In other words, gVisor supports a feature 147 if and only if it is needed to run host Linux containers. 148 149 ### Host Application/Sentry Interface: 150 151 There are a lot of things gVisor does not need to do. For example, it does not 152 need to support arbitrary device drivers, nor does it need to support video 153 playback. By not implementing what will not be used, we avoid introducing 154 potential bugs in our code. 155 156 That is not to say gVisor has limited functionality! Quite the opposite, we 157 analyzed what is actually needed to run Linux containers and today the Sentry 158 supports 237 syscalls[^3]<sup>,</sup>[^4], along with the range of critical 159 /proc and /dev files. However, gVisor does not support every syscall in the 160 Linux kernel. There are about 350 syscalls[^5] within the 5.3.11 version of the 161 Linux kernel, many of which do not apply to Linux containers that typically host 162 cloud-like workloads. For example, we don't support old versions of epoll 163 (epoll_ctl_old, epoll_wait_old), because they are deprecated in Linux and no 164 supported workloads use them. 165 166 Furthermore, any exploited vulnerabilities in the implemented syscalls (or 167 Sentry code in general) only apply to gaining control of the Sentry. More on 168 this in a later post. 169 170 ### Sentry/Host OS Interface: 171 172 The Sentry's interactions with the Host OS are restricted in many ways. For 173 instance, no syscall is "passed-through" from the untrusted application to the 174 host OS. All syscalls are intercepted and interpreted. In the case where the 175 Sentry needs to call the Host OS, we severely limit the syscalls that the Sentry 176 itself is allowed to make to the host kernel[^6]. 177 178 For example, there are many file-system based attacks, where manipulation of 179 files or their paths, can lead to compromise of the host[^7]. As a result, the 180 Sentry does not allow any syscall that creates or opens a file descriptor. All 181 file descriptors must be donated to the sandbox. By disallowing open or creation 182 of file descriptors, we eliminate entire categories of these file-based attacks. 183 184 This does not affect functionality though. For example, during startup, runsc 185 will donate FDs the Sentry that allow for mapping STDIN/STDOUT/STDERR to the 186 sandboxed application. Also the Gofer may donate an FD to the Sentry, allowing 187 for direct access to some files. And most files will be remotely accessed 188 through the Gofers, in which case no FDs are donated to the Sentry. 189 190 The Sentry itself is only allowed access to specific 191 [whitelisted syscalls](https://github.com/google/gvisor/blob/master/runsc/config/config.go). 192 Without networking, the Sentry needs 53 host syscalls in order to function, and 193 with networking, it uses an additional 15[^8]. By limiting the whitelist to only 194 these needed syscalls, we radically reduce the amount of host OS attack surface. 195 If any attempts are made to call something outside the whitelist, it is 196 immediately blocked and the sandbox is killed by the Host OS. 197 198 ### Sentry/Gofer Interface: 199 200 The Sentry communicates with the Gofer through a local unix domain socket (UDS) 201 via a version of the 9P protocol[^9]. The UDS file descriptor is passed to the 202 sandbox during initialization and all communication between the Sentry and Gofer 203 happens via 9P. We will go more into how Gofers work in future posts. 204 205 ### End Result 206 207 So, of the 350 syscalls in the Linux kernel, the Sentry needs to implement only 208 237 of them to support containers. At most, the Sentry only needs to call 68 of 209 the host Linux syscalls. In other words, with gVisor, applications get the vast 210 majority (and growing) functionality of Linux containers for only 68 possible 211 syscalls to the Host OS. 350 syscalls to 68 is attack surface reduction. 212 213 ![Figure 3](/assets/images/2019-11-18-security-basics-figure3.png "Reduction of Attack Surface of the Syscall Table. Note that the Senty's Syscall Emulation Layer keeps the Containerized Process from ever calling the Host OS.") 214 215 ## Secure-by-default 216 217 The default choice for a user should be safe. If users need to run a less secure 218 configuration of the sandbox for the sake of performance or application 219 compatibility, they must make the choice explicitly. 220 221 An example of this might be a networking application that is performance 222 sensitive. Instead of using the safer, Go-based Netstack in the Sentry, the 223 untrusted container can instead use the host Linux networking stack directly. 224 However, this means the untrusted container will be directly interacting with 225 the host, without the safety benefits of the sandbox. It also means that an 226 attack could directly compromise the host through his path. 227 228 These less secure configurations are **not** the default. In fact, the user must 229 take action to change the configuration and run in a less secure mode. 230 Additionally, these actions make it very obvious that a less secure 231 configuration is being used. 232 233 This can be as simple as forcing a default runtime flag option to the secure 234 option. gVisor does this by always using its internal netstack by default. 235 However, for certain performance sensitive applications, we allow the usage of 236 the host OS networking stack, but it requires the user to actively set a 237 flag[^10]. 238 239 # Technology Choices 240 241 Technology choices for gVisor mainly involve things that will give us a security 242 boundary. 243 244 At a higher level, boundaries in software might be describing a great many 245 things. It may be discussing the boundaries between threads, boundaries between 246 processes, boundaries between CPU privilege levels, and more. 247 248 Security boundaries are interfaces that are designed and built so that entire 249 classes of bugs/vulnerabilities are eliminated. 250 251 For example, the Sentry and Gofers are implemented using Go. Go was chosen for a 252 number of the features it provided. Go is a fast, statically-typed, compiled 253 language that has efficient multi-threading support, garbage collection and a 254 constrained set of "unsafe" operations. 255 256 Using these features enabled safe array and pointer handling. This means entire 257 classes of vulnerabilities were eliminated, such as buffer overflows and 258 use-after-free. 259 260 Another example is our use of very strict syscall switching to ensure that the 261 Sentry is always the first software component that parses and interprets the 262 calls being made by the untrusted container. Here is an instance where different 263 platforms use different solutions, but all of them share this common trait, 264 whether it is through the use of ptrace "a la PTRACE_ATTACH"[^11] or kvm's 265 ring0[^12]. 266 267 Finally, one of the most restrictive choices was to use seccomp, to restrict the 268 Sentry from being able to open or create a file descriptor on the host. All file 269 I/O is required to go through Gofers. Preventing the opening or creation of file 270 descriptions eliminates whole categories of bugs around file permissions 271 [like this one](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)[^13]. 272 273 # To be continued - Part 2 274 275 In part 2 of this blog post, we will explore gVisor from an attacker's point of 276 view. We will use it as an opportunity to examine the specific strengths and 277 weaknesses of each gVisor component. 278 279 We will also use it to introduce Google's Vulnerability Reward Program[^14], and 280 other ways the community can contribute to help make gVisor safe, fast and 281 stable. 282 <br> 283 <br> 284 285 -------------------------------------------------------------------------------- 286 287 [^1]: [https://en.wikipedia.org/wiki/Secure_by_design](https://en.wikipedia.org/wiki/Secure_by_design) 288 [^2]: [https://gvisor.dev/docs/architecture_guide](https://gvisor.dev/docs/architecture_guide/) 289 [^3]: [https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/linux64_amd64.go](https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/syscalls.go) 290 291 <!-- mdformat off(mdformat formats this into multiple lines) --> 292 [^4]: Internally that is, it doesn't call to the Host OS to implement them, in fact that is explicitly disallowed, more on that in the future. 293 <!-- mdformat on --> 294 295 [^5]: [https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345](https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345) 296 [^6]: [https://github.com/google/gvisor/tree/master/runsc/boot/filter](https://github.com/google/gvisor/tree/master/runsc/boot/filter) 297 [^7]: [https://en.wikipedia.org/wiki/Dirty_COW](https://en.wikipedia.org/wiki/Dirty_COW) 298 [^8]: [https://github.com/google/gvisor/blob/master/runsc/boot/config.go](https://github.com/google/gvisor/blob/master/runsc/boot/config.go) 299 300 <!-- mdformat off(mdformat breaks this url by escaping the parenthesis) --> 301 [^9]: [https://en.wikipedia.org/wiki/9P_(protocol)](https://en.wikipedia.org/wiki/9P_(protocol)) 302 <!-- mdformat on --> 303 304 [^10]: [https://gvisor.dev/docs/user_guide/networking/#network-passthrough](https://gvisor.dev/docs/user_guide/networking/#network-passthrough) 305 [^11]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390) 306 [^12]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182) 307 [^13]: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557) 308 [^14]: [https://www.google.com/about/appsecurity/reward-program/index.html](https://www.google.com/about/appsecurity/reward-program/index.html)