github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/website/blog/2019-11-18-security-basics.md

github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/website/blog/2019-11-18-security-basics.md (about)

1 # gVisor Security Basics - Part 1
2
3 This blog is a space for engineers and community members to share perspectives
4 and deep dives on technology and design within the gVisor project. Though our
5 logo suggests we're in the business of space exploration (or perhaps fighting
6 sea monsters), we're actually in the business of sandboxing Linux containers.
7 When we created gVisor, we had three specific goals in mind; _container-native
8 security_, _resource efficiency_, and _platform portability_. To put it simply,
9 gVisor provides _efficient defense-in-depth for containers anywhere_.
10
11 This post addresses gVisor's _container-native security_, specifically how
12 gVisor provides strong isolation between an application and the host OS. Future
13 posts will address _resource efficiency_ (how gVisor preserves container
14 benefits like fast starts, smaller snapshots, and less memory overhead than VMs)
15 and _platform portability_ (run gVisor wherever Linux OCI containers run).
16 Delivering on each of these goals requires careful security considerations and a
17 robust design.
18
19 ## What does "sandbox" mean?
20
21 gVisor allows the execution of untrusted containers, preventing them from
22 adversely affecting the host. This means that the untrusted container is
23 prevented from attacking or spying on either the host kernel or any other peer
24 userspace processes on the host.
25
26 For example, if you are a cloud container hosting service, running containers
27 from different customers on the same virtual machine means that compromises
28 expose customer data. Properly configured, gVisor can provide sufficient
29 isolation to allow different customers to run containers on the same host. There
30 are many aspects to the proper configuration, including limiting file and
31 network access, which we will discuss in future posts.
32
33 ## The cost of compromise
34
35 gVisor was designed around the premise that any security boundary could
36 potentially be compromised with enough time and resources. We tried to optimize
37 for a solution that was as costly and time-consuming for an attacker as
38 possible, at every layer.
39
40 Consequently, gVisor was built through a combination of intentional design
41 principles and specific technology choices that work together to provide the
42 security isolation needed for running hostile containers on a host. We'll dig
43 into it in the next section!
44
45 # Design Principles
46
47 gVisor was designed with some common
48 [secure design](https://en.wikipedia.org/wiki/Secure_by_design) principles in
49 mind: Defense-in-Depth, Principle of Least-Privilege, Attack Surface Reduction
50 and Secure-by-Default[^1].
51
52 In general, Design Principles outline good engineering practices, but in the
53 case of security, they also can be thought of as a set of tactics. In a
54 real-life castle, there is no single defensive feature. Rather, there are many
55 in combination: redundant walls, scattered draw bridges, small bottle-neck
56 entrances, moats, etc.
57
58 A simplified version of the design is below
59 ([more detailed version](/docs/))[^2]:
60
61 ![Figure 1](/assets/images/2019-11-18-security-basics-figure1.png "Simplified design of gVisor.")
62
63 In order to discuss design principles, the following components are important to
64 know:
65
66 * runsc - binary that packages the Sentry, platform, and Gofer(s) that run
67 containers. runsc is the drop-in binary for running gVisor in Docker and
68 Kubernetes.
69 * Untrusted Application - container running in the sandbox. Untrusted
70 application/container are used interchangeably in this article.
71 * Platform Syscall Switcher - intercepts syscalls from the application and
72 passes them to the Sentry with no further handling.
73 * Sentry - The "application kernel" in userspace that serves the untrusted
74 application. Each application instance has its own Sentry. The Sentry
75 handles syscalls, routes I/O to gofers, and manages memory and CPU, all in
76 userspace. The Sentry is allowed to make limited, filtered syscalls to the
77 host OS.
78 * Gofer - a process that specifically handles different types of I/O for the
79 Sentry (usually disk I/O). Gofers are also allowed to make filtered syscalls
80 to the Host OS.
81 * Host OS - the actual OS on which gVisor containers are running, always some
82 flavor of Linux (sorry, Windows/MacOS users).
83
84 It is important to emphasize what is being protected from the untrusted
85 application in this diagram: the host OS and other userspace applications.
86
87 In this post, we are only discussing security-related features of gVisor, and
88 you might ask, "What about performance, compatibility and stability?" We will
89 cover these considerations in future posts.
90
91 ## Defense-in-Depth
92
93 For gVisor, Defense-in-Depth means each component of the software stack trusts
94 the other components as little as possible.
95
96 It may seem strange that we would want our own software components to distrust
97 each other. But by limiting the trust between small, discrete components, each
98 component is forced to defend itself against potentially malicious input. And
99 when you stack these components on top of each other, you can ensure that
100 multiple security barriers must be overcome by an attacker.
101
102 And this leads us to how Defense-in-Depth is applied to gVisor: no single
103 vulnerability should compromise the host.
104
105 In the "Attacker's Advantage / Defender's Dilemma," the defender must succeed
106 all the time while the attacker only needs to succeed once. Defense in Depth
107 inverts this principle: once the attacker successfully compromises any given
108 software component, they are immediately faced with needing to compromise a
109 subsequent, distinct layer in order to move laterally or acquire more privilege.
110
111 For example, the untrusted container is isolated from the Sentry. The Sentry is
112 isolated from host I/O operations by serving those requests in separate
113 processes called Gofers. And both the untrusted container and its associated
114 Gofers are isolated from the host process that is running the sandbox.
115
116 An additional benefit is that this generally leads to more robust and stable
117 software, forcing interfaces to be strictly defined and tested to ensure all
118 inputs are properly parsed and bounds checked.
119
120 ## Least-Privilege
121
122 The principle of Least-Privilege implies that each software component has only
123 the permissions it needs to function, and no more.
124
125 Least-Privilege is applied throughout gVisor. Each component and more
126 importantly, each interface between the components, is designed so that only the
127 minimum level of permission is required for it to perform its function.
128 Specifically, the closer you are to the untrusted application, the less
129 privilege you have.
130
131 ![Figure 2](/assets/images/2019-11-18-security-basics-figure2.png "runsc components and their privileges.")
132
133 This is evident in how runsc (the drop in gVisor binary for Docker/Kubernetes)
134 constructs the sandbox. The Sentry has the least privilege possible (it can't
135 even open a file!). Gofers are only allowed file access, so even if it were
136 compromised, the host network would be unavailable. Only the runsc binary itself
137 has full access to the host OS, and even runsc's access to the host OS is often
138 limited through capabilities / chroot / namespacing.
139
140 Designing a system with Defense-in-Depth and Least-Privilege in mind encourages
141 small, separate, single-purpose components, each with very restricted
142 privileges.
143
144 ## Attack Surface Reduction
145
146 There are no bugs in unwritten code. In other words, gVisor supports a feature
147 if and only if it is needed to run host Linux containers.
148
149 ### Host Application/Sentry Interface:
150
151 There are a lot of things gVisor does not need to do. For example, it does not
152 need to support arbitrary device drivers, nor does it need to support video
153 playback. By not implementing what will not be used, we avoid introducing
154 potential bugs in our code.
155
156 That is not to say gVisor has limited functionality! Quite the opposite, we
157 analyzed what is actually needed to run Linux containers and today the Sentry
158 supports 237 syscalls[^3]<sup>,</sup>[^4], along with the range of critical
159 /proc and /dev files. However, gVisor does not support every syscall in the
160 Linux kernel. There are about 350 syscalls[^5] within the 5.3.11 version of the
161 Linux kernel, many of which do not apply to Linux containers that typically host
162 cloud-like workloads. For example, we don't support old versions of epoll
163 (epoll_ctl_old, epoll_wait_old), because they are deprecated in Linux and no
164 supported workloads use them.
165
166 Furthermore, any exploited vulnerabilities in the implemented syscalls (or
167 Sentry code in general) only apply to gaining control of the Sentry. More on
168 this in a later post.
169
170 ### Sentry/Host OS Interface:
171
172 The Sentry's interactions with the Host OS are restricted in many ways. For
173 instance, no syscall is "passed-through" from the untrusted application to the
174 host OS. All syscalls are intercepted and interpreted. In the case where the
175 Sentry needs to call the Host OS, we severely limit the syscalls that the Sentry
176 itself is allowed to make to the host kernel[^6].
177
178 For example, there are many file-system based attacks, where manipulation of
179 files or their paths, can lead to compromise of the host[^7]. As a result, the
180 Sentry does not allow any syscall that creates or opens a file descriptor. All
181 file descriptors must be donated to the sandbox. By disallowing open or creation
182 of file descriptors, we eliminate entire categories of these file-based attacks.
183
184 This does not affect functionality though. For example, during startup, runsc
185 will donate FDs the Sentry that allow for mapping STDIN/STDOUT/STDERR to the
186 sandboxed application. Also the Gofer may donate an FD to the Sentry, allowing
187 for direct access to some files. And most files will be remotely accessed
188 through the Gofers, in which case no FDs are donated to the Sentry.
189
190 The Sentry itself is only allowed access to specific
191 [whitelisted syscalls](https://github.com/google/gvisor/blob/master/runsc/config/config.go).
192 Without networking, the Sentry needs 53 host syscalls in order to function, and
193 with networking, it uses an additional 15[^8]. By limiting the whitelist to only
194 these needed syscalls, we radically reduce the amount of host OS attack surface.
195 If any attempts are made to call something outside the whitelist, it is
196 immediately blocked and the sandbox is killed by the Host OS.
197
198 ### Sentry/Gofer Interface:
199
200 The Sentry communicates with the Gofer through a local unix domain socket (UDS)
201 via a version of the 9P protocol[^9]. The UDS file descriptor is passed to the
202 sandbox during initialization and all communication between the Sentry and Gofer
203 happens via 9P. We will go more into how Gofers work in future posts.
204
205 ### End Result
206
207 So, of the 350 syscalls in the Linux kernel, the Sentry needs to implement only
208 237 of them to support containers. At most, the Sentry only needs to call 68 of
209 the host Linux syscalls. In other words, with gVisor, applications get the vast
210 majority (and growing) functionality of Linux containers for only 68 possible
211 syscalls to the Host OS. 350 syscalls to 68 is attack surface reduction.
212
213 ![Figure 3](/assets/images/2019-11-18-security-basics-figure3.png "Reduction of Attack Surface of the Syscall Table. Note that the Senty's Syscall Emulation Layer keeps the Containerized Process from ever calling the Host OS.")
214
215 ## Secure-by-default
216
217 The default choice for a user should be safe. If users need to run a less secure
218 configuration of the sandbox for the sake of performance or application
219 compatibility, they must make the choice explicitly.
220
221 An example of this might be a networking application that is performance
222 sensitive. Instead of using the safer, Go-based Netstack in the Sentry, the
223 untrusted container can instead use the host Linux networking stack directly.
224 However, this means the untrusted container will be directly interacting with
225 the host, without the safety benefits of the sandbox. It also means that an
226 attack could directly compromise the host through his path.
227
228 These less secure configurations are **not** the default. In fact, the user must
229 take action to change the configuration and run in a less secure mode.
230 Additionally, these actions make it very obvious that a less secure
231 configuration is being used.
232
233 This can be as simple as forcing a default runtime flag option to the secure
234 option. gVisor does this by always using its internal netstack by default.
235 However, for certain performance sensitive applications, we allow the usage of
236 the host OS networking stack, but it requires the user to actively set a
237 flag[^10].
238
239 # Technology Choices
240
241 Technology choices for gVisor mainly involve things that will give us a security
242 boundary.
243
244 At a higher level, boundaries in software might be describing a great many
245 things. It may be discussing the boundaries between threads, boundaries between
246 processes, boundaries between CPU privilege levels, and more.
247
248 Security boundaries are interfaces that are designed and built so that entire
249 classes of bugs/vulnerabilities are eliminated.
250
251 For example, the Sentry and Gofers are implemented using Go. Go was chosen for a
252 number of the features it provided. Go is a fast, statically-typed, compiled
253 language that has efficient multi-threading support, garbage collection and a
254 constrained set of "unsafe" operations.
255
256 Using these features enabled safe array and pointer handling. This means entire
257 classes of vulnerabilities were eliminated, such as buffer overflows and
258 use-after-free.
259
260 Another example is our use of very strict syscall switching to ensure that the
261 Sentry is always the first software component that parses and interprets the
262 calls being made by the untrusted container. Here is an instance where different
263 platforms use different solutions, but all of them share this common trait,
264 whether it is through the use of ptrace "a la PTRACE_ATTACH"[^11] or kvm's
265 ring0[^12].
266
267 Finally, one of the most restrictive choices was to use seccomp, to restrict the
268 Sentry from being able to open or create a file descriptor on the host. All file
269 I/O is required to go through Gofers. Preventing the opening or creation of file
270 descriptions eliminates whole categories of bugs around file permissions
271 [like this one](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)[^13].
272
273 # To be continued - Part 2
274
275 In part 2 of this blog post, we will explore gVisor from an attacker's point of
276 view. We will use it as an opportunity to examine the specific strengths and
277 weaknesses of each gVisor component.
278
279 We will also use it to introduce Google's Vulnerability Reward Program[^14], and
280 other ways the community can contribute to help make gVisor safe, fast and
281 stable.
282 <br>
283 <br>
284
285 --------------------------------------------------------------------------------
286
287 [^1]: [https://en.wikipedia.org/wiki/Secure_by_design](https://en.wikipedia.org/wiki/Secure_by_design)
288 [^2]: [https://gvisor.dev/docs/architecture_guide](https://gvisor.dev/docs/architecture_guide/)
289 [^3]: [https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/linux64_amd64.go](https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/syscalls.go)
290
291 
292 [^4]: Internally that is, it doesn't call to the Host OS to implement them, in fact that is explicitly disallowed, more on that in the future.
293 
294
295 [^5]: [https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345](https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345)
296 [^6]: [https://github.com/google/gvisor/tree/master/runsc/boot/filter](https://github.com/google/gvisor/tree/master/runsc/boot/filter)
297 [^7]: [https://en.wikipedia.org/wiki/Dirty_COW](https://en.wikipedia.org/wiki/Dirty_COW)
298 [^8]: [https://github.com/google/gvisor/blob/master/runsc/boot/config.go](https://github.com/google/gvisor/blob/master/runsc/boot/config.go)
299
300 
301 [^9]: [https://en.wikipedia.org/wiki/9P_(protocol)](https://en.wikipedia.org/wiki/9P_(protocol))
302 
303
304 [^10]: [https://gvisor.dev/docs/user_guide/networking/#network-passthrough](https://gvisor.dev/docs/user_guide/networking/#network-passthrough)
305 [^11]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390)
306 [^12]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182)
307 [^13]: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)
308 [^14]: [https://www.google.com/about/appsecurity/reward-program/index.html](https://www.google.com/about/appsecurity/reward-program/index.html)