github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/website/blog/2019-11-18-security-basics.md (about)

     1  # gVisor Security Basics - Part 1
     2  
     3  This blog is a space for engineers and community members to share perspectives
     4  and deep dives on technology and design within the gVisor project. Though our
     5  logo suggests we're in the business of space exploration (or perhaps fighting
     6  sea monsters), we're actually in the business of sandboxing Linux containers.
     7  When we created gVisor, we had three specific goals in mind; _container-native
     8  security_, _resource efficiency_, and _platform portability_. To put it simply,
     9  gVisor provides _efficient defense-in-depth for containers anywhere_.
    10  
    11  This post addresses gVisor's _container-native security_, specifically how
    12  gVisor provides strong isolation between an application and the host OS. Future
    13  posts will address _resource efficiency_ (how gVisor preserves container
    14  benefits like fast starts, smaller snapshots, and less memory overhead than VMs)
    15  and _platform portability_ (run gVisor wherever Linux OCI containers run).
    16  Delivering on each of these goals requires careful security considerations and a
    17  robust design.
    18  
    19  ## What does "sandbox" mean?
    20  
    21  gVisor allows the execution of untrusted containers, preventing them from
    22  adversely affecting the host. This means that the untrusted container is
    23  prevented from attacking or spying on either the host kernel or any other peer
    24  userspace processes on the host.
    25  
    26  For example, if you are a cloud container hosting service, running containers
    27  from different customers on the same virtual machine means that compromises
    28  expose customer data. Properly configured, gVisor can provide sufficient
    29  isolation to allow different customers to run containers on the same host. There
    30  are many aspects to the proper configuration, including limiting file and
    31  network access, which we will discuss in future posts.
    32  
    33  ## The cost of compromise
    34  
    35  gVisor was designed around the premise that any security boundary could
    36  potentially be compromised with enough time and resources. We tried to optimize
    37  for a solution that was as costly and time-consuming for an attacker as
    38  possible, at every layer.
    39  
    40  Consequently, gVisor was built through a combination of intentional design
    41  principles and specific technology choices that work together to provide the
    42  security isolation needed for running hostile containers on a host. We'll dig
    43  into it in the next section!
    44  
    45  # Design Principles
    46  
    47  gVisor was designed with some common
    48  [secure design](https://en.wikipedia.org/wiki/Secure_by_design) principles in
    49  mind: Defense-in-Depth, Principle of Least-Privilege, Attack Surface Reduction
    50  and Secure-by-Default[^1].
    51  
    52  In general, Design Principles outline good engineering practices, but in the
    53  case of security, they also can be thought of as a set of tactics. In a
    54  real-life castle, there is no single defensive feature. Rather, there are many
    55  in combination: redundant walls, scattered draw bridges, small bottle-neck
    56  entrances, moats, etc.
    57  
    58  A simplified version of the design is below
    59  ([more detailed version](/docs/))[^2]:
    60  
    61  ![Figure 1](/assets/images/2019-11-18-security-basics-figure1.png "Simplified design of gVisor.")
    62  
    63  In order to discuss design principles, the following components are important to
    64  know:
    65  
    66  *   runsc - binary that packages the Sentry, platform, and Gofer(s) that run
    67      containers. runsc is the drop-in binary for running gVisor in Docker and
    68      Kubernetes.
    69  *   Untrusted Application - container running in the sandbox. Untrusted
    70      application/container are used interchangeably in this article.
    71  *   Platform Syscall Switcher - intercepts syscalls from the application and
    72      passes them to the Sentry with no further handling.
    73  *   Sentry - The "application kernel" in userspace that serves the untrusted
    74      application. Each application instance has its own Sentry. The Sentry
    75      handles syscalls, routes I/O to gofers, and manages memory and CPU, all in
    76      userspace. The Sentry is allowed to make limited, filtered syscalls to the
    77      host OS.
    78  *   Gofer - a process that specifically handles different types of I/O for the
    79      Sentry (usually disk I/O). Gofers are also allowed to make filtered syscalls
    80      to the Host OS.
    81  *   Host OS - the actual OS on which gVisor containers are running, always some
    82      flavor of Linux (sorry, Windows/MacOS users).
    83  
    84  It is important to emphasize what is being protected from the untrusted
    85  application in this diagram: the host OS and other userspace applications.
    86  
    87  In this post, we are only discussing security-related features of gVisor, and
    88  you might ask, "What about performance, compatibility and stability?" We will
    89  cover these considerations in future posts.
    90  
    91  ## Defense-in-Depth
    92  
    93  For gVisor, Defense-in-Depth means each component of the software stack trusts
    94  the other components as little as possible.
    95  
    96  It may seem strange that we would want our own software components to distrust
    97  each other. But by limiting the trust between small, discrete components, each
    98  component is forced to defend itself against potentially malicious input. And
    99  when you stack these components on top of each other, you can ensure that
   100  multiple security barriers must be overcome by an attacker.
   101  
   102  And this leads us to how Defense-in-Depth is applied to gVisor: no single
   103  vulnerability should compromise the host.
   104  
   105  In the "Attacker's Advantage / Defender's Dilemma," the defender must succeed
   106  all the time while the attacker only needs to succeed once. Defense in Depth
   107  inverts this principle: once the attacker successfully compromises any given
   108  software component, they are immediately faced with needing to compromise a
   109  subsequent, distinct layer in order to move laterally or acquire more privilege.
   110  
   111  For example, the untrusted container is isolated from the Sentry. The Sentry is
   112  isolated from host I/O operations by serving those requests in separate
   113  processes called Gofers. And both the untrusted container and its associated
   114  Gofers are isolated from the host process that is running the sandbox.
   115  
   116  An additional benefit is that this generally leads to more robust and stable
   117  software, forcing interfaces to be strictly defined and tested to ensure all
   118  inputs are properly parsed and bounds checked.
   119  
   120  ## Least-Privilege
   121  
   122  The principle of Least-Privilege implies that each software component has only
   123  the permissions it needs to function, and no more.
   124  
   125  Least-Privilege is applied throughout gVisor. Each component and more
   126  importantly, each interface between the components, is designed so that only the
   127  minimum level of permission is required for it to perform its function.
   128  Specifically, the closer you are to the untrusted application, the less
   129  privilege you have.
   130  
   131  ![Figure 2](/assets/images/2019-11-18-security-basics-figure2.png "runsc components and their privileges.")
   132  
   133  This is evident in how runsc (the drop in gVisor binary for Docker/Kubernetes)
   134  constructs the sandbox. The Sentry has the least privilege possible (it can't
   135  even open a file!). Gofers are only allowed file access, so even if it were
   136  compromised, the host network would be unavailable. Only the runsc binary itself
   137  has full access to the host OS, and even runsc's access to the host OS is often
   138  limited through capabilities / chroot / namespacing.
   139  
   140  Designing a system with Defense-in-Depth and Least-Privilege in mind encourages
   141  small, separate, single-purpose components, each with very restricted
   142  privileges.
   143  
   144  ## Attack Surface Reduction
   145  
   146  There are no bugs in unwritten code. In other words, gVisor supports a feature
   147  if and only if it is needed to run host Linux containers.
   148  
   149  ### Host Application/Sentry Interface:
   150  
   151  There are a lot of things gVisor does not need to do. For example, it does not
   152  need to support arbitrary device drivers, nor does it need to support video
   153  playback. By not implementing what will not be used, we avoid introducing
   154  potential bugs in our code.
   155  
   156  That is not to say gVisor has limited functionality! Quite the opposite, we
   157  analyzed what is actually needed to run Linux containers and today the Sentry
   158  supports 237 syscalls[^3]<sup>,</sup>[^4], along with the range of critical
   159  /proc and /dev files. However, gVisor does not support every syscall in the
   160  Linux kernel. There are about 350 syscalls[^5] within the 5.3.11 version of the
   161  Linux kernel, many of which do not apply to Linux containers that typically host
   162  cloud-like workloads. For example, we don't support old versions of epoll
   163  (epoll_ctl_old, epoll_wait_old), because they are deprecated in Linux and no
   164  supported workloads use them.
   165  
   166  Furthermore, any exploited vulnerabilities in the implemented syscalls (or
   167  Sentry code in general) only apply to gaining control of the Sentry. More on
   168  this in a later post.
   169  
   170  ### Sentry/Host OS Interface:
   171  
   172  The Sentry's interactions with the Host OS are restricted in many ways. For
   173  instance, no syscall is "passed-through" from the untrusted application to the
   174  host OS. All syscalls are intercepted and interpreted. In the case where the
   175  Sentry needs to call the Host OS, we severely limit the syscalls that the Sentry
   176  itself is allowed to make to the host kernel[^6].
   177  
   178  For example, there are many file-system based attacks, where manipulation of
   179  files or their paths, can lead to compromise of the host[^7]. As a result, the
   180  Sentry does not allow any syscall that creates or opens a file descriptor. All
   181  file descriptors must be donated to the sandbox. By disallowing open or creation
   182  of file descriptors, we eliminate entire categories of these file-based attacks.
   183  
   184  This does not affect functionality though. For example, during startup, runsc
   185  will donate FDs the Sentry that allow for mapping STDIN/STDOUT/STDERR to the
   186  sandboxed application. Also the Gofer may donate an FD to the Sentry, allowing
   187  for direct access to some files. And most files will be remotely accessed
   188  through the Gofers, in which case no FDs are donated to the Sentry.
   189  
   190  The Sentry itself is only allowed access to specific
   191  [whitelisted syscalls](https://github.com/google/gvisor/blob/master/runsc/config/config.go).
   192  Without networking, the Sentry needs 53 host syscalls in order to function, and
   193  with networking, it uses an additional 15[^8]. By limiting the whitelist to only
   194  these needed syscalls, we radically reduce the amount of host OS attack surface.
   195  If any attempts are made to call something outside the whitelist, it is
   196  immediately blocked and the sandbox is killed by the Host OS.
   197  
   198  ### Sentry/Gofer Interface:
   199  
   200  The Sentry communicates with the Gofer through a local unix domain socket (UDS)
   201  via a version of the 9P protocol[^9]. The UDS file descriptor is passed to the
   202  sandbox during initialization and all communication between the Sentry and Gofer
   203  happens via 9P. We will go more into how Gofers work in future posts.
   204  
   205  ### End Result
   206  
   207  So, of the 350 syscalls in the Linux kernel, the Sentry needs to implement only
   208  237 of them to support containers. At most, the Sentry only needs to call 68 of
   209  the host Linux syscalls. In other words, with gVisor, applications get the vast
   210  majority (and growing) functionality of Linux containers for only 68 possible
   211  syscalls to the Host OS. 350 syscalls to 68 is attack surface reduction.
   212  
   213  ![Figure 3](/assets/images/2019-11-18-security-basics-figure3.png "Reduction of Attack Surface of the Syscall Table. Note that the Senty's Syscall Emulation Layer keeps the Containerized Process from ever calling the Host OS.")
   214  
   215  ## Secure-by-default
   216  
   217  The default choice for a user should be safe. If users need to run a less secure
   218  configuration of the sandbox for the sake of performance or application
   219  compatibility, they must make the choice explicitly.
   220  
   221  An example of this might be a networking application that is performance
   222  sensitive. Instead of using the safer, Go-based Netstack in the Sentry, the
   223  untrusted container can instead use the host Linux networking stack directly.
   224  However, this means the untrusted container will be directly interacting with
   225  the host, without the safety benefits of the sandbox. It also means that an
   226  attack could directly compromise the host through his path.
   227  
   228  These less secure configurations are **not** the default. In fact, the user must
   229  take action to change the configuration and run in a less secure mode.
   230  Additionally, these actions make it very obvious that a less secure
   231  configuration is being used.
   232  
   233  This can be as simple as forcing a default runtime flag option to the secure
   234  option. gVisor does this by always using its internal netstack by default.
   235  However, for certain performance sensitive applications, we allow the usage of
   236  the host OS networking stack, but it requires the user to actively set a
   237  flag[^10].
   238  
   239  # Technology Choices
   240  
   241  Technology choices for gVisor mainly involve things that will give us a security
   242  boundary.
   243  
   244  At a higher level, boundaries in software might be describing a great many
   245  things. It may be discussing the boundaries between threads, boundaries between
   246  processes, boundaries between CPU privilege levels, and more.
   247  
   248  Security boundaries are interfaces that are designed and built so that entire
   249  classes of bugs/vulnerabilities are eliminated.
   250  
   251  For example, the Sentry and Gofers are implemented using Go. Go was chosen for a
   252  number of the features it provided. Go is a fast, statically-typed, compiled
   253  language that has efficient multi-threading support, garbage collection and a
   254  constrained set of "unsafe" operations.
   255  
   256  Using these features enabled safe array and pointer handling. This means entire
   257  classes of vulnerabilities were eliminated, such as buffer overflows and
   258  use-after-free.
   259  
   260  Another example is our use of very strict syscall switching to ensure that the
   261  Sentry is always the first software component that parses and interprets the
   262  calls being made by the untrusted container. Here is an instance where different
   263  platforms use different solutions, but all of them share this common trait,
   264  whether it is through the use of ptrace "a la PTRACE_ATTACH"[^11] or kvm's
   265  ring0[^12].
   266  
   267  Finally, one of the most restrictive choices was to use seccomp, to restrict the
   268  Sentry from being able to open or create a file descriptor on the host. All file
   269  I/O is required to go through Gofers. Preventing the opening or creation of file
   270  descriptions eliminates whole categories of bugs around file permissions
   271  [like this one](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)[^13].
   272  
   273  # To be continued - Part 2
   274  
   275  In part 2 of this blog post, we will explore gVisor from an attacker's point of
   276  view. We will use it as an opportunity to examine the specific strengths and
   277  weaknesses of each gVisor component.
   278  
   279  We will also use it to introduce Google's Vulnerability Reward Program[^14], and
   280  other ways the community can contribute to help make gVisor safe, fast and
   281  stable.
   282  <br>
   283  <br>
   284  
   285  --------------------------------------------------------------------------------
   286  
   287  [^1]: [https://en.wikipedia.org/wiki/Secure_by_design](https://en.wikipedia.org/wiki/Secure_by_design)
   288  [^2]: [https://gvisor.dev/docs/architecture_guide](https://gvisor.dev/docs/architecture_guide/)
   289  [^3]: [https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/linux64_amd64.go](https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/syscalls.go)
   290  
   291  <!-- mdformat off(mdformat formats this into multiple lines) -->
   292  [^4]: Internally that is, it doesn't call to the Host OS to implement them, in fact that is explicitly disallowed, more on that in the future.
   293  <!-- mdformat on -->
   294  
   295  [^5]: [https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345](https://elixir.bootlin.com/linux/latest/source/arch/x86/entry/syscalls/syscall_64.tbl#L345)
   296  [^6]: [https://github.com/google/gvisor/tree/master/runsc/boot/filter](https://github.com/google/gvisor/tree/master/runsc/boot/filter)
   297  [^7]: [https://en.wikipedia.org/wiki/Dirty_COW](https://en.wikipedia.org/wiki/Dirty_COW)
   298  [^8]: [https://github.com/google/gvisor/blob/master/runsc/boot/config.go](https://github.com/google/gvisor/blob/master/runsc/boot/config.go)
   299  
   300  <!-- mdformat off(mdformat breaks this url by escaping the parenthesis) -->
   301  [^9]: [https://en.wikipedia.org/wiki/9P_(protocol)](https://en.wikipedia.org/wiki/9P_(protocol))
   302  <!-- mdformat on -->
   303  
   304  [^10]: [https://gvisor.dev/docs/user_guide/networking/#network-passthrough](https://gvisor.dev/docs/user_guide/networking/#network-passthrough)
   305  [^11]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ptrace/subprocess.go#L390)
   306  [^12]: [https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182](https://github.com/google/gvisor/blob/c7e901f47a09eaac56bd4813227edff016fa6bff/pkg/sentry/platform/ring0/kernel_amd64.go#L182)
   307  [^13]: [https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2016-4557)
   308  [^14]: [https://www.google.com/about/appsecurity/reward-program/index.html](https://www.google.com/about/appsecurity/reward-program/index.html)