gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/g3doc/architecture_guide/security.md (about)

     1  # Security Model
     2  
     3  [TOC]
     4  
     5  gVisor was created in order to provide additional defense against the
     6  exploitation of kernel bugs by untrusted userspace code. In order to understand
     7  how gVisor achieves this goal, it is first necessary to understand the basic
     8  threat model.
     9  
    10  ## Threats: The Anatomy of an Exploit
    11  
    12  An exploit takes advantage of a software or hardware bug in order to escalate
    13  privileges, gain access to privileged data, or disrupt services. All of the
    14  possible interactions that a malicious application can have with the rest of the
    15  system (attack vectors) define the attack surface. We categorize these attack
    16  vectors into several common classes.
    17  
    18  ### System API
    19  
    20  An operating system or hypervisor exposes an abstract System API in the form of
    21  system calls and traps. This API may be documented and stable, as with Linux, or
    22  it may be abstracted behind a library, as with Windows (i.e. win32.dll or
    23  ntdll.dll). The System API includes all standard interfaces that application
    24  code uses to interact with the system. This includes high-level abstractions
    25  that are derived from low-level system calls, such as system files, sockets and
    26  namespaces.
    27  
    28  Although the System API is exposed to applications by design, bugs and race
    29  conditions within the kernel or hypervisor may occasionally be exploitable via
    30  the API. This is common in part due to the fact that most kernels and
    31  hypervisors are written in [C][clang], which is well-suited to interfacing with
    32  hardware but often prone to security issues. In order to exploit these issues, a
    33  typical attack might involve some combination of the following:
    34  
    35  1.  Opening or creating some combination of files, sockets or other descriptors.
    36  1.  Passing crafted, malicious arguments, structures or packets.
    37  1.  Racing with multiple threads in order to hit specific code paths.
    38  
    39  For example, for the [Dirty Cow][dirtycow] privilege escalation bug, an
    40  application would open a specific file in `/proc` or use a specific `ptrace`
    41  system call, and use multiple threads in order to trigger a race condition when
    42  touching a fresh page of memory. The attacker then gains control over a page of
    43  memory belonging to the system. With additional privileges or access to
    44  privileged data in the kernel, an attacker will often be able to employ
    45  additional techniques to gain full access to the rest of the system.
    46  
    47  While bugs in the implementation of the System API are readily fixed, they are
    48  also the most common form of exploit. The exposure created by this class of
    49  exploit is what gVisor aims to minimize and control, described in detail below.
    50  
    51  ### System ABI
    52  
    53  Hardware and software exploits occasionally exist in execution paths that are
    54  not part of an intended System API. In this case, exploits may be found as part
    55  of implicit actions the hardware or privileged system code takes in response to
    56  certain events, such as traps or interrupts. For example, the recent
    57  [POPSS][popss] flaw required only native code execution (no specific system call
    58  or file access). In that case, the Xen hypervisor was similarly vulnerable,
    59  highlighting that hypervisors are not immune to this vector.
    60  
    61  ### Side Channels
    62  
    63  Hardware side channels may be exploitable by any code running on a system:
    64  native, sandboxed, or virtualized. However, many host-level mitigations against
    65  hardware side channels are still effective with a sandbox. For example, kernels
    66  built with retpoline protect against some speculative execution attacks
    67  (Spectre) and frame poisoning may protect against L1 terminal fault (L1TF)
    68  attacks. Hypervisors may introduce additional complications in this regard, as
    69  there is no mitigation against an application in a normally functioning Virtual
    70  Machine (VM) exploiting the L1TF vulnerability for another VM on the sibling
    71  hyperthread.
    72  
    73  ### Other Vectors
    74  
    75  The above categories in no way represent an exhaustive list of exploits, as we
    76  focus only on running untrusted code from within the operating system or
    77  hypervisor. We do not consider other ways that a more generic adversary may
    78  interact with a system, such as inserting a portable storage device with a
    79  malicious filesystem image, using a combination of crafted keyboard or touch
    80  inputs, or saturating a network device with ill-formed packets.
    81  
    82  Furthermore, high-level systems may contain exploitable components. An attacker
    83  need not escalate privileges within a container if there’s an exploitable
    84  network-accessible service on the host or some other API path. *A sandbox is not
    85  a substitute for a secure architecture*.
    86  
    87  ## Goals: Limiting Exposure
    88  
    89  ![Threat model](security.png "Threat model.")
    90  
    91  gVisor’s primary design goal is to minimize the System API attack vector through
    92  multiple layers of defense, while still providing a process model. There are two
    93  primary security principles that inform this design. First, the application’s
    94  direct interactions with the host System API are intercepted by the Sentry,
    95  which implements the System API instead. Second, the System API accessible to
    96  the Sentry itself is minimized to a safer, restricted set. The first principle
    97  minimizes the possibility of direct exploitation of the host System API by
    98  applications, and the second principle minimizes indirect exploitability, which
    99  is the exploitation by an exploited or buggy Sentry (e.g. chaining an exploit).
   100  
   101  The first principle is similar to the security basis for a Virtual Machine (VM).
   102  With a VM, an application’s interactions with the host are replaced by
   103  interactions with a guest operating system and a set of virtualized hardware
   104  devices. These hardware devices are then implemented via the host System API by
   105  a Virtual Machine Monitor (VMM). The Sentry similarly prevents direct
   106  interactions by providing its own implementation of the System API that the
   107  application must interact with. Applications are not able to directly craft
   108  specific arguments or flags for the host System API, or interact directly with
   109  host primitives.
   110  
   111  For both the Sentry and a VMM, it’s worth noting that while direct interactions
   112  are not possible, indirect interactions are still possible. For example, a read
   113  on a host-backed file in the Sentry may ultimately result in a host read system
   114  call (made by the Sentry, not by passing through arguments from the
   115  application), similar to how a read on a block device in a VM may result in the
   116  VMM issuing a corresponding host read system call from a backing file.
   117  
   118  An important distinction from a VM is that the Sentry implements a System API
   119  based directly on host System API primitives instead of relying on virtualized
   120  hardware and a guest operating system. This selects a distinct set of
   121  trade-offs, largely in the performance, efficiency and compatibility domains.
   122  Since transitions in and out of the sandbox are relatively expensive, a guest
   123  operating system will typically take ownership of resources. For example, in the
   124  above case, the guest operating system may read the block device data in a local
   125  page cache, to avoid subsequent reads. This may lead to better performance but
   126  lower efficiency, since memory may be wasted or duplicated. The Sentry opts
   127  instead to defer to the host for many operations during runtime, for improved
   128  efficiency but lower performance in some use cases.
   129  
   130  ### What can a sandbox do?
   131  
   132  An application in a gVisor sandbox is permitted to do most things a standard
   133  container can do: for example, applications can read and write files mapped
   134  within the container, make network connections, etc. As described above,
   135  gVisor's primary goal is to limit exposure to bugs and exploits while still
   136  allowing most applications to run. Even so, gVisor will limit some operations
   137  that might be permitted with a standard container. Even with appropriate
   138  capabilities, a user in a gVisor sandbox will only be able to manipulate
   139  virtualized system resources (e.g. the system time, kernel settings or
   140  filesystem attributes) and not underlying host system resources.
   141  
   142  While the sandbox virtualizes many operations for the application, we limit the
   143  sandbox's own interactions with the host to the following high-level operations:
   144  
   145  1.  Communicate with a Gofer process via a connected socket. The sandbox may
   146      receive new file descriptors from the Gofer process, corresponding to opened
   147      files. These files can then be read from and written to by the sandbox.
   148  1.  Make a minimal set of host system calls. The calls do not include the
   149      creation of new sockets (unless host networking mode is enabled) or opening
   150      files. The calls include duplication and closing of file descriptors,
   151      synchronization, timers and signal management.
   152  1.  Read and write packets to a virtual ethernet device. This is not required if
   153      host networking is enabled (or networking is disabled).
   154  
   155  ### System ABI, Side Channels and Other Vectors
   156  
   157  gVisor relies on the host operating system and the platform for defense against
   158  hardware-based attacks. Given the nature of these vulnerabilities, there is
   159  little defense that gVisor can provide (there’s no guarantee that additional
   160  hardware measures, such as virtualization, memory encryption, etc. would
   161  actually decrease the attack surface). Note that this is true even when using
   162  hardware virtualization for acceleration, as the host kernel or hypervisor is
   163  ultimately responsible for defending against attacks from within malicious
   164  guests.
   165  
   166  gVisor similarly relies on the host resource mechanisms (cgroups) for defense
   167  against resource exhaustion and denial of service attacks. Network policy
   168  controls should be applied at the container level to ensure appropriate network
   169  policy enforcement. Note that the sandbox itself is not capable of altering or
   170  configuring these mechanisms, and the sandbox itself should make an attacker
   171  less likely to exploit or override these controls through other means.
   172  
   173  ## Principles: Defense-in-Depth
   174  
   175  For gVisor development, there are several engineering principles that are
   176  employed in order to ensure that the system meets its design goals.
   177  
   178  1.  No system call is passed through directly to the host. Every supported call
   179      has an independent implementation in the Sentry, that is unlikely to suffer
   180      from identical vulnerabilities that may appear in the host. This has the
   181      consequence that all kernel features used by applications require an
   182      implementation within the Sentry.
   183  1.  Only common, universal functionality is implemented. Some filesystems,
   184      network devices or modules may expose specialized functionality to user
   185      space applications via mechanisms such as extended attributes, raw sockets
   186      or ioctls. Since the Sentry is responsible for implementing the full system
   187      call surface, we do not implement or pass through these specialized APIs.
   188  1.  The host surface exposed to the Sentry is minimized. While the system call
   189      surface is not trivial, it is explicitly enumerated and controlled. The
   190      Sentry is not permitted to open new files, create new sockets or do many
   191      other interesting things on the host.
   192  
   193  Additionally, we have practical restrictions that are imposed on the project to
   194  minimize the risk of Sentry exploitability. For example:
   195  
   196  1.  Unsafe code is carefully controlled. All unsafe code is isolated in files
   197      that end with "unsafe.go", in order to facilitate validation and auditing.
   198      No file without the unsafe suffix may import the unsafe package.
   199  1.  No CGo is allowed. The Sentry must be a pure Go binary.
   200  1.  External imports are not generally allowed within the core packages. Only
   201      limited external imports are used within the setup code. The code available
   202      inside the Sentry is carefully controlled, to ensure that the above rules
   203      are effective.
   204  
   205  Finally, we recognize that security is a process, and that vigilance is
   206  critical. Beyond our security disclosure process, the Sentry is fuzzed
   207  continuously to identify potential bugs and races proactively, and production
   208  crashes are recorded and triaged to similarly identify material issues.
   209  
   210  ## FAQ
   211  
   212  ### Is this more or less secure than a Virtual Machine?
   213  
   214  The security of a VM depends to a large extent on what is exposed from the host
   215  kernel and userspace support code. For example, device emulation code in the
   216  host kernel (e.g. APIC) or optimizations (e.g. vhost) can be more complex than a
   217  simple system call, and exploits carry the same risks. Similarly, the userspace
   218  support code is frequently unsandboxed, and exploits, while rare, may allow
   219  unfettered access to the system.
   220  
   221  Some platforms leverage the same virtualization hardware as VMs in order to
   222  provide better system call interception performance. However, gVisor does not
   223  implement any device emulation, and instead opts to use a sandboxed host System
   224  API directly. Both approaches significantly reduce the original attack surface.
   225  Ultimately, since gVisor is capable of using the same hardware mechanism, one
   226  should not assume that the mere use of virtualization hardware makes a system
   227  more or less secure, just as it would be a mistake to make the claim that the
   228  use of a unibody alone makes a car safe.
   229  
   230  ### Does this stop hardware side channels?
   231  
   232  In general, gVisor does not provide protection against hardware side channels,
   233  although it may make exploits that rely on direct access to the host System API
   234  more difficult to use. To minimize exposure, you should follow relevant guidance
   235  from vendors and keep your host kernel and firmware up-to-date.
   236  
   237  ### Is this just a ptrace sandbox?
   238  
   239  No: the term “ptrace sandbox” generally refers to software that uses the Linux
   240  ptrace facility to inspect and authorize system calls made by applications,
   241  enforcing a specific policy. These commonly suffer from two issues. First,
   242  vulnerable system calls may be authorized by the sandbox, as the application
   243  still has direct access to some System API. Second, it’s impossible to avoid
   244  time-of-check, time-of-use race conditions without disabling multi-threading.
   245  
   246  In gVisor, the platforms that use ptrace operate differently. The stubs that are
   247  traced are never allowed to continue execution into the host kernel and complete
   248  a call directly. Instead, all system calls are interpreted and handled by the
   249  Sentry itself, who reflects resulting register state back into the tracee before
   250  continuing execution in userspace. This is very similar to the mechanism used by
   251  User-Mode Linux (UML).
   252  
   253  [dirtycow]: https://en.wikipedia.org/wiki/Dirty_COW
   254  [clang]: https://en.wikipedia.org/wiki/C_(programming_language)
   255  [popss]: https://nvd.nist.gov/vuln/detail/CVE-2018-8897