github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/README.md (about)

     1  # What is gVisor?
     2  
     3  gVisor is an application kernel, written in Go, that implements a substantial
     4  portion of the [Linux system call interface][linux]. It provides an additional
     5  layer of isolation between running applications and the host operating system.
     6  
     7  gVisor includes an [Open Container Initiative (OCI)][oci] runtime called `runsc`
     8  that makes it easy to work with existing container tooling. The `runsc` runtime
     9  integrates with Docker and Kubernetes, making it simple to run sandboxed
    10  containers.
    11  
    12  gVisor can be used with Docker, Kubernetes, or directly using `runsc`. Use the
    13  links below to see detailed instructions for each of them:
    14  
    15  *   [Docker](./user_guide/quick_start/docker.md): The quickest and easiest way
    16      to get started.
    17  *   [Kubernetes](./user_guide/quick_start/kubernetes.md): Isolate Pods in your
    18      K8s cluster with gVisor.
    19  *   [OCI Quick Start](./user_guide/quick_start/oci.md): Expert mode. Customize
    20      gVisor for your environment.
    21  
    22  ## What does gVisor do?
    23  
    24  gVisor provides a virtualized environment in order to sandbox containers. The
    25  system interfaces normally implemented by the host kernel are moved into a
    26  distinct, per-sandbox application kernel in order to minimize the risk of an
    27  container escape exploit. gVisor does not introduce large fixed overheads
    28  however, and still retains a process-like model with respect to resource
    29  utilization.
    30  
    31  ## How is this different?
    32  
    33  Two other approaches are commonly taken to provide stronger isolation than
    34  native containers.
    35  
    36  **Machine-level virtualization**, such as [KVM][kvm] and [Xen][xen], exposes
    37  virtualized hardware to a guest kernel via a Virtual Machine Monitor (VMM). This
    38  virtualized hardware is generally enlightened (paravirtualized) and additional
    39  mechanisms can be used to improve the visibility between the guest and host
    40  (e.g. balloon drivers, paravirtualized spinlocks). Running containers in
    41  distinct virtual machines can provide great isolation, compatibility and
    42  performance (though nested virtualization may bring challenges in this area),
    43  but for containers it often requires additional proxies and agents, and may
    44  require a larger resource footprint and slower start-up times.
    45  
    46  ![Machine-level virtualization](Machine-Virtualization.png "Machine-level virtualization")
    47  
    48  **Rule-based execution**, such as [seccomp][seccomp], [SELinux][selinux] and
    49  [AppArmor][apparmor], allows the specification of a fine-grained security policy
    50  for an application or container. These schemes typically rely on hooks
    51  implemented inside the host kernel to enforce the rules. If the surface can be
    52  made small enough, then this is an excellent way to sandbox applications and
    53  maintain native performance. However, in practice it can be extremely difficult
    54  (if not impossible) to reliably define a policy for arbitrary, previously
    55  unknown applications, making this approach challenging to apply universally.
    56  
    57  ![Rule-based execution](Rule-Based-Execution.png "Rule-based execution")
    58  
    59  Rule-based execution is often combined with additional layers for
    60  defense-in-depth.
    61  
    62  **gVisor** provides a third isolation mechanism, distinct from those above.
    63  
    64  gVisor intercepts application system calls and acts as the guest kernel, without
    65  the need for translation through virtualized hardware. gVisor may be thought of
    66  as either a merged guest kernel and VMM, or as seccomp on steroids. This
    67  architecture allows it to provide a flexible resource footprint (i.e. one based
    68  on threads and memory mappings, not fixed guest physical resources) while also
    69  lowering the fixed costs of virtualization. However, this comes at the price of
    70  reduced application compatibility and higher per-system call overhead.
    71  
    72  ![gVisor](Layers.png "gVisor")
    73  
    74  On top of this, gVisor employs rule-based execution to provide defense-in-depth
    75  (details below).
    76  
    77  gVisor's approach is similar to [User Mode Linux (UML)][uml], although UML
    78  virtualizes hardware internally and thus provides a fixed resource footprint.
    79  
    80  Each of the above approaches may excel in distinct scenarios. For example,
    81  machine-level virtualization will face challenges achieving high density, while
    82  gVisor may provide poor performance for system call heavy workloads.
    83  
    84  ## Why Go?
    85  
    86  gVisor is written in [Go][golang] in order to avoid security pitfalls that can
    87  plague kernels. With Go, there are strong types, built-in bounds checks, no
    88  uninitialized variables, no use-after-free, no stack overflow, and a built-in
    89  race detector. However, the use of Go has its challenges, and the runtime often
    90  introduces performance overhead.
    91  
    92  ## What are the different components?
    93  
    94  A gVisor sandbox consists of multiple processes. These processes collectively
    95  comprise an environment in which one or more containers can be run.
    96  
    97  Each sandbox has its own isolated instance of:
    98  
    99  *   The **Sentry**, which is a kernel that runs the containers and intercepts
   100      and responds to system calls made by the application.
   101  
   102  Each container running in the sandbox has its own isolated instance of:
   103  
   104  *   A **Gofer** which provides file system access to the containers.
   105  
   106  ![gVisor architecture diagram](Sentry-Gofer.png "gVisor architecture diagram")
   107  
   108  ## What is runsc?
   109  
   110  The entrypoint to running a sandboxed container is the `runsc` executable.
   111  `runsc` implements the [Open Container Initiative (OCI)][oci] runtime
   112  specification, which is used by Docker and Kubernetes. This means that OCI
   113  compatible _filesystem bundles_ can be run by `runsc`. Filesystem bundles are
   114  comprised of a `config.json` file containing container configuration, and a root
   115  filesystem for the container. Please see the [OCI runtime spec][runtime-spec]
   116  for more information on filesystem bundles. `runsc` implements multiple commands
   117  that perform various functions such as starting, stopping, listing, and querying
   118  the status of containers.
   119  
   120  ### Sentry {#sentry}
   121  
   122  The Sentry is the largest component of gVisor. It can be thought of as a
   123  application kernel. The Sentry implements all the kernel functionality needed by
   124  the application, including: system calls, signal delivery, memory management and
   125  page faulting logic, the threading model, and more.
   126  
   127  When the application makes a system call, the
   128  [Platform](./architecture_guide/platforms.md) redirects the call to the Sentry,
   129  which will do the necessary work to service it. It is important to note that the
   130  Sentry does not pass system calls through to the host kernel. As a userspace
   131  application, the Sentry will make some host system calls to support its
   132  operation, but it does not allow the application to directly control the system
   133  calls it makes. For example, the Sentry is not able to open files directly; file
   134  system operations that extend beyond the sandbox (not internal `/proc` files,
   135  pipes, etc) are sent to the Gofer, described below.
   136  
   137  ### Gofer {#gofer}
   138  
   139  The Gofer is a standard host process which is started with each container and
   140  communicates with the Sentry via the [9P protocol][9p] over a socket or shared
   141  memory channel. The Sentry process is started in a restricted seccomp container
   142  without access to file system resources. The Gofer mediates all access to the
   143  these resources, providing an additional level of isolation.
   144  
   145  ### Application {#application}
   146  
   147  The application is a normal Linux binary provided to gVisor in an OCI runtime
   148  bundle. gVisor aims to provide an environment equivalent to Linux v4.4, so
   149  applications should be able to run unmodified. However, gVisor does not
   150  presently implement every system call, `/proc` file, or `/sys` file so some
   151  incompatibilities may occur. See [Compatibility](./user_guide/compatibility.md)
   152  for more information.
   153  
   154  [9p]: https://en.wikipedia.org/wiki/9P_(protocol)
   155  [apparmor]: https://wiki.ubuntu.com/AppArmor
   156  [golang]: https://golang.org
   157  [kvm]: https://www.linux-kvm.org
   158  [linux]: https://en.wikipedia.org/wiki/Linux_kernel_interfaces
   159  [oci]: https://www.opencontainers.org
   160  [runtime-spec]: https://github.com/opencontainers/runtime-spec
   161  [seccomp]: https://www.kernel.org/doc/Documentation/prctl/seccomp_filter.txt
   162  [selinux]: https://selinuxproject.org
   163  [uml]: http://user-mode-linux.sourceforge.net/
   164  [xen]: https://www.xenproject.org