github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/g3doc/architecture_guide/resources.md (about)

     1  # Resource Model
     2  
     3  [TOC]
     4  
     5  The resource model for gVisor does not assume a fixed number of threads of
     6  execution (i.e. vCPUs) or amount of physical memory. Where possible, decisions
     7  about underlying physical resources are delegated to the host system, where
     8  optimizations can be made with global information. This delegation allows the
     9  sandbox to be highly dynamic in terms of resource usage: spanning a large number
    10  of cores and large amount of memory when busy, and yielding those resources back
    11  to the host when not.
    12  
    13  In order words, the shape of the sandbox should closely track the shape of the
    14  sandboxed process:
    15  
    16  ![Resource model](resources.png "Workloads of different shapes.")
    17  
    18  ## Processes
    19  
    20  Much like a Virtual Machine (VM), a gVisor sandbox appears as an opaque process
    21  on the system. Processes within the sandbox do not manifest as processes on the
    22  host system, and process-level interactions within the sandbox require entering
    23  the sandbox (e.g. via a [Docker exec][exec]).
    24  
    25  ## Networking
    26  
    27  The sandbox attaches a network endpoint to the system, but runs its own network
    28  stack. All network resources, other than packets in flight on the host, exist
    29  only inside the sandbox, bound by relevant resource limits.
    30  
    31  You can interact with network endpoints exposed by the sandbox, just as you
    32  would any other container, but network introspection similarly requires entering
    33  the sandbox.
    34  
    35  ## Files
    36  
    37  Files in the sandbox may be backed by different implementations. For host-native
    38  files (where a file descriptor is available), the Gofer may return a file
    39  descriptor to the Sentry via [SCM_RIGHTS][scmrights][^1].
    40  
    41  These files may be read from and written to through standard system calls, and
    42  also mapped into the associated application's address space. This allows the
    43  same host memory to be shared across multiple sandboxes, although this mechanism
    44  does not preclude the use of side-channels (see [Security Model](./security.md).
    45  
    46  Note that some file systems exist only within the context of the sandbox. For
    47  example, in many cases a `tmpfs` mount will be available at `/tmp` or
    48  `/dev/shm`, which allocates memory directly from the sandbox memory file (see
    49  below). Ultimately, these will be accounted against relevant limits in a similar
    50  way as the host native case.
    51  
    52  ## Threads
    53  
    54  The Sentry models individual task threads with [goroutines][goroutine]. As a
    55  result, each task thread is a lightweight [green thread][greenthread], and may
    56  not correspond to an underlying host thread.
    57  
    58  However, application execution is modelled as a blocking system call with the
    59  Sentry. This means that additional host threads may be created, *depending on
    60  the number of active application threads*. In practice, a busy application will
    61  converge on the number of active threads, and the host will be able to make
    62  scheduling decisions about all application threads.
    63  
    64  ## Time
    65  
    66  Time in the sandbox is provided by the Sentry, through its own [vDSO][vdso] and
    67  time-keeping implementation. This is distinct from the host time, and no state
    68  is shared with the host, although the time will be initialized with the host
    69  clock.
    70  
    71  The Sentry runs timers to note the passage of time, much like a kernel running
    72  on hardware (though the timers are software timers, in this case). These timers
    73  provide updates to the vDSO, the time returned through system calls, and the
    74  time recorded for usage or limit tracking (e.g. [RLIMIT_CPU][rlimit]).
    75  
    76  When all application threads are idle, the Sentry disables timers until an event
    77  occurs that wakes either the Sentry or an application thread, similar to a
    78  [tickless kernel][tickless]. This allows the Sentry to achieve near zero CPU
    79  usage for idle applications.
    80  
    81  ## Memory
    82  
    83  The Sentry implements its own memory management, including demand-paging and a
    84  Sentry internal page cache for files that cannot be used natively. A single
    85  [memfd][memfd] backs all application memory.
    86  
    87  ### Address spaces
    88  
    89  The creation of address spaces is platform-specific. For some platforms,
    90  additional "stub" processes may be created on the host in order to support
    91  additional address spaces. These stubs are subject to various limits applied at
    92  the sandbox level (e.g. PID limits).
    93  
    94  ### Physical memory
    95  
    96  The host is able to manage physical memory using regular means (e.g. tracking
    97  working sets, reclaiming and swapping under pressure). The Sentry lazily
    98  populates host mappings for applications, and allow the host to demand-page
    99  those regions, which is critical for the functioning of those mechanisms.
   100  
   101  In order to avoid excessive overhead, the Sentry does not demand-page individual
   102  pages. Instead, it selects appropriate regions based on heuristics. There is a
   103  trade-off here: the Sentry is unable to trivially determine which pages are
   104  active and which are not. Even if pages were individually faulted, the host may
   105  select pages to be reclaimed or swapped without the Sentry's knowledge.
   106  
   107  Therefore, memory usage statistics within the sandbox (e.g. via `proc`) are
   108  approximations. The Sentry maintains an internal breakdown of memory usage, and
   109  can collect accurate information but only through a relatively expensive API
   110  call. In any case, it would likely be considered unwise to share precise
   111  information about how the host is managing memory with the sandbox.
   112  
   113  Finally, when an application marks a region of memory as no longer needed, for
   114  example via a call to [madvise][madvise], the Sentry *releases this memory back
   115  to the host*. There can be performance penalties for this, since it may be
   116  cheaper in many cases to retain the memory and use it to satisfy some other
   117  request. However, releasing it immediately to the host allows the host to more
   118  effectively multiplex resources and apply an efficient global policy.
   119  
   120  ## Limits
   121  
   122  All Sentry threads and Sentry memory are subject to a container cgroup. However,
   123  application usage will not appear as anonymous memory usage, and will instead be
   124  accounted to the `memfd`. All anonymous memory will correspond to Sentry usage,
   125  and host memory charged to the container will work as standard.
   126  
   127  The cgroups can be monitored for standard signals: pressure indicators,
   128  threshold notifiers, etc. and can also be adjusted dynamically. Note that the
   129  Sentry itself may listen for pressure signals in its containing cgroup, in order
   130  to purge internal caches.
   131  
   132  [goroutine]: https://tour.golang.org/concurrency/1
   133  [greenthread]: https://en.wikipedia.org/wiki/Green_threads
   134  [scheduler]: https://morsmachine.dk/go-scheduler
   135  [vdso]: https://en.wikipedia.org/wiki/VDSO
   136  [rlimit]: http://man7.org/linux/man-pages/man2/getrlimit.2.html
   137  [tickless]: https://en.wikipedia.org/wiki/Tickless_kernel
   138  [memfd]: http://man7.org/linux/man-pages/man2/memfd_create.2.html
   139  [scmrights]: http://man7.org/linux/man-pages/man7/unix.7.html
   140  [madvise]: http://man7.org/linux/man-pages/man2/madvise.2.html
   141  [exec]: https://docs.docker.com/engine/reference/commandline/exec/
   142  [^1]: Unless host networking is enabled, the Sentry is not able to create or
   143      open host file descriptors itself, it can only receive them in this way
   144      from the Gofer.