gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/g3doc/architecture_guide/resources.md (about)

     1  # Resource Model
     2  
     3  [TOC]
     4  
     5  The resource model for gVisor does not assume a fixed number of threads of
     6  execution (i.e. vCPUs) or amount of physical memory. Where possible, decisions
     7  about underlying physical resources are delegated to the host system, where
     8  optimizations can be made with global information. This delegation allows the
     9  sandbox to be highly dynamic in terms of resource usage: spanning a large number
    10  of cores and large amount of memory when busy, and yielding those resources back
    11  to the host when not.
    12  
    13  In order words, the shape of the sandbox should closely track the shape of the
    14  sandboxed process:
    15  
    16  ![Resource model](resources.png "Workloads of different shapes.")
    17  
    18  ## Processes
    19  
    20  Much like a Virtual Machine (VM), a gVisor sandbox appears as an opaque process
    21  on the system. Processes within the sandbox do not manifest as processes on the
    22  host system, and process-level interactions within the sandbox require entering
    23  the sandbox (e.g. via a [Docker exec][exec]).
    24  
    25  ## Networking
    26  
    27  The sandbox attaches a network endpoint to the system, but runs its own network
    28  stack. All network resources, other than packets in flight on the host, exist
    29  only inside the sandbox, bound by relevant resource limits.
    30  
    31  You can interact with network endpoints exposed by the sandbox, just as you
    32  would any other container, but network introspection similarly requires entering
    33  the sandbox.
    34  
    35  ## Files
    36  
    37  Files in the sandbox may be backed by different implementations. For host-native
    38  files (where a file descriptor is available), the Gofer may return a file
    39  descriptor to the Sentry via [SCM_RIGHTS][scmrights][^1].
    40  
    41  These files may be read from and written to through standard system calls, and
    42  also mapped into the associated application's address space. This allows the
    43  same host memory to be shared across multiple sandboxes, although this mechanism
    44  does not preclude the use of side-channels (see
    45  [Security Model](./security.md)).
    46  
    47  Note that some file systems exist only within the context of the sandbox. For
    48  example, in many cases a `tmpfs` mount will be available at `/tmp` or
    49  `/dev/shm`, which allocates memory directly from the sandbox memory file (see
    50  below). Ultimately, these will be accounted against relevant limits in a similar
    51  way as the host native case.
    52  
    53  ## Threads
    54  
    55  The Sentry models individual task threads with [goroutines][goroutine]. As a
    56  result, each task thread is a lightweight [green thread][greenthread], and may
    57  not correspond to an underlying host thread.
    58  
    59  However, application execution is modelled as a blocking system call with the
    60  Sentry. This means that additional host threads may be created, *depending on
    61  the number of active application threads*. In practice, a busy application will
    62  converge on the number of active threads, and the host will be able to make
    63  scheduling decisions about all application threads.
    64  
    65  ## Time
    66  
    67  Time in the sandbox is provided by the Sentry, through its own [vDSO][vdso] and
    68  time-keeping implementation. This is distinct from the host time, and no state
    69  is shared with the host, although the time will be initialized with the host
    70  clock.
    71  
    72  The Sentry runs timers to note the passage of time, much like a kernel running
    73  on hardware (though the timers are software timers, in this case). These timers
    74  provide updates to the vDSO, the time returned through system calls, and the
    75  time recorded for usage or limit tracking (e.g. [RLIMIT_CPU][rlimit]).
    76  
    77  When all application threads are idle, the Sentry disables timers until an event
    78  occurs that wakes either the Sentry or an application thread, similar to a
    79  [tickless kernel][tickless]. This allows the Sentry to achieve near zero CPU
    80  usage for idle applications.
    81  
    82  ## Memory
    83  
    84  The Sentry implements its own memory management, including demand-paging and a
    85  Sentry internal page cache for files that cannot be used natively. A single
    86  [memfd][memfd] backs all application memory.
    87  
    88  ### Address spaces
    89  
    90  The creation of address spaces is platform-specific. For some platforms,
    91  additional "stub" processes may be created on the host in order to support
    92  additional address spaces. These stubs are subject to various limits applied at
    93  the sandbox level (e.g. PID limits).
    94  
    95  ### Physical memory
    96  
    97  The host is able to manage physical memory using regular means (e.g. tracking
    98  working sets, reclaiming and swapping under pressure). The Sentry lazily
    99  populates host mappings for applications, and allow the host to demand-page
   100  those regions, which is critical for the functioning of those mechanisms.
   101  
   102  In order to avoid excessive overhead, the Sentry does not demand-page individual
   103  pages. Instead, it selects appropriate regions based on heuristics. There is a
   104  trade-off here: the Sentry is unable to trivially determine which pages are
   105  active and which are not. Even if pages were individually faulted, the host may
   106  select pages to be reclaimed or swapped without the Sentry's knowledge.
   107  
   108  Therefore, memory usage statistics within the sandbox (e.g. via `proc`) are
   109  approximations. The Sentry maintains an internal breakdown of memory usage, and
   110  can collect accurate information but only through a relatively expensive API
   111  call. In any case, it would likely be considered unwise to share precise
   112  information about how the host is managing memory with the sandbox.
   113  
   114  Finally, when an application marks a region of memory as no longer needed, for
   115  example via a call to [madvise][madvise], the Sentry *releases this memory back
   116  to the host*. There can be performance penalties for this, since it may be
   117  cheaper in many cases to retain the memory and use it to satisfy some other
   118  request. However, releasing it immediately to the host allows the host to more
   119  effectively multiplex resources and apply an efficient global policy.
   120  
   121  ## Limits
   122  
   123  All Sentry threads and Sentry memory are subject to a container cgroup. However,
   124  application usage will not appear as anonymous memory usage, and will instead be
   125  accounted to the `memfd`. All anonymous memory will correspond to Sentry usage,
   126  and host memory charged to the container will work as standard.
   127  
   128  The cgroups can be monitored for standard signals: pressure indicators,
   129  threshold notifiers, etc. and can also be adjusted dynamically. Note that the
   130  Sentry itself may listen for pressure signals in its containing cgroup, in order
   131  to purge internal caches.
   132  
   133  [goroutine]: https://tour.golang.org/concurrency/1
   134  [greenthread]: https://en.wikipedia.org/wiki/Green_threads
   135  [scheduler]: https://morsmachine.dk/go-scheduler
   136  [vdso]: https://en.wikipedia.org/wiki/VDSO
   137  [rlimit]: http://man7.org/linux/man-pages/man2/getrlimit.2.html
   138  [tickless]: https://en.wikipedia.org/wiki/Tickless_kernel
   139  [memfd]: http://man7.org/linux/man-pages/man2/memfd_create.2.html
   140  [scmrights]: http://man7.org/linux/man-pages/man7/unix.7.html
   141  [madvise]: http://man7.org/linux/man-pages/man2/madvise.2.html
   142  [exec]: https://docs.docker.com/engine/reference/commandline/exec/
   143  [^1]: Unless host networking is enabled, the Sentry is not able to create or
   144      open host file descriptors itself, it can only receive them in this way
   145      from the Gofer.