gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/g3doc/architecture_guide/resources.md

gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/g3doc/architecture_guide/resources.md (about)

1 # Resource Model
2
3 [TOC]
4
5 The resource model for gVisor does not assume a fixed number of threads of
6 execution (i.e. vCPUs) or amount of physical memory. Where possible, decisions
7 about underlying physical resources are delegated to the host system, where
8 optimizations can be made with global information. This delegation allows the
9 sandbox to be highly dynamic in terms of resource usage: spanning a large number
10 of cores and large amount of memory when busy, and yielding those resources back
11 to the host when not.
12
13 In order words, the shape of the sandbox should closely track the shape of the
14 sandboxed process:
15
16 ![Resource model](resources.png "Workloads of different shapes.")
17
18 ## Processes
19
20 Much like a Virtual Machine (VM), a gVisor sandbox appears as an opaque process
21 on the system. Processes within the sandbox do not manifest as processes on the
22 host system, and process-level interactions within the sandbox require entering
23 the sandbox (e.g. via a [Docker exec][exec]).
24
25 ## Networking
26
27 The sandbox attaches a network endpoint to the system, but runs its own network
28 stack. All network resources, other than packets in flight on the host, exist
29 only inside the sandbox, bound by relevant resource limits.
30
31 You can interact with network endpoints exposed by the sandbox, just as you
32 would any other container, but network introspection similarly requires entering
33 the sandbox.
34
35 ## Files
36
37 Files in the sandbox may be backed by different implementations. For host-native
38 files (where a file descriptor is available), the Gofer may return a file
39 descriptor to the Sentry via [SCM_RIGHTS][scmrights][^1].
40
41 These files may be read from and written to through standard system calls, and
42 also mapped into the associated application's address space. This allows the
43 same host memory to be shared across multiple sandboxes, although this mechanism
44 does not preclude the use of side-channels (see
45 [Security Model](./security.md)).
46
47 Note that some file systems exist only within the context of the sandbox. For
48 example, in many cases a `tmpfs` mount will be available at `/tmp` or
49 `/dev/shm`, which allocates memory directly from the sandbox memory file (see
50 below). Ultimately, these will be accounted against relevant limits in a similar
51 way as the host native case.
52
53 ## Threads
54
55 The Sentry models individual task threads with [goroutines][goroutine]. As a
56 result, each task thread is a lightweight [green thread][greenthread], and may
57 not correspond to an underlying host thread.
58
59 However, application execution is modelled as a blocking system call with the
60 Sentry. This means that additional host threads may be created, *depending on
61 the number of active application threads*. In practice, a busy application will
62 converge on the number of active threads, and the host will be able to make
63 scheduling decisions about all application threads.
64
65 ## Time
66
67 Time in the sandbox is provided by the Sentry, through its own [vDSO][vdso] and
68 time-keeping implementation. This is distinct from the host time, and no state
69 is shared with the host, although the time will be initialized with the host
70 clock.
71
72 The Sentry runs timers to note the passage of time, much like a kernel running
73 on hardware (though the timers are software timers, in this case). These timers
74 provide updates to the vDSO, the time returned through system calls, and the
75 time recorded for usage or limit tracking (e.g. [RLIMIT_CPU][rlimit]).
76
77 When all application threads are idle, the Sentry disables timers until an event
78 occurs that wakes either the Sentry or an application thread, similar to a
79 [tickless kernel][tickless]. This allows the Sentry to achieve near zero CPU
80 usage for idle applications.
81
82 ## Memory
83
84 The Sentry implements its own memory management, including demand-paging and a
85 Sentry internal page cache for files that cannot be used natively. A single
86 [memfd][memfd] backs all application memory.
87
88 ### Address spaces
89
90 The creation of address spaces is platform-specific. For some platforms,
91 additional "stub" processes may be created on the host in order to support
92 additional address spaces. These stubs are subject to various limits applied at
93 the sandbox level (e.g. PID limits).
94
95 ### Physical memory
96
97 The host is able to manage physical memory using regular means (e.g. tracking
98 working sets, reclaiming and swapping under pressure). The Sentry lazily
99 populates host mappings for applications, and allow the host to demand-page
100 those regions, which is critical for the functioning of those mechanisms.
101
102 In order to avoid excessive overhead, the Sentry does not demand-page individual
103 pages. Instead, it selects appropriate regions based on heuristics. There is a
104 trade-off here: the Sentry is unable to trivially determine which pages are
105 active and which are not. Even if pages were individually faulted, the host may
106 select pages to be reclaimed or swapped without the Sentry's knowledge.
107
108 Therefore, memory usage statistics within the sandbox (e.g. via `proc`) are
109 approximations. The Sentry maintains an internal breakdown of memory usage, and
110 can collect accurate information but only through a relatively expensive API
111 call. In any case, it would likely be considered unwise to share precise
112 information about how the host is managing memory with the sandbox.
113
114 Finally, when an application marks a region of memory as no longer needed, for
115 example via a call to [madvise][madvise], the Sentry *releases this memory back
116 to the host*. There can be performance penalties for this, since it may be
117 cheaper in many cases to retain the memory and use it to satisfy some other
118 request. However, releasing it immediately to the host allows the host to more
119 effectively multiplex resources and apply an efficient global policy.
120
121 ## Limits
122
123 All Sentry threads and Sentry memory are subject to a container cgroup. However,
124 application usage will not appear as anonymous memory usage, and will instead be
125 accounted to the `memfd`. All anonymous memory will correspond to Sentry usage,
126 and host memory charged to the container will work as standard.
127
128 The cgroups can be monitored for standard signals: pressure indicators,
129 threshold notifiers, etc. and can also be adjusted dynamically. Note that the
130 Sentry itself may listen for pressure signals in its containing cgroup, in order
131 to purge internal caches.
132
133 [goroutine]: https://tour.golang.org/concurrency/1
134 [greenthread]: https://en.wikipedia.org/wiki/Green_threads
135 [scheduler]: https://morsmachine.dk/go-scheduler
136 [vdso]: https://en.wikipedia.org/wiki/VDSO
137 [rlimit]: http://man7.org/linux/man-pages/man2/getrlimit.2.html
138 [tickless]: https://en.wikipedia.org/wiki/Tickless_kernel
139 [memfd]: http://man7.org/linux/man-pages/man2/memfd_create.2.html
140 [scmrights]: http://man7.org/linux/man-pages/man7/unix.7.html
141 [madvise]: http://man7.org/linux/man-pages/man2/madvise.2.html
142 [exec]: https://docs.docker.com/engine/reference/commandline/exec/
143 [^1]: Unless host networking is enabled, the Sentry is not able to create or
144 open host file descriptors itself, it can only receive them in this way
145 from the Gofer.