gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/mm/README.md (about) 1 This package provides an emulation of Linux semantics for application virtual 2 memory mappings. 3 4 For completeness, this document also describes aspects of the memory management 5 subsystem defined outside this package. 6 7 # Background 8 9 We begin by describing semantics for virtual memory in Linux. 10 11 A virtual address space is defined as a collection of mappings from virtual 12 addresses to physical memory. However, userspace applications do not configure 13 mappings to physical memory directly. Instead, applications configure memory 14 mappings from virtual addresses to offsets into a file using the `mmap` system 15 call.[^mmap-anon] For example, a call to: 16 17 mmap( 18 /* addr = */ 0x400000, 19 /* length = */ 0x1000, 20 PROT_READ | PROT_WRITE, 21 MAP_SHARED, 22 /* fd = */ 3, 23 /* offset = */ 0); 24 25 creates a mapping of length 0x1000 bytes, starting at virtual address (VA) 26 0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within 27 the Linux kernel, virtual memory mappings are represented by *virtual memory 28 areas* (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the 29 virtual memory subsystem after the `mmap` call may be depicted as: 30 31 VMA: VA:0x400000 -> /tmp/foo:0x0 32 33 Establishing a virtual memory area does not necessarily establish a mapping to a 34 physical address, because Linux has not necessarily provisioned physical memory 35 to store the file's contents. Thus, if the application attempts to read the 36 contents of VA 0x400000, it may incur a *page fault*, a CPU exception that 37 forces the kernel to create such a mapping to service the read. 38 39 For a file, doing so consists of several logical phases: 40 41 1. The kernel allocates physical memory to store the contents of the required 42 part of the file, and copies file contents to the allocated memory. 43 Supposing that the kernel chooses the physical memory at physical address 44 (PA) 0x2fb000, the resulting state of the system is: 45 46 VMA: VA:0x400000 -> /tmp/foo:0x0 47 Filemap: /tmp/foo:0x0 -> PA:0x2fb000 48 49 (In Linux the state of the mapping from file offset to physical memory is 50 stored in `struct address_space`, but to avoid confusion with other notions 51 of address space we will refer to this system as filemap, named after Linux 52 kernel source file `mm/filemap.c`.) 53 54 2. The kernel stores the effective mapping from virtual to physical address in 55 a *page table entry* (PTE) in the application's *page tables*, which are 56 used by the CPU's virtual memory hardware to perform address translation. 57 The resulting state of the system is: 58 59 VMA: VA:0x400000 -> /tmp/foo:0x0 60 Filemap: /tmp/foo:0x0 -> PA:0x2fb000 61 PTE: VA:0x400000 -----------------> PA:0x2fb000 62 63 The PTE is required for the application to actually use the contents of the 64 mapped file as virtual memory. However, the PTE is derived from the VMA and 65 filemap state, both of which are independently mutable, such that mutations 66 to either will affect the PTE. For example: 67 68 - The application may remove the VMA using the `munmap` system call. This 69 breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently 70 the mapping from VA:0x400000 to PA:0x2fb000. However, it does not 71 necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a 72 future mapping of the same file offset may reuse this physical memory. 73 74 - The application may invalidate the file's contents by passing a length 75 of 0 to the `ftruncate` system call. This breaks the mapping from 76 /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from 77 VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from 78 VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents 79 may again be made visible at VA:0x400000 after another page fault 80 results in the allocation of a new physical address. 81 82 Note that, in order to correctly break the mapping from VA:0x400000 to 83 PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping* 84 from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE. 85 86 [^mmap-anon]: Memory mappings to non-files are discussed in later sections. 87 88 ## Private Mappings 89 90 The preceding example considered VMAs created using the `MAP_SHARED` flag, which 91 means that PTEs derived from the mapping should always use physical memory that 92 represents the current state of the mapped file.[^mmap-dev-zero] Applications 93 can alternatively pass the `MAP_PRIVATE` flag to create a *private mapping*. 94 Private mappings are *copy-on-write*. 95 96 Suppose that the application instead created a private mapping in the previous 97 example. In Linux, the state of the system after a read page fault would be: 98 99 VMA: VA:0x400000 -> /tmp/foo:0x0 (private) 100 Filemap: /tmp/foo:0x0 -> PA:0x2fb000 101 PTE: VA:0x400000 -----------------> PA:0x2fb000 (read-only) 102 103 Now suppose the application attempts to write to VA:0x400000. For a shared 104 mapping, the write would be propagated to PA:0x2fb000, and the kernel would be 105 responsible for ensuring that the write is later propagated to the mapped file. 106 For a private mapping, the write incurs another page fault since the PTE is 107 marked read-only. In response, the kernel allocates physical memory to store the 108 mapping's *private copy* of the file's contents, copies file contents to the 109 allocated memory, and changes the PTE to map to the private copy. Supposing that 110 the kernel chooses the physical memory at physical address (PA) 0x5ea000, the 111 resulting state of the system is: 112 113 VMA: VA:0x400000 -> /tmp/foo:0x0 (private) 114 Filemap: /tmp/foo:0x0 -> PA:0x2fb000 115 PTE: VA:0x400000 -----------------> PA:0x5ea000 116 117 Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist, 118 but is now irrelevant to this mapping. 119 120 [^mmap-dev-zero]: Modulo files with special mmap semantics such as `/dev/zero`. 121 122 ## Anonymous Mappings 123 124 Instead of passing a file to the `mmap` system call, applications can instead 125 request an *anonymous* mapping by passing the `MAP_ANONYMOUS` flag. 126 Semantically, an anonymous mapping is essentially a mapping to an ephemeral file 127 initially filled with zero bytes. Practically speaking, this is how shared 128 anonymous mappings are implemented, but private anonymous mappings do not result 129 in the creation of an ephemeral file; since there would be no way to modify the 130 contents of the underlying file through a private mapping, all private anonymous 131 mappings use a single shared page filled with zero bytes until copy-on-write 132 occurs. 133 134 # Virtual Memory in the Sentry 135 136 The sentry implements application virtual memory atop a host kernel, introducing 137 an additional level of indirection to the above. 138 139 Consider the same scenario as in the previous section. Since the sentry handles 140 application system calls, the effect of an application `mmap` system call is to 141 create a VMA in the sentry (as opposed to the host kernel): 142 143 Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 144 145 When the application first incurs a page fault on this address, the host kernel 146 delivers information about the page fault to the sentry in a platform-dependent 147 manner, and the sentry handles the fault: 148 149 1. The sentry allocates memory to store the contents of the required part of 150 the file, and copies file contents to the allocated memory. However, since 151 the sentry is implemented atop a host kernel, it does not configure mappings 152 to physical memory directly. Instead, mappable "memory" in the sentry is 153 represented by a host file descriptor and offset, since (as noted in 154 "Background") this is the memory mapping primitive provided by the host 155 kernel. In general, memory is allocated from a temporary host file using the 156 `pgalloc` package. Supposing that the sentry allocates offset 0x3000 from 157 host file "memory-file", the resulting state is: 158 159 Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 160 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 161 162 2. The sentry stores the effective mapping from virtual address to host file in 163 a host VMA by invoking the `mmap` system call: 164 165 Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 166 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 167 Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 168 169 3. The sentry returns control to the application, which immediately incurs the 170 page fault again.[^mmap-populate] However, since a host VMA now exists for 171 the faulting virtual address, the host kernel now handles the page fault as 172 described in "Background": 173 174 Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 175 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 176 Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 177 Host filemap: host:memory-file:0x3000 -> PA:0x2fb000 178 Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000 179 180 Thus, from an implementation standpoint, host VMAs serve the same purpose in the 181 sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is 182 independently mutable, and the desired state of host VMAs is derived from that 183 state. 184 185 [^mmap-populate]: The sentry could force the host kernel to establish PTEs when 186 it creates the host VMA by passing the `MAP_POPULATE` flag to 187 the `mmap` system call, but usually does not. This is because, 188 to reduce the number of page faults that require handling by 189 the sentry and (correspondingly) the number of host `mmap` 190 system calls, the sentry usually creates host VMAs that are 191 much larger than the single faulting page. 192 193 ## Private Mappings 194 195 The sentry implements private mappings consistently with Linux. Before 196 copy-on-write, the private mapping example given in the Background results in: 197 198 Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private) 199 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 200 Host VMA: VA:0x400000 -----------------> host:memory-file:0x3000 (read-only) 201 Host filemap: host:memory-file:0x3000 -> PA:0x2fb000 202 Host PTE: VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only) 203 204 When the application attempts to write to this address, the host kernel delivers 205 information about the resulting page fault to the sentry. Analogous to Linux, 206 the sentry allocates memory to store the mapping's private copy of the file's 207 contents, copies file contents to the allocated memory, and changes the host VMA 208 to map to the private copy. Supposing that the sentry chooses the offset 0x4000 209 in host file `memory-file` to store the private copy, the state of the system 210 after copy-on-write is: 211 212 Sentry VMA: VA:0x400000 -> /tmp/foo:0x0 (private) 213 Sentry filemap: /tmp/foo:0x0 -> host:memory-file:0x3000 214 Host VMA: VA:0x400000 -----------------> host:memory-file:0x4000 215 Host filemap: host:memory-file:0x4000 -> PA:0x5ea000 216 Host PTE: VA:0x400000 --------------------------------------------> PA:0x5ea000 217 218 However, this highlights an important difference between Linux and the sentry. 219 In Linux, page tables are concrete (architecture-dependent) data structures 220 owned by the kernel. Conversely, the sentry has the ability to create and 221 destroy host VMAs using host system calls, but it does not have direct access to 222 their state. Thus, as written, if the application invokes the `munmap` system 223 call to remove the sentry VMA, it is non-trivial for the sentry to determine 224 that it should deallocate `host:memory-file:0x4000`. This implies that the 225 sentry must retain information about the host VMAs that it has created. 226 227 ## Anonymous Mappings 228 229 The sentry implements anonymous mappings consistently with Linux, except that 230 there is no shared zero page. 231 232 # Implementation Constructs 233 234 In Linux: 235 236 - A virtual address space is represented by `struct mm_struct`. 237 238 - VMAs are represented by `struct vm_area_struct`, stored in `struct 239 mm_struct::mmap`. 240 241 - Mappings from file offsets to physical memory are stored in `struct 242 address_space`. 243 244 - Reverse mappings from file offsets to virtual mappings are stored in `struct 245 address_space::i_mmap`. 246 247 - Physical memory pages are represented by a pointer to `struct page` or an 248 index called a *page frame number* (PFN), represented by `pfn_t`. 249 250 - PTEs are represented by architecture-dependent type `pte_t`, stored in a 251 table hierarchy rooted at `struct mm_struct::pgd`. 252 253 In the sentry: 254 255 - A virtual address space is represented by type [`mm.MemoryManager`][mm]. 256 257 - Sentry VMAs are represented by type [`mm.vma`][mm], stored in 258 `mm.MemoryManager.vmas`. 259 260 - Mappings from sentry file offsets to host file offsets are abstracted 261 through interface method [`memmap.Mappable.Translate`][memmap]. 262 263 - Reverse mappings from sentry file offsets to virtual mappings are abstracted 264 through interface methods 265 [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap]. 266 267 - Host files that may be mapped into host VMAs are represented by type 268 [`memmap.File`][memmap]. 269 270 - Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform 271 mapping area"), stored in `mm.MemoryManager.pmas`. 272 273 - Creation and destruction of host VMAs is abstracted through interface 274 methods 275 [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform]. 276 277 [memmap]: https://github.com/google/gvisor/blob/master/pkg/sentry/memmap/memmap.go 278 [mm]: https://github.com/google/gvisor/blob/master/pkg/sentry/mm/mm.go 279 [pgalloc]: https://github.com/google/gvisor/blob/master/pkg/sentry/pgalloc/pgalloc.go 280 [platform]: https://github.com/google/gvisor/blob/master/pkg/sentry/platform/platform.go