gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/fsutil/README.md (about) 1 This package provides utilities for implementing virtual filesystem objects. 2 3 [TOC] 4 5 ## Page cache 6 7 `CachingInodeOperations` implements a page cache for files that cannot use the 8 host page cache. Normally these are files that store their data in a remote 9 filesystem. This also applies to files that are accessed on a platform that does 10 not support directly memory mapping host file descriptors (e.g. the ptrace 11 platform). 12 13 An `CachingInodeOperations` buffers regions of a single file into memory. It is 14 owned by an `fs.Inode`, the in-memory representation of a file (all open file 15 descriptors are backed by an `fs.Inode`). The `fs.Inode` provides operations for 16 reading memory into an `CachingInodeOperations`, to represent the contents of 17 the file in-memory, and for writing memory out, to relieve memory pressure on 18 the kernel and to synchronize in-memory changes to filesystems. 19 20 An `CachingInodeOperations` enables readable and/or writable memory access to 21 file content. Files can be mapped shared or private, see mmap(2). When a file is 22 mapped shared, changes to the file via write(2) and truncate(2) are reflected in 23 the shared memory region. Conversely, when the shared memory region is modified, 24 changes to the file are visible via read(2). Multiple shared mappings of the 25 same file are coherent with each other. This is consistent with Linux. 26 27 When a file is mapped private, updates to the mapped memory are not visible to 28 other memory mappings. Updates to the mapped memory are also not reflected in 29 the file content as seen by read(2). If the file is changed after a private 30 mapping is created, for instance by write(2), the change to the file may or may 31 not be reflected in the private mapping. This is consistent with Linux. 32 33 An `CachingInodeOperations` keeps track of ranges of memory that were modified 34 (or "dirtied"). When the file is explicitly synced via fsync(2), only the dirty 35 ranges are written out to the filesystem. Any error returned indicates a failure 36 to write all dirty memory of an `CachingInodeOperations` to the filesystem. In 37 this case the filesystem may be in an inconsistent state. The same operation can 38 be performed on the shared memory itself using msync(2). If neither fsync(2) nor 39 msync(2) is performed, then the dirty memory is written out in accordance with 40 the `CachingInodeOperations` eviction strategy (see below) and there is no 41 guarantee that memory will be written out successfully in full. 42 43 ### Memory allocation and eviction 44 45 An `CachingInodeOperations` implements the following allocation and eviction 46 strategy: 47 48 - Memory is allocated and brought up to date with the contents of a file when 49 a region of mapped memory is accessed (or "faulted on"). 50 51 - Dirty memory is written out to filesystems when an fsync(2) or msync(2) 52 operation is performed on a memory mapped file, for all memory mapped files 53 when saved, and/or when there are no longer any memory mappings of a range 54 of a file, see munmap(2). As the latter implies, in the absence of a panic 55 or SIGKILL, dirty memory is written out for all memory mapped files when an 56 application exits. 57 58 - Memory is freed when there are no longer any memory mappings of a range of a 59 file (e.g. when an application exits). This behavior is consistent with 60 Linux for shared memory that has been locked via mlock(2). 61 62 Notably, memory is not allocated for read(2) or write(2) operations. This means 63 that reads and writes to the file are only accelerated by an 64 `CachingInodeOperations` if the file being read or written has been memory 65 mapped *and* if the shared memory has been accessed at the region being read or 66 written. This diverges from Linux which buffers memory into a page cache on 67 read(2) proactively (i.e. readahead) and delays writing it out to filesystems on 68 write(2) (i.e. writeback). The absence of these optimizations is not visible to 69 applications beyond less than optimal performance when repeatedly reading and/or 70 writing to same region of a file. See [Future Work](#future-work) for plans to 71 implement these optimizations. 72 73 Additionally, memory held by `CachingInodeOperationss` is currently unbounded in 74 size. An `CachingInodeOperations` does not write out dirty memory and free it 75 under system memory pressure. This can cause pathological memory usage. 76 77 When memory is written back, an `CachingInodeOperations` may write regions of 78 shared memory that were never modified. This is due to the strategy of 79 minimizing page faults (see below) and handling only a subset of memory write 80 faults. In the absence of an application or sentry crash, it is guaranteed that 81 if a region of shared memory was written to, it is written back to a filesystem. 82 83 ### Life of a shared memory mapping 84 85 A file is memory mapped via mmap(2). For example, if `A` is an address, an 86 application may execute: 87 88 ``` 89 mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 90 ``` 91 92 This creates a shared mapping of fd that reflects 4k of the contents of fd 93 starting at offset 0, accessible at address `A`. This in turn creates a virtual 94 memory area region ("vma") which indicates that [`A`, `A`+0x1000) is now a valid 95 address range for this application to access. 96 97 At this point, memory has not been allocated in the file's 98 `CachingInodeOperations`. It is also the case that the address range [`A`, 99 `A`+0x1000) has not been mapped on the host on behalf of the application. If the 100 application then tries to modify 8 bytes of the shared memory: 101 102 ``` 103 char buffer[] = "aaaaaaaa"; 104 memcpy(A, buffer, 8); 105 ``` 106 107 The host then sends a `SIGSEGV` to the sentry because the address range [`A`, 108 `A`+8) is not mapped on the host. The `SIGSEGV` indicates that the memory was 109 accessed writable. The sentry looks up the vma associated with [`A`, `A`+8), 110 finds the file that was mapped and its `CachingInodeOperations`. It then calls 111 `CachingInodeOperations.Translate` which allocates memory to back [`A`, `A`+8). 112 It may choose to allocate more memory (i.e. do "readahead") to minimize 113 subsequent faults. 114 115 Memory that is allocated comes from a host tmpfs file (see 116 `pgalloc.MemoryFile`). The host tmpfs file memory is brought up to date with the 117 contents of the mapped file on its filesystem. The region of the host tmpfs file 118 that reflects the mapped file is then mapped into the host address space of the 119 application so that subsequent memory accesses do not repeatedly generate a 120 `SIGSEGV`. 121 122 The range that was allocated, including any extra memory allocation to minimize 123 faults, is marked dirty due to the write fault. This overcounts dirty memory if 124 the extra memory allocated is never modified. 125 126 To make the scenario more interesting, imagine that this application spawns 127 another process and maps the same file in the exact same way: 128 129 ``` 130 mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); 131 ``` 132 133 Imagine that this process then tries to modify the file again but with only 4 134 bytes: 135 136 ``` 137 char buffer[] = "bbbb"; 138 memcpy(A, buffer, 4); 139 ``` 140 141 Since the first process has already mapped and accessed the same region of the 142 file writable, `CachingInodeOperations.Translate` is called but returns the 143 memory that has already been allocated rather than allocating new memory. The 144 address range [`A`, `A`+0x1000) reflects the same cached view of the file as the 145 first process sees. For example, reading 8 bytes from the file from either 146 process via read(2) starting at offset 0 returns a consistent "bbbbaaaa". 147 148 When this process no longer needs the shared memory, it may do: 149 150 ``` 151 munmap(A, 0x1000); 152 ``` 153 154 At this point, the modified memory cached by the `CachingInodeOperations` is not 155 written back to the file because it is still in use by the first process that 156 mapped it. When the first process also does: 157 158 ``` 159 munmap(A, 0x1000); 160 ``` 161 162 Then the last memory mapping of the file at the range [0, 0x1000) is gone. The 163 file's `CachingInodeOperations` then starts writing back memory marked dirty to 164 the file on its filesystem. Once writing completes, regardless of whether it was 165 successful, the `CachingInodeOperations` frees the memory cached at the range 166 [0, 0x1000). 167 168 Subsequent read(2) or write(2) operations on the file go directly to the 169 filesystem since there no longer exists memory for it in its 170 `CachingInodeOperations`. 171 172 ## Future Work 173 174 ### Page cache 175 176 The sentry does not yet implement the readahead and writeback optimizations for 177 read(2) and write(2) respectively. To do so, on read(2) and/or write(2) the 178 sentry must ensure that memory is allocated in a page cache to read or write 179 into. However, the sentry cannot boundlessly allocate memory. If it did, the 180 host would eventually OOM-kill the sentry+application process. This means that 181 the sentry must implement a page cache memory allocation strategy that is 182 bounded by a global user or container imposed limit. When this limit is 183 approached, the sentry must decide from which page cache memory should be freed 184 so that it can allocate more memory. If it makes a poor decision, the sentry may 185 end up freeing and re-allocating memory to back regions of files that are 186 frequently used, nullifying the optimization (and in some cases causing worse 187 performance due to the overhead of memory allocation and general management). 188 This is a form of "cache thrashing". 189 190 In Linux, much research has been done to select and implement a lightweight but 191 optimal page cache eviction algorithm. Linux makes use of hardware page bits to 192 keep track of whether memory has been accessed. The sentry does not have direct 193 access to hardware. Implementing a similarly lightweight and optimal page cache 194 eviction algorithm will need to either introduce a kernel interface to obtain 195 these page bits or find a suitable alternative proxy for access events. 196 197 In Linux, readahead happens by default but is not always ideal. For instance, 198 for files that are not read sequentially, it would be more ideal to simply read 199 from only those regions of the file rather than to optimistically cache some 200 number of bytes ahead of the read (up to 2MB in Linux) if the bytes cached won't 201 be accessed. Linux implements the fadvise64(2) system call for applications to 202 specify that a range of a file will not be accessed sequentially. The advice bit 203 FADV_RANDOM turns off the readahead optimization for the given range in the 204 given file. However fadvise64 is rarely used by applications so Linux implements 205 a readahead backoff strategy if reads are not sequential. To ensure that 206 application performance is not degraded, the sentry must implement a similar 207 backoff strategy.