gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/fsutil/README.md

gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/fsutil/README.md (about)

     1  This package provides utilities for implementing virtual filesystem objects.
     2  
     3  [TOC]
     4  
     5  ## Page cache
     6  
     7  `CachingInodeOperations` implements a page cache for files that cannot use the
     8  host page cache. Normally these are files that store their data in a remote
     9  filesystem. This also applies to files that are accessed on a platform that does
    10  not support directly memory mapping host file descriptors (e.g. the ptrace
    11  platform).
    12  
    13  An `CachingInodeOperations` buffers regions of a single file into memory. It is
    14  owned by an `fs.Inode`, the in-memory representation of a file (all open file
    15  descriptors are backed by an `fs.Inode`). The `fs.Inode` provides operations for
    16  reading memory into an `CachingInodeOperations`, to represent the contents of
    17  the file in-memory, and for writing memory out, to relieve memory pressure on
    18  the kernel and to synchronize in-memory changes to filesystems.
    19  
    20  An `CachingInodeOperations` enables readable and/or writable memory access to
    21  file content. Files can be mapped shared or private, see mmap(2). When a file is
    22  mapped shared, changes to the file via write(2) and truncate(2) are reflected in
    23  the shared memory region. Conversely, when the shared memory region is modified,
    24  changes to the file are visible via read(2). Multiple shared mappings of the
    25  same file are coherent with each other. This is consistent with Linux.
    26  
    27  When a file is mapped private, updates to the mapped memory are not visible to
    28  other memory mappings. Updates to the mapped memory are also not reflected in
    29  the file content as seen by read(2). If the file is changed after a private
    30  mapping is created, for instance by write(2), the change to the file may or may
    31  not be reflected in the private mapping. This is consistent with Linux.
    32  
    33  An `CachingInodeOperations` keeps track of ranges of memory that were modified
    34  (or "dirtied"). When the file is explicitly synced via fsync(2), only the dirty
    35  ranges are written out to the filesystem. Any error returned indicates a failure
    36  to write all dirty memory of an `CachingInodeOperations` to the filesystem. In
    37  this case the filesystem may be in an inconsistent state. The same operation can
    38  be performed on the shared memory itself using msync(2). If neither fsync(2) nor
    39  msync(2) is performed, then the dirty memory is written out in accordance with
    40  the `CachingInodeOperations` eviction strategy (see below) and there is no
    41  guarantee that memory will be written out successfully in full.
    42  
    43  ### Memory allocation and eviction
    44  
    45  An `CachingInodeOperations` implements the following allocation and eviction
    46  strategy:
    47  
    48  -   Memory is allocated and brought up to date with the contents of a file when
    49      a region of mapped memory is accessed (or "faulted on").
    50  
    51  -   Dirty memory is written out to filesystems when an fsync(2) or msync(2)
    52      operation is performed on a memory mapped file, for all memory mapped files
    53      when saved, and/or when there are no longer any memory mappings of a range
    54      of a file, see munmap(2). As the latter implies, in the absence of a panic
    55      or SIGKILL, dirty memory is written out for all memory mapped files when an
    56      application exits.
    57  
    58  -   Memory is freed when there are no longer any memory mappings of a range of a
    59      file (e.g. when an application exits). This behavior is consistent with
    60      Linux for shared memory that has been locked via mlock(2).
    61  
    62  Notably, memory is not allocated for read(2) or write(2) operations. This means
    63  that reads and writes to the file are only accelerated by an
    64  `CachingInodeOperations` if the file being read or written has been memory
    65  mapped *and* if the shared memory has been accessed at the region being read or
    66  written. This diverges from Linux which buffers memory into a page cache on
    67  read(2) proactively (i.e. readahead) and delays writing it out to filesystems on
    68  write(2) (i.e. writeback). The absence of these optimizations is not visible to
    69  applications beyond less than optimal performance when repeatedly reading and/or
    70  writing to same region of a file. See [Future Work](#future-work) for plans to
    71  implement these optimizations.
    72  
    73  Additionally, memory held by `CachingInodeOperationss` is currently unbounded in
    74  size. An `CachingInodeOperations` does not write out dirty memory and free it
    75  under system memory pressure. This can cause pathological memory usage.
    76  
    77  When memory is written back, an `CachingInodeOperations` may write regions of
    78  shared memory that were never modified. This is due to the strategy of
    79  minimizing page faults (see below) and handling only a subset of memory write
    80  faults. In the absence of an application or sentry crash, it is guaranteed that
    81  if a region of shared memory was written to, it is written back to a filesystem.
    82  
    83  ### Life of a shared memory mapping
    84  
    85  A file is memory mapped via mmap(2). For example, if `A` is an address, an
    86  application may execute:
    87  
    88  ```
    89  mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
    90  ```
    91  
    92  This creates a shared mapping of fd that reflects 4k of the contents of fd
    93  starting at offset 0, accessible at address `A`. This in turn creates a virtual
    94  memory area region ("vma") which indicates that [`A`, `A`+0x1000) is now a valid
    95  address range for this application to access.
    96  
    97  At this point, memory has not been allocated in the file's
    98  `CachingInodeOperations`. It is also the case that the address range [`A`,
    99  `A`+0x1000) has not been mapped on the host on behalf of the application. If the
   100  application then tries to modify 8 bytes of the shared memory:
   101  
   102  ```
   103  char buffer[] = "aaaaaaaa";
   104  memcpy(A, buffer, 8);
   105  ```
   106  
   107  The host then sends a `SIGSEGV` to the sentry because the address range [`A`,
   108  `A`+8) is not mapped on the host. The `SIGSEGV` indicates that the memory was
   109  accessed writable. The sentry looks up the vma associated with [`A`, `A`+8),
   110  finds the file that was mapped and its `CachingInodeOperations`. It then calls
   111  `CachingInodeOperations.Translate` which allocates memory to back [`A`, `A`+8).
   112  It may choose to allocate more memory (i.e. do "readahead") to minimize
   113  subsequent faults.
   114  
   115  Memory that is allocated comes from a host tmpfs file (see
   116  `pgalloc.MemoryFile`). The host tmpfs file memory is brought up to date with the
   117  contents of the mapped file on its filesystem. The region of the host tmpfs file
   118  that reflects the mapped file is then mapped into the host address space of the
   119  application so that subsequent memory accesses do not repeatedly generate a
   120  `SIGSEGV`.
   121  
   122  The range that was allocated, including any extra memory allocation to minimize
   123  faults, is marked dirty due to the write fault. This overcounts dirty memory if
   124  the extra memory allocated is never modified.
   125  
   126  To make the scenario more interesting, imagine that this application spawns
   127  another process and maps the same file in the exact same way:
   128  
   129  ```
   130  mmap(A, 0x1000, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
   131  ```
   132  
   133  Imagine that this process then tries to modify the file again but with only 4
   134  bytes:
   135  
   136  ```
   137  char buffer[] = "bbbb";
   138  memcpy(A, buffer, 4);
   139  ```
   140  
   141  Since the first process has already mapped and accessed the same region of the
   142  file writable, `CachingInodeOperations.Translate` is called but returns the
   143  memory that has already been allocated rather than allocating new memory. The
   144  address range [`A`, `A`+0x1000) reflects the same cached view of the file as the
   145  first process sees. For example, reading 8 bytes from the file from either
   146  process via read(2) starting at offset 0 returns a consistent "bbbbaaaa".
   147  
   148  When this process no longer needs the shared memory, it may do:
   149  
   150  ```
   151  munmap(A, 0x1000);
   152  ```
   153  
   154  At this point, the modified memory cached by the `CachingInodeOperations` is not
   155  written back to the file because it is still in use by the first process that
   156  mapped it. When the first process also does:
   157  
   158  ```
   159  munmap(A, 0x1000);
   160  ```
   161  
   162  Then the last memory mapping of the file at the range [0, 0x1000) is gone. The
   163  file's `CachingInodeOperations` then starts writing back memory marked dirty to
   164  the file on its filesystem. Once writing completes, regardless of whether it was
   165  successful, the `CachingInodeOperations` frees the memory cached at the range
   166  [0, 0x1000).
   167  
   168  Subsequent read(2) or write(2) operations on the file go directly to the
   169  filesystem since there no longer exists memory for it in its
   170  `CachingInodeOperations`.
   171  
   172  ## Future Work
   173  
   174  ### Page cache
   175  
   176  The sentry does not yet implement the readahead and writeback optimizations for
   177  read(2) and write(2) respectively. To do so, on read(2) and/or write(2) the
   178  sentry must ensure that memory is allocated in a page cache to read or write
   179  into. However, the sentry cannot boundlessly allocate memory. If it did, the
   180  host would eventually OOM-kill the sentry+application process. This means that
   181  the sentry must implement a page cache memory allocation strategy that is
   182  bounded by a global user or container imposed limit. When this limit is
   183  approached, the sentry must decide from which page cache memory should be freed
   184  so that it can allocate more memory. If it makes a poor decision, the sentry may
   185  end up freeing and re-allocating memory to back regions of files that are
   186  frequently used, nullifying the optimization (and in some cases causing worse
   187  performance due to the overhead of memory allocation and general management).
   188  This is a form of "cache thrashing".
   189  
   190  In Linux, much research has been done to select and implement a lightweight but
   191  optimal page cache eviction algorithm. Linux makes use of hardware page bits to
   192  keep track of whether memory has been accessed. The sentry does not have direct
   193  access to hardware. Implementing a similarly lightweight and optimal page cache
   194  eviction algorithm will need to either introduce a kernel interface to obtain
   195  these page bits or find a suitable alternative proxy for access events.
   196  
   197  In Linux, readahead happens by default but is not always ideal. For instance,
   198  for files that are not read sequentially, it would be more ideal to simply read
   199  from only those regions of the file rather than to optimistically cache some
   200  number of bytes ahead of the read (up to 2MB in Linux) if the bytes cached won't
   201  be accessed. Linux implements the fadvise64(2) system call for applications to
   202  specify that a range of a file will not be accessed sequentially. The advice bit
   203  FADV_RANDOM turns off the readahead optimization for the given range in the
   204  given file. However fadvise64 is rarely used by applications so Linux implements
   205  a readahead backoff strategy if reads are not sequential. To ensure that
   206  application performance is not degraded, the sentry must implement a similar
   207  backoff strategy.