github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/sentry/mm/README.md (about)

     1  This package provides an emulation of Linux semantics for application virtual
     2  memory mappings.
     3  
     4  For completeness, this document also describes aspects of the memory management
     5  subsystem defined outside this package.
     6  
     7  # Background
     8  
     9  We begin by describing semantics for virtual memory in Linux.
    10  
    11  A virtual address space is defined as a collection of mappings from virtual
    12  addresses to physical memory. However, userspace applications do not configure
    13  mappings to physical memory directly. Instead, applications configure memory
    14  mappings from virtual addresses to offsets into a file using the `mmap` system
    15  call.[^mmap-anon] For example, a call to:
    16  
    17      mmap(
    18          /* addr = */ 0x400000,
    19          /* length = */ 0x1000,
    20          PROT_READ | PROT_WRITE,
    21          MAP_SHARED,
    22          /* fd = */ 3,
    23          /* offset = */ 0);
    24  
    25  creates a mapping of length 0x1000 bytes, starting at virtual address (VA)
    26  0x400000, to offset 0 in the file represented by file descriptor (FD) 3. Within
    27  the Linux kernel, virtual memory mappings are represented by *virtual memory
    28  areas* (VMAs). Supposing that FD 3 represents file /tmp/foo, the state of the
    29  virtual memory subsystem after the `mmap` call may be depicted as:
    30  
    31      VMA:     VA:0x400000 -> /tmp/foo:0x0
    32  
    33  Establishing a virtual memory area does not necessarily establish a mapping to a
    34  physical address, because Linux has not necessarily provisioned physical memory
    35  to store the file's contents. Thus, if the application attempts to read the
    36  contents of VA 0x400000, it may incur a *page fault*, a CPU exception that
    37  forces the kernel to create such a mapping to service the read.
    38  
    39  For a file, doing so consists of several logical phases:
    40  
    41  1.  The kernel allocates physical memory to store the contents of the required
    42      part of the file, and copies file contents to the allocated memory.
    43      Supposing that the kernel chooses the physical memory at physical address
    44      (PA) 0x2fb000, the resulting state of the system is:
    45  
    46          VMA:     VA:0x400000 -> /tmp/foo:0x0
    47          Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
    48  
    49      (In Linux the state of the mapping from file offset to physical memory is
    50      stored in `struct address_space`, but to avoid confusion with other notions
    51      of address space we will refer to this system as filemap, named after Linux
    52      kernel source file `mm/filemap.c`.)
    53  
    54  2.  The kernel stores the effective mapping from virtual to physical address in
    55      a *page table entry* (PTE) in the application's *page tables*, which are
    56      used by the CPU's virtual memory hardware to perform address translation.
    57      The resulting state of the system is:
    58  
    59          VMA:     VA:0x400000 -> /tmp/foo:0x0
    60          Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
    61          PTE:     VA:0x400000 -----------------> PA:0x2fb000
    62  
    63      The PTE is required for the application to actually use the contents of the
    64      mapped file as virtual memory. However, the PTE is derived from the VMA and
    65      filemap state, both of which are independently mutable, such that mutations
    66      to either will affect the PTE. For example:
    67  
    68      -   The application may remove the VMA using the `munmap` system call. This
    69          breaks the mapping from VA:0x400000 to /tmp/foo:0x0, and consequently
    70          the mapping from VA:0x400000 to PA:0x2fb000. However, it does not
    71          necessarily break the mapping from /tmp/foo:0x0 to PA:0x2fb000, so a
    72          future mapping of the same file offset may reuse this physical memory.
    73  
    74      -   The application may invalidate the file's contents by passing a length
    75          of 0 to the `ftruncate` system call. This breaks the mapping from
    76          /tmp/foo:0x0 to PA:0x2fb000, and consequently the mapping from
    77          VA:0x400000 to PA:0x2fb000. However, it does not break the mapping from
    78          VA:0x400000 to /tmp/foo:0x0, so future changes to the file's contents
    79          may again be made visible at VA:0x400000 after another page fault
    80          results in the allocation of a new physical address.
    81  
    82      Note that, in order to correctly break the mapping from VA:0x400000 to
    83      PA:0x2fb000 in the latter case, filemap must also store a *reverse mapping*
    84      from /tmp/foo:0x0 to VA:0x400000 so that it can locate and remove the PTE.
    85  
    86  [^mmap-anon]: Memory mappings to non-files are discussed in later sections.
    87  
    88  ## Private Mappings
    89  
    90  The preceding example considered VMAs created using the `MAP_SHARED` flag, which
    91  means that PTEs derived from the mapping should always use physical memory that
    92  represents the current state of the mapped file.[^mmap-dev-zero] Applications
    93  can alternatively pass the `MAP_PRIVATE` flag to create a *private mapping*.
    94  Private mappings are *copy-on-write*.
    95  
    96  Suppose that the application instead created a private mapping in the previous
    97  example. In Linux, the state of the system after a read page fault would be:
    98  
    99      VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
   100      Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
   101      PTE:     VA:0x400000 -----------------> PA:0x2fb000 (read-only)
   102  
   103  Now suppose the application attempts to write to VA:0x400000. For a shared
   104  mapping, the write would be propagated to PA:0x2fb000, and the kernel would be
   105  responsible for ensuring that the write is later propagated to the mapped file.
   106  For a private mapping, the write incurs another page fault since the PTE is
   107  marked read-only. In response, the kernel allocates physical memory to store the
   108  mapping's *private copy* of the file's contents, copies file contents to the
   109  allocated memory, and changes the PTE to map to the private copy. Supposing that
   110  the kernel chooses the physical memory at physical address (PA) 0x5ea000, the
   111  resulting state of the system is:
   112  
   113      VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
   114      Filemap:                /tmp/foo:0x0 -> PA:0x2fb000
   115      PTE:     VA:0x400000 -----------------> PA:0x5ea000
   116  
   117  Note that the filemap mapping from /tmp/foo:0x0 to PA:0x2fb000 may still exist,
   118  but is now irrelevant to this mapping.
   119  
   120  [^mmap-dev-zero]: Modulo files with special mmap semantics such as `/dev/zero`.
   121  
   122  ## Anonymous Mappings
   123  
   124  Instead of passing a file to the `mmap` system call, applications can instead
   125  request an *anonymous* mapping by passing the `MAP_ANONYMOUS` flag.
   126  Semantically, an anonymous mapping is essentially a mapping to an ephemeral file
   127  initially filled with zero bytes. Practically speaking, this is how shared
   128  anonymous mappings are implemented, but private anonymous mappings do not result
   129  in the creation of an ephemeral file; since there would be no way to modify the
   130  contents of the underlying file through a private mapping, all private anonymous
   131  mappings use a single shared page filled with zero bytes until copy-on-write
   132  occurs.
   133  
   134  # Virtual Memory in the Sentry
   135  
   136  The sentry implements application virtual memory atop a host kernel, introducing
   137  an additional level of indirection to the above.
   138  
   139  Consider the same scenario as in the previous section. Since the sentry handles
   140  application system calls, the effect of an application `mmap` system call is to
   141  create a VMA in the sentry (as opposed to the host kernel):
   142  
   143      Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
   144  
   145  When the application first incurs a page fault on this address, the host kernel
   146  delivers information about the page fault to the sentry in a platform-dependent
   147  manner, and the sentry handles the fault:
   148  
   149  1.  The sentry allocates memory to store the contents of the required part of
   150      the file, and copies file contents to the allocated memory. However, since
   151      the sentry is implemented atop a host kernel, it does not configure mappings
   152      to physical memory directly. Instead, mappable "memory" in the sentry is
   153      represented by a host file descriptor and offset, since (as noted in
   154      "Background") this is the memory mapping primitive provided by the host
   155      kernel. In general, memory is allocated from a temporary host file using the
   156      `pgalloc` package. Supposing that the sentry allocates offset 0x3000 from
   157      host file "memory-file", the resulting state is:
   158  
   159          Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
   160          Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
   161  
   162  2.  The sentry stores the effective mapping from virtual address to host file in
   163      a host VMA by invoking the `mmap` system call:
   164  
   165          Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
   166          Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
   167            Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000
   168  
   169  3.  The sentry returns control to the application, which immediately incurs the
   170      page fault again.[^mmap-populate] However, since a host VMA now exists for
   171      the faulting virtual address, the host kernel now handles the page fault as
   172      described in "Background":
   173  
   174          Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0
   175          Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
   176            Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000
   177            Host filemap:                                host:memory-file:0x3000 -> PA:0x2fb000
   178            Host PTE:     VA:0x400000 --------------------------------------------> PA:0x2fb000
   179  
   180  Thus, from an implementation standpoint, host VMAs serve the same purpose in the
   181  sentry that PTEs do in Linux. As in Linux, sentry VMA and filemap state is
   182  independently mutable, and the desired state of host VMAs is derived from that
   183  state.
   184  
   185  [^mmap-populate]: The sentry could force the host kernel to establish PTEs when
   186      it creates the host VMA by passing the `MAP_POPULATE` flag to
   187      the `mmap` system call, but usually does not. This is because,
   188      to reduce the number of page faults that require handling by
   189      the sentry and (correspondingly) the number of host `mmap`
   190      system calls, the sentry usually creates host VMAs that are
   191      much larger than the single faulting page.
   192  
   193  ## Private Mappings
   194  
   195  The sentry implements private mappings consistently with Linux. Before
   196  copy-on-write, the private mapping example given in the Background results in:
   197  
   198      Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
   199      Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
   200        Host VMA:     VA:0x400000 -----------------> host:memory-file:0x3000 (read-only)
   201        Host filemap:                                host:memory-file:0x3000 -> PA:0x2fb000
   202        Host PTE:     VA:0x400000 --------------------------------------------> PA:0x2fb000 (read-only)
   203  
   204  When the application attempts to write to this address, the host kernel delivers
   205  information about the resulting page fault to the sentry. Analogous to Linux,
   206  the sentry allocates memory to store the mapping's private copy of the file's
   207  contents, copies file contents to the allocated memory, and changes the host VMA
   208  to map to the private copy. Supposing that the sentry chooses the offset 0x4000
   209  in host file `memory-file` to store the private copy, the state of the system
   210  after copy-on-write is:
   211  
   212      Sentry VMA:     VA:0x400000 -> /tmp/foo:0x0 (private)
   213      Sentry filemap:                /tmp/foo:0x0 -> host:memory-file:0x3000
   214        Host VMA:     VA:0x400000 -----------------> host:memory-file:0x4000
   215        Host filemap:                                host:memory-file:0x4000 -> PA:0x5ea000
   216        Host PTE:     VA:0x400000 --------------------------------------------> PA:0x5ea000
   217  
   218  However, this highlights an important difference between Linux and the sentry.
   219  In Linux, page tables are concrete (architecture-dependent) data structures
   220  owned by the kernel. Conversely, the sentry has the ability to create and
   221  destroy host VMAs using host system calls, but it does not have direct access to
   222  their state. Thus, as written, if the application invokes the `munmap` system
   223  call to remove the sentry VMA, it is non-trivial for the sentry to determine
   224  that it should deallocate `host:memory-file:0x4000`. This implies that the
   225  sentry must retain information about the host VMAs that it has created.
   226  
   227  ## Anonymous Mappings
   228  
   229  The sentry implements anonymous mappings consistently with Linux, except that
   230  there is no shared zero page.
   231  
   232  # Implementation Constructs
   233  
   234  In Linux:
   235  
   236  -   A virtual address space is represented by `struct mm_struct`.
   237  
   238  -   VMAs are represented by `struct vm_area_struct`, stored in `struct
   239      mm_struct::mmap`.
   240  
   241  -   Mappings from file offsets to physical memory are stored in `struct
   242      address_space`.
   243  
   244  -   Reverse mappings from file offsets to virtual mappings are stored in `struct
   245      address_space::i_mmap`.
   246  
   247  -   Physical memory pages are represented by a pointer to `struct page` or an
   248      index called a *page frame number* (PFN), represented by `pfn_t`.
   249  
   250  -   PTEs are represented by architecture-dependent type `pte_t`, stored in a
   251      table hierarchy rooted at `struct mm_struct::pgd`.
   252  
   253  In the sentry:
   254  
   255  -   A virtual address space is represented by type [`mm.MemoryManager`][mm].
   256  
   257  -   Sentry VMAs are represented by type [`mm.vma`][mm], stored in
   258      `mm.MemoryManager.vmas`.
   259  
   260  -   Mappings from sentry file offsets to host file offsets are abstracted
   261      through interface method [`memmap.Mappable.Translate`][memmap].
   262  
   263  -   Reverse mappings from sentry file offsets to virtual mappings are abstracted
   264      through interface methods
   265      [`memmap.Mappable.AddMapping` and `memmap.Mappable.RemoveMapping`][memmap].
   266  
   267  -   Host files that may be mapped into host VMAs are represented by type
   268      [`platform.File`][platform].
   269  
   270  -   Host VMAs are represented in the sentry by type [`mm.pma`][mm] ("platform
   271      mapping area"), stored in `mm.MemoryManager.pmas`.
   272  
   273  -   Creation and destruction of host VMAs is abstracted through interface
   274      methods
   275      [`platform.AddressSpace.MapFile` and `platform.AddressSpace.Unmap`][platform].
   276  
   277  [memmap]: https://github.com/google/gvisor/blob/master/pkg/sentry/memmap/memmap.go
   278  [mm]: https://github.com/google/gvisor/blob/master/pkg/sentry/mm/mm.go
   279  [pgalloc]: https://github.com/google/gvisor/blob/master/pkg/sentry/pgalloc/pgalloc.go
   280  [platform]: https://github.com/google/gvisor/blob/master/pkg/sentry/platform/platform.go