github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/sentry/fs/README.md (about)

     1  This package provides an implementation of the Linux virtual filesystem.
     2  
     3  [TOC]
     4  
     5  ## Overview
     6  
     7  -   An `fs.Dirent` caches an `fs.Inode` in memory at a path in the VFS, giving
     8      the `fs.Inode` a relative position with respect to other `fs.Inode`s.
     9  
    10  -   If an `fs.Dirent` is referenced by two file descriptors, then those file
    11      descriptors are coherent with each other: they depend on the same
    12      `fs.Inode`.
    13  
    14  -   A mount point is an `fs.Dirent` for which `fs.Dirent.mounted` is true. It
    15      exposes the root of a mounted filesystem.
    16  
    17  -   The `fs.Inode` produced by a registered filesystem on mount(2) owns an
    18      `fs.MountedFilesystem` from which other `fs.Inode`s will be looked up. For a
    19      remote filesystem, the `fs.MountedFilesystem` owns the connection to that
    20      remote filesystem.
    21  
    22  -   In general:
    23  
    24  ```
    25  fs.Inode <------------------------------
    26  |                                      |
    27  |                                      |
    28  produced by                            |
    29  exactly one                            |
    30  |                             responsible for the
    31  |                             virtual identity of
    32  v                                      |
    33  fs.MountedFilesystem -------------------
    34  ```
    35  
    36  Glossary:
    37  
    38  -   VFS: virtual filesystem.
    39  
    40  -   inode: a virtual file object holding a cached view of a file on a backing
    41      filesystem (includes metadata and page caches).
    42  
    43  -   superblock: the virtual state of a mounted filesystem (e.g. the virtual
    44      inode number set).
    45  
    46  -   mount namespace: a view of the mounts under a root (during path traversal,
    47      the VFS makes visible/follows the mount point that is in the current task's
    48      mount namespace).
    49  
    50  ## Save and restore
    51  
    52  An application's hard dependencies on filesystem state can be broken down into
    53  two categories:
    54  
    55  -   The state necessary to execute a traversal on or view the *virtual*
    56      filesystem hierarchy, regardless of what files an application has open.
    57  
    58  -   The state necessary to represent open files.
    59  
    60  The first is always necessary to save and restore. An application may never have
    61  any open file descriptors, but across save and restore it should see a coherent
    62  view of any mount namespace. NOTE(b/63601033): Currently only one "initial"
    63  mount namespace is supported.
    64  
    65  The second is so that system calls across save and restore are coherent with
    66  each other (e.g. so that unintended re-reads or overwrites do not occur).
    67  
    68  Specifically this state is:
    69  
    70  -   An `fs.MountManager` containing mount points.
    71  
    72  -   A `kernel.FDTable` containing pointers to open files.
    73  
    74  Anything else managed by the VFS that can be easily loaded into memory from a
    75  filesystem is synced back to those filesystems and is not saved. Examples are
    76  pages in page caches used for optimizations (i.e. readahead and writeback), and
    77  directory entries used to accelerate path lookups.
    78  
    79  ### Mount points
    80  
    81  Saving and restoring a mount point means saving and restoring:
    82  
    83  -   The root of the mounted filesystem.
    84  
    85  -   Mount flags, which control how the VFS interacts with the mounted
    86      filesystem.
    87  
    88  -   Any relevant metadata about the mounted filesystem.
    89  
    90  -   All `fs.Inode`s referenced by the application that reside under the mount
    91      point.
    92  
    93  `fs.MountedFilesystem` is metadata about a filesystem that is mounted. It is
    94  referenced by every `fs.Inode` loaded into memory under the mount point
    95  including the `fs.Inode` of the mount point itself. The `fs.MountedFilesystem`
    96  maps file objects on the filesystem to a virtualized `fs.Inode` number and vice
    97  versa.
    98  
    99  To restore all `fs.Inode`s under a given mount point, each `fs.Inode` leverages
   100  its dependency on an `fs.MountedFilesystem`. Since the `fs.MountedFilesystem`
   101  knows how an `fs.Inode` maps to a file object on a backing filesystem, this
   102  mapping can be trivially consulted by each `fs.Inode` when the `fs.Inode` is
   103  restored.
   104  
   105  In detail, a mount point is saved in two steps:
   106  
   107  -   First, after the kernel is paused but before state.Save, we walk all mount
   108      namespaces and install a mapping from `fs.Inode` numbers to file paths
   109      relative to the root of the mounted filesystem in each
   110      `fs.MountedFilesystem`. This is subsequently called the set of `fs.Inode`
   111      mappings.
   112  
   113  -   Second, during state.Save, each `fs.MountedFilesystem` decides whether to
   114      save the set of `fs.Inode` mappings. In-memory filesystems, like tmpfs, have
   115      no need to save a set of `fs.Inode` mappings, since the `fs.Inode`s can be
   116      entirely encoded in state file. Each `fs.MountedFilesystem` also optionally
   117      saves the device name from when the filesystem was originally mounted. Each
   118      `fs.Inode` saves its virtual identifier and a reference to a
   119      `fs.MountedFilesystem`.
   120  
   121  A mount point is restored in two steps:
   122  
   123  -   First, before state.Load, all mount configurations are stored in a global
   124      `fs.RestoreEnvironment`. This tells us what mount points the user wants to
   125      restore and how to re-establish pointers to backing filesystems.
   126  
   127  -   Second, during state.Load, each `fs.MountedFilesystem` optionally searches
   128      for a mount in the `fs.RestoreEnvironment` that matches its saved device
   129      name. The `fs.MountedFilesystem` then reestablishes a pointer to the root of
   130      the mounted filesystem. For example, the mount specification provides the
   131      network connection for a mounted remote filesystem client to communicate
   132      with its remote file server. The `fs.MountedFilesystem` also trivially loads
   133      its set of `fs.Inode` mappings. When an `fs.Inode` is encountered, the
   134      `fs.Inode` loads its virtual identifier and its reference a
   135      `fs.MountedFilesystem`. It uses the `fs.MountedFilesystem` to obtain the
   136      root of the mounted filesystem and the `fs.Inode` mappings to obtain the
   137      relative file path to its data. With these, the `fs.Inode` re-establishes a
   138      pointer to its file object.
   139  
   140  A mount point can trivially restore its `fs.Inode`s in parallel since
   141  `fs.Inode`s have a restore dependency on their `fs.MountedFilesystem` and not on
   142  each other.
   143  
   144  ### Open files
   145  
   146  An `fs.File` references the following filesystem objects:
   147  
   148  ```go
   149  fs.File -> fs.Dirent -> fs.Inode -> fs.MountedFilesystem
   150  ```
   151  
   152  The `fs.Inode` is restored using its `fs.MountedFilesystem`. The
   153  [Mount points](#mount-points) section above describes how this happens in
   154  detail. The `fs.Dirent` restores its pointer to an `fs.Inode`, pointers to
   155  parent and children `fs.Dirents`, and the basename of the file.
   156  
   157  Otherwise an `fs.File` restores flags, an offset, and a unique identifier (only
   158  used internally).
   159  
   160  It may use the `fs.Inode`, which it indirectly holds a reference on through the
   161  `fs.Dirent`, to reestablish an open file handle on the backing filesystem (e.g.
   162  to continue reading and writing).
   163  
   164  ## Overlay
   165  
   166  The overlay implementation in the fs package takes Linux overlayfs as a frame of
   167  reference but corrects for several POSIX consistency errors.
   168  
   169  In Linux overlayfs, the `struct inode` used for reading and writing to the same
   170  file may be different. This is because the `struct inode` is dissociated with
   171  the process of copying up the file from the upper to the lower directory. Since
   172  flock(2) and fcntl(2) locks, inotify(7) watches, page caches, and a file's
   173  identity are all stored directly or indirectly off the `struct inode`, these
   174  properties of the `struct inode` may be stale after the first modification. This
   175  can lead to file locking bugs, missed inotify events, and inconsistent data in
   176  shared memory mappings of files, to name a few problems.
   177  
   178  The fs package maintains a single `fs.Inode` to represent a directory entry in
   179  an overlay and defines operations on this `fs.Inode` which synchronize with the
   180  copy up process. This achieves several things:
   181  
   182  +   File locks, inotify watches, and the identity of the file need not be copied
   183      at all.
   184  
   185  +   Memory mappings of files coordinate with the copy up process so that if a
   186      file in the lower directory is memory mapped, all references to it are
   187      invalidated, forcing the application to re-fault on memory mappings of the
   188      file under the upper directory.
   189  
   190  The `fs.Inode` holds metadata about files in the upper and/or lower directories
   191  via an `fs.overlayEntry`. The `fs.overlayEntry` implements the `fs.Mappable`
   192  interface. It multiplexes between upper and lower directory memory mappings and
   193  stores a copy of memory references so they can be transferred to the upper
   194  directory `fs.Mappable` when the file is copied up.
   195  
   196  The lower filesystem in an overlay may contain another (nested) overlay, but the
   197  upper filesystem may not contain another overlay. In other words, nested
   198  overlays form a tree structure that only allows branching in the lower
   199  filesystem.
   200  
   201  Caching decisions in the overlay are delegated to the upper filesystem, meaning
   202  that the Keep and Revalidate methods on the overlay return the same values as
   203  the upper filesystem. A small wrinkle is that the lower filesystem is not
   204  allowed to return `true` from Revalidate, as the overlay can not reload inodes
   205  from the lower filesystem. A lower filesystem that does return `true` from
   206  Revalidate will trigger a panic.
   207  
   208  The `fs.Inode` also holds a reference to a `fs.MountedFilesystem` that
   209  normalizes across the mounted filesystem state of the upper and lower
   210  directories.
   211  
   212  When a file is copied from the lower to the upper directory, attempts to
   213  interact with the file block until the copy completes. All copying synchronizes
   214  with rename(2).
   215  
   216  ## Future Work
   217  
   218  ### Overlay
   219  
   220  When a file is copied from a lower directory to an upper directory, several
   221  locks are taken: the global renamuMu and the copyMu of the `fs.Inode` being
   222  copied. This blocks operations on the file, including fault handling of memory
   223  mappings. Performance could be improved by copying files into a temporary
   224  directory that resides on the same filesystem as the upper directory and doing
   225  an atomic rename, holding locks only during the rename operation.
   226  
   227  Additionally files are copied up synchronously. For large files, this causes a
   228  noticeable latency. Performance could be improved by pipelining copies at
   229  non-overlapping file offsets.