github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/sentry/vfs/README.md (about)

     1  # The gVisor Virtual Filesystem
     2  
     3  THIS PACKAGE IS CURRENTLY EXPERIMENTAL AND NOT READY OR ENABLED FOR PRODUCTION
     4  USE. For the filesystem implementation currently used by gVisor, see the `fs`
     5  package.
     6  
     7  ## Implementation Notes
     8  
     9  ### Reference Counting
    10  
    11  Filesystem, Dentry, Mount, MountNamespace, and FileDescription are all
    12  reference-counted. Mount and MountNamespace are exclusively VFS-managed; when
    13  their reference count reaches zero, VFS releases their resources. Filesystem and
    14  FileDescription management is shared between VFS and filesystem implementations;
    15  when their reference count reaches zero, VFS notifies the implementation by
    16  calling `FilesystemImpl.Release()` or `FileDescriptionImpl.Release()`
    17  respectively and then releases VFS-owned resources. Dentries are exclusively
    18  managed by filesystem implementations; reference count changes are abstracted
    19  through DentryImpl, which should release resources when reference count reaches
    20  zero.
    21  
    22  Filesystem references are held by:
    23  
    24  -   Mount: Each referenced Mount holds a reference on the mounted Filesystem.
    25  
    26  Dentry references are held by:
    27  
    28  -   FileDescription: Each referenced FileDescription holds a reference on the
    29      Dentry through which it was opened, via `FileDescription.vd.dentry`.
    30  
    31  -   Mount: Each referenced Mount holds a reference on its mount point and on the
    32      mounted filesystem root. The mount point is mutable (`mount(MS_MOVE)`).
    33  
    34  Mount references are held by:
    35  
    36  -   FileDescription: Each referenced FileDescription holds a reference on the
    37      Mount on which it was opened, via `FileDescription.vd.mount`.
    38  
    39  -   Mount: Each referenced Mount holds a reference on its parent, which is the
    40      mount containing its mount point.
    41  
    42  -   VirtualFilesystem: A reference is held on each Mount that has been connected
    43      to a mount point, but not yet umounted.
    44  
    45  MountNamespace and FileDescription references are held by users of VFS. The
    46  expectation is that each `kernel.Task` holds a reference on its corresponding
    47  MountNamespace, and each file descriptor holds a reference on its represented
    48  FileDescription.
    49  
    50  Notes:
    51  
    52  -   Dentries do not hold a reference on their owning Filesystem. Instead, all
    53      uses of a Dentry occur in the context of a Mount, which holds a reference on
    54      the relevant Filesystem (see e.g. the VirtualDentry type). As a corollary,
    55      when releasing references on both a Dentry and its corresponding Mount, the
    56      Dentry's reference must be released first (because releasing the Mount's
    57      reference may release the last reference on the Filesystem, whose state may
    58      be required to release the Dentry reference).
    59  
    60  ### The Inheritance Pattern
    61  
    62  Filesystem, Dentry, and FileDescription are all concepts featuring both state
    63  that must be shared between VFS and filesystem implementations, and operations
    64  that are implementation-defined. To facilitate this, each of these three
    65  concepts follows the same pattern, shown below for Dentry:
    66  
    67  ```go
    68  // Dentry represents a node in a filesystem tree.
    69  type Dentry struct {
    70    // VFS-required dentry state.
    71    parent *Dentry
    72    // ...
    73  
    74    // impl is the DentryImpl associated with this Dentry. impl is immutable.
    75    // This should be the last field in Dentry.
    76    impl DentryImpl
    77  }
    78  
    79  // Init must be called before first use of d.
    80  func (d *Dentry) Init(impl DentryImpl) {
    81    d.impl = impl
    82  }
    83  
    84  // Impl returns the DentryImpl associated with d.
    85  func (d *Dentry) Impl() DentryImpl {
    86    return d.impl
    87  }
    88  
    89  // DentryImpl contains implementation-specific details of a Dentry.
    90  // Implementations of DentryImpl should contain their associated Dentry by
    91  // value as their first field.
    92  type DentryImpl interface {
    93    // VFS-required implementation-defined dentry operations.
    94    IncRef()
    95    // ...
    96  }
    97  ```
    98  
    99  This construction, which is essentially a type-safe analogue to Linux's
   100  `container_of` pattern, has the following properties:
   101  
   102  -   VFS works almost exclusively with pointers to Dentry rather than DentryImpl
   103      interface objects, such as in the type of `Dentry.parent`. This avoids
   104      interface method calls (which are somewhat expensive to perform, and defeat
   105      inlining and escape analysis), reduces the size of VFS types (since an
   106      interface object is two pointers in size), and allows pointers to be loaded
   107      and stored atomically using `sync/atomic`. Implementation-defined behavior
   108      is accessed via `Dentry.impl` when required.
   109  
   110  -   Filesystem implementations can access the implementation-defined state
   111      associated with objects of VFS types by type-asserting or type-switching
   112      (e.g. `Dentry.Impl().(*myDentry)`). Type assertions to a concrete type
   113      require only an equality comparison of the interface object's type pointer
   114      to a static constant, and are consequently very fast.
   115  
   116  -   Filesystem implementations can access the VFS state associated with objects
   117      of implementation-defined types directly.
   118  
   119  -   VFS and implementation-defined state for a given type occupy the same
   120      object, minimizing memory allocations and maximizing memory locality. `impl`
   121      is the last field in `Dentry`, and `Dentry` is the first field in
   122      `DentryImpl` implementations, for similar reasons: this tends to cause
   123      fetching of the `Dentry.impl` interface object to also fetch `DentryImpl`
   124      fields, either because they are in the same cache line or via next-line
   125      prefetching.
   126  
   127  ## Future Work
   128  
   129  -   Most `mount(2)` features, and unmounting, are incomplete.
   130  
   131  -   VFS1 filesystems are not directly compatible with VFS2. It may be possible
   132      to implement shims that implement `vfs.FilesystemImpl` for
   133      `fs.MountNamespace`, `vfs.DentryImpl` for `fs.Dirent`, and
   134      `vfs.FileDescriptionImpl` for `fs.File`, which may be adequate for
   135      filesystems that are not performance-critical (e.g. sysfs); however, it is
   136      not clear that this will be less effort than simply porting the filesystems
   137      in question. Practically speaking, the following filesystems will probably
   138      need to be ported or made compatible through a shim to evaluate filesystem
   139      performance on realistic workloads:
   140  
   141      -   devfs/procfs/sysfs, which will realistically be necessary to execute
   142          most applications. (Note that procfs and sysfs do not support hard
   143          links, so they do not require the complexity of separate inode objects.
   144          Also note that Linux's /dev is actually a variant of tmpfs called
   145          devtmpfs.)
   146  
   147      -   tmpfs. This should be relatively straightforward: copy/paste memfs,
   148          store regular file contents in pgalloc-allocated memory instead of
   149          `[]byte`, and add support for file timestamps. (In fact, it probably
   150          makes more sense to convert memfs to tmpfs and not keep the former.)
   151  
   152      -   A remote filesystem, either lisafs (if it is ready by the time that
   153          other benchmarking prerequisites are) or v9fs (aka 9P, aka gofers).
   154  
   155      -   epoll files.
   156  
   157      Filesystems that will need to be ported before switching to VFS2, but can
   158      probably be skipped for early testing:
   159  
   160      -   overlayfs, which is needed for (at least) synthetic mount points.
   161  
   162      -   Support for host ttys.
   163  
   164      -   timerfd files.
   165  
   166      Filesystems that can be probably dropped:
   167  
   168      -   ashmem, which is far too incomplete to use.
   169  
   170      -   binder, which is similarly far too incomplete to use.
   171  
   172  -   Save/restore. For instance, it is unclear if the current implementation of
   173      the `state` package supports the inheritance pattern described above.
   174  
   175  -   Many features that were previously implemented by VFS must now be
   176      implemented by individual filesystems (though, in most cases, this should
   177      consist of calls to hooks or libraries provided by `vfs` or other packages).
   178      This includes, but is not necessarily limited to:
   179  
   180      -   Block and character device special files
   181  
   182      -   Inotify
   183  
   184      -   File locking
   185  
   186      -   `O_ASYNC`