github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/lisafs/README.md (about)

     1  # Replacing 9P
     2  
     3  ## Background
     4  
     5  The Linux filesystem model consists of the following key aspects (modulo mounts,
     6  which are outside the scope of this discussion):
     7  
     8  -   A `struct inode` represents a "filesystem object", such as a directory or a
     9      regular file. "Filesystem object" is most precisely defined by the practical
    10      properties of an inode, such as an immutable type (regular file, directory,
    11      symbolic link, etc.) and its independence from the path originally used to
    12      obtain it.
    13  
    14  -   A `struct dentry` represents a node in a filesystem tree. Semantically, each
    15      dentry is immutably associated with an inode representing the filesystem
    16      object at that position. (Linux implements optimizations involving reuse of
    17      unreferenced dentries, which allows their associated inodes to change, but
    18      this is outside the scope of this discussion.)
    19  
    20  -   A `struct file` represents an open file description (hereafter FD) and is
    21      needed to perform I/O. Each FD is immutably associated with the dentry
    22      through which it was opened.
    23  
    24  The current gVisor virtual filesystem implementation (hereafter VFS1) closely
    25  imitates the Linux design:
    26  
    27  -   `struct inode` => `fs.Inode`
    28  
    29  -   `struct dentry` => `fs.Dirent`
    30  
    31  -   `struct file` => `fs.File`
    32  
    33  gVisor accesses most external filesystems through a variant of the 9P2000.L
    34  protocol, including extensions for performance (`walkgetattr`) and for features
    35  not supported by vanilla 9P2000.L (`flushf`, `lconnect`). The 9P protocol family
    36  is inode-based; 9P fids represent a file (equivalently "file system object"),
    37  and the protocol is structured around alternatively obtaining fids to represent
    38  files (with `walk` and, in gVisor, `walkgetattr`) and performing operations on
    39  those fids.
    40  
    41  In the sections below, a **shared** filesystem is a filesystem that is *mutably*
    42  accessible by multiple concurrent clients, such that a **non-shared** filesystem
    43  is a filesystem that is either read-only or accessible by only a single client.
    44  
    45  ## Problems
    46  
    47  ### Serialization of Path Component RPCs
    48  
    49  Broadly speaking, VFS1 traverses each path component in a pathname, alternating
    50  between verifying that each traversed dentry represents an inode that represents
    51  a searchable directory and moving to the next dentry in the path.
    52  
    53  In the context of a remote filesystem, the structure of this traversal means
    54  that - modulo caching - a path involving N components requires at least N-1
    55  *sequential* RPCs to obtain metadata for intermediate directories, incurring
    56  significant latency. (In vanilla 9P2000.L, 2(N-1) RPCs are required: N-1 `walk`
    57  and N-1 `getattr`. We added the `walkgetattr` RPC to reduce this overhead.) On
    58  non-shared filesystems, this overhead is primarily significant during
    59  application startup; caching mitigates much of this overhead at steady state. On
    60  shared filesystems, where correct caching requires revalidation (requiring RPCs
    61  for each revalidated directory anyway), this overhead is consistently ruinous.
    62  
    63  ### Inefficient RPCs
    64  
    65  9P is not exceptionally economical with RPCs in general. In addition to the
    66  issue described above:
    67  
    68  -   Opening an existing file in 9P involves at least 2 RPCs: `walk` to produce
    69      an unopened fid representing the file, and `lopen` to open the fid.
    70  
    71  -   Creating a file also involves at least 2 RPCs: `walk` to produce an unopened
    72      fid representing the parent directory, and `lcreate` to create the file and
    73      convert the fid to an open fid representing the created file. In practice,
    74      both the Linux and gVisor 9P clients expect to have an unopened fid for the
    75      created file (necessitating an additional `walk`), as well as attributes for
    76      the created file (necessitating an additional `getattr`), for a total of 4
    77      RPCs. (In a shared filesystem, where whether a file already exists can
    78      change between RPCs, a correct implementation of `open(O_CREAT)` would have
    79      to alternate between these two paths (plus `clunk`ing the temporary fid
    80      between alternations, since the nature of the `fid` differs between the two
    81      paths). Neither Linux nor gVisor implement the required alternation, so
    82      `open(O_CREAT)` without `O_EXCL` can spuriously fail with `EEXIST` on both.)
    83  
    84  -   Closing (`clunk`ing) a fid requires an RPC. VFS1 issues this RPC
    85      asynchronously in an attempt to reduce critical path latency, but scheduling
    86      overhead makes this not clearly advantageous in practice.
    87  
    88  -   `read` and `readdir` can return partial reads without a way to indicate EOF,
    89      necessitating an additional final read to detect EOF.
    90  
    91  -   Operations that affect filesystem state do not consistently return updated
    92      filesystem state. In gVisor, the client implementation attempts to handle
    93      this by tracking what it thinks updated state "should" be; this is complex,
    94      and especially brittle for timestamps (which are often not arbitrarily
    95      settable). In Linux, the client implemtation invalidates cached metadata
    96      whenever it performs such an operation, and reloads it when a dentry
    97      corresponding to an inode with no valid cached metadata is revalidated; this
    98      is simple, but necessitates an additional `getattr`.
    99  
   100  ### Dentry/Inode Ambiguity
   101  
   102  As noted above, 9P's documentation tends to imply that unopened fids represent
   103  an inode. In practice, most filesystem APIs present very limited interfaces for
   104  working with inodes at best, such that the interpretation of unopened fids
   105  varies:
   106  
   107  -   Linux's 9P client associates unopened fids with (dentry, uid) pairs. When
   108      caching is enabled, it also associates each inode with the first fid opened
   109      writably that references that inode, in order to support page cache
   110      writeback.
   111  
   112  -   gVisor's 9P client associates unopened fids with inodes, and also caches
   113      opened fids in inodes in a manner similar to Linux.
   114  
   115  -   The runsc fsgofer associates unopened fids with both "dentries" (host
   116      filesystem paths) and "inodes" (host file descriptors); which is used
   117      depends on the operation invoked on the fid.
   118  
   119  For non-shared filesystems, this confusion has resulted in correctness issues
   120  that are (in gVisor) currently handled by a number of coarse-grained locks that
   121  serialize renames with all other filesystem operations. For shared filesystems,
   122  this means inconsistent behavior in the presence of concurrent mutation.
   123  
   124  ## Design
   125  
   126  Almost all Linux filesystem syscalls describe filesystem resources in one of two
   127  ways:
   128  
   129  -   Path-based: A filesystem position is described by a combination of a
   130      starting position and a sequence of path components relative to that
   131      position, where the starting position is one of:
   132  
   133      -   The VFS root (defined by mount namespace and chroot), for absolute paths
   134  
   135      -   The VFS position of an existing FD, for relative paths passed to `*at`
   136          syscalls (e.g. `statat`)
   137  
   138      -   The current working directory, for relative paths passed to non-`*at`
   139          syscalls and `*at` syscalls with `AT_FDCWD`
   140  
   141  -   File-description-based: A filesystem object is described by an existing FD,
   142      passed to a `f*` syscall (e.g. `fstat`).
   143  
   144  Many of our issues with 9P arise from its (and VFS') interposition of a model
   145  based on inodes between the filesystem syscall API and filesystem
   146  implementations. We propose to replace 9P with a protocol that does not feature
   147  inodes at all, and instead closely follows the filesystem syscall API by
   148  featuring only path-based and FD-based operations, with minimal deviations as
   149  necessary to ameliorate deficiencies in the syscall interface (see below). This
   150  approach addresses the issues described above:
   151  
   152  -   Even on shared filesystems, most application filesystem syscalls are
   153      translated to a single RPC (possibly excepting special cases described
   154      below), which is a logical lower bound.
   155  
   156  -   The behavior of application syscalls on shared filesystems is
   157      straightforwardly predictable: path-based syscalls are translated to
   158      path-based RPCs, which will re-lookup the file at that path, and FD-based
   159      syscalls are translated to FD-based RPCs, which use an existing open file
   160      without performing another lookup. (This is at least true on gofers that
   161      proxy the host local filesystem; other filesystems that lack support for
   162      e.g. certain operations on FDs may have different behavior, but this
   163      divergence is at least still predictable and inherent to the underlying
   164      filesystem implementation.)
   165  
   166  Note that this approach is only feasible in gVisor's next-generation virtual
   167  filesystem (VFS2), which does not assume the existence of inodes and allows the
   168  remote filesystem client to translate whole path-based syscalls into RPCs. Thus
   169  one of the unavoidable tradeoffs associated with such a protocol vs. 9P is the
   170  inability to construct a Linux client that is performance-competitive with
   171  gVisor.
   172  
   173  ### File Permissions
   174  
   175  Many filesystem operations are side-effectual, such that file permissions must
   176  be checked before such operations take effect. The simplest approach to file
   177  permission checking is for the sentry to obtain permissions from the remote
   178  filesystem, then apply permission checks in the sentry before performing the
   179  application-requested operation. However, this requires an additional RPC per
   180  application syscall (which can't be mitigated by caching on shared filesystems).
   181  Alternatively, we may delegate file permission checking to gofers. In general,
   182  file permission checks depend on the following properties of the accessor:
   183  
   184  -   Filesystem UID/GID
   185  
   186  -   Supplementary GIDs
   187  
   188  -   Effective capabilities in the accessor's user namespace (i.e. the accessor's
   189      effective capability set)
   190  
   191  -   All UIDs and GIDs mapped in the accessor's user namespace (which determine
   192      if the accessor's capabilities apply to accessed files)
   193  
   194  We may choose to delay implementation of file permission checking delegation,
   195  although this is potentially costly since it doubles the number of required RPCs
   196  for most operations on shared filesystems. We may also consider compromise
   197  options, such as only delegating file permission checks for accessors in the
   198  root user namespace.
   199  
   200  ### Symbolic Links
   201  
   202  gVisor usually interprets symbolic link targets in its VFS rather than on the
   203  filesystem containing the symbolic link; thus e.g. a symlink to
   204  "/proc/self/maps" on a remote filesystem resolves to said file in the sentry's
   205  procfs rather than the host's. This implies that:
   206  
   207  -   Remote filesystem servers that proxy filesystems supporting symlinks must
   208      check if each path component is a symlink during path traversal.
   209  
   210  -   Absolute symlinks require that the sentry restart the operation at its
   211      contextual VFS root (which is task-specific and may not be on a remote
   212      filesystem at all), so if a remote filesystem server encounters an absolute
   213      symlink during path traversal on behalf of a path-based operation, it must
   214      terminate path traversal and return the symlink target.
   215  
   216  -   Relative symlinks begin target resolution in the parent directory of the
   217      symlink, so in theory most relative symlinks can be handled automatically
   218      during the path traversal that encounters the symlink, provided that said
   219      traversal is supplied with the number of remaining symlinks before `ELOOP`.
   220      However, the new path traversed by the symlink target may cross VFS mount
   221      boundaries, such that it's only safe for remote filesystem servers to
   222      speculatively follow relative symlinks for side-effect-free operations such
   223      as `stat` (where the sentry can simply ignore results that are inapplicable
   224      due to crossing mount boundaries). We may choose to delay implementation of
   225      this feature, at the cost of an additional RPC per relative symlink (note
   226      that even if the symlink target crosses a mount boundary, the sentry will
   227      need to `stat` the path to the mount boundary to confirm that each traversed
   228      component is an accessible directory); until it is implemented, relative
   229      symlinks may be handled like absolute symlinks, by terminating path
   230      traversal and returning the symlink target.
   231  
   232  The possibility of symlinks (and the possibility of a compromised sentry) means
   233  that the sentry may issue RPCs with paths that, in the absence of symlinks,
   234  would traverse beyond the root of the remote filesystem. For example, the sentry
   235  may issue an RPC with a path like "/foo/../..", on the premise that if "/foo" is
   236  a symlink then the resulting path may be elsewhere on the remote filesystem. To
   237  handle this, path traversal must also track its current depth below the remote
   238  filesystem root, and terminate path traversal if it would ascend beyond this
   239  point.
   240  
   241  ### Path Traversal
   242  
   243  Since path-based VFS operations will translate to path-based RPCs, filesystem
   244  servers will need to handle path traversal. From the perspective of a given
   245  filesystem implementation in the server, there are two basic approaches to path
   246  traversal:
   247  
   248  -   Inode-walk: For each path component, obtain a handle to the underlying
   249      filesystem object (e.g. with `open(O_PATH)`), check if that object is a
   250      symlink (as described above) and that that object is accessible by the
   251      caller (e.g. with `fstat()`), then continue to the next path component (e.g.
   252      with `openat()`). This ensures that the checked filesystem object is the one
   253      used to obtain the next object in the traversal, which is intuitively
   254      appealing. However, while this approach works for host local filesystems, it
   255      requires features that are not widely supported by other filesystems.
   256  
   257  -   Path-walk: For each path component, use a path-based operation to determine
   258      if the filesystem object currently referred to by that path component is a
   259      symlink / is accessible. This is highly portable, but suffers from quadratic
   260      behavior (at the level of the underlying filesystem implementation, the
   261      first path component will be traversed a number of times equal to the number
   262      of path components in the path).
   263  
   264  The implementation should support either option by delegating path traversal to
   265  filesystem implementations within the server (like VFS and the remote filesystem
   266  protocol itself), as inode-walking is still safe, efficient, amenable to FD
   267  caching, and implementable on non-shared host local filesystems (a sufficiently
   268  common case as to be worth considering in the design).
   269  
   270  Both approaches are susceptible to race conditions that may permit sandboxed
   271  filesystem escapes:
   272  
   273  -   Under inode-walk, a malicious application may cause a directory to be moved
   274      (with `rename`) during path traversal, such that the filesystem
   275      implementation incorrectly determines whether subsequent inodes are located
   276      in paths that should be visible to sandboxed applications.
   277  
   278  -   Under path-walk, a malicious application may cause a non-symlink file to be
   279      replaced with a symlink during path traversal, such that following path
   280      operations will incorrectly follow the symlink.
   281  
   282  Both race conditions can, to some extent, be mitigated in filesystem server
   283  implementations by synchronizing path traversal with the hazardous operations in
   284  question. However, shared filesystems are frequently used to share data between
   285  sandboxed and unsandboxed applications in a controlled way, and in some cases a
   286  malicious sandboxed application may be able to take advantage of a hazardous
   287  filesystem operation performed by an unsandboxed application. In some cases,
   288  filesystem features may be available to ensure safety even in such cases (e.g.
   289  [the new openat2() syscall](https://man7.org/linux/man-pages/man2/openat2.2.html)),
   290  but it is not clear how to solve this problem in general. (Note that this issue
   291  is not specific to our design; rather, it is a fundamental limitation of
   292  filesystem sandboxing.)
   293  
   294  ### Filesystem Multiplexing
   295  
   296  A given sentry may need to access multiple distinct remote filesystems (e.g.
   297  different volumes for a given container). In many cases, there is no advantage
   298  to serving these filesystems from distinct filesystem servers, or accessing them
   299  through distinct connections (factors such as maximum RPC concurrency should be
   300  based on available host resources). Therefore, the protocol should support
   301  multiplexing of distinct filesystem trees within a single session. 9P supports
   302  this by allowing multiple calls to the `attach` RPC to produce fids representing
   303  distinct filesystem trees, but this is somewhat clunky; we propose a much
   304  simpler mechanism wherein each message that conveys a path also conveys a
   305  numeric filesystem ID that identifies a filesystem tree.
   306  
   307  ## Alternatives Considered
   308  
   309  ### Additional Extensions to 9P
   310  
   311  There are at least three conceptual aspects to 9P:
   312  
   313  -   Wire format: messages with a 4-byte little-endian size prefix, strings with
   314      a 2-byte little-endian size prefix, etc. Whether the wire format is worth
   315      retaining is unclear; in particular, it's unclear that the 9P wire format
   316      has a significant advantage over protobufs, which are substantially easier
   317      to extend. Note that the official Go protobuf implementation is widely known
   318      to suffer from a significant number of performance deficiencies, so if we
   319      choose to switch to protobuf, we may need to use an alternative toolchain
   320      such as `gogo/protobuf` (which is also widely used in the Go ecosystem, e.g.
   321      by Kubernetes).
   322  
   323  -   Filesystem model: fids, qids, etc. Discarding this is one of the motivations
   324      for this proposal.
   325  
   326  -   RPCs: Twalk, Tlopen, etc. In addition to previously-described
   327      inefficiencies, most of these are dependent on the filesystem model and
   328      therefore must be discarded.
   329  
   330  ### FUSE
   331  
   332  The FUSE (Filesystem in Userspace) protocol is frequently used to provide
   333  arbitrary userspace filesystem implementations to a host Linux kernel.
   334  Unfortunately, FUSE is also inode-based, and therefore doesn't address any of
   335  the problems we have with 9P.
   336  
   337  ### virtio-fs
   338  
   339  virtio-fs is an ongoing project aimed at improving Linux VM filesystem
   340  performance when accessing Linux host filesystems (vs. virtio-9p). In brief, it
   341  is based on:
   342  
   343  -   Using a FUSE client in the guest that communicates over virtio with a FUSE
   344      server in the host.
   345  
   346  -   Using DAX to map the host page cache into the guest.
   347  
   348  -   Using a file metadata table in shared memory to avoid VM exits for metadata
   349      updates.
   350  
   351  None of these improvements seem applicable to gVisor:
   352  
   353  -   As explained above, FUSE is still inode-based, so it is still susceptible to
   354      most of the problems we have with 9P.
   355  
   356  -   Our use of host file descriptors already allows us to leverage the host page
   357      cache for file contents.
   358  
   359  -   Our need for shared filesystem coherence is usually based on a user
   360      requirement that an out-of-sandbox filesystem mutation is guaranteed to be
   361      visible by all subsequent observations from within the sandbox, or vice
   362      versa; it's not clear that this can be guaranteed without a synchronous
   363      signaling mechanism like an RPC.