gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/vfs/g3doc/fuse.md (about)

     1  # Foreword
     2  
     3  This document describes an on-going project to support FUSE filesystems within
     4  the sentry. This is intended to become the final documentation for this
     5  subsystem, and is therefore written in the past tense. However FUSE support is
     6  currently incomplete and the document will be updated as things progress.
     7  
     8  # FUSE: Filesystem in Userspace
     9  
    10  The sentry supports dispatching filesystem operations to a FUSE server, allowing
    11  FUSE filesystem to be used with a sandbox.
    12  
    13  ## Overview
    14  
    15  FUSE has two main components:
    16  
    17  1.  A client kernel driver (canonically `fuse.ko` in Linux), which forwards
    18      filesystem operations (usually initiated by syscalls) to the server.
    19  
    20  2.  A server, which is a userspace daemon that implements the actual filesystem.
    21  
    22  The sentry implements the client component, which allows a server daemon running
    23  within the sandbox to implement a filesystem within the sandbox.
    24  
    25  A FUSE filesystem is initialized with `mount(2)`, typically with the help of a
    26  utility like `fusermount(1)`. Various mount options exist for establishing
    27  ownership and access permissions on the filesystem, but the most important mount
    28  option is a file descriptor used to establish communication between the client
    29  and server.
    30  
    31  The FUSE device FD is obtained by opening `/dev/fuse`. During regular operation,
    32  the client and server use the FUSE protocol described in `fuse(4)` to service
    33  filesystem operations. See the "Protocol" section below for more information
    34  about this protocol. The core of the sentry support for FUSE is the client-side
    35  implementation of this protocol.
    36  
    37  ## FUSE in the Sentry
    38  
    39  The sentry's FUSE client has the following components:
    40  
    41  -   An implementation of `/dev/fuse`.
    42  
    43  -   A filesystem for mapping syscalls to FUSE ops. One point of contention may
    44      be the lack of inodes in the VFS layer. We can tentatively implement a
    45      kernfs-based filesystem to bridge the gap in APIs. The kernfs base
    46      functionality can serve the role of the Linux inode cache and, the
    47      filesystem can map syscalls to kernfs inode operations; see the
    48      `kernfs.Inode` interface.
    49  
    50  The FUSE protocol lends itself well to marshaling with `go_marshal`. The various
    51  request and response packets can be defined in the ABI package and converted to
    52  and from the wire format using `go_marshal`.
    53  
    54  ### Design Goals
    55  
    56  -   While filesystem performance is always important, the sentry's FUSE support
    57      is primarily concerned with compatibility, with performance as a secondary
    58      concern.
    59  
    60  -   Avoiding deadlocks from a hung server daemon.
    61  
    62  -   Consider the potential for denial of service from a malicious server daemon.
    63      Protecting itself from userspace is already a design goal for the sentry,
    64      but needs additional consideration for FUSE. Normally, an operating system
    65      doesn't rely on userspace to make progress with filesystem operations. Since
    66      this changes with FUSE, it opens up the possibility of creating a chain of
    67      dependencies controlled by userspace, which could affect an entire sandbox.
    68      For example: a FUSE op can block a syscall, which could be holding a
    69      subsystem lock, which can then block another task goroutine.
    70  
    71  ### Milestones
    72  
    73  Below are some broad goals to aim for while implementing FUSE in the sentry.
    74  Many FUSE ops can be grouped into broad categories of functionality, and most
    75  ops can be implemented in parallel.
    76  
    77  #### Minimal client that can mount a trivial FUSE filesystem.
    78  
    79  -   Implement `/dev/fuse` - a character device used to establish an FD for
    80      communication between the sentry and the server daemon.
    81  
    82  -   Implement basic FUSE ops like `FUSE_INIT`.
    83  
    84  #### Read-only mount with basic file operations
    85  
    86  -   Implement the majority of file, directory and file descriptor FUSE ops. For
    87      this milestone, we can skip uncommon or complex operations like mmap, mknod,
    88      file locking, poll, and extended attributes. We can stub these out along
    89      with any ops that modify the filesystem. The exact list of required ops are
    90      to be determined, but the goal is to mount a real filesystem as read-only,
    91      and be able to read contents from the filesystem in the sentry.
    92  
    93  #### Full read-write support
    94  
    95  -   Implement the remaining FUSE ops and decide if we can omit rarely used
    96      operations like ioctl.
    97  
    98  ### Design Details
    99  
   100  #### Lifecycle for a FUSE Request
   101  
   102  -   User invokes a syscall
   103  -   Sentry prepares corresponding request
   104      -   If FUSE device is available
   105          -   Write the request in binary
   106      -   If FUSE device is full
   107          -   Kernel task blocked until available
   108  -   Sentry notifies the readers of fuse device that it's ready for read
   109  -   FUSE daemon reads the request and processes it
   110  -   Sentry waits until a reply is written to the FUSE device
   111      -   but returns directly for async requests
   112  -   FUSE daemon writes to the fuse device
   113  -   Sentry processes the reply
   114      -   For sync requests, unblock blocked kernel task
   115      -   For async requests, execute pre-specified callback if any
   116  -   Sentry returns the syscall to the user
   117  
   118  #### Channels and Queues for Requests in Different Stages
   119  
   120  `connection.initializedChan`
   121  
   122  -   a channel that the requests issued before connection initialization blocks
   123      on.
   124  
   125  `fd.queue`
   126  
   127  -   a queue of requests that haven’t been read by the FUSE daemon yet.
   128  
   129  `fd.completions`
   130  
   131  -   a map of the requests that have been prepared but not yet received a
   132      response, including the ones on the `fd.queue`.
   133  
   134  `fd.waitQueue`
   135  
   136  -   a queue of waiters that is waiting for the fuse device fd to be available,
   137      such as the FUSE daemon.
   138  
   139  `fd.fullQueueCh`
   140  
   141  -   a channel that the kernel task will be blocked on when the fd is not
   142      available.
   143  
   144  #### Basic I/O Implementation
   145  
   146  Currently we have implemented basic functionalities of read and write for our
   147  FUSE. We describe the design and ways to improve it here:
   148  
   149  ##### Basic FUSE Read
   150  
   151  The VFS expects implementations of `vfs.FileDescriptionImpl.Read()` and
   152  `vfs.FileDescriptionImpl.PRead()`. When a syscall is made, it will eventually
   153  reach our implementation of those interface functions located at
   154  `pkg/sentry/fsimpl/fuse/regular_file.go` for regular files.
   155  
   156  After validation checks of the input, sentry sends `FUSE_READ` requests to the
   157  FUSE daemon. The FUSE daemon returns data after the `fuse_out_header` as the
   158  responses. For the first version, we create a copy in kernel memory of those
   159  data. They are represented as a byte slice in the marshalled struct. This
   160  happens as a common process for all the FUSE responses at this moment at
   161  `pkg/sentry/fsimpl/fuse/dev.go:writeLocked()`. We then directly copy from this
   162  intermediate buffer to the input buffer provided by the read syscall.
   163  
   164  There is an extra requirement for FUSE: When mounting the FUSE fs, the mounter
   165  or the FUSE daemon can specify a `max_read` or a `max_pages` parameter. They are
   166  the upperbound of the bytes to read in each `FUSE_READ` request. We implemented
   167  the code to handle the fragmented reads.
   168  
   169  To improve the performance: ideally we should have buffer cache to copy those
   170  data from the responses of FUSE daemon into, as is also the design of several
   171  other existing file system implementations for sentry, instead of a single-use
   172  temporary buffer. Directly mapping the memory of one process to another could
   173  also boost the performance, but to keep them isolated, we did not choose to do
   174  so.
   175  
   176  ##### Basic FUSE Write
   177  
   178  The VFS invokes implementations of `vfs.FileDescriptionImpl.Write()` and
   179  `vfs.FileDescriptionImpl.PWrite()` on the regular file descriptor of FUSE when a
   180  user makes write(2) and pwrite(2) syscall.
   181  
   182  For valid writes, sentry sends the bytes to write after a `FUSE_WRITE` header
   183  (can be regarded as a request with 2 payloads) to the FUSE daemon. For the first
   184  version, we allocate a buffer inside kernel memory to store the bytes from the
   185  user, and copy directly from that buffer to the memory of FUSE daemon. This
   186  happens at `pkg/sentry/fsimpl/fuse/dev.go:readLocked()`
   187  
   188  The parameters `max_write` and `max_pages` restrict the number of bytes in one
   189  `FUSE_WRITE`. There are code handling fragmented writes in current
   190  implementation.
   191  
   192  To have better performance: the extra copy created to store the bytes to write
   193  can be replaced by the buffer cache as well.
   194  
   195  # Appendix
   196  
   197  ## FUSE Protocol
   198  
   199  The FUSE protocol is a request-response protocol. All requests are initiated by
   200  the client. The wire-format for the protocol is raw C structs serialized to
   201  memory.
   202  
   203  All FUSE requests begin with the following request header:
   204  
   205  ```c
   206  struct fuse_in_header {
   207    uint32_t len;       // Length of the request, including this header.
   208    uint32_t opcode;    // Requested operation.
   209    uint64_t unique;    // A unique identifier for this request.
   210    uint64_t nodeid;    // ID of the filesystem object being operated on.
   211    uint32_t uid;       // UID of the requesting process.
   212    uint32_t gid;       // GID of the requesting process.
   213    uint32_t pid;       // PID of the requesting process.
   214    uint32_t padding;
   215  };
   216  ```
   217  
   218  The request is then followed by a payload specific to the `opcode`.
   219  
   220  All responses begin with this response header:
   221  
   222  ```c
   223  struct fuse_out_header {
   224    uint32_t len;       // Length of the response, including this header.
   225    int32_t  error;     // Status of the request, 0 if success.
   226    uint64_t unique;    // The unique identifier from the corresponding request.
   227  };
   228  ```
   229  
   230  The response payload also depends on the request `opcode`. If `error != 0`, the
   231  response payload must be empty.
   232  
   233  ### Operations
   234  
   235  The following is a list of all FUSE operations used in `fuse_in_header.opcode`
   236  as of Linux v4.4, and a brief description of their purpose. These are defined in
   237  `uapi/linux/fuse.h`. Many of these have a corresponding request and response
   238  payload struct; `fuse(4)` has details for some of these. We also note how these
   239  operations map to the sentry virtual filesystem.
   240  
   241  #### FUSE meta-operations
   242  
   243  These operations are specific to FUSE and don't have a corresponding action in a
   244  generic filesystem.
   245  
   246  -   `FUSE_INIT`: This operation initializes a new FUSE filesystem, and is the
   247      first message sent by the client after mount. This is used for version and
   248      feature negotiation. This is related to `mount(2)`.
   249  -   `FUSE_DESTROY`: Teardown a FUSE filesystem, related to `unmount(2)`.
   250  -   `FUSE_INTERRUPT`: Interrupts an in-flight operation, specified by the
   251      `fuse_in_header.unique` value provided in the corresponding request header.
   252      The client can send at most one of these per request, and will enter an
   253      uninterruptible wait for a reply. The server is expected to reply promptly.
   254  -   `FUSE_FORGET`: A hint to the server that server should evict the indicate
   255      node from any caches. This is wired up to `(struct
   256      super_operations).evict_inode` in Linux, which is in turned hooked as the
   257      inode cache shrinker which is typically triggered by system memory pressure.
   258  -   `FUSE_BATCH_FORGET`: Batch version of `FUSE_FORGET`.
   259  
   260  #### Filesystem Syscalls
   261  
   262  These FUSE ops map directly to an equivalent filesystem syscall, or family of
   263  syscalls. The relevant syscalls have a similar name to the operation, unless
   264  otherwise noted.
   265  
   266  Node creation:
   267  
   268  -   `FUSE_MKNOD`
   269  -   `FUSE_MKDIR`
   270  -   `FUSE_CREATE`: This is equivalent to `open(2)` and `creat(2)`, which
   271      atomically creates and opens a node.
   272  
   273  Node attributes and extended attributes:
   274  
   275  -   `FUSE_GETATTR`
   276  -   `FUSE_SETATTR`
   277  -   `FUSE_SETXATTR`
   278  -   `FUSE_GETXATTR`
   279  -   `FUSE_LISTXATTR`
   280  -   `FUSE_REMOVEXATTR`
   281  
   282  Node link manipulation:
   283  
   284  -   `FUSE_READLINK`
   285  -   `FUSE_LINK`
   286  -   `FUSE_SYMLINK`
   287  -   `FUSE_UNLINK`
   288  
   289  Directory operations:
   290  
   291  -   `FUSE_RMDIR`
   292  -   `FUSE_RENAME`
   293  -   `FUSE_RENAME2`
   294  -   `FUSE_OPENDIR`: `open(2)` for directories.
   295  -   `FUSE_RELEASEDIR`: `close(2)` for directories.
   296  -   `FUSE_READDIR`
   297  -   `FUSE_READDIRPLUS`
   298  -   `FUSE_FSYNCDIR`: `fsync(2)` for directories.
   299  -   `FUSE_LOOKUP`: Establishes a unique identifier for a FS node. This is
   300      reminiscent of `VirtualFilesystem.GetDentryAt` in that it resolves a path
   301      component to a node. However the returned identifier is opaque to the
   302      client. The server must remember this mapping, as this is how the client
   303      will reference the node in the future.
   304  
   305  File operations:
   306  
   307  -   `FUSE_OPEN`: `open(2)` for files.
   308  -   `FUSE_RELEASE`: `close(2)` for files.
   309  -   `FUSE_FSYNC`
   310  -   `FUSE_FALLOCATE`
   311  -   `FUSE_SETUPMAPPING`: Creates a memory map on a file for `mmap(2)`.
   312  -   `FUSE_REMOVEMAPPING`: Removes a memory map for `munmap(2)`.
   313  
   314  File locking:
   315  
   316  -   `FUSE_GETLK`
   317  -   `FUSE_SETLK`
   318  -   `FUSE_SETLKW`
   319  -   `FUSE_COPY_FILE_RANGE`
   320  
   321  File descriptor operations:
   322  
   323  -   `FUSE_IOCTL`
   324  -   `FUSE_POLL`
   325  -   `FUSE_LSEEK`
   326  
   327  Filesystem operations:
   328  
   329  -   `FUSE_STATFS`
   330  
   331  #### Permissions
   332  
   333  -   `FUSE_ACCESS` is used to check if a node is accessible, as part of many
   334      syscall implementations. Maps to `vfs.FilesystemImpl.AccessAt` in the
   335      sentry.
   336  
   337  #### I/O Operations
   338  
   339  These ops are used to read and write file pages. They're used to implement both
   340  I/O syscalls like `read(2)`, `write(2)` and `mmap(2)`.
   341  
   342  -   `FUSE_READ`
   343  -   `FUSE_WRITE`
   344  
   345  #### Miscellaneous
   346  
   347  -   `FUSE_FLUSH`: Used by the client to indicate when a file descriptor is
   348      closed. Distinct from `FUSE_FSYNC`, which corresponds to an `fsync(2)`
   349      syscall from the user. Maps to `vfs.FileDescriptorImpl.Release` in the
   350      sentry.
   351  -   `FUSE_BMAP`: Old address space API for block defrag. Probably not needed.
   352  -   `FUSE_NOTIFY_REPLY`: [TODO: what does this do?]
   353  
   354  # References
   355  
   356  -   [fuse(4) Linux manual page](https://www.man7.org/linux/man-pages/man4/fuse.4.html)
   357  -   [Linux kernel FUSE documentation](https://www.kernel.org/doc/html/latest/filesystems/fuse.html)
   358  -   [The reference implementation of the Linux FUSE (Filesystem in Userspace)
   359      interface](https://github.com/libfuse/libfuse)
   360  -   [The kernel interface of FUSE](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h)