gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/vfs/g3doc/fuse.md

gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/vfs/g3doc/fuse.md (about)

1 # Foreword
2
3 This document describes an on-going project to support FUSE filesystems within
4 the sentry. This is intended to become the final documentation for this
5 subsystem, and is therefore written in the past tense. However FUSE support is
6 currently incomplete and the document will be updated as things progress.
7
8 # FUSE: Filesystem in Userspace
9
10 The sentry supports dispatching filesystem operations to a FUSE server, allowing
11 FUSE filesystem to be used with a sandbox.
12
13 ## Overview
14
15 FUSE has two main components:
16
17 1. A client kernel driver (canonically `fuse.ko` in Linux), which forwards
18 filesystem operations (usually initiated by syscalls) to the server.
19
20 2. A server, which is a userspace daemon that implements the actual filesystem.
21
22 The sentry implements the client component, which allows a server daemon running
23 within the sandbox to implement a filesystem within the sandbox.
24
25 A FUSE filesystem is initialized with `mount(2)`, typically with the help of a
26 utility like `fusermount(1)`. Various mount options exist for establishing
27 ownership and access permissions on the filesystem, but the most important mount
28 option is a file descriptor used to establish communication between the client
29 and server.
30
31 The FUSE device FD is obtained by opening `/dev/fuse`. During regular operation,
32 the client and server use the FUSE protocol described in `fuse(4)` to service
33 filesystem operations. See the "Protocol" section below for more information
34 about this protocol. The core of the sentry support for FUSE is the client-side
35 implementation of this protocol.
36
37 ## FUSE in the Sentry
38
39 The sentry's FUSE client has the following components:
40
41 - An implementation of `/dev/fuse`.
42
43 - A filesystem for mapping syscalls to FUSE ops. One point of contention may
44 be the lack of inodes in the VFS layer. We can tentatively implement a
45 kernfs-based filesystem to bridge the gap in APIs. The kernfs base
46 functionality can serve the role of the Linux inode cache and, the
47 filesystem can map syscalls to kernfs inode operations; see the
48 `kernfs.Inode` interface.
49
50 The FUSE protocol lends itself well to marshaling with `go_marshal`. The various
51 request and response packets can be defined in the ABI package and converted to
52 and from the wire format using `go_marshal`.
53
54 ### Design Goals
55
56 - While filesystem performance is always important, the sentry's FUSE support
57 is primarily concerned with compatibility, with performance as a secondary
58 concern.
59
60 - Avoiding deadlocks from a hung server daemon.
61
62 - Consider the potential for denial of service from a malicious server daemon.
63 Protecting itself from userspace is already a design goal for the sentry,
64 but needs additional consideration for FUSE. Normally, an operating system
65 doesn't rely on userspace to make progress with filesystem operations. Since
66 this changes with FUSE, it opens up the possibility of creating a chain of
67 dependencies controlled by userspace, which could affect an entire sandbox.
68 For example: a FUSE op can block a syscall, which could be holding a
69 subsystem lock, which can then block another task goroutine.
70
71 ### Milestones
72
73 Below are some broad goals to aim for while implementing FUSE in the sentry.
74 Many FUSE ops can be grouped into broad categories of functionality, and most
75 ops can be implemented in parallel.
76
77 #### Minimal client that can mount a trivial FUSE filesystem.
78
79 - Implement `/dev/fuse` - a character device used to establish an FD for
80 communication between the sentry and the server daemon.
81
82 - Implement basic FUSE ops like `FUSE_INIT`.
83
84 #### Read-only mount with basic file operations
85
86 - Implement the majority of file, directory and file descriptor FUSE ops. For
87 this milestone, we can skip uncommon or complex operations like mmap, mknod,
88 file locking, poll, and extended attributes. We can stub these out along
89 with any ops that modify the filesystem. The exact list of required ops are
90 to be determined, but the goal is to mount a real filesystem as read-only,
91 and be able to read contents from the filesystem in the sentry.
92
93 #### Full read-write support
94
95 - Implement the remaining FUSE ops and decide if we can omit rarely used
96 operations like ioctl.
97
98 ### Design Details
99
100 #### Lifecycle for a FUSE Request
101
102 - User invokes a syscall
103 - Sentry prepares corresponding request
104 - If FUSE device is available
105 - Write the request in binary
106 - If FUSE device is full
107 - Kernel task blocked until available
108 - Sentry notifies the readers of fuse device that it's ready for read
109 - FUSE daemon reads the request and processes it
110 - Sentry waits until a reply is written to the FUSE device
111 - but returns directly for async requests
112 - FUSE daemon writes to the fuse device
113 - Sentry processes the reply
114 - For sync requests, unblock blocked kernel task
115 - For async requests, execute pre-specified callback if any
116 - Sentry returns the syscall to the user
117
118 #### Channels and Queues for Requests in Different Stages
119
120 `connection.initializedChan`
121
122 - a channel that the requests issued before connection initialization blocks
123 on.
124
125 `fd.queue`
126
127 - a queue of requests that haven’t been read by the FUSE daemon yet.
128
129 `fd.completions`
130
131 - a map of the requests that have been prepared but not yet received a
132 response, including the ones on the `fd.queue`.
133
134 `fd.waitQueue`
135
136 - a queue of waiters that is waiting for the fuse device fd to be available,
137 such as the FUSE daemon.
138
139 `fd.fullQueueCh`
140
141 - a channel that the kernel task will be blocked on when the fd is not
142 available.
143
144 #### Basic I/O Implementation
145
146 Currently we have implemented basic functionalities of read and write for our
147 FUSE. We describe the design and ways to improve it here:
148
149 ##### Basic FUSE Read
150
151 The VFS expects implementations of `vfs.FileDescriptionImpl.Read()` and
152 `vfs.FileDescriptionImpl.PRead()`. When a syscall is made, it will eventually
153 reach our implementation of those interface functions located at
154 `pkg/sentry/fsimpl/fuse/regular_file.go` for regular files.
155
156 After validation checks of the input, sentry sends `FUSE_READ` requests to the
157 FUSE daemon. The FUSE daemon returns data after the `fuse_out_header` as the
158 responses. For the first version, we create a copy in kernel memory of those
159 data. They are represented as a byte slice in the marshalled struct. This
160 happens as a common process for all the FUSE responses at this moment at
161 `pkg/sentry/fsimpl/fuse/dev.go:writeLocked()`. We then directly copy from this
162 intermediate buffer to the input buffer provided by the read syscall.
163
164 There is an extra requirement for FUSE: When mounting the FUSE fs, the mounter
165 or the FUSE daemon can specify a `max_read` or a `max_pages` parameter. They are
166 the upperbound of the bytes to read in each `FUSE_READ` request. We implemented
167 the code to handle the fragmented reads.
168
169 To improve the performance: ideally we should have buffer cache to copy those
170 data from the responses of FUSE daemon into, as is also the design of several
171 other existing file system implementations for sentry, instead of a single-use
172 temporary buffer. Directly mapping the memory of one process to another could
173 also boost the performance, but to keep them isolated, we did not choose to do
174 so.
175
176 ##### Basic FUSE Write
177
178 The VFS invokes implementations of `vfs.FileDescriptionImpl.Write()` and
179 `vfs.FileDescriptionImpl.PWrite()` on the regular file descriptor of FUSE when a
180 user makes write(2) and pwrite(2) syscall.
181
182 For valid writes, sentry sends the bytes to write after a `FUSE_WRITE` header
183 (can be regarded as a request with 2 payloads) to the FUSE daemon. For the first
184 version, we allocate a buffer inside kernel memory to store the bytes from the
185 user, and copy directly from that buffer to the memory of FUSE daemon. This
186 happens at `pkg/sentry/fsimpl/fuse/dev.go:readLocked()`
187
188 The parameters `max_write` and `max_pages` restrict the number of bytes in one
189 `FUSE_WRITE`. There are code handling fragmented writes in current
190 implementation.
191
192 To have better performance: the extra copy created to store the bytes to write
193 can be replaced by the buffer cache as well.
194
195 # Appendix
196
197 ## FUSE Protocol
198
199 The FUSE protocol is a request-response protocol. All requests are initiated by
200 the client. The wire-format for the protocol is raw C structs serialized to
201 memory.
202
203 All FUSE requests begin with the following request header:
204
205 ```c
206 struct fuse_in_header {
207 uint32_t len; // Length of the request, including this header.
208 uint32_t opcode; // Requested operation.
209 uint64_t unique; // A unique identifier for this request.
210 uint64_t nodeid; // ID of the filesystem object being operated on.
211 uint32_t uid; // UID of the requesting process.
212 uint32_t gid; // GID of the requesting process.
213 uint32_t pid; // PID of the requesting process.
214 uint32_t padding;
215 };
216 ```
217
218 The request is then followed by a payload specific to the `opcode`.
219
220 All responses begin with this response header:
221
222 ```c
223 struct fuse_out_header {
224 uint32_t len; // Length of the response, including this header.
225 int32_t error; // Status of the request, 0 if success.
226 uint64_t unique; // The unique identifier from the corresponding request.
227 };
228 ```
229
230 The response payload also depends on the request `opcode`. If `error != 0`, the
231 response payload must be empty.
232
233 ### Operations
234
235 The following is a list of all FUSE operations used in `fuse_in_header.opcode`
236 as of Linux v4.4, and a brief description of their purpose. These are defined in
237 `uapi/linux/fuse.h`. Many of these have a corresponding request and response
238 payload struct; `fuse(4)` has details for some of these. We also note how these
239 operations map to the sentry virtual filesystem.
240
241 #### FUSE meta-operations
242
243 These operations are specific to FUSE and don't have a corresponding action in a
244 generic filesystem.
245
246 - `FUSE_INIT`: This operation initializes a new FUSE filesystem, and is the
247 first message sent by the client after mount. This is used for version and
248 feature negotiation. This is related to `mount(2)`.
249 - `FUSE_DESTROY`: Teardown a FUSE filesystem, related to `unmount(2)`.
250 - `FUSE_INTERRUPT`: Interrupts an in-flight operation, specified by the
251 `fuse_in_header.unique` value provided in the corresponding request header.
252 The client can send at most one of these per request, and will enter an
253 uninterruptible wait for a reply. The server is expected to reply promptly.
254 - `FUSE_FORGET`: A hint to the server that server should evict the indicate
255 node from any caches. This is wired up to `(struct
256 super_operations).evict_inode` in Linux, which is in turned hooked as the
257 inode cache shrinker which is typically triggered by system memory pressure.
258 - `FUSE_BATCH_FORGET`: Batch version of `FUSE_FORGET`.
259
260 #### Filesystem Syscalls
261
262 These FUSE ops map directly to an equivalent filesystem syscall, or family of
263 syscalls. The relevant syscalls have a similar name to the operation, unless
264 otherwise noted.
265
266 Node creation:
267
268 - `FUSE_MKNOD`
269 - `FUSE_MKDIR`
270 - `FUSE_CREATE`: This is equivalent to `open(2)` and `creat(2)`, which
271 atomically creates and opens a node.
272
273 Node attributes and extended attributes:
274
275 - `FUSE_GETATTR`
276 - `FUSE_SETATTR`
277 - `FUSE_SETXATTR`
278 - `FUSE_GETXATTR`
279 - `FUSE_LISTXATTR`
280 - `FUSE_REMOVEXATTR`
281
282 Node link manipulation:
283
284 - `FUSE_READLINK`
285 - `FUSE_LINK`
286 - `FUSE_SYMLINK`
287 - `FUSE_UNLINK`
288
289 Directory operations:
290
291 - `FUSE_RMDIR`
292 - `FUSE_RENAME`
293 - `FUSE_RENAME2`
294 - `FUSE_OPENDIR`: `open(2)` for directories.
295 - `FUSE_RELEASEDIR`: `close(2)` for directories.
296 - `FUSE_READDIR`
297 - `FUSE_READDIRPLUS`
298 - `FUSE_FSYNCDIR`: `fsync(2)` for directories.
299 - `FUSE_LOOKUP`: Establishes a unique identifier for a FS node. This is
300 reminiscent of `VirtualFilesystem.GetDentryAt` in that it resolves a path
301 component to a node. However the returned identifier is opaque to the
302 client. The server must remember this mapping, as this is how the client
303 will reference the node in the future.
304
305 File operations:
306
307 - `FUSE_OPEN`: `open(2)` for files.
308 - `FUSE_RELEASE`: `close(2)` for files.
309 - `FUSE_FSYNC`
310 - `FUSE_FALLOCATE`
311 - `FUSE_SETUPMAPPING`: Creates a memory map on a file for `mmap(2)`.
312 - `FUSE_REMOVEMAPPING`: Removes a memory map for `munmap(2)`.
313
314 File locking:
315
316 - `FUSE_GETLK`
317 - `FUSE_SETLK`
318 - `FUSE_SETLKW`
319 - `FUSE_COPY_FILE_RANGE`
320
321 File descriptor operations:
322
323 - `FUSE_IOCTL`
324 - `FUSE_POLL`
325 - `FUSE_LSEEK`
326
327 Filesystem operations:
328
329 - `FUSE_STATFS`
330
331 #### Permissions
332
333 - `FUSE_ACCESS` is used to check if a node is accessible, as part of many
334 syscall implementations. Maps to `vfs.FilesystemImpl.AccessAt` in the
335 sentry.
336
337 #### I/O Operations
338
339 These ops are used to read and write file pages. They're used to implement both
340 I/O syscalls like `read(2)`, `write(2)` and `mmap(2)`.
341
342 - `FUSE_READ`
343 - `FUSE_WRITE`
344
345 #### Miscellaneous
346
347 - `FUSE_FLUSH`: Used by the client to indicate when a file descriptor is
348 closed. Distinct from `FUSE_FSYNC`, which corresponds to an `fsync(2)`
349 syscall from the user. Maps to `vfs.FileDescriptorImpl.Release` in the
350 sentry.
351 - `FUSE_BMAP`: Old address space API for block defrag. Probably not needed.
352 - `FUSE_NOTIFY_REPLY`: [TODO: what does this do?]
353
354 # References
355
356 - [fuse(4) Linux manual page](https://www.man7.org/linux/man-pages/man4/fuse.4.html)
357 - [Linux kernel FUSE documentation](https://www.kernel.org/doc/html/latest/filesystems/fuse.html)
358 - [The reference implementation of the Linux FUSE (Filesystem in Userspace)
359 interface](https://github.com/libfuse/libfuse)
360 - [The kernel interface of FUSE](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/uapi/linux/fuse.h)