github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/lisafs/README.md (about) 1 # Replacing 9P 2 3 ## Background 4 5 The Linux filesystem model consists of the following key aspects (modulo mounts, 6 which are outside the scope of this discussion): 7 8 - A `struct inode` represents a "filesystem object", such as a directory or a 9 regular file. "Filesystem object" is most precisely defined by the practical 10 properties of an inode, such as an immutable type (regular file, directory, 11 symbolic link, etc.) and its independence from the path originally used to 12 obtain it. 13 14 - A `struct dentry` represents a node in a filesystem tree. Semantically, each 15 dentry is immutably associated with an inode representing the filesystem 16 object at that position. (Linux implements optimizations involving reuse of 17 unreferenced dentries, which allows their associated inodes to change, but 18 this is outside the scope of this discussion.) 19 20 - A `struct file` represents an open file description (hereafter FD) and is 21 needed to perform I/O. Each FD is immutably associated with the dentry 22 through which it was opened. 23 24 The current gVisor virtual filesystem implementation (hereafter VFS1) closely 25 imitates the Linux design: 26 27 - `struct inode` => `fs.Inode` 28 29 - `struct dentry` => `fs.Dirent` 30 31 - `struct file` => `fs.File` 32 33 gVisor accesses most external filesystems through a variant of the 9P2000.L 34 protocol, including extensions for performance (`walkgetattr`) and for features 35 not supported by vanilla 9P2000.L (`flushf`, `lconnect`). The 9P protocol family 36 is inode-based; 9P fids represent a file (equivalently "file system object"), 37 and the protocol is structured around alternatively obtaining fids to represent 38 files (with `walk` and, in gVisor, `walkgetattr`) and performing operations on 39 those fids. 40 41 In the sections below, a **shared** filesystem is a filesystem that is *mutably* 42 accessible by multiple concurrent clients, such that a **non-shared** filesystem 43 is a filesystem that is either read-only or accessible by only a single client. 44 45 ## Problems 46 47 ### Serialization of Path Component RPCs 48 49 Broadly speaking, VFS1 traverses each path component in a pathname, alternating 50 between verifying that each traversed dentry represents an inode that represents 51 a searchable directory and moving to the next dentry in the path. 52 53 In the context of a remote filesystem, the structure of this traversal means 54 that - modulo caching - a path involving N components requires at least N-1 55 *sequential* RPCs to obtain metadata for intermediate directories, incurring 56 significant latency. (In vanilla 9P2000.L, 2(N-1) RPCs are required: N-1 `walk` 57 and N-1 `getattr`. We added the `walkgetattr` RPC to reduce this overhead.) On 58 non-shared filesystems, this overhead is primarily significant during 59 application startup; caching mitigates much of this overhead at steady state. On 60 shared filesystems, where correct caching requires revalidation (requiring RPCs 61 for each revalidated directory anyway), this overhead is consistently ruinous. 62 63 ### Inefficient RPCs 64 65 9P is not exceptionally economical with RPCs in general. In addition to the 66 issue described above: 67 68 - Opening an existing file in 9P involves at least 2 RPCs: `walk` to produce 69 an unopened fid representing the file, and `lopen` to open the fid. 70 71 - Creating a file also involves at least 2 RPCs: `walk` to produce an unopened 72 fid representing the parent directory, and `lcreate` to create the file and 73 convert the fid to an open fid representing the created file. In practice, 74 both the Linux and gVisor 9P clients expect to have an unopened fid for the 75 created file (necessitating an additional `walk`), as well as attributes for 76 the created file (necessitating an additional `getattr`), for a total of 4 77 RPCs. (In a shared filesystem, where whether a file already exists can 78 change between RPCs, a correct implementation of `open(O_CREAT)` would have 79 to alternate between these two paths (plus `clunk`ing the temporary fid 80 between alternations, since the nature of the `fid` differs between the two 81 paths). Neither Linux nor gVisor implement the required alternation, so 82 `open(O_CREAT)` without `O_EXCL` can spuriously fail with `EEXIST` on both.) 83 84 - Closing (`clunk`ing) a fid requires an RPC. VFS1 issues this RPC 85 asynchronously in an attempt to reduce critical path latency, but scheduling 86 overhead makes this not clearly advantageous in practice. 87 88 - `read` and `readdir` can return partial reads without a way to indicate EOF, 89 necessitating an additional final read to detect EOF. 90 91 - Operations that affect filesystem state do not consistently return updated 92 filesystem state. In gVisor, the client implementation attempts to handle 93 this by tracking what it thinks updated state "should" be; this is complex, 94 and especially brittle for timestamps (which are often not arbitrarily 95 settable). In Linux, the client implemtation invalidates cached metadata 96 whenever it performs such an operation, and reloads it when a dentry 97 corresponding to an inode with no valid cached metadata is revalidated; this 98 is simple, but necessitates an additional `getattr`. 99 100 ### Dentry/Inode Ambiguity 101 102 As noted above, 9P's documentation tends to imply that unopened fids represent 103 an inode. In practice, most filesystem APIs present very limited interfaces for 104 working with inodes at best, such that the interpretation of unopened fids 105 varies: 106 107 - Linux's 9P client associates unopened fids with (dentry, uid) pairs. When 108 caching is enabled, it also associates each inode with the first fid opened 109 writably that references that inode, in order to support page cache 110 writeback. 111 112 - gVisor's 9P client associates unopened fids with inodes, and also caches 113 opened fids in inodes in a manner similar to Linux. 114 115 - The runsc fsgofer associates unopened fids with both "dentries" (host 116 filesystem paths) and "inodes" (host file descriptors); which is used 117 depends on the operation invoked on the fid. 118 119 For non-shared filesystems, this confusion has resulted in correctness issues 120 that are (in gVisor) currently handled by a number of coarse-grained locks that 121 serialize renames with all other filesystem operations. For shared filesystems, 122 this means inconsistent behavior in the presence of concurrent mutation. 123 124 ## Design 125 126 Almost all Linux filesystem syscalls describe filesystem resources in one of two 127 ways: 128 129 - Path-based: A filesystem position is described by a combination of a 130 starting position and a sequence of path components relative to that 131 position, where the starting position is one of: 132 133 - The VFS root (defined by mount namespace and chroot), for absolute paths 134 135 - The VFS position of an existing FD, for relative paths passed to `*at` 136 syscalls (e.g. `statat`) 137 138 - The current working directory, for relative paths passed to non-`*at` 139 syscalls and `*at` syscalls with `AT_FDCWD` 140 141 - File-description-based: A filesystem object is described by an existing FD, 142 passed to a `f*` syscall (e.g. `fstat`). 143 144 Many of our issues with 9P arise from its (and VFS') interposition of a model 145 based on inodes between the filesystem syscall API and filesystem 146 implementations. We propose to replace 9P with a protocol that does not feature 147 inodes at all, and instead closely follows the filesystem syscall API by 148 featuring only path-based and FD-based operations, with minimal deviations as 149 necessary to ameliorate deficiencies in the syscall interface (see below). This 150 approach addresses the issues described above: 151 152 - Even on shared filesystems, most application filesystem syscalls are 153 translated to a single RPC (possibly excepting special cases described 154 below), which is a logical lower bound. 155 156 - The behavior of application syscalls on shared filesystems is 157 straightforwardly predictable: path-based syscalls are translated to 158 path-based RPCs, which will re-lookup the file at that path, and FD-based 159 syscalls are translated to FD-based RPCs, which use an existing open file 160 without performing another lookup. (This is at least true on gofers that 161 proxy the host local filesystem; other filesystems that lack support for 162 e.g. certain operations on FDs may have different behavior, but this 163 divergence is at least still predictable and inherent to the underlying 164 filesystem implementation.) 165 166 Note that this approach is only feasible in gVisor's next-generation virtual 167 filesystem (VFS2), which does not assume the existence of inodes and allows the 168 remote filesystem client to translate whole path-based syscalls into RPCs. Thus 169 one of the unavoidable tradeoffs associated with such a protocol vs. 9P is the 170 inability to construct a Linux client that is performance-competitive with 171 gVisor. 172 173 ### File Permissions 174 175 Many filesystem operations are side-effectual, such that file permissions must 176 be checked before such operations take effect. The simplest approach to file 177 permission checking is for the sentry to obtain permissions from the remote 178 filesystem, then apply permission checks in the sentry before performing the 179 application-requested operation. However, this requires an additional RPC per 180 application syscall (which can't be mitigated by caching on shared filesystems). 181 Alternatively, we may delegate file permission checking to gofers. In general, 182 file permission checks depend on the following properties of the accessor: 183 184 - Filesystem UID/GID 185 186 - Supplementary GIDs 187 188 - Effective capabilities in the accessor's user namespace (i.e. the accessor's 189 effective capability set) 190 191 - All UIDs and GIDs mapped in the accessor's user namespace (which determine 192 if the accessor's capabilities apply to accessed files) 193 194 We may choose to delay implementation of file permission checking delegation, 195 although this is potentially costly since it doubles the number of required RPCs 196 for most operations on shared filesystems. We may also consider compromise 197 options, such as only delegating file permission checks for accessors in the 198 root user namespace. 199 200 ### Symbolic Links 201 202 gVisor usually interprets symbolic link targets in its VFS rather than on the 203 filesystem containing the symbolic link; thus e.g. a symlink to 204 "/proc/self/maps" on a remote filesystem resolves to said file in the sentry's 205 procfs rather than the host's. This implies that: 206 207 - Remote filesystem servers that proxy filesystems supporting symlinks must 208 check if each path component is a symlink during path traversal. 209 210 - Absolute symlinks require that the sentry restart the operation at its 211 contextual VFS root (which is task-specific and may not be on a remote 212 filesystem at all), so if a remote filesystem server encounters an absolute 213 symlink during path traversal on behalf of a path-based operation, it must 214 terminate path traversal and return the symlink target. 215 216 - Relative symlinks begin target resolution in the parent directory of the 217 symlink, so in theory most relative symlinks can be handled automatically 218 during the path traversal that encounters the symlink, provided that said 219 traversal is supplied with the number of remaining symlinks before `ELOOP`. 220 However, the new path traversed by the symlink target may cross VFS mount 221 boundaries, such that it's only safe for remote filesystem servers to 222 speculatively follow relative symlinks for side-effect-free operations such 223 as `stat` (where the sentry can simply ignore results that are inapplicable 224 due to crossing mount boundaries). We may choose to delay implementation of 225 this feature, at the cost of an additional RPC per relative symlink (note 226 that even if the symlink target crosses a mount boundary, the sentry will 227 need to `stat` the path to the mount boundary to confirm that each traversed 228 component is an accessible directory); until it is implemented, relative 229 symlinks may be handled like absolute symlinks, by terminating path 230 traversal and returning the symlink target. 231 232 The possibility of symlinks (and the possibility of a compromised sentry) means 233 that the sentry may issue RPCs with paths that, in the absence of symlinks, 234 would traverse beyond the root of the remote filesystem. For example, the sentry 235 may issue an RPC with a path like "/foo/../..", on the premise that if "/foo" is 236 a symlink then the resulting path may be elsewhere on the remote filesystem. To 237 handle this, path traversal must also track its current depth below the remote 238 filesystem root, and terminate path traversal if it would ascend beyond this 239 point. 240 241 ### Path Traversal 242 243 Since path-based VFS operations will translate to path-based RPCs, filesystem 244 servers will need to handle path traversal. From the perspective of a given 245 filesystem implementation in the server, there are two basic approaches to path 246 traversal: 247 248 - Inode-walk: For each path component, obtain a handle to the underlying 249 filesystem object (e.g. with `open(O_PATH)`), check if that object is a 250 symlink (as described above) and that that object is accessible by the 251 caller (e.g. with `fstat()`), then continue to the next path component (e.g. 252 with `openat()`). This ensures that the checked filesystem object is the one 253 used to obtain the next object in the traversal, which is intuitively 254 appealing. However, while this approach works for host local filesystems, it 255 requires features that are not widely supported by other filesystems. 256 257 - Path-walk: For each path component, use a path-based operation to determine 258 if the filesystem object currently referred to by that path component is a 259 symlink / is accessible. This is highly portable, but suffers from quadratic 260 behavior (at the level of the underlying filesystem implementation, the 261 first path component will be traversed a number of times equal to the number 262 of path components in the path). 263 264 The implementation should support either option by delegating path traversal to 265 filesystem implementations within the server (like VFS and the remote filesystem 266 protocol itself), as inode-walking is still safe, efficient, amenable to FD 267 caching, and implementable on non-shared host local filesystems (a sufficiently 268 common case as to be worth considering in the design). 269 270 Both approaches are susceptible to race conditions that may permit sandboxed 271 filesystem escapes: 272 273 - Under inode-walk, a malicious application may cause a directory to be moved 274 (with `rename`) during path traversal, such that the filesystem 275 implementation incorrectly determines whether subsequent inodes are located 276 in paths that should be visible to sandboxed applications. 277 278 - Under path-walk, a malicious application may cause a non-symlink file to be 279 replaced with a symlink during path traversal, such that following path 280 operations will incorrectly follow the symlink. 281 282 Both race conditions can, to some extent, be mitigated in filesystem server 283 implementations by synchronizing path traversal with the hazardous operations in 284 question. However, shared filesystems are frequently used to share data between 285 sandboxed and unsandboxed applications in a controlled way, and in some cases a 286 malicious sandboxed application may be able to take advantage of a hazardous 287 filesystem operation performed by an unsandboxed application. In some cases, 288 filesystem features may be available to ensure safety even in such cases (e.g. 289 [the new openat2() syscall](https://man7.org/linux/man-pages/man2/openat2.2.html)), 290 but it is not clear how to solve this problem in general. (Note that this issue 291 is not specific to our design; rather, it is a fundamental limitation of 292 filesystem sandboxing.) 293 294 ### Filesystem Multiplexing 295 296 A given sentry may need to access multiple distinct remote filesystems (e.g. 297 different volumes for a given container). In many cases, there is no advantage 298 to serving these filesystems from distinct filesystem servers, or accessing them 299 through distinct connections (factors such as maximum RPC concurrency should be 300 based on available host resources). Therefore, the protocol should support 301 multiplexing of distinct filesystem trees within a single session. 9P supports 302 this by allowing multiple calls to the `attach` RPC to produce fids representing 303 distinct filesystem trees, but this is somewhat clunky; we propose a much 304 simpler mechanism wherein each message that conveys a path also conveys a 305 numeric filesystem ID that identifies a filesystem tree. 306 307 ## Alternatives Considered 308 309 ### Additional Extensions to 9P 310 311 There are at least three conceptual aspects to 9P: 312 313 - Wire format: messages with a 4-byte little-endian size prefix, strings with 314 a 2-byte little-endian size prefix, etc. Whether the wire format is worth 315 retaining is unclear; in particular, it's unclear that the 9P wire format 316 has a significant advantage over protobufs, which are substantially easier 317 to extend. Note that the official Go protobuf implementation is widely known 318 to suffer from a significant number of performance deficiencies, so if we 319 choose to switch to protobuf, we may need to use an alternative toolchain 320 such as `gogo/protobuf` (which is also widely used in the Go ecosystem, e.g. 321 by Kubernetes). 322 323 - Filesystem model: fids, qids, etc. Discarding this is one of the motivations 324 for this proposal. 325 326 - RPCs: Twalk, Tlopen, etc. In addition to previously-described 327 inefficiencies, most of these are dependent on the filesystem model and 328 therefore must be discarded. 329 330 ### FUSE 331 332 The FUSE (Filesystem in Userspace) protocol is frequently used to provide 333 arbitrary userspace filesystem implementations to a host Linux kernel. 334 Unfortunately, FUSE is also inode-based, and therefore doesn't address any of 335 the problems we have with 9P. 336 337 ### virtio-fs 338 339 virtio-fs is an ongoing project aimed at improving Linux VM filesystem 340 performance when accessing Linux host filesystems (vs. virtio-9p). In brief, it 341 is based on: 342 343 - Using a FUSE client in the guest that communicates over virtio with a FUSE 344 server in the host. 345 346 - Using DAX to map the host page cache into the guest. 347 348 - Using a file metadata table in shared memory to avoid VM exits for metadata 349 updates. 350 351 None of these improvements seem applicable to gVisor: 352 353 - As explained above, FUSE is still inode-based, so it is still susceptible to 354 most of the problems we have with 9P. 355 356 - Our use of host file descriptors already allows us to leverage the host page 357 cache for file contents. 358 359 - Our need for shared filesystem coherence is usually based on a user 360 requirement that an out-of-sandbox filesystem mutation is guaranteed to be 361 visible by all subsequent observations from within the sandbox, or vice 362 versa; it's not clear that this can be guaranteed without a synchronous 363 signaling mechanism like an RPC.