github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/sentry/fs/README.md

github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/sentry/fs/README.md (about)

1 This package provides an implementation of the Linux virtual filesystem.
2
3 [TOC]
4
5 ## Overview
6
7 - An `fs.Dirent` caches an `fs.Inode` in memory at a path in the VFS, giving
8 the `fs.Inode` a relative position with respect to other `fs.Inode`s.
9
10 - If an `fs.Dirent` is referenced by two file descriptors, then those file
11 descriptors are coherent with each other: they depend on the same
12 `fs.Inode`.
13
14 - A mount point is an `fs.Dirent` for which `fs.Dirent.mounted` is true. It
15 exposes the root of a mounted filesystem.
16
17 - The `fs.Inode` produced by a registered filesystem on mount(2) owns an
18 `fs.MountedFilesystem` from which other `fs.Inode`s will be looked up. For a
19 remote filesystem, the `fs.MountedFilesystem` owns the connection to that
20 remote filesystem.
21
22 - In general:
23
24 ```
25 fs.Inode <------------------------------
26 | |
27 | |
28 produced by |
29 exactly one |
30 | responsible for the
31 | virtual identity of
32 v |
33 fs.MountedFilesystem -------------------
34 ```
35
36 Glossary:
37
38 - VFS: virtual filesystem.
39
40 - inode: a virtual file object holding a cached view of a file on a backing
41 filesystem (includes metadata and page caches).
42
43 - superblock: the virtual state of a mounted filesystem (e.g. the virtual
44 inode number set).
45
46 - mount namespace: a view of the mounts under a root (during path traversal,
47 the VFS makes visible/follows the mount point that is in the current task's
48 mount namespace).
49
50 ## Save and restore
51
52 An application's hard dependencies on filesystem state can be broken down into
53 two categories:
54
55 - The state necessary to execute a traversal on or view the *virtual*
56 filesystem hierarchy, regardless of what files an application has open.
57
58 - The state necessary to represent open files.
59
60 The first is always necessary to save and restore. An application may never have
61 any open file descriptors, but across save and restore it should see a coherent
62 view of any mount namespace. NOTE(b/63601033): Currently only one "initial"
63 mount namespace is supported.
64
65 The second is so that system calls across save and restore are coherent with
66 each other (e.g. so that unintended re-reads or overwrites do not occur).
67
68 Specifically this state is:
69
70 - An `fs.MountManager` containing mount points.
71
72 - A `kernel.FDTable` containing pointers to open files.
73
74 Anything else managed by the VFS that can be easily loaded into memory from a
75 filesystem is synced back to those filesystems and is not saved. Examples are
76 pages in page caches used for optimizations (i.e. readahead and writeback), and
77 directory entries used to accelerate path lookups.
78
79 ### Mount points
80
81 Saving and restoring a mount point means saving and restoring:
82
83 - The root of the mounted filesystem.
84
85 - Mount flags, which control how the VFS interacts with the mounted
86 filesystem.
87
88 - Any relevant metadata about the mounted filesystem.
89
90 - All `fs.Inode`s referenced by the application that reside under the mount
91 point.
92
93 `fs.MountedFilesystem` is metadata about a filesystem that is mounted. It is
94 referenced by every `fs.Inode` loaded into memory under the mount point
95 including the `fs.Inode` of the mount point itself. The `fs.MountedFilesystem`
96 maps file objects on the filesystem to a virtualized `fs.Inode` number and vice
97 versa.
98
99 To restore all `fs.Inode`s under a given mount point, each `fs.Inode` leverages
100 its dependency on an `fs.MountedFilesystem`. Since the `fs.MountedFilesystem`
101 knows how an `fs.Inode` maps to a file object on a backing filesystem, this
102 mapping can be trivially consulted by each `fs.Inode` when the `fs.Inode` is
103 restored.
104
105 In detail, a mount point is saved in two steps:
106
107 - First, after the kernel is paused but before state.Save, we walk all mount
108 namespaces and install a mapping from `fs.Inode` numbers to file paths
109 relative to the root of the mounted filesystem in each
110 `fs.MountedFilesystem`. This is subsequently called the set of `fs.Inode`
111 mappings.
112
113 - Second, during state.Save, each `fs.MountedFilesystem` decides whether to
114 save the set of `fs.Inode` mappings. In-memory filesystems, like tmpfs, have
115 no need to save a set of `fs.Inode` mappings, since the `fs.Inode`s can be
116 entirely encoded in state file. Each `fs.MountedFilesystem` also optionally
117 saves the device name from when the filesystem was originally mounted. Each
118 `fs.Inode` saves its virtual identifier and a reference to a
119 `fs.MountedFilesystem`.
120
121 A mount point is restored in two steps:
122
123 - First, before state.Load, all mount configurations are stored in a global
124 `fs.RestoreEnvironment`. This tells us what mount points the user wants to
125 restore and how to re-establish pointers to backing filesystems.
126
127 - Second, during state.Load, each `fs.MountedFilesystem` optionally searches
128 for a mount in the `fs.RestoreEnvironment` that matches its saved device
129 name. The `fs.MountedFilesystem` then reestablishes a pointer to the root of
130 the mounted filesystem. For example, the mount specification provides the
131 network connection for a mounted remote filesystem client to communicate
132 with its remote file server. The `fs.MountedFilesystem` also trivially loads
133 its set of `fs.Inode` mappings. When an `fs.Inode` is encountered, the
134 `fs.Inode` loads its virtual identifier and its reference a
135 `fs.MountedFilesystem`. It uses the `fs.MountedFilesystem` to obtain the
136 root of the mounted filesystem and the `fs.Inode` mappings to obtain the
137 relative file path to its data. With these, the `fs.Inode` re-establishes a
138 pointer to its file object.
139
140 A mount point can trivially restore its `fs.Inode`s in parallel since
141 `fs.Inode`s have a restore dependency on their `fs.MountedFilesystem` and not on
142 each other.
143
144 ### Open files
145
146 An `fs.File` references the following filesystem objects:
147
148 ```go
149 fs.File -> fs.Dirent -> fs.Inode -> fs.MountedFilesystem
150 ```
151
152 The `fs.Inode` is restored using its `fs.MountedFilesystem`. The
153 [Mount points](#mount-points) section above describes how this happens in
154 detail. The `fs.Dirent` restores its pointer to an `fs.Inode`, pointers to
155 parent and children `fs.Dirents`, and the basename of the file.
156
157 Otherwise an `fs.File` restores flags, an offset, and a unique identifier (only
158 used internally).
159
160 It may use the `fs.Inode`, which it indirectly holds a reference on through the
161 `fs.Dirent`, to reestablish an open file handle on the backing filesystem (e.g.
162 to continue reading and writing).
163
164 ## Overlay
165
166 The overlay implementation in the fs package takes Linux overlayfs as a frame of
167 reference but corrects for several POSIX consistency errors.
168
169 In Linux overlayfs, the `struct inode` used for reading and writing to the same
170 file may be different. This is because the `struct inode` is dissociated with
171 the process of copying up the file from the upper to the lower directory. Since
172 flock(2) and fcntl(2) locks, inotify(7) watches, page caches, and a file's
173 identity are all stored directly or indirectly off the `struct inode`, these
174 properties of the `struct inode` may be stale after the first modification. This
175 can lead to file locking bugs, missed inotify events, and inconsistent data in
176 shared memory mappings of files, to name a few problems.
177
178 The fs package maintains a single `fs.Inode` to represent a directory entry in
179 an overlay and defines operations on this `fs.Inode` which synchronize with the
180 copy up process. This achieves several things:
181
182 + File locks, inotify watches, and the identity of the file need not be copied
183 at all.
184
185 + Memory mappings of files coordinate with the copy up process so that if a
186 file in the lower directory is memory mapped, all references to it are
187 invalidated, forcing the application to re-fault on memory mappings of the
188 file under the upper directory.
189
190 The `fs.Inode` holds metadata about files in the upper and/or lower directories
191 via an `fs.overlayEntry`. The `fs.overlayEntry` implements the `fs.Mappable`
192 interface. It multiplexes between upper and lower directory memory mappings and
193 stores a copy of memory references so they can be transferred to the upper
194 directory `fs.Mappable` when the file is copied up.
195
196 The lower filesystem in an overlay may contain another (nested) overlay, but the
197 upper filesystem may not contain another overlay. In other words, nested
198 overlays form a tree structure that only allows branching in the lower
199 filesystem.
200
201 Caching decisions in the overlay are delegated to the upper filesystem, meaning
202 that the Keep and Revalidate methods on the overlay return the same values as
203 the upper filesystem. A small wrinkle is that the lower filesystem is not
204 allowed to return `true` from Revalidate, as the overlay can not reload inodes
205 from the lower filesystem. A lower filesystem that does return `true` from
206 Revalidate will trigger a panic.
207
208 The `fs.Inode` also holds a reference to a `fs.MountedFilesystem` that
209 normalizes across the mounted filesystem state of the upper and lower
210 directories.
211
212 When a file is copied from the lower to the upper directory, attempts to
213 interact with the file block until the copy completes. All copying synchronizes
214 with rename(2).
215
216 ## Future Work
217
218 ### Overlay
219
220 When a file is copied from a lower directory to an upper directory, several
221 locks are taken: the global renamuMu and the copyMu of the `fs.Inode` being
222 copied. This blocks operations on the file, including fault handling of memory
223 mappings. Performance could be improved by copying files into a temporary
224 directory that resides on the same filesystem as the upper directory and doing
225 an atomic rename, holding locks only during the rename operation.
226
227 Additionally files are copied up synchronously. For large files, this causes a
228 noticeable latency. Performance could be improved by pipelining copies at
229 non-overlapping file offsets.