gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/vfs/g3doc/inotify.md (about) 1 # Inotify 2 3 Inotify is a mechanism for monitoring filesystem events in Linux--see 4 inotify(7). An inotify instance can be used to monitor files and directories for 5 modifications, creation/deletion, etc. The inotify API consists of system calls 6 that create inotify instances (inotify_init/inotify_init1) and add/remove 7 watches on files to an instance (inotify_add_watch/inotify_rm_watch). Events are 8 generated from various places in the sentry, including the syscall layer, the 9 vfs layer, the process fd table, and within each filesystem implementation. This 10 document outlines the implementation details of inotify. 11 12 ## Inotify Objects 13 14 Inotify data structures are implemented in the vfs package. 15 16 ### vfs.Inotify 17 18 Inotify instances are represented by vfs.Inotify objects, which implement 19 vfs.FileDescriptionImpl. As in Linux, inotify fds are backed by a 20 pseudo-filesystem (anonfs). Each inotify instance receives events from a set of 21 vfs.Watch objects, which can be modified with inotify_add_watch(2) and 22 inotify_rm_watch(2). An application can retrieve events by reading the inotify 23 fd. 24 25 ### vfs.Watches 26 27 The set of all watches held on a single file (i.e., the watch target) is stored 28 in vfs.Watches. Each watch will belong to a different inotify instance (an 29 instance can only have one watch on any watch target). The watches are stored in 30 a map indexed by their vfs.Inotify owner’s id. Hard links and file descriptions 31 to a single file will all share the same vfs.Watches (with the exception of the 32 gofer filesystem, described in a later section). Activity on the target causes 33 its vfs.Watches to generate notifications on its watches’ inotify instances. 34 35 ### vfs.Watch 36 37 A single watch, owned by one inotify instance and applied to one watch target. 38 Both the vfs.Inotify owner and vfs.Watches on the target will hold a vfs.Watch, 39 which leads to some complicated locking behavior (see Lock Ordering). Whenever a 40 watch is notified of an event on its target, it will queue events to its inotify 41 instance for delivery to the user. 42 43 ### vfs.Event 44 45 vfs.Event is a simple struct encapsulating all the fields for an inotify event. 46 It is generated by vfs.Watches and forwarded to the watches' owners. It is 47 serialized to the user during read(2) syscalls on the associated fs.Inotify's 48 fd. 49 50 ## Lock Ordering 51 52 There are three locks related to the inotify implementation: 53 54 Inotify.mu: the inotify instance lock. Inotify.evMu: the inotify event queue 55 lock. Watches.mu: the watch set lock, used to protect the collection of watches 56 on a target. 57 58 The correct lock ordering for inotify code is: 59 60 Inotify.mu -> Watches.mu -> Inotify.evMu. 61 62 Note that we use a distinct lock to protect the inotify event queue. If we 63 simply used Inotify.mu, we could simultaneously have locks being acquired in the 64 order of Inotify.mu -> Watches.mu and Watches.mu -> Inotify.mu, which would 65 cause deadlocks. For instance, adding a watch to an inotify instance would 66 require locking Inotify.mu, and then adding the same watch to the target would 67 cause Watches.mu to be held. At the same time, generating an event on the target 68 would require Watches.mu to be held before iterating through each watch, and 69 then notifying the owner of each watch would cause Inotify.mu to be held. 70 71 See the vfs package comment to understand how inotify locks fit into the overall 72 ordering of filesystem locks. 73 74 ## Watch Targets in Different Filesystem Implementations 75 76 In Linux, watches reside on inodes at the virtual filesystem layer. As a result, 77 all hard links and file descriptions on a single file will all share the same 78 watch set. There is no common inode structure across filesystem types (some may 79 not even have inodes), so we have to plumb inotify support through each specific 80 filesystem implementation. Some of the technical considerations are outlined 81 below. 82 83 ### Tmpfs 84 85 For filesystems with inodes, like tmpfs, the design is quite similar to that of 86 Linux, where watches reside on the inode. 87 88 ### Pseudo-filesystems 89 90 Technically, because inotify is implemented at the vfs layer in Linux, 91 pseudo-filesystems on top of kernfs support inotify passively. However, watches 92 can only track explicit filesystem operations like read/write, open/close, 93 mknod, etc., so watches on a target like /proc/self/fd will not generate events 94 every time a new fd is added or removed. As of this writing, we leave inotify 95 unimplemented in kernfs and anonfs; it does not seem particularly useful. 96 97 ### Gofer Filesystem (fsimpl/gofer) 98 99 The gofer filesystem has several traits that make it difficult to support 100 inotify: 101 102 * **There are no inodes.** A file is represented as a dentry that holds an 103 unopened p9 file (and possibly an open FID), through which the Sentry 104 interacts with the gofer. 105 * *Solution:* Because there is no inode structure stored in the sandbox, 106 inotify watches must be held on the dentry. For the purposes of inotify, 107 we assume that every dentry corresponds to a unique inode, which may 108 cause unexpected behavior in the presence of hard links, where multiple 109 dentries should share the same set of watches. Indeed, it is impossible 110 for us to be absolutely sure whether dentries correspond to the same 111 file or not, due to the following point: 112 * **The Sentry cannot always be aware of hard links on the remote 113 filesystem.** There is no way for us to confirm whether two files on the 114 remote filesystem are actually links to the same inode. QIDs and inodes are 115 not always 1:1. The assumption that dentries and inodes are 1:1 is 116 inevitably broken if there are remote hard links that we cannot detect. 117 * *Solution:* this is an issue with gofer fs in general, not only inotify, 118 and we will have to live with it. 119 * **Dentries can be cached, and then evicted.** Dentry lifetime does not 120 correspond to file lifetime. Because gofer fs is not entirely in-memory, the 121 absence of a dentry does not mean that the corresponding file does not 122 exist, nor does a dentry reaching zero references mean that the 123 corresponding file no longer exists. When a dentry reaches zero references, 124 it will be cached, in case the file at that path is needed again in the 125 future. However, the dentry may be evicted from the cache, which will cause 126 a new dentry to be created next time the same file path is used. The 127 existing watches will be lost. 128 * *Solution:* When a dentry reaches zero references, do not cache it if it 129 has any watches, so we can avoid eviction/destruction. Note that if the 130 dentry was deleted or invalidated (d.vfsd.IsDead()), we should still 131 destroy it along with its watches. Additionally, when a dentry’s last 132 watch is removed, we cache it if it also has zero references. This way, 133 the dentry can eventually be evicted from memory if it is no longer 134 needed. 135 * **Dentries can be invalidated.** Another issue with dentry lifetime is that 136 the remote file at the file path represented may change from underneath the 137 dentry. In this case, the next time that the dentry is used, it will be 138 invalidated and a new dentry will replace it. In this case, it is not clear 139 what should be done with the watches on the old dentry. 140 * *Solution:* Silently destroy the watches when invalidation occurs. We 141 have no way of knowing exactly what happened, when it happens. Inotify 142 instances on NFS files in Linux probably behave in a similar fashion, 143 since inotify is implemented at the vfs layer and is not aware of the 144 complexities of remote file systems. 145 * An alternative would be to issue some kind of event upon invalidation, 146 e.g. a delete event, but this has several issues: 147 * We cannot discern whether the remote file was invalidated because it was 148 moved, deleted, etc. This information is crucial, because these cases 149 should result in different events. Furthermore, the watches should only 150 be destroyed if the file has been deleted. 151 * Moreover, the mechanism for detecting whether the underlying file has 152 changed is to check whether a new QID is given by the gofer. This may 153 result in false positives, e.g. suppose that the server closed and 154 re-opened the same file, which may result in a new QID. 155 * Finally, the time of the event may be completely different from the time 156 of the file modification, since a dentry is not immediately notified 157 when the underlying file has changed. It would be quite unexpected to 158 receive the notification when invalidation was triggered, i.e. the next 159 time the file was accessed within the sandbox, because then the 160 read/write/etc. operation on the file would not result in the expected 161 event. 162 * Another point in favor of the first solution: inotify in Linux can 163 already be lossy on local filesystems (one of the sacrifices made so 164 that filesystem performance isn’t killed), and it is lossy on NFS for 165 similar reasons to gofer fs. Therefore, it is better for inotify to be 166 silent than to emit incorrect notifications. 167 * **There may be external users of the remote filesystem.** We can only track 168 operations performed on the file within the sandbox. This is sufficient 169 under InteropModeExclusive, but whenever there are external users, the set 170 of actions we are aware of is incomplete. 171 * *Solution:* We could either return an error or just issue a warning when 172 inotify is used without InteropModeExclusive. Although faulty, VFS1 173 allows it when the filesystem is shared, and Linux does the same for 174 remote filesystems (as mentioned above, inotify sits at the vfs level). 175 176 ## Dentry Interface 177 178 For events that must be generated above the vfs layer, we provide the following 179 DentryImpl methods to allow interactions with targets on any FilesystemImpl: 180 181 * **InotifyWithParent()** generates events on the dentry’s watches as well as 182 its parent’s. 183 * **Watches()** retrieves the watch set of the target represented by the 184 dentry. This is used to access and modify watches on a target. 185 * **OnZeroWatches()** performs cleanup tasks after the last watch is removed 186 from a dentry. This is needed by gofer fs, which must allow a watched dentry 187 to be cached once it has no more watches. Most implementations can just do 188 nothing. Note that OnZeroWatches() must be called after all inotify locks 189 are released to preserve lock ordering, since it may acquire 190 FilesystemImpl-specific locks. 191 192 ## IN_EXCL_UNLINK 193 194 There are several options that can be set for a watch, specified as part of the 195 mask in inotify_add_watch(2). In particular, IN_EXCL_UNLINK requires some 196 additional support in each filesystem. 197 198 A watch with IN_EXCL_UNLINK will not generate events for its target if it 199 corresponds to a path that was unlinked. For instance, if an fd is opened on 200 “foo/bar” and “foo/bar” is subsequently unlinked, any reads/writes/etc. on the 201 fd will be ignored by watches on “foo” or “foo/bar” with IN_EXCL_UNLINK. This 202 requires each DentryImpl to keep track of whether it has been unlinked, in order 203 to determine whether events should be sent to watches with IN_EXCL_UNLINK. 204 205 ## IN_ONESHOT 206 207 One-shot watches expire after generating a single event. When an event occurs, 208 all one-shot watches on the target that successfully generated an event are 209 removed. Lock ordering can cause the management of one-shot watches to be quite 210 expensive; see Watches.Notify() for more information.