github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/sentry/fs/g3doc/inotify.md (about) 1 # Inotify 2 3 Inotify implements the like-named filesystem event notification system for the 4 sentry, see `inotify(7)`. 5 6 ## Architecture 7 8 For the most part, the sentry implementation of inotify mirrors the Linux 9 architecture. Inotify instances (i.e. the fd returned by inotify_init(2)) are 10 backed by a pseudo-filesystem. Events are generated from various places in the 11 sentry, including the [syscall layer][syscall_dir], the [vfs layer][dirent] and 12 the [process fd table][fd_table]. Watches are stored in inodes and generated 13 events are queued to the inotify instance owning the watches for delivery to the 14 user. 15 16 ## Objects 17 18 Here is a brief description of the existing and new objects involved in the 19 sentry inotify mechanism, and how they interact: 20 21 ### [`fs.Inotify`][inotify] 22 23 - An inotify instances, created by inotify_init(2)/inotify_init1(2). 24 - The inotify fd has a `fs.Dirent`, supports filesystem syscalls to read 25 events. 26 - Has multiple `fs.Watch`es, with at most one watch per target inode, per 27 inotify instance. 28 - Has an instance `id` which is globally unique. This is *not* the fd number 29 for this instance, since the fd can be duped. This `id` is not externally 30 visible. 31 32 ### [`fs.Watch`][watch] 33 34 - An inotify watch, created/deleted by 35 inotify_add_watch(2)/inotify_rm_watch(2). 36 - Owned by an `fs.Inotify` instance, each watch keeps a pointer to the 37 `owner`. 38 - Associated with a single `fs.Inode`, which is the watch `target`. While the 39 watch is active, it indirectly pins `target` to memory. See the "Reference 40 Model" section for a detailed explanation. 41 - Filesystem operations on `target` generate `fs.Event`s. 42 43 ### [`fs.Event`][event] 44 45 - A simple struct encapsulating all the fields for an inotify event. 46 - Generated by `fs.Watch`es and forwarded to the watches' `owner`s. 47 - Serialized to the user during read(2) syscalls on the associated 48 `fs.Inotify`'s fd. 49 50 ### [`fs.Dirent`][dirent] 51 52 - Many inotify events are generated inside dirent methods. Events are 53 generated in the dirent methods rather than `fs.Inode` methods because some 54 events carry the name of the subject node, and node names are generally 55 unavailable in an `fs.Inode`. 56 - Dirents do not directly contain state for any watches. Instead, they forward 57 notifications to the underlying `fs.Inode`. 58 59 ### [`fs.Inode`][inode] 60 61 - Interacts with inotify through `fs.Watch`es. 62 - Inodes contain a map of all active `fs.Watch`es on them. 63 - An `fs.Inotify` instance can have at most one `fs.Watch` per inode. 64 `fs.Watch`es on an inode are indexed by their `owner`'s `id`. 65 - All inotify logic is encapsulated in the [`Watches`][inode_watches] struct 66 in an inode. Logically, `Watches` is the set of inotify watches on the 67 inode. 68 69 ## Reference Model 70 71 The sentry inotify implementation has a complex reference model. An inotify 72 watch observes a single inode. For efficient lookup, the state for a watch is 73 stored directly on the target inode. This state needs to be persistent for the 74 lifetime of watch. Unlike usual filesystem metadata, the watch state has no 75 "on-disk" representation, so they cannot be reconstructed by the filesystem if 76 the inode is flushed from memory. This effectively means we need to keep any 77 inodes with actives watches pinned to memory. 78 79 We can't just hold an extra ref on the inode to pin it to memory because some 80 filesystems (such as gofer-based filesystems) don't have persistent inodes. In 81 such a filesystem, if we just pin the inode, nothing prevents the enclosing 82 dirent from being GCed. Once the dirent is GCed, the pinned inode is 83 unreachable -- these filesystems generate a new inode by re-reading the node 84 state on the next walk. Incidentally, hardlinks also don't work on these 85 filesystems for this reason. 86 87 To prevent the above scenario, when a new watch is added on an inode, we *pin* 88 the dirent we used to reach the inode. Note that due to hardlinks, this dirent 89 may not be the only dirent pointing to the inode. Attempting to set an inotify 90 watch via multiple hardlinks to the same file results in the same watch being 91 returned for both links. However, for each new dirent we use to reach the same 92 inode, we add a new pin. We need a new pin for each new dirent used to reach the 93 inode because we have no guarantees about the deletion order of the different 94 links to the inode. 95 96 ## Lock Ordering 97 98 There are 4 locks related to the inotify implementation: 99 100 - `Inotify.mu`: the inotify instance lock. 101 - `Inotify.evMu`: the inotify event queue lock. 102 - `Watch.mu`: the watch lock, used to protect pins. 103 - `fs.Watches.mu`: the inode watch set mu, used to protect the collection of 104 watches on the inode. 105 106 The correct lock ordering for inotify code is: 107 108 `Inotify.mu` -> `fs.Watches.mu` -> `Watch.mu` -> `Inotify.evMu`. 109 110 We need a distinct lock for the event queue because by the time a goroutine 111 attempts to queue a new event, it is already holding `fs.Watches.mu`. If we used 112 `Inotify.mu` to also protect the event queue, this would violate the above lock 113 ordering. 114 115 [dirent]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/dirent.go 116 [event]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify_event.go 117 [fd_table]: https://github.com/google/gvisor/blob/master/pkg/sentry/kernel/fd_table.go 118 [inode]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inode.go 119 [inode_watches]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inode_inotify.go 120 [inotify]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify.go 121 [syscall_dir]: https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/ 122 [watch]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify_watch.go