github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/sentry/fs/g3doc/inotify.md (about)

     1  # Inotify
     2  
     3  Inotify implements the like-named filesystem event notification system for the
     4  sentry, see `inotify(7)`.
     5  
     6  ## Architecture
     7  
     8  For the most part, the sentry implementation of inotify mirrors the Linux
     9  architecture. Inotify instances (i.e. the fd returned by inotify_init(2)) are
    10  backed by a pseudo-filesystem. Events are generated from various places in the
    11  sentry, including the [syscall layer][syscall_dir], the [vfs layer][dirent] and
    12  the [process fd table][fd_table]. Watches are stored in inodes and generated
    13  events are queued to the inotify instance owning the watches for delivery to the
    14  user.
    15  
    16  ## Objects
    17  
    18  Here is a brief description of the existing and new objects involved in the
    19  sentry inotify mechanism, and how they interact:
    20  
    21  ### [`fs.Inotify`][inotify]
    22  
    23  -   An inotify instances, created by inotify_init(2)/inotify_init1(2).
    24  -   The inotify fd has a `fs.Dirent`, supports filesystem syscalls to read
    25      events.
    26  -   Has multiple `fs.Watch`es, with at most one watch per target inode, per
    27      inotify instance.
    28  -   Has an instance `id` which is globally unique. This is *not* the fd number
    29      for this instance, since the fd can be duped. This `id` is not externally
    30      visible.
    31  
    32  ### [`fs.Watch`][watch]
    33  
    34  -   An inotify watch, created/deleted by
    35      inotify_add_watch(2)/inotify_rm_watch(2).
    36  -   Owned by an `fs.Inotify` instance, each watch keeps a pointer to the
    37      `owner`.
    38  -   Associated with a single `fs.Inode`, which is the watch `target`. While the
    39      watch is active, it indirectly pins `target` to memory. See the "Reference
    40      Model" section for a detailed explanation.
    41  -   Filesystem operations on `target` generate `fs.Event`s.
    42  
    43  ### [`fs.Event`][event]
    44  
    45  -   A simple struct encapsulating all the fields for an inotify event.
    46  -   Generated by `fs.Watch`es and forwarded to the watches' `owner`s.
    47  -   Serialized to the user during read(2) syscalls on the associated
    48      `fs.Inotify`'s fd.
    49  
    50  ### [`fs.Dirent`][dirent]
    51  
    52  -   Many inotify events are generated inside dirent methods. Events are
    53      generated in the dirent methods rather than `fs.Inode` methods because some
    54      events carry the name of the subject node, and node names are generally
    55      unavailable in an `fs.Inode`.
    56  -   Dirents do not directly contain state for any watches. Instead, they forward
    57      notifications to the underlying `fs.Inode`.
    58  
    59  ### [`fs.Inode`][inode]
    60  
    61  -   Interacts with inotify through `fs.Watch`es.
    62  -   Inodes contain a map of all active `fs.Watch`es on them.
    63  -   An `fs.Inotify` instance can have at most one `fs.Watch` per inode.
    64      `fs.Watch`es on an inode are indexed by their `owner`'s `id`.
    65  -   All inotify logic is encapsulated in the [`Watches`][inode_watches] struct
    66      in an inode. Logically, `Watches` is the set of inotify watches on the
    67      inode.
    68  
    69  ## Reference Model
    70  
    71  The sentry inotify implementation has a complex reference model. An inotify
    72  watch observes a single inode. For efficient lookup, the state for a watch is
    73  stored directly on the target inode. This state needs to be persistent for the
    74  lifetime of watch. Unlike usual filesystem metadata, the watch state has no
    75  "on-disk" representation, so they cannot be reconstructed by the filesystem if
    76  the inode is flushed from memory. This effectively means we need to keep any
    77  inodes with actives watches pinned to memory.
    78  
    79  We can't just hold an extra ref on the inode to pin it to memory because some
    80  filesystems (such as gofer-based filesystems) don't have persistent inodes. In
    81  such a filesystem, if we just pin the inode, nothing prevents the enclosing
    82  dirent from being GCed. Once the dirent is GCed, the pinned inode is
    83  unreachable -- these filesystems generate a new inode by re-reading the node
    84  state on the next walk. Incidentally, hardlinks also don't work on these
    85  filesystems for this reason.
    86  
    87  To prevent the above scenario, when a new watch is added on an inode, we *pin*
    88  the dirent we used to reach the inode. Note that due to hardlinks, this dirent
    89  may not be the only dirent pointing to the inode. Attempting to set an inotify
    90  watch via multiple hardlinks to the same file results in the same watch being
    91  returned for both links. However, for each new dirent we use to reach the same
    92  inode, we add a new pin. We need a new pin for each new dirent used to reach the
    93  inode because we have no guarantees about the deletion order of the different
    94  links to the inode.
    95  
    96  ## Lock Ordering
    97  
    98  There are 4 locks related to the inotify implementation:
    99  
   100  -   `Inotify.mu`: the inotify instance lock.
   101  -   `Inotify.evMu`: the inotify event queue lock.
   102  -   `Watch.mu`: the watch lock, used to protect pins.
   103  -   `fs.Watches.mu`: the inode watch set mu, used to protect the collection of
   104      watches on the inode.
   105  
   106  The correct lock ordering for inotify code is:
   107  
   108  `Inotify.mu` -> `fs.Watches.mu` -> `Watch.mu` -> `Inotify.evMu`.
   109  
   110  We need a distinct lock for the event queue because by the time a goroutine
   111  attempts to queue a new event, it is already holding `fs.Watches.mu`. If we used
   112  `Inotify.mu` to also protect the event queue, this would violate the above lock
   113  ordering.
   114  
   115  [dirent]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/dirent.go
   116  [event]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify_event.go
   117  [fd_table]: https://github.com/google/gvisor/blob/master/pkg/sentry/kernel/fd_table.go
   118  [inode]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inode.go
   119  [inode_watches]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inode_inotify.go
   120  [inotify]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify.go
   121  [syscall_dir]: https://github.com/google/gvisor/blob/master/pkg/sentry/syscalls/linux/
   122  [watch]: https://github.com/google/gvisor/blob/master/pkg/sentry/fs/inotify_watch.go