gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/vfs/g3doc/inotify.md (about)

     1  # Inotify
     2  
     3  Inotify is a mechanism for monitoring filesystem events in Linux--see
     4  inotify(7). An inotify instance can be used to monitor files and directories for
     5  modifications, creation/deletion, etc. The inotify API consists of system calls
     6  that create inotify instances (inotify_init/inotify_init1) and add/remove
     7  watches on files to an instance (inotify_add_watch/inotify_rm_watch). Events are
     8  generated from various places in the sentry, including the syscall layer, the
     9  vfs layer, the process fd table, and within each filesystem implementation. This
    10  document outlines the implementation details of inotify.
    11  
    12  ## Inotify Objects
    13  
    14  Inotify data structures are implemented in the vfs package.
    15  
    16  ### vfs.Inotify
    17  
    18  Inotify instances are represented by vfs.Inotify objects, which implement
    19  vfs.FileDescriptionImpl. As in Linux, inotify fds are backed by a
    20  pseudo-filesystem (anonfs). Each inotify instance receives events from a set of
    21  vfs.Watch objects, which can be modified with inotify_add_watch(2) and
    22  inotify_rm_watch(2). An application can retrieve events by reading the inotify
    23  fd.
    24  
    25  ### vfs.Watches
    26  
    27  The set of all watches held on a single file (i.e., the watch target) is stored
    28  in vfs.Watches. Each watch will belong to a different inotify instance (an
    29  instance can only have one watch on any watch target). The watches are stored in
    30  a map indexed by their vfs.Inotify owner’s id. Hard links and file descriptions
    31  to a single file will all share the same vfs.Watches (with the exception of the
    32  gofer filesystem, described in a later section). Activity on the target causes
    33  its vfs.Watches to generate notifications on its watches’ inotify instances.
    34  
    35  ### vfs.Watch
    36  
    37  A single watch, owned by one inotify instance and applied to one watch target.
    38  Both the vfs.Inotify owner and vfs.Watches on the target will hold a vfs.Watch,
    39  which leads to some complicated locking behavior (see Lock Ordering). Whenever a
    40  watch is notified of an event on its target, it will queue events to its inotify
    41  instance for delivery to the user.
    42  
    43  ### vfs.Event
    44  
    45  vfs.Event is a simple struct encapsulating all the fields for an inotify event.
    46  It is generated by vfs.Watches and forwarded to the watches' owners. It is
    47  serialized to the user during read(2) syscalls on the associated fs.Inotify's
    48  fd.
    49  
    50  ## Lock Ordering
    51  
    52  There are three locks related to the inotify implementation:
    53  
    54  Inotify.mu: the inotify instance lock. Inotify.evMu: the inotify event queue
    55  lock. Watches.mu: the watch set lock, used to protect the collection of watches
    56  on a target.
    57  
    58  The correct lock ordering for inotify code is:
    59  
    60  Inotify.mu -> Watches.mu -> Inotify.evMu.
    61  
    62  Note that we use a distinct lock to protect the inotify event queue. If we
    63  simply used Inotify.mu, we could simultaneously have locks being acquired in the
    64  order of Inotify.mu -> Watches.mu and Watches.mu -> Inotify.mu, which would
    65  cause deadlocks. For instance, adding a watch to an inotify instance would
    66  require locking Inotify.mu, and then adding the same watch to the target would
    67  cause Watches.mu to be held. At the same time, generating an event on the target
    68  would require Watches.mu to be held before iterating through each watch, and
    69  then notifying the owner of each watch would cause Inotify.mu to be held.
    70  
    71  See the vfs package comment to understand how inotify locks fit into the overall
    72  ordering of filesystem locks.
    73  
    74  ## Watch Targets in Different Filesystem Implementations
    75  
    76  In Linux, watches reside on inodes at the virtual filesystem layer. As a result,
    77  all hard links and file descriptions on a single file will all share the same
    78  watch set. There is no common inode structure across filesystem types (some may
    79  not even have inodes), so we have to plumb inotify support through each specific
    80  filesystem implementation. Some of the technical considerations are outlined
    81  below.
    82  
    83  ### Tmpfs
    84  
    85  For filesystems with inodes, like tmpfs, the design is quite similar to that of
    86  Linux, where watches reside on the inode.
    87  
    88  ### Pseudo-filesystems
    89  
    90  Technically, because inotify is implemented at the vfs layer in Linux,
    91  pseudo-filesystems on top of kernfs support inotify passively. However, watches
    92  can only track explicit filesystem operations like read/write, open/close,
    93  mknod, etc., so watches on a target like /proc/self/fd will not generate events
    94  every time a new fd is added or removed. As of this writing, we leave inotify
    95  unimplemented in kernfs and anonfs; it does not seem particularly useful.
    96  
    97  ### Gofer Filesystem (fsimpl/gofer)
    98  
    99  The gofer filesystem has several traits that make it difficult to support
   100  inotify:
   101  
   102  *   **There are no inodes.** A file is represented as a dentry that holds an
   103      unopened p9 file (and possibly an open FID), through which the Sentry
   104      interacts with the gofer.
   105      *   *Solution:* Because there is no inode structure stored in the sandbox,
   106          inotify watches must be held on the dentry. For the purposes of inotify,
   107          we assume that every dentry corresponds to a unique inode, which may
   108          cause unexpected behavior in the presence of hard links, where multiple
   109          dentries should share the same set of watches. Indeed, it is impossible
   110          for us to be absolutely sure whether dentries correspond to the same
   111          file or not, due to the following point:
   112  *   **The Sentry cannot always be aware of hard links on the remote
   113      filesystem.** There is no way for us to confirm whether two files on the
   114      remote filesystem are actually links to the same inode. QIDs and inodes are
   115      not always 1:1. The assumption that dentries and inodes are 1:1 is
   116      inevitably broken if there are remote hard links that we cannot detect.
   117      *   *Solution:* this is an issue with gofer fs in general, not only inotify,
   118          and we will have to live with it.
   119  *   **Dentries can be cached, and then evicted.** Dentry lifetime does not
   120      correspond to file lifetime. Because gofer fs is not entirely in-memory, the
   121      absence of a dentry does not mean that the corresponding file does not
   122      exist, nor does a dentry reaching zero references mean that the
   123      corresponding file no longer exists. When a dentry reaches zero references,
   124      it will be cached, in case the file at that path is needed again in the
   125      future. However, the dentry may be evicted from the cache, which will cause
   126      a new dentry to be created next time the same file path is used. The
   127      existing watches will be lost.
   128      *   *Solution:* When a dentry reaches zero references, do not cache it if it
   129          has any watches, so we can avoid eviction/destruction. Note that if the
   130          dentry was deleted or invalidated (d.vfsd.IsDead()), we should still
   131          destroy it along with its watches. Additionally, when a dentry’s last
   132          watch is removed, we cache it if it also has zero references. This way,
   133          the dentry can eventually be evicted from memory if it is no longer
   134          needed.
   135  *   **Dentries can be invalidated.** Another issue with dentry lifetime is that
   136      the remote file at the file path represented may change from underneath the
   137      dentry. In this case, the next time that the dentry is used, it will be
   138      invalidated and a new dentry will replace it. In this case, it is not clear
   139      what should be done with the watches on the old dentry.
   140      *   *Solution:* Silently destroy the watches when invalidation occurs. We
   141          have no way of knowing exactly what happened, when it happens. Inotify
   142          instances on NFS files in Linux probably behave in a similar fashion,
   143          since inotify is implemented at the vfs layer and is not aware of the
   144          complexities of remote file systems.
   145      *   An alternative would be to issue some kind of event upon invalidation,
   146          e.g. a delete event, but this has several issues:
   147      *   We cannot discern whether the remote file was invalidated because it was
   148          moved, deleted, etc. This information is crucial, because these cases
   149          should result in different events. Furthermore, the watches should only
   150          be destroyed if the file has been deleted.
   151      *   Moreover, the mechanism for detecting whether the underlying file has
   152          changed is to check whether a new QID is given by the gofer. This may
   153          result in false positives, e.g. suppose that the server closed and
   154          re-opened the same file, which may result in a new QID.
   155      *   Finally, the time of the event may be completely different from the time
   156          of the file modification, since a dentry is not immediately notified
   157          when the underlying file has changed. It would be quite unexpected to
   158          receive the notification when invalidation was triggered, i.e. the next
   159          time the file was accessed within the sandbox, because then the
   160          read/write/etc. operation on the file would not result in the expected
   161          event.
   162      *   Another point in favor of the first solution: inotify in Linux can
   163          already be lossy on local filesystems (one of the sacrifices made so
   164          that filesystem performance isn’t killed), and it is lossy on NFS for
   165          similar reasons to gofer fs. Therefore, it is better for inotify to be
   166          silent than to emit incorrect notifications.
   167  *   **There may be external users of the remote filesystem.** We can only track
   168      operations performed on the file within the sandbox. This is sufficient
   169      under InteropModeExclusive, but whenever there are external users, the set
   170      of actions we are aware of is incomplete.
   171      *   *Solution:* We could either return an error or just issue a warning when
   172          inotify is used without InteropModeExclusive. Although faulty, VFS1
   173          allows it when the filesystem is shared, and Linux does the same for
   174          remote filesystems (as mentioned above, inotify sits at the vfs level).
   175  
   176  ## Dentry Interface
   177  
   178  For events that must be generated above the vfs layer, we provide the following
   179  DentryImpl methods to allow interactions with targets on any FilesystemImpl:
   180  
   181  *   **InotifyWithParent()** generates events on the dentry’s watches as well as
   182      its parent’s.
   183  *   **Watches()** retrieves the watch set of the target represented by the
   184      dentry. This is used to access and modify watches on a target.
   185  *   **OnZeroWatches()** performs cleanup tasks after the last watch is removed
   186      from a dentry. This is needed by gofer fs, which must allow a watched dentry
   187      to be cached once it has no more watches. Most implementations can just do
   188      nothing. Note that OnZeroWatches() must be called after all inotify locks
   189      are released to preserve lock ordering, since it may acquire
   190      FilesystemImpl-specific locks.
   191  
   192  ## IN_EXCL_UNLINK
   193  
   194  There are several options that can be set for a watch, specified as part of the
   195  mask in inotify_add_watch(2). In particular, IN_EXCL_UNLINK requires some
   196  additional support in each filesystem.
   197  
   198  A watch with IN_EXCL_UNLINK will not generate events for its target if it
   199  corresponds to a path that was unlinked. For instance, if an fd is opened on
   200  “foo/bar” and “foo/bar” is subsequently unlinked, any reads/writes/etc. on the
   201  fd will be ignored by watches on “foo” or “foo/bar” with IN_EXCL_UNLINK. This
   202  requires each DentryImpl to keep track of whether it has been unlinked, in order
   203  to determine whether events should be sent to watches with IN_EXCL_UNLINK.
   204  
   205  ## IN_ONESHOT
   206  
   207  One-shot watches expire after generating a single event. When an event occurs,
   208  all one-shot watches on the target that successfully generated an event are
   209  removed. Lock ordering can cause the management of one-shot watches to be quite
   210  expensive; see Watches.Notify() for more information.