github.com/SagerNet/gvisor@v0.0.0-20210707092255-7731c139d75c/pkg/sentry/fsimpl/ext/README.md (about)

     1  ## EXT(2/3/4) File System
     2  
     3  This is a filesystem driver which supports ext2, ext3 and ext4 filesystems.
     4  Linux has specialized drivers for each variant but none which supports all. This
     5  library takes advantage of ext's backward compatibility and understands the
     6  internal organization of on-disk structures to support all variants.
     7  
     8  This driver implementation diverges from the Linux implementations in being more
     9  forgiving about versioning. For instance, if a filesystem contains both extent
    10  based inodes and classical block map based inodes, this driver will not complain
    11  and interpret them both correctly. While in Linux this would be an issue. This
    12  blurs the line between the three ext fs variants.
    13  
    14  Ext2 is considered deprecated as of Red Hat Enterprise Linux 7, and ext3 has
    15  been superseded by ext4 by large performance gains. Thus it is recommended to
    16  upgrade older filesystem images to ext4 using e2fsprogs for better performance.
    17  
    18  ### Read Only
    19  
    20  This driver currently only allows read only operations. A lot of the design
    21  decisions are based on this feature. There are plans to implement write (the
    22  process for which is documented in the future work section).
    23  
    24  ### Performance
    25  
    26  One of the biggest wins about this driver is that it directly talks to the
    27  underlying block device (or whatever persistent storage is being used), instead
    28  of making expensive RPCs to a gofer.
    29  
    30  Another advantage is that ext fs supports fast concurrent reads. Currently the
    31  device is represented using a `io.ReaderAt` which allows for concurrent reads.
    32  All reads are directly passed to the device driver which intelligently serves
    33  the read requests in the optimal order. There is no congestion due to locking
    34  while reading in the filesystem level.
    35  
    36  Reads are optimized further in the way file data is transferred over to user
    37  memory. Ext fs directly copies over file data from disk into user memory with no
    38  additional allocations on the way. We can only get faster by preloading file
    39  data into memory (see future work section).
    40  
    41  The internal structures used to represent files, inodes and file descriptors use
    42  a lot of inheritance. With the level of indirection that an interface adds with
    43  an internal pointer, it can quickly fragment a structure across memory. As this
    44  runs along side a full blown kernel (which is memory intensive), having a
    45  fragmented struct might hurt performance. Hence these internal structures,
    46  though interfaced, are tightly packed in memory using the same inheritance
    47  pattern that pkg/sentry/vfs uses. The pkg/sentry/fsimpl/ext/disklayout package
    48  makes an execption to this pattern for reasons documented in the package.
    49  
    50  ### Security
    51  
    52  This driver also intends to help sandbox the container better by reducing the
    53  surface of the host kernel that the application touches. It prevents the
    54  application from exploiting vulnerabilities in the host filesystem driver. All
    55  `io.ReaderAt.ReadAt()` calls are translated to `pread(2)` which are directly
    56  passed to the device driver in the kernel. Hence this reduces the surface for
    57  attack.
    58  
    59  The application can not affect any host filesystems other than the one passed
    60  via block device by the user.
    61  
    62  ### Future Work
    63  
    64  #### Write
    65  
    66  To support write operations we would need to modify the block device underneath.
    67  Currently, the driver does not modify the device at all, not even for updating
    68  the access times for reads. Modifying the filesystem incorrectly can corrupt it
    69  and render it unreadable for other correct ext(x) drivers. Hence caution must be
    70  maintained while modifying metadata structures.
    71  
    72  Ext4 specifically is built for performance and has added a lot of complexity as
    73  to how metadata structures are modified. For instance, files that are organized
    74  via an extent tree which must be balanced and file data blocks must be placed in
    75  the same extent as much as possible to increase locality. Such properties must
    76  be maintained while modifying the tree.
    77  
    78  Ext filesystems boast a lot about locality, which plays a big role in them being
    79  performant. The block allocation algorithm in Linux does a good job in keeping
    80  related data together. This behavior must be maintained as much as possible,
    81  else we might end up degrading the filesystem performance over time.
    82  
    83  Ext4 also supports a wide variety of features which are specialized for varying
    84  use cases. Implementing all of them can get difficult very quickly.
    85  
    86  Ext(x) checksums all its metadata structures to check for corruption, so
    87  modification of any metadata struct must correspond with re-checksumming the
    88  struct. Linux filesystem drivers also order on-disk updates intelligently to not
    89  corrupt the filesystem and also remain performant. The in-memory metadata
    90  structures must be kept in sync with what is on disk.
    91  
    92  There is also replication of some important structures across the filesystem.
    93  All replicas must be updated when their original copy is updated. There is also
    94  provisioning for snapshotting which must be kept in mind, although it should not
    95  affect this implementation unless we allow users to create filesystem snapshots.
    96  
    97  Ext4 also introduced journaling (jbd2). The journal must be updated
    98  appropriately.
    99  
   100  #### Performance
   101  
   102  To improve performance we should implement a buffer cache, and optionally, read
   103  ahead for small files. While doing so we must also keep in mind the memory usage
   104  and have a reasonable cap on how much file data we want to hold in memory.
   105  
   106  #### Features
   107  
   108  Our current implementation will work with most ext4 filesystems for readonly
   109  purposed. However, the following features are not supported yet:
   110  
   111  -   Journal
   112  -   Snapshotting
   113  -   Extended Attributes
   114  -   Hash Tree Directories
   115  -   Meta Block Groups
   116  -   Multiple Mount Protection
   117  -   Bigalloc