github.com/rohankumardubey/proxyfs@v0.0.0-20210108201508-653efa9ab00e/docs/source/architecture/background.rst (about)

     1  ProxyFS: A write-journaled, extent-based filesystem
     2  ===================================================
     3  
     4  ProxyFS is a write-journaled, extent-based filesystem that utilizes
     5  properties of an object storage backend. In this section we will cover
     6  the motivation and rationale behind the ProxyFS architecture.
     7  
     8  Background
     9  ----------
    10  
    11  Traditional filesystems utilize mutable blocks
    12  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    13  
    14  Filesystems are typically architected to make use of a set of one or
    15  more block storage devices each presenting the property that a write of
    16  a block followed by a read of that same block will return the written
    17  contents. A subsequent overwrite of that block followed by a second read
    18  will return the contents of that second write. Such file systems make
    19  use of this property as they modify in place the contents of the
    20  underlying block storage to maintain application data and file system
    21  metadata.
    22  
    23  Scale-out architectures leverage eventually consistent properties
    24  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    25  
    26  Efforts to satisfy the ever growing storage capacity needs have led to
    27  the development of what are termed “scale-out” architectures. Such
    28  systems are characterized by independent nodes cooperating to present a
    29  single system view albeit with several challenges. The biggest challenge
    30  is keeping all the nodes in sync with that single system view. So called
    31  “eventually consistent” solutions are designed to push the limits of
    32  scale by relaxing this consistency goal in various ways.
    33  
    34  Object APIs leverage scale-out architectures
    35  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    36  
    37  A third storage trend has been the advent of object storage. In this
    38  model, arbitrarily sized objects support a PUT, GET, and DELETE model
    39  where the atomicity of block storage is retained but at the object
    40  level. Various eventually consistent object storage systems are now
    41  available (e.g. Amazon S3, OpenStack Swift, etc…). In such systems,
    42  objects are identified by a Universal Resource Locator, or URL.
    43  
    44  Shifting applications from the use of file systems to object storage is
    45  complicated when various features innate to file systems but not present
    46  in the chosen object storage system are a requirement. While the concept
    47  of an object shares a lot with that of a file, there are important
    48  differences. For one, operations on objects typically preclude
    49  modification of their contents short of rewriting the entire object.
    50  File systems also typically support reorganization by means of some sort
    51  of rename or move operation. Often, the means by which scale-out storage
    52  systems are able to scale so well has been at the cost of dropping
    53  support for this rename capability.
    54  
    55  Enabling a transition for simultaneous filesystem and object access
    56  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    57  
    58  And yet, the advantages of moving from a traditional file system to a
    59  scale-out eventually consistent object storage solution are compelling.
    60  A transition strategy enabling applications currently accessing their
    61  data via file system semantics to move to object storage is needed. Of
    62  all the challenges in providing this transition (there are many), one of
    63  the biggest is the need of the file system to deal with the eventual
    64  consistency property of an underlying object storage solution.
    65  
    66  PUT-ting a strategy together
    67  ----------------------------
    68  
    69  Consider two successive writes (PUTs) of the same URL. In an eventually
    70  consistent object storage solutions, a subsequent read (GET) may return
    71  either version of the object. This could be due to a preferred location
    72  to store the object being temporarily unavailable during some part of
    73  the write sequence.
    74  
    75  Using unique URLs
    76  -----------------
    77  
    78  Next consider that the first write was to a new URL. No second write has
    79  ever been performed to that URL, hence a subsequent read (GET) could
    80  only return the first write’s data (or an error due to unavailability of
    81  course). Thus there is no confusion as to which version of an object’s
    82  data will be returned.
    83  
    84  A filesystem would typically map itself logically onto a set of blocks
    85  of storage. As blocks become unreferenced (e.g. due to a file’s
    86  deletion), those blocks would be returned to a pool and re-used later.
    87  If a file is modified, it is also possible to modify the underlying
    88  blocks storing the data for that file directly. In either case,
    89  confusion as to which version of the contents of a block would be fatal
    90  to a file system.
    91  
    92  The solution is to take advantage of the arbitrarily large namespace
    93  offered by the URL model of the object store system by only ever writing
    94  to a given URL once. In this way, a file system will not be confused by
    95  the underlying storage system returning stale versions of file system
    96  data.
    97  
    98  A filesystem are represented by inodes and extents
    99  --------------------------------------------------
   100  
   101  A filesystem must represent a number of data structures that are
   102  persisted in the underlying storage system. The basic element of a file
   103  system is termed an inode. An inode is a representation of familiar file
   104  system concepts such as files and directories (though there are
   105  typically other types of inodes supported). Each inode has associated
   106  metadata typically of a fixed size. For file inodes, there is also a
   107  mapping from logical extents (starting offset and length) to locations
   108  in the storage system. For directory inodes, there is also a list of
   109  mappings from names to inodes representing the directory’s contents. The
   110  set of inodes as well as the contents of each file and directory inode
   111  can be of arbitrary size and must support modification.
   112  
   113  ProxyFS utilizes a B+Tree for filesystem inodes and extents
   114  -----------------------------------------------------------
   115  
   116  Support for arbitrary size in each of the above can be accomplished by
   117  many means. ProxyFS utilizes a B+Tree structure. Lookups, insertions,
   118  modifications, and deletions to a B+Tree structure are able to make
   119  efficient use of in-memory caching to achieve high performance with the
   120  ability to scale massively.
   121  
   122  The drawback of the B+Tree structure is in how data is typically
   123  persisted in the underlying storage. B+Trees attempt to keep the nodes
   124  the tree below a maximum size such that they may be modified in place.
   125  When the underlying storage is an eventually consistent object storage
   126  system, modifications in place will be problematic due to this lack of
   127  strong consistency. Instead, ProxyFS uses a write-journaling approach to
   128  select a unique object into which updates are made. Each modification of
   129  a B+Tree must therefore affect all nodes from the updated leaf node all
   130  the way up to the root node. When these are each written to the object
   131  storage system, only the root node’s location need be located in order
   132  to access a consistent view of the filesystem element (be that a file or
   133  directory).
   134  
   135  Like any write-journalling system, portions of the journal become
   136  unreferenced. For ProxyFS, the unreferenced elements are portions of
   137  objects or entire objects. Garbage collection operations include
   138  compaction (where the still referenced portions of objects are copied to
   139  new objects) and simple object deletion (for objects not currently
   140  referenced). Such activity must be balanced with the workload being
   141  applied to both the file system and the underlying storage system.