github.com/rohankumardubey/proxyfs@v0.0.0-20210108201508-653efa9ab00e/docs/source/architecture/example-flows.rst (about)

     1  Data Flows
     2  ==========
     3  
     4  As an example of how the system works, this will be a walk through of
     5  basic filesystem operators so that we can illuminate how the system
     6  works and everything is put together. This is meant to be an overview of
     7  the basic operations and how they flow through ProxyFS.
     8  
     9  Filesystem Writes
    10  -----------------
    11  
    12  When a filesystem client goes to do a write two things happen. First the
    13  data bits need to be written in the storage. Second, the filesystem tree
    14  needs to be updated so that it knows of the existence of that file.
    15  Let’s dive into a walkthrough of how those two operations happen.
    16  
    17  Client initiates a write
    18  ~~~~~~~~~~~~~~~~~~~~~~~~
    19  
    20  After a client has mounted a filesystem volume, the client initiates a
    21  write request which is received by the ProxyFS process.
    22  
    23  Pick a unique object for “strong read-your writes” consistency
    24  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    25  
    26  To store data in the back-end object storage cluster, a unique filename
    27  is chosen with the help of the *nonce* configuration so that each
    28  “block” of storage have unique URLs and inherit the ”strong read-your
    29  writes” property of the object storage backend.
    30  
    31  Pool connections and writes to object storage connection
    32  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    33  
    34  Writes are managed with a pool of maintained connections to an object
    35  API endpoint. One of these connections is chosen and the data for this
    36  write is streamed to the backend storage.
    37  
    38  This allows ProxyFS to mediate the comparatively small write sizes to
    39  the larger object sizes by streaming multiple filesystem writes into a
    40  single back-end object. This improves write performance as objects are
    41  optimized for streaming, sequential writes.
    42  
    43  More data is accumulated from the write into this open connection of
    44  this file. Until one of two tings happen. First, is if the “max flush
    45  size” has been triggered. Alternatively a timeout has been reached (“max
    46  flush time”). Either one triggers that connection to close, and the data
    47  is sorted in the back-end object storage. This file that is written is
    48  called a log-segment.
    49  
    50  Data is persisted in back-end object storage
    51  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    52  
    53  In the back-end object storage, data must be stored in
    54  containers/buckets. How many files that are stored in a given
    55  container/bucket and the number of containers/buckets utilized is also
    56  configurable.
    57  
    58  What storage location is used is also configurable, which enable various
    59  back-end storage policies to be used, or to enable swift-s3-sync to
    60  archive data to other s3 storage targets.
    61  
    62  Update filesystem metadata
    63  ~~~~~~~~~~~~~~~~~~~~~~~~~~
    64  
    65  The next thing that has to happen is that the filesystem metadata needs
    66  to be updated. There is a new log-segment that has been stored that
    67  represents new data in this volume. The filesystem needs to be updated
    68  to reflect this new data. Any new inodes or extents need to be captured
    69  in the filesystem metadata.
    70  
    71  There is a data structure that represents the filesystem. It’s called
    72  “headhunter”. Because each modification of a B+Tree affect all nodes in
    73  the filesystem from the updated leaf node all the way up to the root
    74  node. This means that the log-structured filesystems are updating the
    75  “head” of the filesystem tree.
    76  
    77  Persisting filesystem metadata
    78  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    79  
    80  The filesystem metadata is persisted in the back-end object storage.
    81  ProxyFS utilizes the same tricks of creating unique objects using a
    82  *nonce* so that any persistence of the filesystem metadata also can
    83  take advantage of “strong read your writes for new data”
    84  
    85  Checkpoints of the filesystem tree are created and “snapshotted” into
    86  the back-end object storage. This ensures that any filesystem metadata
    87  that is stored in the back-end storage represents a consistent view of
    88  the filesystem.
    89  
    90  A few methods are used to determine when a checkpoint of the filesystem
    91  metadata should be stored in the back-end object storage.
    92  
    93  One method is time-based. A checkpoint can be initiated and stored in
    94  the back-end object storage at a configured interval. The default
    95  configuration is 10 seconds.
    96  
    97  Additionally a checkpoint can be triggered when the filesystem client
    98  asks for one. For example, if the client unmounts the filesystem the
    99  client can ask for an explicit flush. Or another example is when after a
   100  file write, the client asks for a special close/flush. This will also
   101  trigger a checkpoint to be made of the filesystem tree and be stored in
   102  the back-end object storage.
   103  
   104  Checkpoints may also be a useful tool for management software to perform
   105  various volume management functions such as moving volumes, shutting
   106  down services, etc.
   107  
   108  Replay log for Zero RPO
   109  ~~~~~~~~~~~~~~~~~~~~~~~
   110  
   111  ProxyFS additionally utilizes replay logs as a method of keeping track
   112  of changes to the filesystem. In addition to updating the B+Tree that
   113  represents the filesystem, a small log is kept that contains only the
   114  instructions on how to apply the filesystem metadata updates.
   115  
   116  This small replay log does not store file data, just filesystem
   117  metadata. This replay log is kept to ensure that no filesystem metadata
   118  updates are lost if the system is rebooted or there is power loss.
   119  
   120  Additionally, this replay log can be utilized by management software to
   121  manage volume migrations, or assist in failover.
   122  
   123  Object API Writes
   124  -----------------
   125  
   126  For object API writes with the AWS S3 or OpenStack Swift API, the Swift
   127  Proxy provides access to the object storage back end. Object storage
   128  manages its namespace with accounts and buckets/containers as its
   129  namespace constructs. ProxyFS creates a volume for each account in the
   130  system and the top-level directories map to buckets/containers.
   131  
   132  Middleware powers many of the functions provided by the Swift Proxy
   133  node. ProxyFS provides an additional middleware that enables any Swift
   134  Proxy to read and write data for a ProxyFS-enabled account.
   135  
   136  When a request is made to write data via the S3 or Swift API, the
   137  ProxyFS middleware writes data using the log-structured data format
   138  utilizing non-overlapping *nonce* to create uniquely-named segments.
   139  
   140  In Swift, there is a Container (Bucket) service that needs to be
   141  informed that there is a new object in its namespace. For ProxyFS
   142  enabled accounts, rather than contacting the Container service, the
   143  volume’s ProxyFS service is contacted to inform which new segments need
   144  to be added to the filesystem namespace.
   145  
   146  Multi-part upload APIs are accommodated by “coalescing” multiple parts
   147  into a single file.
   148  
   149  Filesystem Reads
   150  ----------------
   151  
   152  How filesystem reads differ from object API reads is that object reads
   153  are optimized for larger, sequential reads. Whereas filesystem reads may
   154  read in smaller segments.
   155  
   156  While object APIs do support range-read requests, it’s not necessarily
   157  efficient to do very small frequent reads. ProxyFS will do a range read
   158  request to “read ahead” of the filesystem client and cache data from the
   159  log segment. The size of the “read ahead” is configureable. The total
   160  size of the read cache is configurable and each volume can be configured
   161  with a relative weight of that cache that each volume will utilize.
   162  
   163  There is a separate pool of connections to the object storage backend
   164  (the size of which is configureable to support various read patterns).
   165  
   166  When a write is requested, the filesystem metadata translates the range
   167  of data that maps back to the referenced inode to specific log-segments
   168  in the back-end object storage. Data is cached and appropriate byte
   169  ranges are served back to the client.
   170  
   171  Object Reads
   172  ------------
   173  
   174  When a read request is made, the object server doesn’t itself know how
   175  to map the URI (/account/bucket/file) to log-segments in the backend
   176  storage. The Proxy Server configured with the ProxyFS middleware will
   177  query the volume’s ProxyFS server with the URI and in response be
   178  provided with a “read plan” that contains the appropriate log segments
   179  and byte ranges to respond to the read request.