github.com/rohankumardubey/proxyfs@v0.0.0-20210108201508-653efa9ab00e/docs/source/architecture/example-flows.rst (about) 1 Data Flows 2 ========== 3 4 As an example of how the system works, this will be a walk through of 5 basic filesystem operators so that we can illuminate how the system 6 works and everything is put together. This is meant to be an overview of 7 the basic operations and how they flow through ProxyFS. 8 9 Filesystem Writes 10 ----------------- 11 12 When a filesystem client goes to do a write two things happen. First the 13 data bits need to be written in the storage. Second, the filesystem tree 14 needs to be updated so that it knows of the existence of that file. 15 Let’s dive into a walkthrough of how those two operations happen. 16 17 Client initiates a write 18 ~~~~~~~~~~~~~~~~~~~~~~~~ 19 20 After a client has mounted a filesystem volume, the client initiates a 21 write request which is received by the ProxyFS process. 22 23 Pick a unique object for “strong read-your writes” consistency 24 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 25 26 To store data in the back-end object storage cluster, a unique filename 27 is chosen with the help of the *nonce* configuration so that each 28 “block” of storage have unique URLs and inherit the ”strong read-your 29 writes” property of the object storage backend. 30 31 Pool connections and writes to object storage connection 32 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 33 34 Writes are managed with a pool of maintained connections to an object 35 API endpoint. One of these connections is chosen and the data for this 36 write is streamed to the backend storage. 37 38 This allows ProxyFS to mediate the comparatively small write sizes to 39 the larger object sizes by streaming multiple filesystem writes into a 40 single back-end object. This improves write performance as objects are 41 optimized for streaming, sequential writes. 42 43 More data is accumulated from the write into this open connection of 44 this file. Until one of two tings happen. First, is if the “max flush 45 size” has been triggered. Alternatively a timeout has been reached (“max 46 flush time”). Either one triggers that connection to close, and the data 47 is sorted in the back-end object storage. This file that is written is 48 called a log-segment. 49 50 Data is persisted in back-end object storage 51 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 52 53 In the back-end object storage, data must be stored in 54 containers/buckets. How many files that are stored in a given 55 container/bucket and the number of containers/buckets utilized is also 56 configurable. 57 58 What storage location is used is also configurable, which enable various 59 back-end storage policies to be used, or to enable swift-s3-sync to 60 archive data to other s3 storage targets. 61 62 Update filesystem metadata 63 ~~~~~~~~~~~~~~~~~~~~~~~~~~ 64 65 The next thing that has to happen is that the filesystem metadata needs 66 to be updated. There is a new log-segment that has been stored that 67 represents new data in this volume. The filesystem needs to be updated 68 to reflect this new data. Any new inodes or extents need to be captured 69 in the filesystem metadata. 70 71 There is a data structure that represents the filesystem. It’s called 72 “headhunter”. Because each modification of a B+Tree affect all nodes in 73 the filesystem from the updated leaf node all the way up to the root 74 node. This means that the log-structured filesystems are updating the 75 “head” of the filesystem tree. 76 77 Persisting filesystem metadata 78 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 79 80 The filesystem metadata is persisted in the back-end object storage. 81 ProxyFS utilizes the same tricks of creating unique objects using a 82 *nonce* so that any persistence of the filesystem metadata also can 83 take advantage of “strong read your writes for new data” 84 85 Checkpoints of the filesystem tree are created and “snapshotted” into 86 the back-end object storage. This ensures that any filesystem metadata 87 that is stored in the back-end storage represents a consistent view of 88 the filesystem. 89 90 A few methods are used to determine when a checkpoint of the filesystem 91 metadata should be stored in the back-end object storage. 92 93 One method is time-based. A checkpoint can be initiated and stored in 94 the back-end object storage at a configured interval. The default 95 configuration is 10 seconds. 96 97 Additionally a checkpoint can be triggered when the filesystem client 98 asks for one. For example, if the client unmounts the filesystem the 99 client can ask for an explicit flush. Or another example is when after a 100 file write, the client asks for a special close/flush. This will also 101 trigger a checkpoint to be made of the filesystem tree and be stored in 102 the back-end object storage. 103 104 Checkpoints may also be a useful tool for management software to perform 105 various volume management functions such as moving volumes, shutting 106 down services, etc. 107 108 Replay log for Zero RPO 109 ~~~~~~~~~~~~~~~~~~~~~~~ 110 111 ProxyFS additionally utilizes replay logs as a method of keeping track 112 of changes to the filesystem. In addition to updating the B+Tree that 113 represents the filesystem, a small log is kept that contains only the 114 instructions on how to apply the filesystem metadata updates. 115 116 This small replay log does not store file data, just filesystem 117 metadata. This replay log is kept to ensure that no filesystem metadata 118 updates are lost if the system is rebooted or there is power loss. 119 120 Additionally, this replay log can be utilized by management software to 121 manage volume migrations, or assist in failover. 122 123 Object API Writes 124 ----------------- 125 126 For object API writes with the AWS S3 or OpenStack Swift API, the Swift 127 Proxy provides access to the object storage back end. Object storage 128 manages its namespace with accounts and buckets/containers as its 129 namespace constructs. ProxyFS creates a volume for each account in the 130 system and the top-level directories map to buckets/containers. 131 132 Middleware powers many of the functions provided by the Swift Proxy 133 node. ProxyFS provides an additional middleware that enables any Swift 134 Proxy to read and write data for a ProxyFS-enabled account. 135 136 When a request is made to write data via the S3 or Swift API, the 137 ProxyFS middleware writes data using the log-structured data format 138 utilizing non-overlapping *nonce* to create uniquely-named segments. 139 140 In Swift, there is a Container (Bucket) service that needs to be 141 informed that there is a new object in its namespace. For ProxyFS 142 enabled accounts, rather than contacting the Container service, the 143 volume’s ProxyFS service is contacted to inform which new segments need 144 to be added to the filesystem namespace. 145 146 Multi-part upload APIs are accommodated by “coalescing” multiple parts 147 into a single file. 148 149 Filesystem Reads 150 ---------------- 151 152 How filesystem reads differ from object API reads is that object reads 153 are optimized for larger, sequential reads. Whereas filesystem reads may 154 read in smaller segments. 155 156 While object APIs do support range-read requests, it’s not necessarily 157 efficient to do very small frequent reads. ProxyFS will do a range read 158 request to “read ahead” of the filesystem client and cache data from the 159 log segment. The size of the “read ahead” is configureable. The total 160 size of the read cache is configurable and each volume can be configured 161 with a relative weight of that cache that each volume will utilize. 162 163 There is a separate pool of connections to the object storage backend 164 (the size of which is configureable to support various read patterns). 165 166 When a write is requested, the filesystem metadata translates the range 167 of data that maps back to the referenced inode to specific log-segments 168 in the back-end object storage. Data is cached and appropriate byte 169 ranges are served back to the client. 170 171 Object Reads 172 ------------ 173 174 When a read request is made, the object server doesn’t itself know how 175 to map the URI (/account/bucket/file) to log-segments in the backend 176 storage. The Proxy Server configured with the ProxyFS middleware will 177 query the volume’s ProxyFS server with the URI and in response be 178 provided with a “read plan” that contains the appropriate log segments 179 and byte ranges to respond to the read request.