github.com/rohankumardubey/proxyfs@v0.0.0-20210108201508-653efa9ab00e/docs/source/architecture/background.rst (about) 1 ProxyFS: A write-journaled, extent-based filesystem 2 =================================================== 3 4 ProxyFS is a write-journaled, extent-based filesystem that utilizes 5 properties of an object storage backend. In this section we will cover 6 the motivation and rationale behind the ProxyFS architecture. 7 8 Background 9 ---------- 10 11 Traditional filesystems utilize mutable blocks 12 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 13 14 Filesystems are typically architected to make use of a set of one or 15 more block storage devices each presenting the property that a write of 16 a block followed by a read of that same block will return the written 17 contents. A subsequent overwrite of that block followed by a second read 18 will return the contents of that second write. Such file systems make 19 use of this property as they modify in place the contents of the 20 underlying block storage to maintain application data and file system 21 metadata. 22 23 Scale-out architectures leverage eventually consistent properties 24 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 25 26 Efforts to satisfy the ever growing storage capacity needs have led to 27 the development of what are termed “scale-out” architectures. Such 28 systems are characterized by independent nodes cooperating to present a 29 single system view albeit with several challenges. The biggest challenge 30 is keeping all the nodes in sync with that single system view. So called 31 “eventually consistent” solutions are designed to push the limits of 32 scale by relaxing this consistency goal in various ways. 33 34 Object APIs leverage scale-out architectures 35 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 36 37 A third storage trend has been the advent of object storage. In this 38 model, arbitrarily sized objects support a PUT, GET, and DELETE model 39 where the atomicity of block storage is retained but at the object 40 level. Various eventually consistent object storage systems are now 41 available (e.g. Amazon S3, OpenStack Swift, etc…). In such systems, 42 objects are identified by a Universal Resource Locator, or URL. 43 44 Shifting applications from the use of file systems to object storage is 45 complicated when various features innate to file systems but not present 46 in the chosen object storage system are a requirement. While the concept 47 of an object shares a lot with that of a file, there are important 48 differences. For one, operations on objects typically preclude 49 modification of their contents short of rewriting the entire object. 50 File systems also typically support reorganization by means of some sort 51 of rename or move operation. Often, the means by which scale-out storage 52 systems are able to scale so well has been at the cost of dropping 53 support for this rename capability. 54 55 Enabling a transition for simultaneous filesystem and object access 56 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 57 58 And yet, the advantages of moving from a traditional file system to a 59 scale-out eventually consistent object storage solution are compelling. 60 A transition strategy enabling applications currently accessing their 61 data via file system semantics to move to object storage is needed. Of 62 all the challenges in providing this transition (there are many), one of 63 the biggest is the need of the file system to deal with the eventual 64 consistency property of an underlying object storage solution. 65 66 PUT-ting a strategy together 67 ---------------------------- 68 69 Consider two successive writes (PUTs) of the same URL. In an eventually 70 consistent object storage solutions, a subsequent read (GET) may return 71 either version of the object. This could be due to a preferred location 72 to store the object being temporarily unavailable during some part of 73 the write sequence. 74 75 Using unique URLs 76 ----------------- 77 78 Next consider that the first write was to a new URL. No second write has 79 ever been performed to that URL, hence a subsequent read (GET) could 80 only return the first write’s data (or an error due to unavailability of 81 course). Thus there is no confusion as to which version of an object’s 82 data will be returned. 83 84 A filesystem would typically map itself logically onto a set of blocks 85 of storage. As blocks become unreferenced (e.g. due to a file’s 86 deletion), those blocks would be returned to a pool and re-used later. 87 If a file is modified, it is also possible to modify the underlying 88 blocks storing the data for that file directly. In either case, 89 confusion as to which version of the contents of a block would be fatal 90 to a file system. 91 92 The solution is to take advantage of the arbitrarily large namespace 93 offered by the URL model of the object store system by only ever writing 94 to a given URL once. In this way, a file system will not be confused by 95 the underlying storage system returning stale versions of file system 96 data. 97 98 A filesystem are represented by inodes and extents 99 -------------------------------------------------- 100 101 A filesystem must represent a number of data structures that are 102 persisted in the underlying storage system. The basic element of a file 103 system is termed an inode. An inode is a representation of familiar file 104 system concepts such as files and directories (though there are 105 typically other types of inodes supported). Each inode has associated 106 metadata typically of a fixed size. For file inodes, there is also a 107 mapping from logical extents (starting offset and length) to locations 108 in the storage system. For directory inodes, there is also a list of 109 mappings from names to inodes representing the directory’s contents. The 110 set of inodes as well as the contents of each file and directory inode 111 can be of arbitrary size and must support modification. 112 113 ProxyFS utilizes a B+Tree for filesystem inodes and extents 114 ----------------------------------------------------------- 115 116 Support for arbitrary size in each of the above can be accomplished by 117 many means. ProxyFS utilizes a B+Tree structure. Lookups, insertions, 118 modifications, and deletions to a B+Tree structure are able to make 119 efficient use of in-memory caching to achieve high performance with the 120 ability to scale massively. 121 122 The drawback of the B+Tree structure is in how data is typically 123 persisted in the underlying storage. B+Trees attempt to keep the nodes 124 the tree below a maximum size such that they may be modified in place. 125 When the underlying storage is an eventually consistent object storage 126 system, modifications in place will be problematic due to this lack of 127 strong consistency. Instead, ProxyFS uses a write-journaling approach to 128 select a unique object into which updates are made. Each modification of 129 a B+Tree must therefore affect all nodes from the updated leaf node all 130 the way up to the root node. When these are each written to the object 131 storage system, only the root node’s location need be located in order 132 to access a consistent view of the filesystem element (be that a file or 133 directory). 134 135 Like any write-journalling system, portions of the journal become 136 unreferenced. For ProxyFS, the unreferenced elements are portions of 137 objects or entire objects. Garbage collection operations include 138 compaction (where the still referenced portions of objects are copied to 139 new objects) and simple object deletion (for objects not currently 140 referenced). Such activity must be balanced with the workload being 141 applied to both the file system and the underlying storage system.