github.com/VertebrateResequencing/muxfys@v3.0.5+incompatible/README.md (about) 1 # muxfys 2 3 [![GoDoc](https://godoc.org/github.com/VertebrateResequencing/muxfys?status.svg)](https://godoc.org/github.com/VertebrateResequencing/muxfys) 4 [![Go Report Card](https://goreportcard.com/badge/github.com/VertebrateResequencing/muxfys)](https://goreportcard.com/report/github.com/VertebrateResequencing/muxfys) 5 [![Build Status](https://travis-ci.org/VertebrateResequencing/muxfys.svg?branch=master)](https://travis-ci.org/VertebrateResequencing/muxfys) 6 [![Coverage Status](https://coveralls.io/repos/github/VertebrateResequencing/muxfys/badge.svg?branch=master)](https://coveralls.io/github/VertebrateResequencing/muxfys?branch=master) 7 8 go get github.com/VertebrateResequencing/muxfys 9 10 11 muxfys is a pure Go library for temporarily in-process mounting multiple 12 different remote file systems or object stores on to the same mount point as a 13 "filey" system. Currently only support for S3-like systems has been implemented. 14 15 It has high performance, and is easy to use with nothing else to install, and no 16 root permissions needed (except to initially install/configure fuse: on old 17 linux you may need to install fuse-utils, and for macOS you'll need to install 18 osxfuse; for both you must ensure that 'user_allow_other' is set in 19 /etc/fuse.conf or equivalent). 20 21 It has good S3 compatibility, working with AWS Signature Version 4 (Amazon S3, 22 Minio, et al.) and AWS Signature Version 2 (Google Cloud Storage, Openstack 23 Swift, Ceph Object Gateway, Riak CS, et al). 24 25 It allows "multiplexing": you can mount multiple different S3 buckets (or sub 26 directories of the same bucket) on the same local directory. This makes commands 27 you want to run against the files in your buckets much simpler, eg. instead of 28 mounting s3://publicbucket, s3://myinputbucket and s3://myoutputbucket to 29 separate mount points and running: 30 31 $ myexe -ref /mnt/publicbucket/refs/human/ref.fa -i /mnt/myinbucket/xyz/123/ 32 input.file > /mnt/myoutputbucket/xyz/123/output.file 33 34 You could multiplex the 3 buckets (at the desired paths) on to the directory you 35 will work from and just run: 36 37 $ myexe -ref ref.fa -i input.file > output.file 38 39 It is a "filey" system ('fys' instead of 'fs') in that it cares about 40 performance and efficiency first, and POSIX second. It is designed around a 41 particular use-case: 42 43 Non-interactively read a small handful of files who's paths you already know, 44 probably a few times for small files and only once for large files, then upload 45 a few large files. Eg. we want to mount S3 buckets that contain thousands of 46 unchanging cache files, and a few big input files that we process using those 47 cache files, and finally generate some results. 48 49 In particular this means we hold on to directory and file attributes forever and 50 assume they don't change externally. Permissions are ignored and only you get 51 read/write access. 52 53 When using muxfys, you 1) mount, 2) do something that needs the files in your S3 54 bucket(s), 3) unmount. Then repeat 1-3 for other things that need data in your 55 S3 buckets. 56 57 # Performance 58 59 To get a basic sense of performance, a 1GB file in a Ceph Object Gateway S3 60 bucket was read, twice in a row for tools with caching, using the methods that 61 worked for me (I had to hack minfs to get it to work); units are seconds 62 (average of 3 attempts) needed to read the whole file: 63 64 | method | fresh | cached | 65 |----------------|-------|--------| 66 | s3cmd | 5.9 | n/a | 67 | mc | 7.9 | n/a | 68 | minfs | 40 | n/a | 69 | s3fs | 12.1 | n/a | 70 | s3fs caching | 12.2 | 1.0 | 71 | muxfys | 5.7 | n/a | 72 | muxfys caching | 5.8 | 0.7 | 73 74 Ie. minfs is very slow, and muxfys is about 2x faster than s3fs, with no 75 noticeable performance penalty for fuse mounting vs simply downloading the files 76 you need to local disk. (You also get the benefit of being able to seek and read 77 only small parts of the remote file, without having to download the whole 78 thing.) 79 80 The same story holds true when performing the above test 100 times 81 ~simultaneously; while some reads take much longer due to Ceph/network overload, 82 muxfys remains on average twice as fast as s3fs. The only significant change is 83 that s3cmd starts to fail. 84 85 For a real-world test, some data processing and analysis was done with samtools, 86 a tool that can end up reading small parts of very large files. 87 www.htslib.org/workflow was partially followed to map fastqs with 441 read pairs 88 (extracted from an old human chr20 mapping). Mapping, sorting and calling was 89 carried out, in addition to creating and viewing a cram. The different caching 90 strategies used were: cup == reference-related files cached, fastq files 91 uncached, working in a normal POSIX directory; cuf == as cup, but working in a 92 fuse mounted writable directory; uuf == as cuf, but with no caching for the 93 reference-related files. The local(mc) method involved downloading all files 94 with mc first, with the cached result being the maximum possible performance: 95 that of running bwa and samtools when all required files are accessed from the 96 local POSIX filesystem. Units are seconds (average of 3 attempts): 97 98 | method | fresh | cached | 99 |------------|-------|--------| 100 | local(mc) | 157 | 40 | 101 | s3fs.cup | 175 | 50 | 102 | muxfys.cup | 80 | 45 | 103 | muxfys.cuf | 79 | 44 | 104 | muxfys.uuf | 88 | n/a | 105 106 Ie. muxfys is about 2x faster than just downloading all required files manually, 107 and over 2x faster than using s3fs. There isn't much performance loss when the 108 data is cached vs maximum possible performance. There's no noticeable penalty 109 (indeed it's a little faster) for working directly in a muxfys-mounted 110 directory. 111 112 Finally, to compare to a highly optimised tool written in C that has built-in 113 support (via libcurl) for reading from S3, samtools was once again used, this 114 time to read 100bp (the equivalent of a few lines) from an 8GB indexed cram 115 file. The builtin(mc) method involved downloading the single required cram cache 116 file from S3 first using mc, then relying on samtools' built-in S3 support by 117 giving it the s3:// path to the cram file; the cached result involves samtools 118 reading this cache file and the cram's index files from the local POSIX 119 filesystem, but it still reads cram data itself from the remote S3 system. The 120 other methods used samtools normally, giving it paths within the fuse mount(s) 121 created. The different caching strategies used were: cu == reference-related 122 files cached, cram-related files uncached; cc == everything cached; uu == 123 nothing cached. Units are seconds (average of 3 attempts): 124 125 | method | fresh | cached | 126 |-------------|-------|--------| 127 | builtin(mc) | 1.3 | 0.5 | 128 | s3fs.cu | 4.3 | 1.7 | 129 | s3fs.cc | 4.4 | 0.5 | 130 | s3fs.uu | 4.4 | 2.2 | 131 | muxfys.cu | 0.3 | 0.1 | 132 | muxfys.cc | 0.3 | 0.06 | 133 | muxfys.uu | 0.3 | 0.1 | 134 135 Ie. muxfys is much faster than s3fs (more than 2x faster probably due to much 136 faster and more efficient stating of files), and using it also gives a 137 significant benefit over using a tools' built-in support for S3. 138 139 # Status & Limitations 140 141 The only `RemoteAccessor` implemented so far is for S3-like object stores. 142 143 In cached mode, random reads and writes have been implemented. 144 145 In non-cached mode, random reads and serial writes have been implemented. 146 (It is unlikely that random uncached writes will be implemented.) 147 148 Non-POSIX behaviours: 149 150 * does not store file mode/owner/group 151 * does not support hardlinks 152 * symlinks are only supported temporarily in a cached writeable mount: they 153 can be created and used, but do not get uploaded 154 * `atime` (and typically `ctime`) is always the same as `mtime` 155 * `mtime` of files is not stored remotely (remote file mtimes are of their 156 upload time, and muxfys only guarantees that files are uploaded in the order 157 of their mtimes) 158 * does not upload empty directories, can't rename remote directories 159 * `fsync` is ignored, files are only flushed on `close` 160 161 # Guidance 162 163 `CacheData: true` will usually give you the best performance. Not setting an 164 explicit CacheDir will also give the best performance, as if you read a small 165 part of a large file, only the part you read will be downloaded and cached in 166 the unique CacheDir. 167 168 Only turn on `Write` mode if you have to write. 169 170 Use `CacheData: false` if you will read more data than can be stored on local 171 disk. 172 173 If you know that you will definitely end up reading the same data multiple times 174 (either during a mount, or from different mounts) on the same machine, and have 175 sufficient local disk space, use `CacheData: true` and set an explicit CacheDir 176 (with a constant absolute path, eg. starting in /tmp). Doing this results in any 177 file read downloading the whole remote file to cache it, which can be wasteful 178 if you only need to read a small part of a large file. (But this is the only way 179 that muxfys can coordinate the cache amongst independent processes.) 180 181 # Usage 182 183 ```go 184 import "github.com/VertebrateResequencing/muxfys" 185 186 // fully manual S3 configuration 187 accessorConfig := &muxfys.S3Config{ 188 Target: "https://s3.amazonaws.com/mybucket/subdir", 189 Region: "us-east-1", 190 AccessKey: os.Getenv("AWS_ACCESS_KEY_ID"), 191 SecretKey: os.Getenv("AWS_SECRET_ACCESS_KEY"), 192 } 193 accessor, err := muxfys.NewS3Accessor(accessorConfig) 194 if err != nil { 195 log.Fatal(err) 196 } 197 remoteConfig1 := &muxfys.RemoteConfig{ 198 Accessor: accessor, 199 CacheDir: "/tmp/muxfys/cache", 200 Write: true, 201 } 202 203 // or read configuration from standard AWS S3 config files and environment 204 // variables 205 accessorConfig, err = muxfys.S3ConfigFromEnvironment("default", 206 "myotherbucket/another/subdir") 207 if err != nil { 208 log.Fatalf("could not read config from environment: %s\n", err) 209 } 210 accessor, err = muxfys.NewS3Accessor(accessorConfig) 211 if err != nil { 212 log.Fatal(err) 213 } 214 remoteConfig2 := &muxfys.RemoteConfig{ 215 Accessor: accessor, 216 CacheData: true, 217 } 218 219 cfg := &muxfys.Config{ 220 Mount: "/tmp/muxfys/mount", 221 CacheBase: "/tmp", 222 Retries: 3, 223 Verbose: true, 224 } 225 226 fs, err := muxfys.New(cfg) 227 if err != nil { 228 log.Fatalf("bad configuration: %s\n", err) 229 } 230 231 err = fs.Mount(remoteConfig, remoteConfig2) 232 if err != nil { 233 log.Fatalf("could not mount: %s\n", err) 234 } 235 fs.UnmountOnDeath() 236 237 // read from & write to files in /tmp/muxfys/mount, which contains the 238 // contents of mybucket/subdir and myotherbucket/another/subdir; writes will 239 // get uploaded to mybucket/subdir when you Unmount() 240 241 err = fs.Unmount() 242 if err != nil { 243 log.Fatalf("could not unmount: %s\n", err) 244 } 245 246 logs := fs.Logs() 247 ``` 248 249 # Provenance 250 251 There are many ways of accessing data in S3 buckets. Common tools include s3cmd 252 for direct up/download of particular files, and s3fs for fuse-mounting a bucket. 253 But these are not written in Go. 254 255 Amazon provide aws-sdk-go for interacting with S3, but this does not work with 256 (my) Ceph Object Gateway and possibly other implementations of S3. 257 258 minio-go is an alternative Go library that provides good compatibility with a 259 wide variety of S3-like systems. 260 261 There are at least 3 Go libraries for creating fuse-mounted file-systems. 262 github.com/jacobsa/fuse was based on bazil.org/fuse, claiming higher 263 performance. Also claiming high performance is github.com/hanwen/go-fuse. 264 265 There are at least 2 projects that implement fuse-mounting of S3 buckets: 266 267 * github.com/minio/minfs is implemented using minio-go and bazil, but in my 268 hands was very slow. It is designed to be run as root, requiring file-based 269 configuration. 270 * github.com/kahing/goofys is implemented using aws-sdk-go and jacobsa/fuse, 271 making it incompatible with (my) Ceph Object Gateway. 272 273 Both are designed to be run as daemons as opposed to being used in-process. 274 275 muxfys is implemented using minio-go for compatibility, and hanwen/go-fuse for 276 speed. (In my testing, hanwen/go-fuse and jacobsa/fuse did not have noticeably 277 difference performance characteristics, but go-fuse was easier to write for.) 278 However, some of its read code is inspired by goofys. Thanks to minimising 279 remote calls to the remote S3 system, and only implementing what S3 is generally 280 capable of, it shares and adds to goofys' non-POSIX behaviours. 281 282 ## Versioning 283 284 This project adheres to [Semantic Versioning](http://semver.org/). See 285 CHANGELOG.md for a description of changes. 286 287 If you want to rely on a stable API, vendor the library, updating within a 288 desired version. For example, you could use [Glide](https://glide.sh) and: 289 290 $ glide get github.com/VertebrateResequencing/muxfys#^2.0.0