storj.io/minio@v0.0.0-20230509071714-0cbc90f649b1/docs/shared-backend/DESIGN.md (about) 1 Introduction [](https://slack.min.io) 2 ------------ 3 4 This feature allows MinIO to serve a shared NAS drive across multiple MinIO instances. There are no special configuration changes required to enable this feature. Access to files stored on NAS volume are locked and synchronized by default. 5 6 Motivation 7 ---------- 8 9 Since MinIO instances serve the purpose of a single tenant there is an increasing requirement where users want to run multiple MinIO instances on a same backend which is managed by an existing NAS (NFS, GlusterFS, Other distributed filesystems) rather than a local disk. This feature is implemented also with minimal disruption in mind for the user and overall UI. 10 11 Restrictions 12 ------------ 13 14 * A PutObject() is blocked and waits if another GetObject() is in progress. 15 * A CompleteMultipartUpload() is blocked and waits if another PutObject() or GetObject() is in progress. 16 * Cannot run FS mode as a remote disk RPC. 17 18 ## How To Run? 19 20 Running MinIO instances on shared backend is no different than running on a stand-alone disk. There are no special configuration changes required to enable this feature. Access to files stored on NAS volume are locked and synchronized by default. Following examples will clarify this further for each operating system of your choice: 21 22 ### Ubuntu 16.04 LTS 23 24 Example 1: Start MinIO instance on a shared backend mounted and available at `/path/to/nfs-volume`. 25 26 On linux server1 27 ```shell 28 minio gateway nas /path/to/nfs-volume 29 ``` 30 31 On linux server2 32 ```shell 33 minio gateway nas /path/to/nfs-volume 34 ``` 35 36 ### Windows 2012 Server 37 38 Example 1: Start MinIO instance on a shared backend mounted and available at `\\remote-server\cifs`. 39 40 On windows server1 41 ```cmd 42 minio.exe gateway nas \\remote-server\cifs\data 43 ``` 44 45 On windows server2 46 ```cmd 47 minio.exe gateway nas \\remote-server\cifs\data 48 ``` 49 50 Alternatively if `\\remote-server\cifs` is mounted as `D:\` drive. 51 52 On windows server1 53 ```cmd 54 minio.exe gateway nas D:\data 55 ``` 56 57 On windows server2 58 ```cmd 59 minio.exe gateway nas D:\data 60 ``` 61 62 Architecture 63 ------------------ 64 65 ## POSIX/Win32 Locks 66 67 ### Lock process 68 69 With in the same MinIO instance locking is handled by existing in-memory namespace locks (**sync.RWMutex** et. al). To synchronize locks between many MinIO instances we leverage POSIX `fcntl()` locks on Unixes and on Windows `LockFileEx()` Win32 API. Requesting write lock block if there are any read locks held by neighboring MinIO instance on the same path. So does the read lock if there are any active write locks in-progress. 70 71 ### Unlock process 72 73 Unlocking happens on filesystems locks by just closing the file descriptor (fd) which was initially requested for lock operation. Closing the fd tells the kernel to relinquish all the locks held on the path by the current process. This gets trickier when there are many readers on the same path by the same process, it would mean that closing an fd relinquishes locks for all concurrent readers as well. To properly manage this situation a simple fd reference count is implemented, the same fd is shared between many readers. When readers start closing on the fd we start reducing the reference count, once reference count has reached zero we can be sure that there are no more readers active. So we proceed and close the underlying file descriptor which would relinquish the read lock held on the path. 74 75 This doesn't apply for the writes because there is always one writer and many readers for any unique object. 76 77 ## Handling Concurrency. 78 79 An example here shows how the contention is handled with GetObject(). 80 81 GetObject() holds a read lock on `fs.json`. 82 83 ```go 84 fsMetaPath := pathJoin(fs.fsPath, minioMetaBucket, bucketMetaPrefix, bucket, object, fsMetaJSONFile) 85 rlk, err := fs.rwPool.Open(fsMetaPath) 86 if err != nil { 87 return toObjectErr(err, bucket, object) 88 } 89 defer rlk.Close() 90 91 ... you can perform other operations here ... 92 93 _, err = io.Copy(writer, reader) 94 95 ... after successful copy operation unlocks the read lock ... 96 ``` 97 98 A concurrent PutObject is requested on the same object, PutObject() attempts a write lock on `fs.json`. 99 100 ```go 101 fsMetaPath := pathJoin(fs.fsPath, minioMetaBucket, bucketMetaPrefix, bucket, object, fsMetaJSONFile) 102 wlk, err := fs.rwPool.Create(fsMetaPath) 103 if err != nil { 104 return ObjectInfo{}, toObjectErr(err, bucket, object) 105 } 106 // This close will allow for locks to be synchronized on `fs.json`. 107 defer wlk.Close() 108 ``` 109 110 Now from the above snippet the following code one can notice that until the GetObject() returns writing to the client. Following portion of the code will block. 111 112 ```go 113 wlk, err := fs.rwPool.Create(fsMetaPath) 114 ``` 115 116 This restriction is needed so that corrupted data is not returned to the client in between I/O. The logic works vice-versa as well an on-going PutObject(), GetObject() would wait for the PutObject() to complete. 117 118 ### Caveats (concurrency) 119 120 Consider for example 3 servers sharing the same backend 121 122 On minio1 123 124 - DeleteObject(object1) --> lock acquired on `fs.json` while object1 is being deleted. 125 126 On minio2 127 128 - PutObject(object1) --> lock waiting until DeleteObject finishes. 129 130 On minio3 131 132 - PutObject(object1) --> (concurrent request during PutObject minio2 checking if `fs.json` exists) 133 134 Once lock is acquired the minio2 validates if the file really exists to avoid obtaining lock on an fd which is already deleted. But this situation calls for a race with a third server which is also attempting to write the same file before the minio2 can validate if the file exists. It might be potentially possible `fs.json` is created so the lock acquired by minio2 might be invalid and can lead to a potential inconsistency. 135 136 This is a known problem and cannot be solved by POSIX fcntl locks. These are considered to be the limits of shared filesystem.