storj.io/minio@v0.0.0-20230509071714-0cbc90f649b1/docs/shared-backend/DESIGN.md

storj.io/minio@v0.0.0-20230509071714-0cbc90f649b1/docs/shared-backend/DESIGN.md (about)

1 Introduction [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io)
2 ------------
3
4 This feature allows MinIO to serve a shared NAS drive across multiple MinIO instances. There are no special configuration changes required to enable this feature. Access to files stored on NAS volume are locked and synchronized by default.
5
6 Motivation
7 ----------
8
9 Since MinIO instances serve the purpose of a single tenant there is an increasing requirement where users want to run multiple MinIO instances on a same backend which is managed by an existing NAS (NFS, GlusterFS, Other distributed filesystems) rather than a local disk. This feature is implemented also with minimal disruption in mind for the user and overall UI.
10
11 Restrictions
12 ------------
13
14 * A PutObject() is blocked and waits if another GetObject() is in progress.
15 * A CompleteMultipartUpload() is blocked and waits if another PutObject() or GetObject() is in progress.
16 * Cannot run FS mode as a remote disk RPC.
17
18 ## How To Run?
19
20 Running MinIO instances on shared backend is no different than running on a stand-alone disk. There are no special configuration changes required to enable this feature. Access to files stored on NAS volume are locked and synchronized by default. Following examples will clarify this further for each operating system of your choice:
21
22 ### Ubuntu 16.04 LTS
23
24 Example 1: Start MinIO instance on a shared backend mounted and available at `/path/to/nfs-volume`.
25
26 On linux server1
27 ```shell
28 minio gateway nas /path/to/nfs-volume
29 ```
30
31 On linux server2
32 ```shell
33 minio gateway nas /path/to/nfs-volume
34 ```
35
36 ### Windows 2012 Server
37
38 Example 1: Start MinIO instance on a shared backend mounted and available at `\\remote-server\cifs`.
39
40 On windows server1
41 ```cmd
42 minio.exe gateway nas \\remote-server\cifs\data
43 ```
44
45 On windows server2
46 ```cmd
47 minio.exe gateway nas \\remote-server\cifs\data
48 ```
49
50 Alternatively if `\\remote-server\cifs` is mounted as `D:\` drive.
51
52 On windows server1
53 ```cmd
54 minio.exe gateway nas D:\data
55 ```
56
57 On windows server2
58 ```cmd
59 minio.exe gateway nas D:\data
60 ```
61
62 Architecture
63 ------------------
64
65 ## POSIX/Win32 Locks
66
67 ### Lock process
68
69 With in the same MinIO instance locking is handled by existing in-memory namespace locks (**sync.RWMutex** et. al). To synchronize locks between many MinIO instances we leverage POSIX `fcntl()` locks on Unixes and on Windows `LockFileEx()` Win32 API. Requesting write lock block if there are any read locks held by neighboring MinIO instance on the same path. So does the read lock if there are any active write locks in-progress.
70
71 ### Unlock process
72
73 Unlocking happens on filesystems locks by just closing the file descriptor (fd) which was initially requested for lock operation. Closing the fd tells the kernel to relinquish all the locks held on the path by the current process. This gets trickier when there are many readers on the same path by the same process, it would mean that closing an fd relinquishes locks for all concurrent readers as well. To properly manage this situation a simple fd reference count is implemented, the same fd is shared between many readers. When readers start closing on the fd we start reducing the reference count, once reference count has reached zero we can be sure that there are no more readers active. So we proceed and close the underlying file descriptor which would relinquish the read lock held on the path.
74
75 This doesn't apply for the writes because there is always one writer and many readers for any unique object.
76
77 ## Handling Concurrency.
78
79 An example here shows how the contention is handled with GetObject().
80
81 GetObject() holds a read lock on `fs.json`.
82
83 ```go
84 fsMetaPath := pathJoin(fs.fsPath, minioMetaBucket, bucketMetaPrefix, bucket, object, fsMetaJSONFile)
85 rlk, err := fs.rwPool.Open(fsMetaPath)
86 if err != nil {
87 return toObjectErr(err, bucket, object)
88 }
89 defer rlk.Close()
90
91 ... you can perform other operations here ...
92
93 _, err = io.Copy(writer, reader)
94
95 ... after successful copy operation unlocks the read lock ...
96 ```
97
98 A concurrent PutObject is requested on the same object, PutObject() attempts a write lock on `fs.json`.
99
100 ```go
101 fsMetaPath := pathJoin(fs.fsPath, minioMetaBucket, bucketMetaPrefix, bucket, object, fsMetaJSONFile)
102 wlk, err := fs.rwPool.Create(fsMetaPath)
103 if err != nil {
104 return ObjectInfo{}, toObjectErr(err, bucket, object)
105 }
106 // This close will allow for locks to be synchronized on `fs.json`.
107 defer wlk.Close()
108 ```
109
110 Now from the above snippet the following code one can notice that until the GetObject() returns writing to the client. Following portion of the code will block.
111
112 ```go
113 wlk, err := fs.rwPool.Create(fsMetaPath)
114 ```
115
116 This restriction is needed so that corrupted data is not returned to the client in between I/O. The logic works vice-versa as well an on-going PutObject(), GetObject() would wait for the PutObject() to complete.
117
118 ### Caveats (concurrency)
119
120 Consider for example 3 servers sharing the same backend
121
122 On minio1
123
124 - DeleteObject(object1) --> lock acquired on `fs.json` while object1 is being deleted.
125
126 On minio2
127
128 - PutObject(object1) --> lock waiting until DeleteObject finishes.
129
130 On minio3
131
132 - PutObject(object1) --> (concurrent request during PutObject minio2 checking if `fs.json` exists)
133
134 Once lock is acquired the minio2 validates if the file really exists to avoid obtaining lock on an fd which is already deleted. But this situation calls for a race with a third server which is also attempting to write the same file before the minio2 can validate if the file exists. It might be potentially possible `fs.json` is created so the lock acquired by minio2 might be invalid and can lead to a potential inconsistency.
135
136 This is a known problem and cannot be solved by POSIX fcntl locks. These are considered to be the limits of shared filesystem.