storj.io/minio@v0.0.0-20230509071714-0cbc90f649b1/docs/shared-backend/DESIGN.md (about)

     1  Introduction [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io)
     2  ------------
     3  
     4  This feature allows MinIO to serve a shared NAS drive across multiple MinIO instances. There are no special configuration changes required to enable this feature. Access to files stored on NAS volume are locked and synchronized by default.
     5  
     6  Motivation
     7  ----------
     8  
     9  Since MinIO instances serve the purpose of a single tenant there is an increasing requirement where users want to run multiple MinIO instances on a same backend which is managed by an existing NAS (NFS, GlusterFS, Other distributed filesystems) rather than a local disk. This feature is implemented also with minimal disruption in mind for the user and overall UI.
    10  
    11  Restrictions
    12  ------------
    13  
    14  * A PutObject() is blocked and waits if another GetObject() is in progress.
    15  * A CompleteMultipartUpload() is blocked and waits if another PutObject() or GetObject() is in progress.
    16  * Cannot run FS mode as a remote disk RPC.
    17  
    18  ## How To Run?
    19  
    20  Running MinIO instances on shared backend is no different than running on a stand-alone disk. There are no special configuration changes required to enable this feature. Access to files stored on NAS volume are locked and synchronized by default. Following examples will clarify this further for each operating system of your choice:
    21  
    22  ### Ubuntu 16.04 LTS
    23  
    24  Example 1: Start MinIO instance on a shared backend mounted and available at `/path/to/nfs-volume`.
    25  
    26  On linux server1
    27  ```shell
    28  minio gateway nas /path/to/nfs-volume
    29  ```
    30  
    31  On linux server2
    32  ```shell
    33  minio gateway nas /path/to/nfs-volume
    34  ```
    35  
    36  ### Windows 2012 Server
    37  
    38  Example 1: Start MinIO instance on a shared backend mounted and available at `\\remote-server\cifs`.
    39  
    40  On windows server1
    41  ```cmd
    42  minio.exe gateway nas \\remote-server\cifs\data
    43  ```
    44  
    45  On windows server2
    46  ```cmd
    47  minio.exe gateway nas \\remote-server\cifs\data
    48  ```
    49  
    50  Alternatively if `\\remote-server\cifs` is mounted as `D:\` drive.
    51  
    52  On windows server1
    53  ```cmd
    54  minio.exe gateway nas D:\data
    55  ```
    56  
    57  On windows server2
    58  ```cmd
    59  minio.exe gateway nas D:\data
    60  ```
    61  
    62  Architecture
    63  ------------------
    64  
    65  ## POSIX/Win32 Locks
    66  
    67  ### Lock process
    68  
    69  With in the same MinIO instance locking is handled by existing in-memory namespace locks (**sync.RWMutex** et. al).  To synchronize locks between many MinIO instances we leverage POSIX `fcntl()` locks on Unixes and on Windows `LockFileEx()` Win32 API. Requesting write lock block if there are any read locks held by neighboring MinIO instance on the same path. So does the read lock if there are any active write locks in-progress.
    70  
    71  ### Unlock process
    72  
    73  Unlocking happens on filesystems locks by just closing the file descriptor (fd) which was initially requested for lock operation. Closing the fd tells the kernel to relinquish all the locks held on the path by the current process. This gets trickier when there are many readers on the same path by the same process, it would mean that closing an fd relinquishes locks for all concurrent readers as well. To properly manage this situation a simple fd reference count is implemented, the same fd is shared between many readers. When readers start closing on the fd we start reducing the reference count, once reference count has reached zero we can be sure that there are no more readers active. So we proceed and close the underlying file descriptor which would relinquish the read lock held on the path.
    74  
    75  This doesn't apply for the writes because there is always one writer and many readers for any unique object.
    76  
    77  ## Handling Concurrency.
    78  
    79  An example here shows how the contention is handled with GetObject().
    80  
    81  GetObject() holds a read lock on `fs.json`.
    82  
    83  ```go
    84  	fsMetaPath := pathJoin(fs.fsPath, minioMetaBucket, bucketMetaPrefix, bucket, object, fsMetaJSONFile)
    85  	rlk, err := fs.rwPool.Open(fsMetaPath)
    86  	if err != nil {
    87  		return toObjectErr(err, bucket, object)
    88  	}
    89  	defer rlk.Close()
    90  
    91  ... you can perform other operations here ...
    92  
    93  	_, err = io.Copy(writer, reader)
    94  
    95  ... after successful copy operation unlocks the read lock ...
    96  ```
    97  
    98  A concurrent PutObject is requested on the same object, PutObject() attempts a write lock on `fs.json`.
    99  
   100  ```go
   101  	fsMetaPath := pathJoin(fs.fsPath, minioMetaBucket, bucketMetaPrefix, bucket, object, fsMetaJSONFile)
   102  	wlk, err := fs.rwPool.Create(fsMetaPath)
   103  	if err != nil {
   104  		return ObjectInfo{}, toObjectErr(err, bucket, object)
   105  	}
   106  	// This close will allow for locks to be synchronized on `fs.json`.
   107  	defer wlk.Close()
   108  ```
   109  
   110  Now from the above snippet the following code one can notice that until the GetObject() returns writing to the client. Following portion of the code will block.
   111  
   112  ```go
   113  	wlk, err := fs.rwPool.Create(fsMetaPath)
   114  ```
   115  
   116  This restriction is needed so that corrupted data is not returned to the client in between I/O. The logic works vice-versa as well an on-going PutObject(), GetObject() would wait for the PutObject() to complete.
   117  
   118  ### Caveats (concurrency)
   119  
   120  Consider for example 3 servers sharing the same backend
   121  
   122  On minio1
   123  
   124  - DeleteObject(object1) --> lock acquired on `fs.json` while object1 is being deleted.
   125  
   126  On minio2
   127  
   128  - PutObject(object1) --> lock waiting until DeleteObject finishes.
   129  
   130  On minio3
   131  
   132  - PutObject(object1) --> (concurrent request during PutObject minio2 checking if `fs.json` exists)
   133  
   134  Once lock is acquired the minio2 validates if the file really exists to avoid obtaining lock on an fd which is already deleted. But this situation calls for a race with a third server which is also attempting to write the same file before the minio2 can validate if the file exists. It might be potentially possible `fs.json` is created so the lock acquired by minio2 might be invalid and can lead to a potential inconsistency.
   135  
   136  This is a known problem and cannot be solved by POSIX fcntl locks. These are considered to be the limits of shared filesystem.