github.com/minio/minio@v0.0.0-20240328213742-3f72439b8a27/docs/distributed/DESIGN.md (about)

     1  # Distributed Server Design Guide [![Slack](https://slack.min.io/slack?type=svg)](https://slack.min.io)
     2  
     3  This document explains the design, architecture and advanced use cases of the MinIO distributed server.
     4  
     5  ## Command-line
     6  
     7  ```
     8  NAME:
     9    minio server - start object storage server
    10  
    11  USAGE:
    12    minio server [FLAGS] DIR1 [DIR2..]
    13    minio server [FLAGS] DIR{1...64}
    14    minio server [FLAGS] DIR{1...64} DIR{65...128}
    15  
    16  DIR:
    17    DIR points to a directory on a filesystem. When you want to combine
    18    multiple drives into a single large system, pass one directory per
    19    filesystem separated by space. You may also use a '...' convention
    20    to abbreviate the directory arguments. Remote directories in a
    21    distributed setup are encoded as HTTP(s) URIs.
    22  ```
    23  
    24  ## Common usage
    25  
    26  Standalone erasure coded configuration with 4 sets with 16 drives each.
    27  
    28  ```
    29  minio server dir{1...64}
    30  ```
    31  
    32  Distributed erasure coded configuration with 64 sets with 16 drives each.
    33  
    34  ```
    35  minio server http://host{1...16}/export{1...64}
    36  ```
    37  
    38  ## Architecture
    39  
    40  Expansion of ellipses and choice of erasure sets based on this expansion is an automated process in MinIO. Here are some of the details of our underlying erasure coding behavior.
    41  
    42  - Erasure coding used by MinIO is [Reed-Solomon](https://github.com/klauspost/reedsolomon) erasure coding scheme, which has a total shard maximum of 256 i.e 128 data and 128 parity. MinIO design goes beyond this limitation by doing some practical architecture choices.
    43  
    44  - Erasure set is a single erasure coding unit within a MinIO deployment. An object is sharded within an erasure set. Erasure set size is automatically calculated based on the number of drives. MinIO supports unlimited number of drives but each erasure set can be upto 16 drives and a minimum of 2 drives.
    45  
    46  - We limited the number of drives to 16 for erasure set because, erasure code shards more than 16 can become chatty and do not have any performance advantages. Additionally since 16 drive erasure set gives you tolerance of 8 drives per object by default which is plenty in any practical scenario.
    47  
    48  - Choice of erasure set size is automatic based on the number of drives available, let's say for example if there are 32 servers and 32 drives which is a total of 1024 drives. In this scenario 16 becomes the erasure set size. This is decided based on the greatest common divisor (GCD) of acceptable erasure set sizes ranging from *4 to 16*.
    49  
    50  - *If total drives has many common divisors the algorithm chooses the minimum amounts of erasure sets possible for a erasure set size of any N*.  In the example with 1024 drives - 4, 8, 16 are GCD factors. With 16 drives we get a total of 64 possible sets, with 8 drives we get a total of 128 possible sets, with 4 drives we get a total of 256 possible sets. So algorithm automatically chooses 64 sets, which is *16* 64 = 1024* drives in total.
    51  
    52  - *If total number of nodes are of odd number then GCD algorithm provides affinity towards odd number erasure sets to provide for uniform distribution across nodes*. This is to ensure that same number of drives are pariticipating in any erasure set. For example if you have 2 nodes with 180 drives then GCD is 15 but this would lead to uneven distribution, one of the nodes would participate more drives. To avoid this the affinity is given towards nodes which leads to next best GCD factor of 12 which provides uniform distribution.
    53  
    54  - In this algorithm, we also make sure that we spread the drives out evenly. MinIO server expands ellipses passed as arguments. Here is a sample expansion to demonstrate the process.
    55  
    56  ```
    57  minio server http://host{1...2}/export{1...8}
    58  ```
    59  
    60  Expected expansion
    61  
    62  ```
    63  > http://host1/export1
    64  > http://host2/export1
    65  > http://host1/export2
    66  > http://host2/export2
    67  > http://host1/export3
    68  > http://host2/export3
    69  > http://host1/export4
    70  > http://host2/export4
    71  > http://host1/export5
    72  > http://host2/export5
    73  > http://host1/export6
    74  > http://host2/export6
    75  > http://host1/export7
    76  > http://host2/export7
    77  > http://host1/export8
    78  > http://host2/export8
    79  ```
    80  
    81  *A noticeable trait of this expansion is that it chooses unique hosts such the setup provides maximum protection and availability.*
    82  
    83  - Choosing an erasure set for the object is decided during `PutObject()`, object names are used to find the right erasure set using the following pseudo code.
    84  
    85  ```go
    86  // hashes the key returning an integer.
    87  func sipHashMod(key string, cardinality int, id [16]byte) int {
    88          if cardinality <= 0 {
    89                  return -1
    90          }
    91          sip := siphash.New(id[:])
    92          sip.Write([]byte(key))
    93          return int(sip.Sum64() % uint64(cardinality))
    94  }
    95  ```
    96  
    97  Input for the key is the object name specified in `PutObject()`, returns a unique index. This index is one of the erasure sets where the object will reside. This function is a consistent hash for a given object name i.e for a given object name the index returned is always the same.
    98  
    99  - Write and Read quorum are required to be satisfied only across the erasure set for an object. Healing is also done per object within the erasure set which contains the object.
   100  
   101  - MinIO does erasure coding at the object level not at the volume level, unlike other object storage vendors. This allows applications to choose different storage class by setting `x-amz-storage-class=STANDARD/REDUCED_REDUNDANCY` for each object uploads so effectively utilizing the capacity of the cluster. Additionally these can also be enforced using IAM policies to make sure the client uploads with correct HTTP headers.
   102  
   103  - MinIO also supports expansion of existing clusters in server pools. Each pool is a self contained entity with same SLA's (read/write quorum) for each object as original cluster. By using the existing namespace for lookup validation MinIO ensures conflicting objects are not created. When no such object exists then MinIO simply uses the least used pool to place new objects.
   104  
   105  ### There are no limits on how many server pools can be combined
   106  
   107  ```
   108  minio server http://host{1...32}/export{1...32} http://host{1...12}/export{1...12}
   109  ```
   110  
   111  In above example there are two server pools
   112  
   113  - 32 * 32 = 1024 drives pool1
   114  - 12 * 12 = 144 drives pool2
   115  
   116  > Notice the requirement of common SLA here original cluster had 1024 drives with 16 drives per erasure set with default parity of '4', second pool is expected to have a minimum of 8 drives per erasure set to match the original cluster SLA (parity count) of '4'. '12' drives stripe per erasure set in the second pool satisfies the original pool's parity count.
   117  
   118  Refer to the sizing guide with details on the default parity count chosen for different erasure stripe sizes [here](https://github.com/minio/minio/blob/master/docs/distributed/SIZING.md)
   119  
   120  MinIO places new objects in server pools based on proportionate free space, per pool. Following pseudo code demonstrates this behavior.
   121  
   122  ```go
   123  func getAvailablePoolIdx(ctx context.Context) int {
   124          serverPools := z.getServerPoolsAvailableSpace(ctx)
   125          total := serverPools.TotalAvailable()
   126          // choose when we reach this many
   127          choose := rand.Uint64() % total
   128          atTotal := uint64(0)
   129          for _, pool := range serverPools {
   130                  atTotal += pool.Available
   131                  if atTotal > choose && pool.Available > 0 {
   132                          return pool.Index
   133                  }
   134          }
   135          // Should not happen, but print values just in case.
   136          panic(fmt.Errorf("reached end of serverPools (total: %v, atTotal: %v, choose: %v)", total, atTotal, choose))
   137  }
   138  ```
   139  
   140  ## Other usages
   141  
   142  ### Advanced use cases with multiple ellipses
   143  
   144  Standalone erasure coded configuration with 4 sets with 16 drives each, which spawns drives across controllers.
   145  
   146  ```
   147  minio server /mnt/controller{1...4}/data{1...16}
   148  ```
   149  
   150  Standalone erasure coded configuration with 16 sets, 16 drives per set, across mounts and controllers.
   151  
   152  ```
   153  minio server /mnt{1...4}/controller{1...4}/data{1...16}
   154  ```
   155  
   156  Distributed erasure coded configuration with 2 sets, 16 drives per set across hosts.
   157  
   158  ```
   159  minio server http://host{1...32}/disk1
   160  ```
   161  
   162  Distributed erasure coded configuration with rack level redundancy 32 sets in total, 16 drives per set.
   163  
   164  ```
   165  minio server http://rack{1...4}-host{1...8}.example.net/export{1...16}
   166  ```