github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20181204_copysets.md (about)

     1  - Feature Name: Copysets
     2  - Status: draft
     3  - Start Date: 2018-12-04
     4  - Authors: Vijay Karthik, Mohammed Hassan
     5  - RFC PR: (PR # after acceptance of initial draft)
     6  - Cockroach Issue: [#25194](https://github.com/cockroachdb/cockroach/issues/25194)
     7  
     8  # Table of Contents
     9  
    10  - [Table of Contents](#table-of-contents)
    11  - [Summary](#summary)
    12  - [Motivation](#motivation)
    13  - [Guide level explanation](#guide-level-explanation)
    14    - [Design](#design)
    15      - [Managing copysets](#managing-copysets)
    16      - [Rebalancing ranges](#rebalancing-ranges)
    17        - [Copyset score](#copyset-score)
    18    - [Drawbacks](#drawbacks)
    19    - [Rationale and Alternatives](#rationale-and-alternatives)
    20      - [Copyset allocation to minimize data movement](#copyset-allocation-to-minimize-data-movement)
    21      - [Chainsets](#chainsets)
    22    - [Testing scenarios](#testing-scenarios)
    23    
    24  # Summary
    25  
    26  Copysets reduce the probability of data loss in the presence of multi node 
    27  failures in large clusters.
    28  
    29  This RFC will present a design for integrating copysets in cockroach and discuss
    30  its tradeoffs. Copysets have earlier been discussed in
    31  [RFC #6484](https://github.com/cockroachdb/cockroach/pull/6484).
    32  
    33  More details on copysets can be seen in the 
    34  [academic literature](https://web.stanford.edu/~skatti/pubs/usenix13-copysets.pdf).
    35  
    36  # Motivation
    37  In large clusters simultaneous loss of multiple nodes have a very high probability
    38  of data loss. For example, consider a cluster of 100 nodes using a
    39  replication factor of 3 having ~10k ranges. The simultaneous loss of 2 or more
    40  nodes has a very high probability of data loss since there could be a range
    41  out of the 10k ranges which has 2 out of its 3 replicas on the 2 lost nodes.
    42  This probability can be reduced by adding locality to nodes since cockroach
    43  supports failures of all nodes in a locality, but the loss of two nodes in 
    44  different localities again has a high probability of data loss.
    45  
    46  Copysets significantly reduces the probability of data loss in the presence
    47  of multi node failures.
    48  
    49  # Guide-level explanation
    50  Copysets divides the cluster into disjoint sets of nodes. The size of each set
    51  will be based on the used replication factors. Separate copysets are created 
    52  for each replication factor. A range should prefer to allocate its replicas 
    53  within a copyset rather than spread its replicas across copysets.
    54  
    55  So there are two major components
    56  1. Managing copysets (which node belongs to which copyset)  
    57  Copyset assignments should take into account locality of nodes so that locality
    58  fault tolerance is not lost. Addition / Removal / Crashed nodes should
    59  be taken into account when assigning nodes to copysets.
    60  2. Rebalancing all replicas of a range to reside within a single copyset on a
    61   best effort basis.  
    62  Rebalancing replicas into copysets is important, but some properties like 
    63  constraints set by a user should take priority over copysets.
    64  
    65  Copysets will initially be an opt-in feature (based on a cluster setting) and 
    66  implemented for a scatter width of `replication_factor - 1` (eventually it will 
    67  be extended to support higher scatter width).  
    68  **For simplicity, we will explain the design without considering scatter width 
    69  in copysets.**  
    70  ## Design
    71  ### Managing copysets
    72  The cluster will be divided into copysets. For each replication factor in the
    73  cluster, separate copysets will be generated. 
    74  
    75  The requirements for copysets of a replication factor are
    76  1. There should be no overlap of nodes between copysets for scatter width of
    77  rf -1 and minimize overlapping nodes for scatter width >= rf (where
    78  rf is the replication factor).
    79  2. Copysets should be locality fault tolerant (each node in a copyset should 
    80  preferably be from a different locality)
    81  3. Copysets should rebalance on node additions / removal / failures.
    82  
    83  Copysets are generated for each replication factor used in the system.
    84  Better failure tolerance can be provided if copysets for different replication
    85  factors are aligned, but this is not the case in the presented strategies.
    86  
    87  Two possible strategies for copyset allocation is presented below.
    88  
    89  #### Optimal diversity copyset allocation
    90  Optimal allocation (for locality diversity) of copysets for a particular
    91  replication factor can be done as follows:
    92  ```
    93  1. Compute num_copysets = floor(num_stores/replication_factor)
    94  2. Sort stores based on increasing order of locality.
    95  3. Assign copysets to stores in a round robin fashion.
    96  ```
    97  For example, consider the case where we have stores as follows:
    98  ```
    99  Locality1: S1  S2  S3
   100  Locality2: S4  S5  S6 
   101  Locality3: S7  S8  S9 S10
   102  ```
   103  Copysets for RF 3 would be created as
   104  ```
   105  num_copysets = 10/3 = 3
   106  CS1: S1 S4 S7 S10
   107  CS2: S2 S5 S8
   108  CS3: S3 S6 S9
   109  ```
   110  #### Minimize data movement copyset allocation
   111  In this strategy the goal is to minimize data movement when copysets are 
   112  regenerated with a different store list (some stores added, some stores
   113  removed).
   114  
   115  This allocation tries to create a copyset-store mapping (with
   116  incremental changes over previously used copysets) which is diverse in locality.
   117  It tries to minimize the number of changes to previously used copysets and 
   118  ensure that each store in a copyset belongs to a different locality when 
   119  possible.
   120  The allocation
   121  1. Computes the required number of copysets for the new store list.
   122  2. Assign previously existing stores to the same copyset id they belonged to
   123  (if copyset id exists based on 1) if copyset size < replication factor
   124  3. Adds the newly added stores (not present in previous copyset allocation) 
   125  and remaining stores from (2) to empty spots in each copyset (if the copyset 
   126  has < replication factor stores or if it is the last copyset).
   127  after assigning previously existing stores which have carried over).
   128  4. Swaps stores between copysets to avoid duplicate localities in a single
   129     copyset till it converges (diversity cannot be improved further). 
   130  
   131  #### Swaps
   132  Swaps are made between a source copyset and a target copyset which guarantee 
   133  that the diversity of the source copyset increases while the diversity of the 
   134  target copyset does decrease (or if it decreases it still is > replication 
   135  factor).
   136  
   137  Store swaps are made between a source copyset and a target copyset based
   138  on the localities present in the source and target copyset. The conditions
   139  required for a swap are:
   140  1. The source copyset has diversity < replication factor. This means that
   141  the source copyset has two stores in a particular locality. One of these
   142  stores will be a source swap candidate.  
   143  2. The target copyset has a locality not present in the source copyset
   144  (let's call this target locality). A store from this locality will be a target 
   145  swap candidate.
   146  3. One of the following is true
   147     1. Locality of the source swap candidate is not present in the target 
   148        copyset.
   149     2. Target copyset either
   150        1. Has two stores in the target locality.
   151        2. Has diversity > replication factor.  
   152  
   153  By diversity above we mean the number of localities in a copyset.
   154  
   155  Point (3) above ensures that diversity of the target copyset does not decrease
   156  (or if it decreases it does not fall below replication factor).
   157  
   158  
   159  A single iteration doing swaps considers all `(n choose 2)` copyset combinations
   160  where `n` is the number of copysets. These iterations continue till sum of
   161  diversity of all copysets cannot be improved further (no swap are candidates 
   162  found for a whole iteration).
   163  
   164  For example, consider the case where we have stores as follows:
   165  ```
   166  Locality1: S1  S2  S3
   167  Locality2: S4  S5  S6
   168  Locality3: S7  S8  S9
   169  Locality4: S10 S11 S12 S13
   170  ```
   171  And initial copyset allocation as
   172  ```
   173  CS1: S1 S5 S9
   174  CS2: S2 S6 S10
   175  CS3: S3 S7 S11
   176  CS4: S4 S8 S12 S13
   177  ```
   178  
   179  Say store `S6` is removed.
   180  
   181  After step 2 (assign stores to same copyset ID till size reaches rf), we have
   182  ```
   183  CS1: S1 S5 S9
   184  CS2: S2 S10
   185  CS3: S3 S7 S11
   186  CS4: S4 S8 S12
   187  ```
   188  
   189  After filling empty spots by adding remaining stores (`S13` in this case)
   190  ```
   191  CS1: S1 S5 S9
   192  CS2: S2 S10 S13
   193  CS3: S3 S7 S11
   194  CS4: S4 S8 S12
   195  ```
   196  
   197  After swaps (between `CS1` and `CS2` since CS2 has 2 stores from `Locality4`)
   198  ```
   199  CS1: S1 S5 S13
   200  CS2: S2 S10 S9
   201  CS3: S3 S7 S11
   202  CS4: S4 S8 S12
   203  ```
   204  
   205  This strategy may not achieve optimal possible diversity but tries to ensure
   206  that each locality within a copyset is different.
   207  
   208  #### Copyset re-generation
   209  The store list considered for copyset allocation would be the current live 
   210  stores. The way live stores are computed will be the same as the way allocator
   211  detects live stores (but throttled stores will not be excluded.)
   212  Copysets will be re-generated if the store list has been stable and not changed 
   213  for 3 ticks (each tick has a 10s interval).
   214  
   215  Copyset allocation can be persisted as a proto in the distributed KV layer.
   216  The copysets strategy which minimizes data movement requires copysets to be
   217  persisted (it requires the previous state to be global and survive restarts).
   218  The lowest live node ID in the cluster would be managing (persisting) copysets. 
   219  Other nodes will be periodically (every 10s) cache the persisted copysets and 
   220  using it for re-balancing.
   221  
   222  Copysets will only be re-generated (and persisted) if the store list changes.
   223  In steady state all nodes will be periodically reading the persisted copysets
   224  and there will be no need to re-generate and persist new copysets.
   225  
   226  The cluster can tolerate failure of one node within each copyset for RF=3. For
   227  example a 100 node cluster can tolerate the simultaneous failure of 33 nodes
   228  in the best case (for RF=3) without suffering any data loss.
   229  
   230  ## Rebalancing ranges
   231  Ranges need to be rebalanced to be contained within a copyset.  
   232  There are two range re-balancers currently being used in cockroach:
   233  1. Replicate queue
   234  2. Store rebalancer
   235  
   236  This RFC will explain the  implementation for copyset rebalancing for the 
   237  replicate queue which processes one replica at a time.
   238  Replica rebalancing by the store rebalancer will be disabled if copysets is 
   239  enabled (at least for the initial version). The store rebalancer can still
   240  perform lease holder rebalancing.
   241  
   242  The allocator uses a scoring function to 
   243  1. Decide which store to use for a new replica for a range
   244  2. Which replica to remove when a range has more than required replicas
   245  3. Whether a replica has to move from one store to another where the resultant
   246    score for the range will he higher.
   247    
   248  The scoring function considers the following (given in order of priority)
   249  1. Zone constraints (which are constraints on having certain tables in certain 
   250    zones)
   251  2. Disk fullness: checks whether the source or target is too full.
   252  3. Diversity score difference: Diversity score is proportional to the number of 
   253    different localities the range has a replica in. It looks at nC2 diversity 
   254    score based on their localities where n is the number of replicas.
   255  4. Convergence score difference: Convergence score is used to avoid moving 
   256    ranges whose movement will cause the stats (range count) of a range to move 
   257    away from the global mean.
   258  5. Balance score difference: Balance score is the normalized utilization of a 
   259    node. It currently considers number of ranges. Nodes with a low balance score
   260    are preferred.
   261  6. Range count difference: Stores with a low range count are preferred.
   262  
   263  ### Copyset score
   264  For rebalancing ranges into copysets, a new "copyset score" will be added to
   265  the allocator. Priority wise it will be between (2) and (3) above. Zone
   266  constraints and disk fullness take a higher priority over copyset score.
   267  
   268  Since copyset allocation considers diversity, it's priority can be placed above
   269  diversity score.
   270  If copysets are disabled in the cluster, this score will have no impact in 
   271  rebalancing.
   272  
   273  Copyset score (higher score is better) of a range is high if:
   274  1. A range is completely contained within a copyset.
   275  2. The copysets the range is in are under-utilized. We want each copyset to 
   276    be equally loaded. 
   277    If a range is completely contained in a copyset `x` we should move the range
   278    completely to a copyset `y` if the nodes in copyset `y` have a 
   279    **significantly** lower load (for example nodes in `y` have a lot more free 
   280    disk space).  
   281    
   282  So the following replica transition for a range of RF 3 should be allowed in 
   283  case (2):
   284  `x x x -> x x y -> x y y -> y y y`  
   285  where `x x x` means that the 3 replicas of the range are in copyset `x`.
   286  
   287  Let's say `r` is the replication factor of a range. Each of its replicas belongs
   288  to a node with a particular copyset id. We can formally define the scores as:
   289  1. Homogeneity score: `Number of pairwise same copyset id / (r choose 2)`
   290  2. Idle score: This score is proportional to how "idle" a store is. For 
   291  starters we can consider this to be % disk free (Available Capacity / Total 
   292  Capacity of the store). We want ranges to migrate to copysets with significantly
   293  lower load. 
   294     1. The idle score of a store is proportional to the idleness of a store, like
   295     % disk free on the store.
   296     2. The idle score of a copyset is the lowest idle score of the stores in the 
   297     copyset.
   298     3. The idle score of a range is the weighted average idle score of the 
   299     copysets of the stores a range is present in. A range can be a part of 
   300     multiple copysets when it is in flux (examples given below).
   301  
   302  Copyset score can be defined as `(k * homogeneity_score + idle_score) / (k + 1)`.
   303  It is normalized and lies between 0 and 1.
   304  
   305  #### Computation of k
   306  Let's say we want to migrate a range from a copyset `x` to a copyset `y` if the
   307  idle score of `y` differs by more than `d` (configurable).
   308  If `d` is too small, it could lead to thrashing of replicas, so we can use a
   309  value like 15%.  
   310  Though the below calculations may seem a bit complex, to the end user we can
   311  just expose `d`, which is easy to understand - the max difference between idle
   312  scores of two copysets in the cluster.
   313  
   314  For example, if idle score of `x` is `a` and `y` is `a + d`, we require:
   315  ```
   316  copyset_score(x x x) < copyset_score(x x y)
   317  k * homogeneityScore(x x x) + idleScore(x x x) < k * homogeneityScore(x x y) + idleScore(x x y)
   318  # Generalizing for replication factor r where r = 3 below
   319      homogeneityScore(x x x) = 1
   320      idleScore(x x x) = ra/r = a
   321      homogeneityScore(x x y) = (r-1 choose 2) / (r choose 2) # since 1 copyset is different.
   322      idleScore(x x y) = ((r-1) * a + a + d)/r = (ra + d) / r
   323  # So we get
   324  k * 1 + a <= k * (r-1 choose 2) / (r choose 2) + (ra + d) / r
   325  => k <= d / 2
   326  ```
   327  For example, for `r = 3`, `d = 0.15`, and idle score of x being `0.2` and idle
   328  score of y being `0.36`
   329  ```
   330  totalScore(x x x) = 0.075 * 1 + 0.2 = 0.275
   331  totalScore(x x y) = 0.075 * 0.33 + (0.2 + 0.2 + 0.36)/3 = 0.278
   332  ```
   333  So a range will migrate from
   334  ```
   335  (x x x) -> (x x y) -> (x y y) -> (y y y)
   336  ```
   337  The above migration will not happen if `y` has an idle score of `0.34` (since
   338  `d = 0.15`).  
   339  The first step `(x x x) -> (x x y)` is the hardest as homogeneity
   340  is broken. The proof for this is given above.
   341  For `(x x y) -> (x y y)` step, the homogeneity score remains the
   342  same, and idle score improves (since y has a better idle score).
   343  For `(x y y) -> (y y y)` step, both the homogeneity score and
   344  idle score improve.
   345  When a range actually migrates from `(x x x)` to `(x x y)`, it goes
   346  through an intermediate step `(x x x y)` after which one `x` is
   347  removed, but similar math applies.
   348  
   349  This scoring function will allow ranges to organically move into copysets
   350  and try to maintain approximately equal load among copysets. Thrashing will
   351  be avoided by choosing an appropriate value of `d`.
   352  
   353  ## Drawbacks
   354  1. Copysets increase recovery time since only nodes within the copyset of a 
   355  crashed node can up-replicate data. This can be mitigated by choosing a higher
   356  scatter width (description of scatter width is given in the
   357  [academic literature](https://web.stanford.edu/~skatti/pubs/usenix13-copysets.pdf)).
   358  2. Zone constraints will not be supported in the initial version of copysets.
   359  Copyset allocation can later be tweaked to respect zone contraints.
   360  3. Heterogeneous clusters. Copysets will work in heterogeneous clusters but each
   361  copyset will be limited by the weakest node in the copyset (since idle score
   362  of a copyset is the lowest node idle score). This may be something we can live
   363  with.
   364  4. Doesn't play well with the store rebalancer. For the first cut store based
   365  replica rebalancing will be disabled with copysets enabled. A similar
   366  logic can be incorporated into the store rebalancer at a later point.
   367  
   368  Due to the above drawbacks, copysets will be disabled by default and there
   369  will be a cluster setting where users can enable copysets if they are ok with
   370  the above drawbacks.
   371  
   372  ## Rationale and Alternatives
   373  There can be multiple approaches for both copyset allocation and the scoring
   374  function. This design in this RFC is something simple and the respective
   375  algorithms can be tweaked independently later.
   376  
   377  ### Chainsets
   378  [Chainsets](http://hackingdistributed.com/2014/02/14/chainsets/) is one way
   379  to make incremental changes to copysets, but again potentially at the cost
   380  of reduced locality diversity. The length of the chain used in chainsets
   381  could be considered equivalent to replication factor in cockroach.
   382  
   383  ## Testing scenarios
   384  Apart from unit tests, roachtests can be added which verify copyset based
   385  rebalancing in the presence of
   386  1. Node addition / removal
   387  2. Node crashes (up to 1/3rd of the cluster)
   388  3. Change of replication factors
   389  4. Locality fault tolerance
   390  5. Changes of constraints