github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/raft-snapshots.md (about)

     1  # Raft snapshots and why you see them when you oughtn't
     2  
     3  Original author: Tobias Grieger
     4  
     5  # Introduction
     6  
     7  Each and every Range in CockroachDB is fundamentally a *Raft* group
     8  (with a lot of garnish around it). Raft is a consensus protocol that
     9  provides a distributed state machine being fed an ordered log of
    10  commands, the *Raft log*. Raft is a leader-based protocol that keeps
    11  leadership stable, which means that typically one of the members of
    12  the group is the designated leader and is in charge of distributing
    13  the log to followers and marking log entries as committed.
    14  
    15  In typical operation (read: “practically always”) the process that
    16  plays out with each write to a Range is that each of the members of
    17  the Raft group receives new entries (corresponding to some digested
    18  form of the write request) from the leader, which it acknowledges. A
    19  little later, the leader sends an updated *Commit index* instructing the
    20  followers to apply the current (and any preceding) writes to their
    21  state, in the order given by the log.
    22  
    23  A *Raft snapshot* is necessary when the leader does not have the log
    24  entries a follower needs and thus needs to send a full initial
    25  state. This happens in two scenarios:
    26  
    27  1. Log truncation. It’s not economical to store all of the Raft log in
    28     perpetuity. For example, imagine a workload that updates a single
    29     key-value pair over and over (no MVCC). The Raft log’s growth is
    30     unbounded, yet the size of the state is constant.
    31  
    32     When all followers have applied a command, a prefix of the log up
    33     to that command can be discarded. But followers can be unavailable
    34     for extended periods of time; it may not always be wise to wait for
    35     *all* followers to have caught up, especially if the Raft log has
    36     already grown very large. If we don’t wait and an affected follower
    37     wants to catch up, it has to do so through a snapshot.
    38  
    39  2. A configuration change adds a new follower to the Raft group. For
    40     example, a three member group gets turned into a four member
    41     group. Naively, the new member starts with a clean slate, and would
    42     need all of the Raft log. The details are a bit in the weeds, but
    43     even if the log had never been truncated (see 1.) we’d still need a
    44     snapshot, for our initial state is not equal to the empty state
    45     (i.e. if a replica applies all of the log on top of an empty state,
    46     it wouldn’t end up in the correct state).
    47  
    48  # Terminology
    49  
    50  We have already introduced *Raft* and take it that you know what a *Range*
    51  is. Here is some more terminology and notation which is also helpful
    52  to decipher the logs should you ever have to.
    53  
    54  **Replica:** a member of a Range, represented at runtime by a
    55  `*storage.Replica` object (housed by a `*storage.Store`). A Range
    56  typically has an odd number of Replicas, and practically always three
    57  or five.
    58  
    59  **Replica ID:** A Range is identified by a Range ID, but Raft doesn’t even
    60  know which Range it is handling (we do all of that routing in our
    61  transport layer). All Raft knows is the members of the group, which
    62  are identified by *Replica IDs*. For each range, the initial Replica IDs
    63  are 1, 2, 3, and each newly added Replica is assigned the next higher
    64  (and never previously used) one [1]. For example, if a Range has Replicas
    65  with IDs 1, 2, 4, then you can deduce that member 3 was at some point
    66  removed from the range and a new replica 4 added (in some unknown
    67  order).
    68  
    69  [1]  The replica ID counter is stored in the range descriptor.
    70  
    71  The notation you would find in the logs for a member with Replica ID 3
    72  of a Range with ID 17 is r17/3.  Raft snapshot: a snapshot that is
    73  requested internally by Raft. This is different from a snapshot that
    74  CockroachDB sends preemptively, see below.
    75  
    76  **Raft snapshot queue:** a Store-level queue that sends the snapshots
    77  requested by Raft. Prior to the introduction of the queue, Raft
    78  snapshots would often bring a cluster to its knees by overloading the
    79  system. The queue makes sure there aren’t ever too many Raft snapshots
    80  in flight at the same time.
    81  
    82  **Snapshot reservation:** A Store-level mechanism that sits on the receive
    83  path of a snapshot, to make sure we’re not accepting too many
    84  (preemptive or Raft) snapshots at once (which could overload the
    85  node).
    86  
    87  **Replica Change (or Replication Change):** the process of adding or
    88  removing a Replica from/to a Range. An addition usually involves a:
    89  
    90  **Preemptive snapshot:** a snapshot we send before adding a replica to the
    91  group. The idea is that we’ll opportunistically place state on the
    92  store that will house the replica once it is added, so that it can
    93  catch up from the Raft log instead of needing a Raft snapshot. This
    94  minimizes the time during which the follower is part of the commit
    95  quorum but unable to participate in said quorum.
    96  
    97  A preemptive snapshot is rate limited (at time of writing to 2mb/s)
    98  and creates a “placeholder” Replica that doesn’t have an associated
    99  Raft group (and has Replica ID zero) that is typically short-lived (as
   100  it either turns into a full Replica, or gets deleted). See:
   101  
   102  **Replica GC:** the process of deleting Replicas that are not members
   103  of their Raft group any more. Determining whether a Replica is GC’able
   104  is surprisingly tricky.
   105  
   106  **Quota pool:** A flow control mechanism that prevents “faster” replicas
   107  from leaving a slower (but “functional”, for example on a slower
   108  network connection) Replica behind. Without a quota pool, the Raft
   109  group would end up in a weakened state where losing one of the
   110  up-to-date replicas incurs an unavailability until the straggler has
   111  caught up. It can also trigger periodic Raft log truncations that cut
   112  off the slow follower, which requires regular Raft snapshots.
   113  
   114  **Quorum:** strictly more than half of the number of Replicas in a
   115  Range. For example, with three Replicas, quorum is two. With four
   116  Replicas, quorum is three. With five Replicas, quorum is three.
   117  
   118  **Log truncation:** the process of truncating the logs. The Raft
   119  leader computes this index periodically and sends out a Raft command
   120  that instructs replicas to delete a given prefix of their Raft
   121  logs. These operations are triggered by the Raft log queue.
   122  
   123  **Range Split (or just Split):** an operation that splits a Range into
   124  two Ranges by shrinking a Range (from the right) and creating a new
   125  one (with members on the same stores) from the remainder. For example,
   126  r17 with original bounds [a,z) might split into r17 with bounds [a,c)
   127  and r18 with bounds [c,z). The right hand side is initialized with the
   128  data from the (original) left hand side Replica.
   129  
   130  **Range Merge (or just Merge):** the reverse of a Split, but more
   131  complex to implement than a Split for a variety of reasons, the most
   132  relevant to this document being that it needs the Replicas to be
   133  colocated (i.e. with replicas on the same stores).
   134  
   135  We’ll discuss the interplay between the various mechanisms above in
   136  more detail, but this is a good moment to reflect on the title of this
   137  document. Raft snapshots are expensive in that they transfer loads of
   138  data around; it’s typically the right choice to avoid them. When could
   139  they pop up outside of failure situations?
   140  
   141  1. We control replication changes and have preemptive snapshots (which
   142     still move the same amount of data, but can happen at a leisurely
   143     pace without running a vulnerable Raft group); no Raft snapshots
   144     are involved from the looks of it
   145  
   146  2. We control Raft log truncation; if followers are online, the quota
   147     pool makes sure that the followers are always at “comparable” log
   148     positions, so a (significant) prefix of the log that can be
   149     truncated without causing a Raft snapshot should always be
   150     available.
   151  
   152  3. A split shouldn’t require a Raft snapshot: Before the split, all
   153     the data is on whichever stores the replicas are on. After the
   154     split, the data is in the same place, just kept across two ranges -
   155     a bookkeeping change.
   156  
   157  4. Merges also shouldn’t need Raft snapshots. They’re doing replica
   158     changes first (see above) followed by a bookkeeping change that
   159     doesn’t move data any more.
   160  
   161  It seems that the only “reasonable” situation in which you’d expect
   162  Raft snapshots is if a node goes offline for some time and the Raft
   163  log grows so much that we’d prefer a snapshot for catching up the node
   164  once it comes back.
   165  
   166  Pop quiz: how many Raft snapshots would you expect if you started a
   167  fresh 15 node cluster on a 2.1.2-alpha SHA from November 2018, and
   168  imported a significant amount (TBs) of data into it?
   169  
   170  ![unexpected snapshot diagram](raft-snapshots.png)
   171  
   172  I bet ~15,000 wasn’t exactly what you were thinking of, and yet this
   173  was a problem that was long overlooked and just recently addressed by
   174  means of mitigation rather than resolution of root causes. It’s not
   175  just expensive to have that many snapshots. You can see that thousands
   176  of snapshots are requested, and the graph isn’t exactly in a hurry to
   177  dip back towards zero. This is in part because the Raft snapshot queue
   178  had (and has) efficiency concerns, but fundamentally consider that
   179  each queued snapshot might correspond to transferring ~64mb of data
   180  over the network (closer to ~32mb in practice). That’s on the order of
   181  hundreds of gigabytes, i.e. it’ll take time no matter how hard we try.
   182  
   183  At the end of this document, you will have an understanding of how
   184  this graph came to be, how it will look today (spoiler: mostly a zero
   185  flatline), and what the root causes are.
   186  
   187  # Fantastic Raft snapshots and where to find them
   188  
   189  We will now go into the details and discover how we might see Raft
   190  snapshots and how they might proliferate as seen in the above
   191  graph. In doing so, we’ll discover a “meta root cause” which is “poor
   192  interplay between mechanisms that have been mentioned above”. To
   193  motivate this, consider the following conflicts of interest:
   194  
   195  - Log truncation wants to keep the log very short (this is also a
   196    performance concern).  Snapshots want their starting index to not be
   197    truncated away (or they’re useless).
   198  - ReplicaGC wants to remove old data as fast as possible.  Replication
   199    changes want their preemptive snapshots kept around until they have
   200    completed.
   201  
   202  It’s a good warm-up exercise to go through why these conflicts would cause a snapshot.
   203  
   204  ## Problem 1: Truncation cuts off an in-flight snapshot
   205  
   206  Say a replication change sends a preemptive snapshot at log
   207  index 123. Some writes arrive to the range; the Raft log queue checks
   208  the state of the group, sees three up-to-date followers, and issues a
   209  truncation to index 127. The preemptive snapshot arrives and is
   210  applied. The replication change completes and a new Replica is formed
   211  from the preemptive snapshot. It will immediately require a Raft
   212  snapshot because it needs index 124 next, but the leader only has 127
   213  and higher.
   214  
   215  **Solution** (already in place): track the index of in-flight snapshots
   216  and don’t truncate them away. Similar mechanisms make sure that we
   217  don’t cut off followers that are lagging behind slightly, unless we
   218  really really think it’s the right thing to do.
   219  
   220  ## Problem 2: Preemptive snapshots accidentally replicaGC’ed
   221  
   222  Again a preemptive snapshot arrives at a Store. While the replication
   223  change hasn’t completed, this preemptive Replica is gc’able and may be
   224  removed if picked up by the replica GC queue. If this happens and the
   225  replication change completes, the newly added follower will have a
   226  clean slate and will need a Raft snapshot to recover, putting the
   227  Range in a vulnerable state until it has done so.
   228  
   229  **Solution** (unimplemented): delay the GC if we know that the replica
   230  change is still ongoing. Note that this is more difficult than it
   231  sounds. The “driver” of the replica change is another node, and for
   232  all we know it might’ve aborted the change (in which case the snapshot
   233  needs to be deleted ASAP). Simply delaying GC of preemptive snapshots
   234  via a timeout mechanism is tempting, but can backfire: a “wide”
   235  preemptive snapshot may be kept around erroneously and can block
   236  “smaller” snapshots that are now sent after the Range has split into
   237  smaller parts. We can detect this situation, but perhaps it is simpler
   238  to add a direct RPC from the “driver” to the preemptive snapshot to
   239  keep it alive for the right amount of time.
   240  
   241  These two were relatively straightforward. If we throw splits in the mix, things get a lot worse.
   242  
   243  ## Problem 3: Splits can exponentiate the effect of problem 2
   244  
   245  Consider the situation in problem 2, but between the preemptive
   246  snapshot being applied and the replication change succeeding, the
   247  range is split 100 times (splitting small ranges is fast and this
   248  happens during IMPORT/RESTORE). The preemptive snapshot is removed, so
   249  the Replica will receive a new snapshot. This snapshot is likely to
   250  post-date all of the splits, so it will be much “narrower” than
   251  before. In applying it, everything outside of its bounds is deleted,
   252  and that’s the data for the 100 new ranges that we will now need Raft
   253  snapshots for.
   254  
   255  **Solution:** This is not an issue if Problem 2 is solved, but see
   256  Problem 5 for a related problem.
   257  
   258  ## Problem 4: The split trigger can cause Raft snapshots
   259  
   260  This needs a bit of detail about how splits are implemented. A split
   261  is essentially a transaction with a special “split trigger” attached
   262  to its commit. Upon applying the commit, each replica carries out the
   263  logical split into two ranges as described by the split trigger,
   264  culminating in the creation of a new Replica (the right-hand side) and
   265  its initialization with data (inherited from the left-hand side). This
   266  is conceptually easy, but things always get more complicated in a
   267  distributed system. Note that we have multiple replicas that may be
   268  applying the split trigger at different times. What if one replica
   269  took, say, a second longer until it started applying it? All replicas
   270  except for the straggler will already have initialized their new
   271  right-hand sides, and the right-hand sides are configured to talk to
   272  the right-hand side on our straggler - which hasn’t executed the split
   273  trigger yet! The result is that upon receiving messages for this
   274  yet-nonexistent replica, it will be created as an absolutely empty
   275  replica. This replica will promptly [2] tell the leader that it needs a
   276  snapshot. In all likelihood, the split trigger is going to execute
   277  very soon and render the snapshot useless, but the damage has been
   278  done.
   279  
   280  [2] The part that already processed the split may form a quorum and
   281  start accept requests. These requests will be routed also to the node
   282  that hasn't processed the split yet. To process those requests, the
   283  slower node needs a snapshot.
   284  
   285  **Solution:** what’s implemented today is a mechanism that simply
   286  delays snapshots to “blank” replicas for multiple seconds. This works,
   287  but there are situations in which the “blank” replica really needs the
   288  Raft snapshot (see Problems 3,5,6). The idea is to replace this with a
   289  mechanism that can detect whether there’s an outstanding split trigger
   290  that will obviate the snapshot.
   291  
   292  ## Problem 5: Replica removal after a split can cause Raft snapshots
   293  
   294  This is Problem 4, but with a scenario that really needs the Raft
   295  snapshot. Imagine a range splits into two and immediately afterwards,
   296  a replication change (on the original range, i.e. the left-hand side)
   297  removes one of the followers. This can lead to that follower Replica
   298  being replica GC’ed before it has applied the split trigger, and so
   299  the data removed encompasses that that should become the right-hand
   300  side (which is still configured to have a replica on that store). A
   301  Raft snapshot is necessary and in particular note that the
   302  timeout-based mechanism in Solution 4 will have to time out first and
   303  accidentally delays this snapshot.
   304  
   305  **Solution:** See end of Solution 4 to avoid the delayed snapshot. To
   306  avoid the need for a snapshot in the first place, avoid (or delay)
   307  removing replicas that are pending application of a split trigger, or
   308  do the opposite and delay splits when a replication change is
   309  ongoing. Not implemented to date. Both options are problematic as they
   310  can intertwine the split and replicate queues, which already have
   311  hidden dependencies.
   312  
   313  ## Problem 6: Splitting a range that needs snapshots leads to two such ranges
   314  
   315  If a range which has a follower that needs a snapshot (for whatever
   316  reason) is split, the result will be two Raft snapshots required to
   317  catch up the left-hand side and right hand side of the follower,
   318  respectively. You might argue that these snapshots will be smaller and
   319  add up to the original size and so that’s ok, but this isn’t how it
   320  works owing to various deficiencies of the Raft snapshot queue.
   321  
   322  This is particularly bad since we also know from Problem 4 and 5 that
   323  we sometimes accidentally delay the Raft snapshot for seconds (before
   324  we even queue it), leaving ample time for splits to create more
   325  snapshots. To add insult to injury, all new ranges created in this
   326  state are also false positives for the timeout mechanism in Problem 4.
   327  
   328  Note that this is similar to Problem 3 in some sense but it can’t be detected by the split.
   329  
   330  **Solution:** Splits typically happen on the Raft leader and can thus
   331  peek into the Raft state and detect pending snapshots. A mechanism to
   332  delay the split in such cases was implemented, though it doesn’t apply
   333  to splits carried out through the “split queue” (as that queue does
   334  not react well to being throttled). The main scenario in which this
   335  problem was observed is IMPORT/RESTORE, which carries out rapid splits
   336  and replication changes.
   337  
   338  ## Commonalities
   339  
   340  All of these phenomena are in principle observable by running the
   341  restore2TB and import/tpch/nodes=X roachtests, though at this point
   342  all that’s left of the problem is a small spike in Raft snapshots at
   343  the beginning of the test (the likely root cause being Problem 3),
   344  which used to be exponentiated by Problem 6 (before it was fixed).
   345  
   346  # Summary
   347  
   348  Raft snapshots shouldn’t be necessary in any of the situations
   349  described above, and considerable progress has been made towards
   350  making that at reality. The mitigation of Solution 6 is powerful and
   351  guards against the occasional Raft snapshot becoming a problem, but it
   352  is nevertheless worth going the extra mile to add “deep stability” to
   353  this property: a zero rate of Raft snapshots is a good indicator of
   354  cluster health for production and stability testing. Trying to achieve
   355  it has already led to the discovery of a number of surprising
   356  interactions and buglets across the core code base, and not achieving
   357  it will make it hard to end-to-end test this property.