github.com/cockroachdb/pebble@v1.1.2/docs/RFCS/20220311_pebble_flushable_ingested_sstable.md (about)

     1  - Feature Name: Flushable Ingested SSTable
     2  - Status: in-progress
     3  - Start Date: 2022-03-11
     4  - Authors: Mufeez Amjad
     5  - RFC PR: [#1586](https://github.com/cockroachdb/pebble/pull/1586)
     6  - Pebble Issues: [#25](https://github.com/cockroachdb/pebble/issues/25)
     7  - Cockroach Issues:
     8  
     9  ## Summary
    10  
    11  To avoid a forced flush when ingesting SSTables that have an overlap with a
    12  memtable, we "lazily" add the SSTs to the LSM as a `*flushableEntry` to
    13  `d.mu.mem.queue`. In comparison to a regular ingest which adds the SSTs to the
    14  lowest possible level, the SSTs will get placed in the memtable queue before
    15  they are eventually flushed (to the lowest level possible). This state is only
    16  persisted in memory until a flush occurs, thus we require a WAL entry to replay
    17  the ingestion in the event of a crash.
    18  
    19  ## Motivation
    20  
    21  Currently, if any of the SSTs that need to be ingested have an overlap with a
    22  memtable, we
    23  [wait](https://github.com/cockroachdb/pebble/blob/56c5aebe151977964db7e464bb6c87ebd3451bd5/ingest.go#L671)
    24  for the memtable to be flushed before the ingestion can proceed. This is to
    25  satisfy the invariant that newer entries (those in the ingested SSTs) in the LSM
    26  have a higher sequence number than old entries (those in the memtables). This
    27  problem is also present for subsequent normal writes that are blocked behind the
    28  ingest waiting for their sequence number to be made visible.
    29  
    30  ## Technical Design
    31  
    32  The proposed design is mostly taken from Peter's suggestion in #25. The core
    33  requirements are:
    34  1. Replayable WAL entry for the ingest.
    35  2. Implementation of the `flushable` interface for a new `ingestedSSTables` struct.
    36  3. Lazily adding the ingested SSTs to the LSM.
    37  4. Flushing logic to move SSTs into L0-L6.
    38  
    39  <br>
    40  
    41  ### 1. WAL Entry
    42  
    43  We require a WAL entry to make the ingestion into the flushable queue
    44  replayable, and there is a need for a new type of WAL entry that does not get
    45  applied to the memtable. 2 approaches were considered:
    46  1. Using `seqnum=0` to differentiate this new WAL entry.
    47  2. Introduce a new `InternalKeyKind` for the new WAL entry,
    48     `InternalKeyKindIngestSST`.
    49  
    50  We believe the second approach is better because it avoids modifying batch
    51  headers which can be messy/hacky and because `seqnum=0` is already used for
    52  unapplied batches. The second approach also gives way for a simpler/cleaner
    53  implementation because it utilizes the extensibility of `InternalKeyKind` and is
    54  similar to the treatment of `InternalKeyKindLogData`. It also follows the
    55  correct seqnum semantics for SSTable ingestion in the event of a WAL replay —
    56  each SST in the ingestion batch already gets its own sequence number.
    57  
    58  This change will need to be gated on a `FormatMajorVersion` because if the store
    59  is opened with an older version of Pebble, Pebble will not understand any WAL
    60  entry that contains the new `InternalKeyKind`.
    61  
    62  <br>
    63  
    64  When performing an ingest (with overlap), we create a batch with the header:
    65  
    66  ```
    67  +-------------+------------+--- ... ---+
    68  | SeqNum (8B) | Count (4B) |  Entries  |
    69  +-------------+------------+--- ... ---+
    70  ```
    71  
    72  where`SeqNum` is the current running sequence number in the WAL, `Count` is the
    73  number of ingested SSTs, and each entry has the form:
    74  
    75  ```
    76  +-----------+-----------------+-------------------+
    77  | Kind (1B) | Key (varstring) | Value (varstring) |
    78  +-----------+-----------------+-------------------+
    79  ```
    80  
    81  where `Kind` is `InternalKeyKindIngestSST`, and `Key` is a path to the
    82  ingested SST on disk.
    83  
    84  When replaying the WAL, we check every batch's first entry and if `keykind ==
    85  InternalKeyKindIngestSSTs` then we continue reading the rest of the entries in
    86  the batch of SSTs and replay the ingestion steps - we construct a
    87  `flushableEntry` and add it to the flushable queue:
    88  
    89  ```go
    90  b = Batch{db: d}
    91  b.SetRepr(buf.Bytes())
    92  seqNum := b.SeqNum()
    93  maxSeqNum = seqNum + uint64(b.Count())
    94  br := b.Reader()
    95  if kind, _, _, _ := br.Next(); kind == InternalKeyKindIngestSST {
    96    // Continue reading the rest of the batch and construct flushable 
    97    // of sstables with correct seqnum and add to queue.
    98    buf.Reset()
    99    continue
   100  }
   101  ```
   102  
   103  
   104  ### 2. `flushable` Implementation
   105  
   106  Introduce a new flushable type: `ingestedSSTables`.
   107  
   108  ```go
   109  type ingestedSSTables struct {
   110      files []*fileMetadata
   111      size  uint64
   112  
   113      cmp      Compare
   114      newIters tableNewIters
   115  }
   116  ```
   117  which implements the following functions from the `flushable` interface:
   118  
   119  #### 1. `newIter(o *IterOptions) internalIterator`
   120  
   121  We return a `levelIter` since the ingested SSTables have no overlap, and we can
   122  treat them like a level in the LSM.
   123  
   124  ```go
   125  levelSlice := manifest.NewLevelSliceKeySorted(s.cmp, s.files)
   126  return newLevelIter(*o, s.cmp, nil, s.newIters, levelSlice.Iter(), 0, nil)
   127  ```
   128  
   129  <br>
   130  
   131  On the client-side, this iterator would have to be used like this:
   132  ```go
   133  var iter internalIteratorWithStats
   134  var rangeDelIter keyspan.FragmentIterator
   135  iter = base.WrapIterWithStats(mem.newIter(&dbi.opts))
   136  switch mem.flushable.(type) {
   137  case *ingestedSSTables:
   138      iter.(*levelIter).initRangeDel(&rangeDelIter)
   139  default:
   140      rangeDelIter = mem.newRangeDelIter(&dbi.opts)
   141  }
   142  
   143  mlevels = append(mlevels, mergingIterLevel{
   144      iter:         iter,
   145      rangeDelIter: rangeDelIter,
   146  })
   147  ```
   148  
   149  #### 2. `newFlushIter(o *IterOptions, bytesFlushed *uint64) internalIterator`
   150  
   151  #### 3. `newRangeDelIter(o *IterOptions) keyspan.FragmentIterator`
   152  
   153  The above two methods would return `nil`. By doing so, in `c.newInputIter()`:
   154  ```go
   155  if flushIter := f.newFlushIter(nil, &c.bytesIterated); flushIter != nil {
   156      iters = append(iters, flushIter)
   157  }
   158  if rangeDelIter := f.newRangeDelIter(nil); rangeDelIter != nil {
   159      iters = append(iters, rangeDelIter)
   160  }
   161  ```
   162  we ensure that no iterators on `ingestedSSTables` will be used while flushing in
   163  `c.runCompaction()`.
   164  
   165  The special-cased flush process for this flushable is described in [Section
   166  4](#4-flushing-logic-to-move-ssts-into-l0).
   167  
   168  #### 4. `newRangeKeyIter(o *IterOptions) keyspan.FragmentIterator`
   169  
   170  Will wait on range key support in `levelIter` to land before implementing.
   171  
   172  #### 5. `inuseBytes() uint64` and `totalBytes() uint64`
   173  
   174  For both functions, we return 0.
   175  
   176  Returning 0 for `inuseBytes()` means that the calculation of `c.maxOverlapBytes`
   177  is not affected by the SSTs (the ingested SSTs don't participate in the
   178  compaction).
   179  
   180  We don't want the size of the ingested SSTs to contribute to the size of the
   181  memtable when determining whether or not to stall writes
   182  (`MemTableStopWritesThreshold`); they should contribute to the L0 read-amp
   183  instead (`L0StopWritesThreshold`). Thus, we'll have to special case for ingested
   184  SSTs in `d.makeRoomForWrite()` to address this detail.
   185  
   186  `totalBytes()` represents the number of bytes allocated by the flushable, which
   187  in our case is 0. A consequence for this is that the size of the SSTs do not
   188  count towards the flush threshold calculation. However, by setting
   189  `flushableEntry.flushForced` we can achieve the same behaviour.
   190  
   191  #### 6. `readyForFlush() bool`
   192  
   193  The flushable of ingested SSTs can always be flushed because the files are
   194  already on disk, so we return true.
   195  
   196  ### 3. Lazily adding the ingested SSTs to the LSM
   197  
   198  The steps to add the ingested SSTs to the flushable queue are:
   199  1. Detect an overlap exists (existing logic).
   200  
   201  Add a check that falls back to the old ingestion logic of blocking the ingest on
   202  the flush when `len(d.mu.mem.queue) >= MemtablesStopWritesThreshold - 1`. This
   203  reduces the chance that many short, overlapping, and successive ingestions cause
   204  a memtable write stall.
   205  
   206  Additionally, to mitigate the hiccup on subsequent normal writes, we could wait
   207  before the call to `d.commit.AllocateSeqNum` until:
   208  1. the number of immutable memtables and `ingestedSSTs` in the flushable queue
   209     is below a certain threshold (to prevent building up too many sublevels)
   210  2. the number of immutable memtables is low. This could lead to starvation if
   211     there is a high rate of normal writes.
   212  
   213  2. Create a batch with the list of ingested SSTs.
   214  ```go
   215  b := newBatch()
   216  for _, path := range paths:
   217      b.IngestSSTs([]byte(path), nil)
   218  ```
   219  3. Apply the batch.
   220  
   221  In the call to `d.commit.AllocateSeqNum`, `b.count` sequence numbers are already
   222  allocated before the `prepare` step. When we identify a memtable overlap, we
   223  commit the batch to the WAL manually (through logic similar to
   224  `commitPipeline.prepare`). The `apply` step would be a no-op if we performed a
   225  WAL write in the `prepare` step. We would also need to truncate the memtable/WAL
   226  after this step.
   227  
   228  5. Create `ingestedSSTables` flushable and `flushableEntry`.
   229  
   230  We'd need to call `ingestUpdateSeqNum` on these SSTs before adding them to the
   231  flushable. This is to respect the sequence number ordering invariant while the
   232  SSTs reside in the flushable queue.
   233  
   234  6. Add to flushable queue.
   235  
   236  Pebble requires that the last entry in `d.mu.mem.queue` is the mutable memtable
   237  with value `d.mu.mem.mutable`. When adding a `flushableEntry` to the queue, we
   238  want to maintain this invariant. To do this we pass `nil` as the batch to
   239  `d.makeRoomForWrite()`. The result is
   240  
   241  ```
   242  | immutable old memtable | mutable new memtable |
   243  ```
   244  
   245  We then append our new `flushableEntry`, and swap the last two elements in
   246  `d.mu.mem.queue`:
   247  
   248  ```
   249  | immutable old memtable | ingestedSSTables | mutable new memtable |
   250  ```
   251  
   252  Because we add the ingested SSTs to the flushable queue when there is overlap,
   253  and are skipping applying the version edit through the `apply` step of the
   254  ingestion, we ensure that the SSTs are only added to the LSM once.
   255  
   256  7. Call `d.maybeScheduleFlush()`.
   257  
   258  Because we've added an immutable memtable to the flushable queue and set
   259  `flushForced` on the `flushableEntry`, this will surely result in a flush. This
   260  call can be done asynchronously.
   261  
   262  We can then return to caller without waiting for the flush to finish.
   263  
   264  ### 4. Flushing logic to move SSTs into L0-L6
   265  
   266  By returning `nil` for both `flushable.newFlushIter()` and
   267  `flushable.newRangeDelIter()`, the `ingestedSSTables` flushable will not be
   268  flushed normally.
   269  
   270  The suggestion in issue #25 is to move the SSTs from the flushable queue into
   271  L0. However, only the tables that overlap with the memtable will need to target
   272  L0 (because they will likely overlap with L0 post flush), the others can be
   273  moved to lower levels in the LSM. We can use the existing logic in
   274  `ingestTargetLevel` to determine which level to move the ingested SSTables to
   275  during `c.runCompaction()`. However, it's important to do this step after the
   276  memtable has been flushed to use the correct `version` when determining overlap.
   277  
   278  The flushable of ingested SSTs should not influence the bounds on the
   279  compaction, so we will have to skip updating `c.smallest` and `c.largest` in
   280  `d.newFlush()` for this flushable.