github.com/cockroachdb/pebble@v1.1.2/docs/RFCS/20221122_virtual_sstable.md (about)

     1  - Feature Name: Virtual sstables
     2  - Status: in-progress
     3  - Start Date: 2022-10-27
     4  - Authors: Arjun Nair
     5  - RFC PR: https://github.com/cockroachdb/pebble/pull/2116
     6  - Pebble Issues:
     7    https://github.com/cockroachdb/pebble/issues/1683
     8  
     9  
    10  ** Design Draft**
    11  
    12  # Summary
    13  
    14  The RFC outlines the design to enable virtualizing of physical sstables
    15  in Pebble.
    16  
    17  A virtual sstable has no associated physical data on disk, and is instead backed
    18  by an existing physical sstable. Each physical sstable may be shared by one, or
    19  more than one virtual sstable.
    20  
    21  Initially, the design will be used to lower the read-amp and the write-amp
    22  caused by certain ingestions. Sometimes, ingestions are unable to place incoming
    23  files, which have no data overlap with other files in the lsm, lower in the lsm
    24  because of file boundary overlap with files in the lsm. In this case, we are
    25  forced to place files higher in the lsm, sometimes in L0, which can cause higher
    26  read-amp and unnecessary write-amp as the file is moved lower down the lsm. See
    27  https://github.com/cockroachdb/cockroach/issues/80589 for the problem occurring
    28  in practice.
    29  
    30  Eventually, the design will also be used for the disaggregated storage masking
    31  use-case: https://github.com/cockroachdb/cockroach/pull/70419/files.
    32  
    33  This document describes the design of virtual sstables in Pebble with enough
    34  detail to aid the implementation and code review.
    35  
    36  # Design
    37  
    38  ### Ingestion
    39  
    40  When an sstable is ingested into Pebble, we try to place it in the lowest level
    41  without any data overlap, or any file boundary overlap. We can make use of
    42  virtual sstables in the cases where we're forced to place the ingested sstable
    43  at a higher level due to file boundary overlap, but no data overlap.
    44  
    45  ```
    46                                    s2
    47  ingest:                     [i-j-------n]
    48                                    s1
    49  L6:                 [e---g-----------------p---r]
    50               a b c d e f g h i j k l m n o p q r s t u v w x y z
    51  ```
    52  
    53  Consider the sstable s1 in L6 and the ingesting sstable s2. It is clear that
    54  the file boundaries of s1 and s2 overlap, but there is no data overlap as shown
    55  in the diagram. Currently, we will be forced to ingest the sstable s2 into a
    56  level higher than L6. With virtual sstables, we can split the existing sstable
    57  s1 into two sstables s3 and s4 as shown in the following diagram.
    58  
    59  ```
    60                         s3         s2        s4
    61  L6:                 [e---g]-[i-j-------n]-[p---r]
    62               a b c d e f g h i j k l m n o p q r s t u v w x y z
    63  ```
    64  
    65  The sstable s1 will be deleted from the lsm. If s1 was a physical sstable, then
    66  we will keep the file on disk as long as we need to so that it can back the
    67  virtual sstables.
    68  
    69  There are cases where the ingesting sstables have no data overlap with existing
    70  sstables, but we can't make use of virtual sstables. Consider:
    71  ```
    72                                    s2
    73  ingest:               [f-----i-j-------n]
    74                                    s1
    75  L6:                 [e---g-----------------p---r]
    76               a b c d e f g h i j k l m n o p q r s t u v w x y z
    77  ```
    78  We cannot use virtual sstables in the above scenario for two reasons:
    79  1. We don't have a quick method of detecting no data overlap.
    80  2. We will be forced to split the sstable in L6 into more than two virtual
    81     sstables, but we want to avoid many small virtual sstables in the lsm.
    82  
    83  Note that in Cockroach, the easier-to-solve case happens very regularly when an
    84  sstable spans a range boundary (which pebble has no knowledge of), and we ingest
    85  a snapshot of a range in between the two already-present ranges.
    86  
    87  slide in between two existing sstables is more likely to happen. It occurs when
    88  we ingest a snapshot of a range in between two already present ranges.
    89  
    90  `ingestFindTargetLevel` changes:
    91  - The `ingestFindTargetLevel` function is used to determine the target level
    92    of the file which is being ingested. Currently, this function returns an `int`
    93    which is the target level for the ingesting file. Two additional return
    94    parameters, `[]manifest.NewFileEntry` and `*manifest.DeletedFileEntry`, will be
    95    added to the function.
    96  - If `ingestFindTargetLevel` decides to split an existing sstable into virtual
    97    sstables, then it will return new and deleted entries. Otherwise, it will only
    98    return the target level of the ingesting file.
    99  - Within the `ingestFindTargetLevel` function, the `overlapWithIterator`
   100    function is used to quickly detect data overlap. In the case with file
   101    boundary overlap, but no data overlap, in the lowest possible level, we will
   102    split the existing sstable into virtual sstables and generate the
   103    `NewFileEntry`s and the `DeletedFileEntry`. The `FilemetaData` section
   104    describes how the various fields in the `FilemetaData` will be computed for
   105    the newly created virtual sstables.
   106  
   107  - Note that we will not split physical sstables into virtual sstables in L0 for
   108    the use case described in this RFC. The benefit of doing so would be to reduce
   109    the number of L0 sublevels, but the cost would be additional implementation
   110    complexity(see the `FilemetaData` section). We also want to avoid too many
   111    virtual sstables in the lsm as they can lead to space amp(see `Compaction`
   112    section). However, in the future, for the disaggregated storage masking case,
   113    we would need to support ingestion and use of virtual sstables in L0.
   114  
   115  - Note that we may need an upper bound on the number of times an sstable is
   116    split into smaller virtual sstables. We can further reduce the risk of many
   117    small sstables:
   118    1. For CockroachDB's snapshot ingestion, there is one large sst (up to 512MB)
   119       and many tiny ones. We can choose the apply this splitting logic only for
   120       the large sst. It is ok for the tiny ssts to be ingested into L0.
   121    2. Split only if the ingested sst is at least half the size of the sst being
   122       split. So if we have a smaller ingested sst, we will pick a higher level to
   123       split at (where the ssts are smaller). The lifetime of virtual ssts at a
   124       higher level is smaller, so there is lower risk of littering the LSM with
   125       long-lived small virtual ssts.
   126    3. For disaggregated storage implementation, we can avoid masking for tiny
   127       sstables being ingested and instead write a range delete like we currently
   128       do. Precise details on the masking use case are out of the scope of this
   129       RFC.
   130  
   131  `ingestApply` changes:
   132  - The new and deleted file entries returned by the `ingestFindTargetLevel`
   133    function will be added to the version edit in `ingestApply`.
   134  - We will appropriately update the `levelMetrics` based on the new information
   135    returned by `ingestFindTargetLevel`.
   136  
   137  
   138  ### `FilemetaData` changes
   139  
   140  Each virtual sstables will have a unique file metadata value associated with it.
   141  The metadata may be borrowed from the backing physical sstable, or it may be
   142  unique to the virtual sstable.
   143  
   144  This rfc lists out the fields in the `FileMetadata` struct with information on
   145  how each field will be populated.
   146  
   147  `Atomic.AllowedSeeks`: Field is used for read triggered compactions, and we can
   148  populate this field for each virtual sstable since virtual sstables can be
   149  picked for compactions.
   150  
   151  `Atomic.statsValid`: We can set this to true(`1`) when the virtual sstable is
   152  created. On virtual sstable creation we will estimate the table stats of the
   153  virtual sstable based on the table stats of the physical sstable. We can also
   154  set this to `0` and let the table stats job asynchronously compute the stats.
   155  
   156  `refs`: The will be turned into a pointer which will be shared by the
   157  virtual/physical sstables. See the deletion section of the RFC to learn how the
   158  `refs` count will be used.
   159  
   160  `FileNum`: We could give each virtual sstable its own file number or share
   161  the file number between all the virtual sstables. In the former case, the virtual
   162  sstables will be distinguished by the file number, and will have an additional
   163  metadata field to indicate the file number of the parent sstable. In the latter
   164  case, we can use a few of the most significant bits of the 64 bit file number to
   165  distinguish the virtual sstables.
   166  
   167  The benefit of using a single file number for each virtual sstable, is that we
   168  don't need to use additional space to store the file number of the backing
   169  physical sstable.
   170  
   171  It might make sense to give each virtual sstable its own file number. Virtual
   172  sstables are picked for compactions, and compactions and compaction picking
   173  expect a unique file number for each of the files which it is compacting.
   174  For example, read compactions will use the file number of the file to determine
   175  if a file picked for compaction has already been compacted, the version edit
   176  will expect a different file number for each virtual sstable, etc.
   177  
   178  There are direct references to the `FilemetaData.FileNum` throughout Pebble. For
   179  example, the file number is accessed when the the `DB.Checkpoint` function is
   180  called. This function iterates through the files in each level of the lsm,
   181  constructs the filepath using the file number, and reads the file from disk. In
   182  such cases, it is important to exclude virtual sstables.
   183  
   184  `Size`: We compute this using linear interpolation on the number of blocks in
   185  the parent sstable and the number of blocks in the newly created virtual sstable.
   186  
   187  `SmallestSeqNum/LargestSeqNum`: These fields depend on the parent sstable,
   188  but we would need to perform a scan of the physical sstable to compute these
   189  accurately for the virtual sstable upon creation. Instead, we could convert
   190  these fields into lower and upper bounds of the sequence numbers in a file.
   191  
   192  These fields are used for l0 sublevels, pebble tooling, delete compaction hints,
   193  and a lot of plumbing. We don't need to worry about the L0 sublevels use case
   194  because we won't have virtual sstables in L0 for the use case in this RFC. For
   195  the rest of the use cases we can use lower bound for the smallest seq number,
   196  and an upper bound for the largest seq number work.
   197  
   198  TODO(bananabrick): Add more detail for any delete compaction hint changes if
   199  necessary.
   200  
   201  `Smallest/Largest`: These, along with the smallest/largest ranges for the range
   202  and point keys can be computed upon virtual sstable creation. Precisely, these
   203  can be computed when we try and detect data overlap in the `overlapWithIterator`
   204  function during ingestion.
   205  
   206  `Stats`: `TableStats` will either be computed upon virtual sstable creation
   207  using linear interpolation on the block counts of the virtual/physical sstables
   208  or asynchronously using the file bounds of the virtual sstable.
   209  
   210  `PhysicalState`: We can add an additional struct with state associated with
   211  physical ssts which have been virtualized.
   212  
   213  ```
   214  type PhysicalState struct {
   215    // Total refs across all virtual ssts * versions. That is, if the same virtual
   216    // sst is present in multiple versions, it may have multiple refs, if the
   217    // btree node is not the same.
   218    totalRefs int32
   219  
   220    // Number of virtual ssts in the latest version that refer to this physical
   221    // SST. Will be 1 if there is only a physical sst, or there is only 1 virtual
   222    // sst referencing this physical sst.
   223    // INVARIANT: refsInLatestVersion <= totalRefs
   224    // refsInLatestVersion == 0 is a zombie sstable.
   225    refsInLatestVersion int32
   226  
   227    fileSize uint64
   228  
   229    // If sst is not virtualized and in latest version
   230    // virtualSizeSumInLatestVersion == fileSize. If
   231    // virtualSizeSumInLatestVersion > 0 and
   232    // virtualSizeSumInLatestVersion/fileSize is very small, the corresponding
   233    // virtual sst(s) should be candidates for compaction. These candidates can be
   234    // tracked via btree annotations. Incrementlly updated in
   235    // BulkVersionEdit.Apply, when updating refsInLatestVersion.
   236    virtualSizeSumInLatestVersion uint64
   237  }
   238  ```
   239  
   240  The `Deletion` section and the `Compactions` section describe why we need to
   241  store the `PhysicalState`.
   242  
   243  ### Deletion of physical and virtual sstables
   244  
   245  We want to ensure that the physical sstable is only deleted from disk when no
   246  version references it, and when there are no virtual sstables which are backed
   247  by the physical sstable.
   248  
   249  Since `FilemetaData.refs` is a pointer which is shared by the physical and
   250  virtual sstables, the physical sstable won't be deleted when it is removed
   251  from the latest version as the `FilemetaData.refs` will have been increased
   252  when the virtual sstable is added to a version. Therefore, we only need to
   253  ensure that the physical sstable is eventually deleted when there are no
   254  versions which reference it.
   255  
   256  Sstables are deleted from disk by the `DB.doDeleteObsoleteFiles` function which
   257  looks for files to delete in the the `DB.mu.versions.obsoleteTables` slice.
   258  So we need to ensure that any physical sstable which was virtualized is added to
   259  the obsolete tables list iff `FilemetaData.refs` is 0.
   260  
   261  Sstable are added to the obsolete file list when a `Version` is unrefed and
   262  when `DB.scanObsoleteFiles` is called when Pebble is opened.
   263  
   264  When a `Version` is unrefed, sstables referenced by it are only added to the
   265  obsolete table list if the `FilemetaData.refs` hits 0 for the sstable. With
   266  virtual sstables, we can have a case where the last version which directly
   267  references a physical sstable is unrefed, but the physical sstable is not added
   268  to the obsolete table list because its `FilemetaData.refs` count is not 0
   269  because of indirect references through virtual sstables. Since the last Version
   270  which directly references the physical sstable is deleted, the physical sstable
   271  will never get added to the obsolete table list. Since virtual sstables keep
   272  track of their parent physical sstable, we can just add the physical sstable to
   273  the obsolete table list when the last virtual sstable which references it is
   274  deleted.
   275  
   276  `DB.scanObsoleteFiles` will delete any file which isn't referenced by the
   277  `VersionSet.versions` list. So, it's possible that a physical sstable associated
   278  with a virtual sstable will be deleted. This problem can be fixed by a small
   279  tweak in the `d.mu.versions.addLiveFileNums` to treat the parent sstable of
   280  a virtual sstable as a live file.
   281  
   282  Deleted files still referenced by older versions are considered zombie sstables.
   283  We can extend the definition of zombie sstables to be any sstable which is not
   284  directly, or indirectly through virtual sstables, referenced by the latest
   285  version. See the `PhysicalState` subsection of the `FilemetaData` section
   286  where we describe how the references in the latest version will be tracked.
   287  
   288  
   289  ### Reading from virtual sstables
   290  
   291  Since virtual sstables do not exist on disk, we will have to redirect reads
   292  to the physical sstable which backs the virtual sstable.
   293  
   294  All reads to the physical files go through the table cache which opens the file
   295  on disk and creates a `Reader` for the reads. The table cache currently creates
   296  a `FileNum` -> `Reader` mapping for the physical sstables.
   297  
   298  Most of the functions in table cache API take the file metadata of the file as
   299  a parameter. Examples include `newIters`, `newRangeKeyIter`, `withReader`, etc.
   300  Each of these functions then calls a subsequent function on the sstable
   301  `Reader`.
   302  
   303  In the `Reader` API, some functions only really need to be called on physical
   304  sstables, whereas some functions need to be called on both physical and virtual
   305  sstables. For example, the `Reader.EstimateDiskUsage` usage function, or the
   306  `Reader.Layout` function only need to be called on physical sstables, whereas,
   307  some function like, `Reader.NewIter`, and `Reader.NewCompactionIter` need to
   308  work with virtual sstables.
   309  
   310  We could either have an abstraction over the physical sstable `Reader` per
   311  virtual sstable, or update the `Reader` API to accept file bounds of the
   312  sstable. In the latter case, we would create one `Reader` on the physical
   313  sstable for all of the virtual sstables, and update the `Reader` API to accept
   314  the file bounds of the sstable.
   315  
   316  Changes required to share a `Reader` on the physical sstable among the virtual
   317  sstable:
   318  - If the file metadata of the virtual sstable is passed into the table cache, on
   319    a table cache miss, the table cache will load the Reader for the physical
   320    sstable. This step can be performed in the `tableCacheValue.load` function. On
   321    a table cache hit, the file number of the parent sstable will be used to fetch
   322    the appropriate sstable `Reader`.
   323  - The `Reader` api will be updated to support reads from virtual sstables. For
   324    example, the `NewCompactionIter` function will take additional
   325    `lower,upper []byte` parameters.
   326  
   327  Updates to iterators:
   328  - `Reader.NewIter` already has `lower,upper []byte` parameters so this requires
   329     no change.
   330  - Add `lower,upper` fields to the `Reader.NewCompactionIter`. The function
   331    initializes single level and two level iterators, and we can pass in the
   332    `lower,upper` values to those. TODO(bananabrick): Make sure that the value
   333    of `bytesIterated` in the compaction iterator is still accurate.
   334  - `Reader.NewRawRangeKeyIter/NewRawRangeDelIter`: We need to add `lower/upper`
   335     fields to the functions. Both iterators make use of a `fragmentBlockIter`. We
   336     could filter keys above the `fragmentBlockIter` or add filtering within the
   337     `fragmentBlockIter`. To add filtering within the `fragmentBlockIter` we will
   338     initialize it with two additional `lower/upper []byte` fields.
   339  - We would need to update the `SetBounds` logic for the sstable iterators to
   340    never set bounds for the iterators outside the virtual sstable bounds. This
   341    could lead to keys outside the virtual sstable bounds, but inside the physical
   342    sstable bounds, to be surfaced.
   343  
   344  TODO(bananabrick): Add a section about sstable properties, if necessary.
   345  
   346  ### Compactions
   347  
   348  Virtual sstables can be picked for compactions. If the `FilemetaData` and the
   349  iterator stack changes work, then compaction shouldn't require much, if any,
   350  additional work.
   351  
   352  Virtual sstables which are picked for compactions may cause space amplification.
   353  For example, if we have two virtual sstables `a` and `b` in L5, backed by a
   354  physical sstable `c`, and the sstable `a` is picked for a compaction. We will
   355  write some additional data into L6, but we won't delete sstable `c` because
   356  sstable `b` still refers to it. In the worst case, sstable `b` will never be
   357  picked for compaction and will never be compacted into and we'll have permanent
   358  space amplification. We should try prioritize compaction of sstable `b` to
   359  prevent such a scenario.
   360  
   361  See the `PhysicalState` subsection in the `FilemetaData` section to see how
   362  we'll store compaction picking metrics to reduce virtual sstable space-amp.
   363  
   364  ### `VersionEdit` decode/encode
   365  Any additional fields added to the `FilemetaData` need to be supported in the
   366  version edit `decode/encode` functions.