github.com/cockroachdb/pebble@v1.1.2/docs/RFCS/20221122_virtual_sstable.md (about) 1 - Feature Name: Virtual sstables 2 - Status: in-progress 3 - Start Date: 2022-10-27 4 - Authors: Arjun Nair 5 - RFC PR: https://github.com/cockroachdb/pebble/pull/2116 6 - Pebble Issues: 7 https://github.com/cockroachdb/pebble/issues/1683 8 9 10 ** Design Draft** 11 12 # Summary 13 14 The RFC outlines the design to enable virtualizing of physical sstables 15 in Pebble. 16 17 A virtual sstable has no associated physical data on disk, and is instead backed 18 by an existing physical sstable. Each physical sstable may be shared by one, or 19 more than one virtual sstable. 20 21 Initially, the design will be used to lower the read-amp and the write-amp 22 caused by certain ingestions. Sometimes, ingestions are unable to place incoming 23 files, which have no data overlap with other files in the lsm, lower in the lsm 24 because of file boundary overlap with files in the lsm. In this case, we are 25 forced to place files higher in the lsm, sometimes in L0, which can cause higher 26 read-amp and unnecessary write-amp as the file is moved lower down the lsm. See 27 https://github.com/cockroachdb/cockroach/issues/80589 for the problem occurring 28 in practice. 29 30 Eventually, the design will also be used for the disaggregated storage masking 31 use-case: https://github.com/cockroachdb/cockroach/pull/70419/files. 32 33 This document describes the design of virtual sstables in Pebble with enough 34 detail to aid the implementation and code review. 35 36 # Design 37 38 ### Ingestion 39 40 When an sstable is ingested into Pebble, we try to place it in the lowest level 41 without any data overlap, or any file boundary overlap. We can make use of 42 virtual sstables in the cases where we're forced to place the ingested sstable 43 at a higher level due to file boundary overlap, but no data overlap. 44 45 ``` 46 s2 47 ingest: [i-j-------n] 48 s1 49 L6: [e---g-----------------p---r] 50 a b c d e f g h i j k l m n o p q r s t u v w x y z 51 ``` 52 53 Consider the sstable s1 in L6 and the ingesting sstable s2. It is clear that 54 the file boundaries of s1 and s2 overlap, but there is no data overlap as shown 55 in the diagram. Currently, we will be forced to ingest the sstable s2 into a 56 level higher than L6. With virtual sstables, we can split the existing sstable 57 s1 into two sstables s3 and s4 as shown in the following diagram. 58 59 ``` 60 s3 s2 s4 61 L6: [e---g]-[i-j-------n]-[p---r] 62 a b c d e f g h i j k l m n o p q r s t u v w x y z 63 ``` 64 65 The sstable s1 will be deleted from the lsm. If s1 was a physical sstable, then 66 we will keep the file on disk as long as we need to so that it can back the 67 virtual sstables. 68 69 There are cases where the ingesting sstables have no data overlap with existing 70 sstables, but we can't make use of virtual sstables. Consider: 71 ``` 72 s2 73 ingest: [f-----i-j-------n] 74 s1 75 L6: [e---g-----------------p---r] 76 a b c d e f g h i j k l m n o p q r s t u v w x y z 77 ``` 78 We cannot use virtual sstables in the above scenario for two reasons: 79 1. We don't have a quick method of detecting no data overlap. 80 2. We will be forced to split the sstable in L6 into more than two virtual 81 sstables, but we want to avoid many small virtual sstables in the lsm. 82 83 Note that in Cockroach, the easier-to-solve case happens very regularly when an 84 sstable spans a range boundary (which pebble has no knowledge of), and we ingest 85 a snapshot of a range in between the two already-present ranges. 86 87 slide in between two existing sstables is more likely to happen. It occurs when 88 we ingest a snapshot of a range in between two already present ranges. 89 90 `ingestFindTargetLevel` changes: 91 - The `ingestFindTargetLevel` function is used to determine the target level 92 of the file which is being ingested. Currently, this function returns an `int` 93 which is the target level for the ingesting file. Two additional return 94 parameters, `[]manifest.NewFileEntry` and `*manifest.DeletedFileEntry`, will be 95 added to the function. 96 - If `ingestFindTargetLevel` decides to split an existing sstable into virtual 97 sstables, then it will return new and deleted entries. Otherwise, it will only 98 return the target level of the ingesting file. 99 - Within the `ingestFindTargetLevel` function, the `overlapWithIterator` 100 function is used to quickly detect data overlap. In the case with file 101 boundary overlap, but no data overlap, in the lowest possible level, we will 102 split the existing sstable into virtual sstables and generate the 103 `NewFileEntry`s and the `DeletedFileEntry`. The `FilemetaData` section 104 describes how the various fields in the `FilemetaData` will be computed for 105 the newly created virtual sstables. 106 107 - Note that we will not split physical sstables into virtual sstables in L0 for 108 the use case described in this RFC. The benefit of doing so would be to reduce 109 the number of L0 sublevels, but the cost would be additional implementation 110 complexity(see the `FilemetaData` section). We also want to avoid too many 111 virtual sstables in the lsm as they can lead to space amp(see `Compaction` 112 section). However, in the future, for the disaggregated storage masking case, 113 we would need to support ingestion and use of virtual sstables in L0. 114 115 - Note that we may need an upper bound on the number of times an sstable is 116 split into smaller virtual sstables. We can further reduce the risk of many 117 small sstables: 118 1. For CockroachDB's snapshot ingestion, there is one large sst (up to 512MB) 119 and many tiny ones. We can choose the apply this splitting logic only for 120 the large sst. It is ok for the tiny ssts to be ingested into L0. 121 2. Split only if the ingested sst is at least half the size of the sst being 122 split. So if we have a smaller ingested sst, we will pick a higher level to 123 split at (where the ssts are smaller). The lifetime of virtual ssts at a 124 higher level is smaller, so there is lower risk of littering the LSM with 125 long-lived small virtual ssts. 126 3. For disaggregated storage implementation, we can avoid masking for tiny 127 sstables being ingested and instead write a range delete like we currently 128 do. Precise details on the masking use case are out of the scope of this 129 RFC. 130 131 `ingestApply` changes: 132 - The new and deleted file entries returned by the `ingestFindTargetLevel` 133 function will be added to the version edit in `ingestApply`. 134 - We will appropriately update the `levelMetrics` based on the new information 135 returned by `ingestFindTargetLevel`. 136 137 138 ### `FilemetaData` changes 139 140 Each virtual sstables will have a unique file metadata value associated with it. 141 The metadata may be borrowed from the backing physical sstable, or it may be 142 unique to the virtual sstable. 143 144 This rfc lists out the fields in the `FileMetadata` struct with information on 145 how each field will be populated. 146 147 `Atomic.AllowedSeeks`: Field is used for read triggered compactions, and we can 148 populate this field for each virtual sstable since virtual sstables can be 149 picked for compactions. 150 151 `Atomic.statsValid`: We can set this to true(`1`) when the virtual sstable is 152 created. On virtual sstable creation we will estimate the table stats of the 153 virtual sstable based on the table stats of the physical sstable. We can also 154 set this to `0` and let the table stats job asynchronously compute the stats. 155 156 `refs`: The will be turned into a pointer which will be shared by the 157 virtual/physical sstables. See the deletion section of the RFC to learn how the 158 `refs` count will be used. 159 160 `FileNum`: We could give each virtual sstable its own file number or share 161 the file number between all the virtual sstables. In the former case, the virtual 162 sstables will be distinguished by the file number, and will have an additional 163 metadata field to indicate the file number of the parent sstable. In the latter 164 case, we can use a few of the most significant bits of the 64 bit file number to 165 distinguish the virtual sstables. 166 167 The benefit of using a single file number for each virtual sstable, is that we 168 don't need to use additional space to store the file number of the backing 169 physical sstable. 170 171 It might make sense to give each virtual sstable its own file number. Virtual 172 sstables are picked for compactions, and compactions and compaction picking 173 expect a unique file number for each of the files which it is compacting. 174 For example, read compactions will use the file number of the file to determine 175 if a file picked for compaction has already been compacted, the version edit 176 will expect a different file number for each virtual sstable, etc. 177 178 There are direct references to the `FilemetaData.FileNum` throughout Pebble. For 179 example, the file number is accessed when the the `DB.Checkpoint` function is 180 called. This function iterates through the files in each level of the lsm, 181 constructs the filepath using the file number, and reads the file from disk. In 182 such cases, it is important to exclude virtual sstables. 183 184 `Size`: We compute this using linear interpolation on the number of blocks in 185 the parent sstable and the number of blocks in the newly created virtual sstable. 186 187 `SmallestSeqNum/LargestSeqNum`: These fields depend on the parent sstable, 188 but we would need to perform a scan of the physical sstable to compute these 189 accurately for the virtual sstable upon creation. Instead, we could convert 190 these fields into lower and upper bounds of the sequence numbers in a file. 191 192 These fields are used for l0 sublevels, pebble tooling, delete compaction hints, 193 and a lot of plumbing. We don't need to worry about the L0 sublevels use case 194 because we won't have virtual sstables in L0 for the use case in this RFC. For 195 the rest of the use cases we can use lower bound for the smallest seq number, 196 and an upper bound for the largest seq number work. 197 198 TODO(bananabrick): Add more detail for any delete compaction hint changes if 199 necessary. 200 201 `Smallest/Largest`: These, along with the smallest/largest ranges for the range 202 and point keys can be computed upon virtual sstable creation. Precisely, these 203 can be computed when we try and detect data overlap in the `overlapWithIterator` 204 function during ingestion. 205 206 `Stats`: `TableStats` will either be computed upon virtual sstable creation 207 using linear interpolation on the block counts of the virtual/physical sstables 208 or asynchronously using the file bounds of the virtual sstable. 209 210 `PhysicalState`: We can add an additional struct with state associated with 211 physical ssts which have been virtualized. 212 213 ``` 214 type PhysicalState struct { 215 // Total refs across all virtual ssts * versions. That is, if the same virtual 216 // sst is present in multiple versions, it may have multiple refs, if the 217 // btree node is not the same. 218 totalRefs int32 219 220 // Number of virtual ssts in the latest version that refer to this physical 221 // SST. Will be 1 if there is only a physical sst, or there is only 1 virtual 222 // sst referencing this physical sst. 223 // INVARIANT: refsInLatestVersion <= totalRefs 224 // refsInLatestVersion == 0 is a zombie sstable. 225 refsInLatestVersion int32 226 227 fileSize uint64 228 229 // If sst is not virtualized and in latest version 230 // virtualSizeSumInLatestVersion == fileSize. If 231 // virtualSizeSumInLatestVersion > 0 and 232 // virtualSizeSumInLatestVersion/fileSize is very small, the corresponding 233 // virtual sst(s) should be candidates for compaction. These candidates can be 234 // tracked via btree annotations. Incrementlly updated in 235 // BulkVersionEdit.Apply, when updating refsInLatestVersion. 236 virtualSizeSumInLatestVersion uint64 237 } 238 ``` 239 240 The `Deletion` section and the `Compactions` section describe why we need to 241 store the `PhysicalState`. 242 243 ### Deletion of physical and virtual sstables 244 245 We want to ensure that the physical sstable is only deleted from disk when no 246 version references it, and when there are no virtual sstables which are backed 247 by the physical sstable. 248 249 Since `FilemetaData.refs` is a pointer which is shared by the physical and 250 virtual sstables, the physical sstable won't be deleted when it is removed 251 from the latest version as the `FilemetaData.refs` will have been increased 252 when the virtual sstable is added to a version. Therefore, we only need to 253 ensure that the physical sstable is eventually deleted when there are no 254 versions which reference it. 255 256 Sstables are deleted from disk by the `DB.doDeleteObsoleteFiles` function which 257 looks for files to delete in the the `DB.mu.versions.obsoleteTables` slice. 258 So we need to ensure that any physical sstable which was virtualized is added to 259 the obsolete tables list iff `FilemetaData.refs` is 0. 260 261 Sstable are added to the obsolete file list when a `Version` is unrefed and 262 when `DB.scanObsoleteFiles` is called when Pebble is opened. 263 264 When a `Version` is unrefed, sstables referenced by it are only added to the 265 obsolete table list if the `FilemetaData.refs` hits 0 for the sstable. With 266 virtual sstables, we can have a case where the last version which directly 267 references a physical sstable is unrefed, but the physical sstable is not added 268 to the obsolete table list because its `FilemetaData.refs` count is not 0 269 because of indirect references through virtual sstables. Since the last Version 270 which directly references the physical sstable is deleted, the physical sstable 271 will never get added to the obsolete table list. Since virtual sstables keep 272 track of their parent physical sstable, we can just add the physical sstable to 273 the obsolete table list when the last virtual sstable which references it is 274 deleted. 275 276 `DB.scanObsoleteFiles` will delete any file which isn't referenced by the 277 `VersionSet.versions` list. So, it's possible that a physical sstable associated 278 with a virtual sstable will be deleted. This problem can be fixed by a small 279 tweak in the `d.mu.versions.addLiveFileNums` to treat the parent sstable of 280 a virtual sstable as a live file. 281 282 Deleted files still referenced by older versions are considered zombie sstables. 283 We can extend the definition of zombie sstables to be any sstable which is not 284 directly, or indirectly through virtual sstables, referenced by the latest 285 version. See the `PhysicalState` subsection of the `FilemetaData` section 286 where we describe how the references in the latest version will be tracked. 287 288 289 ### Reading from virtual sstables 290 291 Since virtual sstables do not exist on disk, we will have to redirect reads 292 to the physical sstable which backs the virtual sstable. 293 294 All reads to the physical files go through the table cache which opens the file 295 on disk and creates a `Reader` for the reads. The table cache currently creates 296 a `FileNum` -> `Reader` mapping for the physical sstables. 297 298 Most of the functions in table cache API take the file metadata of the file as 299 a parameter. Examples include `newIters`, `newRangeKeyIter`, `withReader`, etc. 300 Each of these functions then calls a subsequent function on the sstable 301 `Reader`. 302 303 In the `Reader` API, some functions only really need to be called on physical 304 sstables, whereas some functions need to be called on both physical and virtual 305 sstables. For example, the `Reader.EstimateDiskUsage` usage function, or the 306 `Reader.Layout` function only need to be called on physical sstables, whereas, 307 some function like, `Reader.NewIter`, and `Reader.NewCompactionIter` need to 308 work with virtual sstables. 309 310 We could either have an abstraction over the physical sstable `Reader` per 311 virtual sstable, or update the `Reader` API to accept file bounds of the 312 sstable. In the latter case, we would create one `Reader` on the physical 313 sstable for all of the virtual sstables, and update the `Reader` API to accept 314 the file bounds of the sstable. 315 316 Changes required to share a `Reader` on the physical sstable among the virtual 317 sstable: 318 - If the file metadata of the virtual sstable is passed into the table cache, on 319 a table cache miss, the table cache will load the Reader for the physical 320 sstable. This step can be performed in the `tableCacheValue.load` function. On 321 a table cache hit, the file number of the parent sstable will be used to fetch 322 the appropriate sstable `Reader`. 323 - The `Reader` api will be updated to support reads from virtual sstables. For 324 example, the `NewCompactionIter` function will take additional 325 `lower,upper []byte` parameters. 326 327 Updates to iterators: 328 - `Reader.NewIter` already has `lower,upper []byte` parameters so this requires 329 no change. 330 - Add `lower,upper` fields to the `Reader.NewCompactionIter`. The function 331 initializes single level and two level iterators, and we can pass in the 332 `lower,upper` values to those. TODO(bananabrick): Make sure that the value 333 of `bytesIterated` in the compaction iterator is still accurate. 334 - `Reader.NewRawRangeKeyIter/NewRawRangeDelIter`: We need to add `lower/upper` 335 fields to the functions. Both iterators make use of a `fragmentBlockIter`. We 336 could filter keys above the `fragmentBlockIter` or add filtering within the 337 `fragmentBlockIter`. To add filtering within the `fragmentBlockIter` we will 338 initialize it with two additional `lower/upper []byte` fields. 339 - We would need to update the `SetBounds` logic for the sstable iterators to 340 never set bounds for the iterators outside the virtual sstable bounds. This 341 could lead to keys outside the virtual sstable bounds, but inside the physical 342 sstable bounds, to be surfaced. 343 344 TODO(bananabrick): Add a section about sstable properties, if necessary. 345 346 ### Compactions 347 348 Virtual sstables can be picked for compactions. If the `FilemetaData` and the 349 iterator stack changes work, then compaction shouldn't require much, if any, 350 additional work. 351 352 Virtual sstables which are picked for compactions may cause space amplification. 353 For example, if we have two virtual sstables `a` and `b` in L5, backed by a 354 physical sstable `c`, and the sstable `a` is picked for a compaction. We will 355 write some additional data into L6, but we won't delete sstable `c` because 356 sstable `b` still refers to it. In the worst case, sstable `b` will never be 357 picked for compaction and will never be compacted into and we'll have permanent 358 space amplification. We should try prioritize compaction of sstable `b` to 359 prevent such a scenario. 360 361 See the `PhysicalState` subsection in the `FilemetaData` section to see how 362 we'll store compaction picking metrics to reduce virtual sstable space-amp. 363 364 ### `VersionEdit` decode/encode 365 Any additional fields added to the `FilemetaData` need to be supported in the 366 version edit `decode/encode` functions.