github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170522_external_storage.md (about) 1 - Feature Name: External Storage for DistSQL processors 2 - Status: completed 3 - Start Date: 2017-05-11 4 - Authors: Arjun Narayan and Alfonso Subiotto Marques 5 - RFC PR: [#16069](https://github.com/cockroachdb/cockroach/pull/16069) 6 - Cockroach Issue: [#15206](https://github.com/cockroachdb/cockroach/issues/15206) 7 8 # Summary 9 Add support for DistSQL processors to use external storage in addition 10 to memory to store intermediate data when processing large DistSQL 11 queries. 12 13 # Motivation and Background 14 Currently, DistSQL processors have a hard limit on the amount of 15 memory they can use before they get OOM-killed. This limits the size 16 of certain queries, particularly when efficient secondary indices do 17 not exist, so some amount of computation must be done holding a large 18 amount of intermediate state at a processor. For instance, a JOIN on 19 two very large tables, where there is no ordering on either table's 20 JOIN predicate columns, must either 21 22 * Use a HashJoiner, which stores one table in a single large 23 hashtable, and stream the other table, looking up for matches on the 24 hashtable, emitting rows when appropriate. 25 26 * Sort both tables by the JOIN predicate columns, and then stream them 27 to a MergeJoiner. 28 29 Either of these two solutions runs out of memory at some query size, 30 and requires external storage to process larger queries. 31 32 # Scope 33 This problem was first encountered in running TPC-H queries on 34 moderately sized scale factors (5 and above). It was exacerbated by 35 the fact that DistSQL currently does not plan MergeJoins and resorts 36 to HashJoins in all cases. TPC-H itself should not be bottlenecked on 37 the lack of external storage, as relevant secondary indices exist for 38 all queries that we are aware of, such that optimal planning keeps 39 tuples flowing smoothly through the dataflow graph without requiring 40 any processor to accumulate a disproportional number of 41 tuples. However, users might still want to run the occasional 42 analytics query without indexes, and we should support that use case. 43 44 At a minimum, external storage should provide enough swap space that 45 it stops all DistSQL processors from being OOM-killed on queries on 46 the TPC-H scalefactor 300 dataset. Whatever storage format is chosen, 47 this should be compatible with all DistSQL processors. 48 49 # Related Work 50 51 None of this is new. Existing databases use both paths for query 52 execution, and typically use a cost-based query planner to decide 53 _which_ of the two paths to use. So for us, it's more a case of 54 prioritization and beginning with an implementation that is 55 extensible. External sorting is typically the first implemented in 56 OLTP databases, since it comes in handy for building new indices, but 57 we have a different pathway for executing those. On-disk storage for 58 query execution only comes into play for OLAP-style queries, 59 60 For example, in the case of running a JOIN: 61 62 1. They have implementations for the GRACE hash join, as well as the 63 external sorter + merge join. 64 65 2. They have extensive microbenchmarks for the underlying costs of 66 each operation (e.g. total time cost of writing a tuple into a 67 serialized stream, or the expected time cost of doing a random 68 insert into a sorted on-disk BTree). They have extensive table 69 statistics. 70 71 3. At query planning time, with the table statistics, the database can 72 compute the cost of using either plan (i.e. estimate the number of 73 passes the GRACE hash join would take and multiply it out with the 74 average cost of writing each tuple, etc.) and choose the better 75 one. 76 77 4. We don't have (2), and thus we don't have a cost-based query 78 planner capable of doing (3). Right now, our implementations only 79 do purely logical query planning (i.e. do only those wins that are 80 asymptotic wins), so the RocksDB path is along those lines as well. 81 82 # Detailed requirements 83 Examining all DistSQL processors, we have the following data 84 structures used: 85 86 | Processor name | On Disk Data structure| 87 |-----------------|----------:| 88 | NoopCore | n/a | 89 | TableReader | n/a | 90 | JoinReader | n/a | 91 | Sorter | Sorted Map| 92 | Aggregator | Map | 93 | Distinct | Map | 94 | MergeJoiner | n/a | 95 | HashJoiner | Map | 96 | Values | n/a | 97 | AlgebraicSetOp | Map | 98 99 Essentially, we have two data types to store: a sorted map and a hash 100 map. The processors that use a hash map (distinct, hashjoin, and 101 algebraic set op) do internally benefit from keys being sorted, but 102 when they are sorted, the emit rows eagerly, and don't need to build 103 expensive in-memory state! They thus only accumulate state when keys 104 are *not* sorted. Thus, sorting is in direct opposition to 105 accumulating state large enough for external storage. 106 107 # Potential solutions 108 109 ## Option 1: On-disk sort-only 110 The first and simplest solution, is to create an on-disk method 111 exclusively for the Sorter processor. All other processors use less 112 memory when the input data is appropriately sorted, so the planner 113 would leverage the on-disk capable Sorter to sort data upstream. 114 115 The simplest on-disk sorter is to use an external Unix `sort` process, 116 or some other on-disk-capable sorter, feeding it rows, and reading 117 back the sorted stream of tuples. More seriously, we could use an 118 external sort implementation that's more battle tested, or even roll 119 our own native Go implementation (e.g. see 120 Postgresql's 121 [external sort](https://github.com/postgres/postgres/blob/master/src/backend/utils/sort/tuplesort.c)). 122 123 The biggest downside of only supporting external sorting is that we 124 might have to do multiple re-sorts for a complex query plan (for 125 instance, if a query plan has multiple sequential joins on multiple 126 different join conditions, that would require a re-join after the 127 first join and before the second), since each processor's ordering 128 requirements must be treated as a stricter precondition than it 129 currently is (at least until we have sophisticated table statistics 130 and join planning that can take that into account). This is a 131 quick-and-dirty fix. 132 133 ## Option 2: Use temporary flat files to page out partial streams. 134 For processors that require an unsorted map, we could use temporary 135 files to page out partial output streams, a la GRACE hash joins, 136 merging them back later. Walking through two examples for how flat 137 files would be used by processors: 138 139 ### Example 1: HashJoin with on-disk streams 140 Consider the following algorithm for 141 processing a HashJoin: 142 143 1. Read n rows from the right input stream, where n is the upper limit 144 on the number of rows we can hold in-memory. 145 146 2. Read all rows from the left input stream, and where we see a match, 147 serialize the row and emit it to a flat file. 148 149 3. Read the next n rows from the right input stream, and read all rows 150 from the start from the left input stream, and emit it to a second 151 flat file (note that DistSQL does not currently allow for rereading 152 streams, so that capability would have to be added as well to make 153 this work). 154 155 4. Repeat until we exhaust the right input stream. 156 157 5. Read all the flat files back, merging them in (if we need to 158 preserve some ordering), and emitting tuples. 159 160 ### Example 2: MergeSort with on-disk streams 161 1. Read n rows from the input stream, inserting it into an in-memory 162 heap. Flush the heap to disk. 163 164 2. Repeat until the input stream is exhausted. 165 166 3. Merge all the files. 167 168 ### Downsides: 169 The first downside is in the runtime: GRACE hash joins are not 170 particularly pretty in worst case algorithmic analysis, and this is 171 evident in that the hash join reads complete streams multiple times. 172 173 The second downside that we need to build a new processor that can 174 provide this buffering service. However, this is mitigated by the fact 175 that a BufferedStream processor would reuse the same code, since all 176 it does it write a stream to disk, and reread it as many times as 177 necessary. 178 179 Finally, this requires a bunch of processor rewriting work: each 180 processor would have to be rewritten to deal with multiple passes, and 181 this also makes the physical planner's job more critical: runtimes can 182 get pathological if we have to resort to paging out streams, so 183 getting ordering requirements right becomes even more important. 184 185 While writing serialized streams to disk has the advantage of having 186 no write amplification, it has very high read amplification since 187 streams reread multiple times. Unfortunately, GRACE hash joins can 188 become very expensive for very large datasets, where the number of 189 passes over the data exceeds `log(n)`. Thus, this is not the most 190 scalable implementation. In combination with a specialized sorter 191 node, which does recover the `n log(n)` complexity, however, and some 192 syntax for explicitly hinting when to sort data, we can recover 193 reasonable performance at scale. 194 195 196 ## Option 3: Use RocksDB for storing sorted maps 197 For processors that require (or could benefit from) a sorted map, 198 RocksDB offers the obvious straightforward on-disk representation. A 199 way to elegantly augment existing processors is for them to use a new 200 map library, which stores a map in-memory, but flushes it to a 201 temporary RocksDB keyspace when it reaches some threshold. Instead of 202 then using a second in-memory map and having to merge streams on the 203 return path, the processors continue using the on-disk map, inserting 204 directly into RocksDB, and performing a single scan at the end. 205 206 This version also gives us random access reads and writes, which makes 207 using external storage for aggregations where there are a large number 208 of buckets simple: we simply use the KV interface to store 209 aggregations by bucket. As a bonus, when we finally read off the 210 aggregations, they will be in sorted order. 211 212 Do note that if configured correctly, RocksDB recovers the algorithm 213 in Option 2 Example 2: if we insert sequential sorted `WriteBatches`, 214 and turn compaction off. This leaves `cgo` overhead as the only 215 remaining downside of this path. 216 217 # Proposal 218 219 We propose to use RocksDB for external storage, using a dedicated 220 additional RocksDB instance solely for DistSQL processors. This is the 221 most general implementation (in terms of scaling to very complex 222 queries), is completely opaque to the user, requires no special syntax 223 at large scales, and requires lower implementation work in modifying 224 processors one-by-one (their internal maps would be changed to use an 225 opaque KV library that switches from in-memory maps to RocksDB at a 226 configurable threshold). The internal knowledge base for tuning and 227 working with RocksDB is available at CockroachDB, so this reduces 228 future implementation risk as well. 229 230 ## Benchmarking overhead 231 232 As a sanity check to ensure that this isn't wildly off-base, we will 233 first benchmark using RocksDB with minimal compaction versus a 234 standard external sort library, and make sure it isn't egregious. 235 236 We need to isolate the following costs: 237 238 1. External sort vs RocksDB: We will sort a billion rows using a 239 standard external sort, and insert (in random order) and scan it 240 from RocksDB. 241 2. RocksDB vs Pure-go: We will also try, as a stretch goal, to measure 242 the RocksDB overhead against a pure-Go implementation, such 243 as [Badger](https://github.com/dgraph-io/badger). 244 245 ## RocksDB Implementation Concerns 246 The primary concern in utilizing disk is contention with the existing 247 transactional RocksDB storage engine for Disk IO. We want to allow for 248 the use of an explicit flag to denote the temporary storage location 249 to be used, if the node is deployed with multiple disks. 250 251 The second concern is that the existing RocksDB instance is tuned to 252 be as efficient as possible for reads, and thus trades off much higher 253 write amplification (via aggressive compactions) in exchange for much 254 lower read amplification. This is not the same tradeoff that we 255 want. 256 257 Finally, RocksDB is very complex, and for this use case we don't need 258 the majority of its database features, so it's potentially 259 overkill. We have to be careful that the overhead is not too high. 260 261 These concerns lead us to a few rules for implementing the RocksDB 262 options: 263 264 * Use a separate RocksDB instance explicitly for DistSQL processors. 265 266 * By default this instance uses a `tmp_storage` location under the 267 first store's directory, overridable by a command line flag. It is 268 deleted and recreated from scratch whenever the node starts. 269 270 * Disable write-ahead logging (`write_option.disableWAL`) to lower 271 write amplification. 272 273 * Processors carve out temporary keyspaces in this RocksDB instance. 274 275 * Use range deletion tombstones for rapidly deleting a processor's 276 entire keyspace when it is done. 277 278 * Experiment with compaction settings (including the extreme of 279 turning it off completely), since, fundamentally, we need to write 280 data once and read it back once. 281 282 * This is an ephemeral RocksDB instance, and some cloud providers 283 provide non-persistent disks. We want to eventually support 284 configuring nodes to use non-persistent disks for this use 285 case. 286 287 * `storage/engine` currently is tailored almost exclusively for use by 288 the `storage` package. This implementation should reuse as much of 289 the existing `engine` code as possible, modifying (and generalizing 290 the package) where appropriate. Eventually, other users who also 291 need a non-persistent node-local key-value store can use it as well. 292 293 ## Potential optimizations: planning subsequent processors with additional orderings 294 295 Using RocksDB to store on-disk data will result in on-disk data being 296 sorted by key. Thus, for example, if a HashJoin evicts to disk, we can 297 recover an extra ordering that was not expected in 298 planning. Currently, there is no way for a DistSQL stream to indicate 299 that tuples are being delivered with additional orderings not 300 specified in the physical plan, but if we could send this information 301 in the producer metadata, downstream processors could potentially take 302 advantage of the additional sorting.