github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170522_external_storage.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170522_external_storage.md (about)

1 - Feature Name: External Storage for DistSQL processors
2 - Status: completed
3 - Start Date: 2017-05-11
4 - Authors: Arjun Narayan and Alfonso Subiotto Marques
5 - RFC PR: [#16069](https://github.com/cockroachdb/cockroach/pull/16069)
6 - Cockroach Issue: [#15206](https://github.com/cockroachdb/cockroach/issues/15206)
7
8 # Summary
9 Add support for DistSQL processors to use external storage in addition
10 to memory to store intermediate data when processing large DistSQL
11 queries.
12
13 # Motivation and Background
14 Currently, DistSQL processors have a hard limit on the amount of
15 memory they can use before they get OOM-killed. This limits the size
16 of certain queries, particularly when efficient secondary indices do
17 not exist, so some amount of computation must be done holding a large
18 amount of intermediate state at a processor. For instance, a JOIN on
19 two very large tables, where there is no ordering on either table's
20 JOIN predicate columns, must either
21
22 * Use a HashJoiner, which stores one table in a single large
23 hashtable, and stream the other table, looking up for matches on the
24 hashtable, emitting rows when appropriate.
25
26 * Sort both tables by the JOIN predicate columns, and then stream them
27 to a MergeJoiner.
28
29 Either of these two solutions runs out of memory at some query size,
30 and requires external storage to process larger queries.
31
32 # Scope
33 This problem was first encountered in running TPC-H queries on
34 moderately sized scale factors (5 and above). It was exacerbated by
35 the fact that DistSQL currently does not plan MergeJoins and resorts
36 to HashJoins in all cases. TPC-H itself should not be bottlenecked on
37 the lack of external storage, as relevant secondary indices exist for
38 all queries that we are aware of, such that optimal planning keeps
39 tuples flowing smoothly through the dataflow graph without requiring
40 any processor to accumulate a disproportional number of
41 tuples. However, users might still want to run the occasional
42 analytics query without indexes, and we should support that use case.
43
44 At a minimum, external storage should provide enough swap space that
45 it stops all DistSQL processors from being OOM-killed on queries on
46 the TPC-H scalefactor 300 dataset. Whatever storage format is chosen,
47 this should be compatible with all DistSQL processors.
48
49 # Related Work
50
51 None of this is new. Existing databases use both paths for query
52 execution, and typically use a cost-based query planner to decide
53 _which_ of the two paths to use. So for us, it's more a case of
54 prioritization and beginning with an implementation that is
55 extensible. External sorting is typically the first implemented in
56 OLTP databases, since it comes in handy for building new indices, but
57 we have a different pathway for executing those. On-disk storage for
58 query execution only comes into play for OLAP-style queries,
59
60 For example, in the case of running a JOIN:
61
62 1. They have implementations for the GRACE hash join, as well as the
63 external sorter + merge join.
64
65 2. They have extensive microbenchmarks for the underlying costs of
66 each operation (e.g. total time cost of writing a tuple into a
67 serialized stream, or the expected time cost of doing a random
68 insert into a sorted on-disk BTree). They have extensive table
69 statistics.
70
71 3. At query planning time, with the table statistics, the database can
72 compute the cost of using either plan (i.e. estimate the number of
73 passes the GRACE hash join would take and multiply it out with the
74 average cost of writing each tuple, etc.) and choose the better
75 one.
76
77 4. We don't have (2), and thus we don't have a cost-based query
78 planner capable of doing (3). Right now, our implementations only
79 do purely logical query planning (i.e. do only those wins that are
80 asymptotic wins), so the RocksDB path is along those lines as well.
81
82 # Detailed requirements
83 Examining all DistSQL processors, we have the following data
84 structures used:
85
86 | Processor name | On Disk Data structure|
87 |-----------------|----------:|
88 | NoopCore | n/a |
89 | TableReader | n/a |
90 | JoinReader | n/a |
91 | Sorter | Sorted Map|
92 | Aggregator | Map |
93 | Distinct | Map |
94 | MergeJoiner | n/a |
95 | HashJoiner | Map |
96 | Values | n/a |
97 | AlgebraicSetOp | Map |
98
99 Essentially, we have two data types to store: a sorted map and a hash
100 map. The processors that use a hash map (distinct, hashjoin, and
101 algebraic set op) do internally benefit from keys being sorted, but
102 when they are sorted, the emit rows eagerly, and don't need to build
103 expensive in-memory state! They thus only accumulate state when keys
104 are *not* sorted. Thus, sorting is in direct opposition to
105 accumulating state large enough for external storage.
106
107 # Potential solutions
108
109 ## Option 1: On-disk sort-only
110 The first and simplest solution, is to create an on-disk method
111 exclusively for the Sorter processor. All other processors use less
112 memory when the input data is appropriately sorted, so the planner
113 would leverage the on-disk capable Sorter to sort data upstream.
114
115 The simplest on-disk sorter is to use an external Unix `sort` process,
116 or some other on-disk-capable sorter, feeding it rows, and reading
117 back the sorted stream of tuples. More seriously, we could use an
118 external sort implementation that's more battle tested, or even roll
119 our own native Go implementation (e.g. see
120 Postgresql's
121 [external sort](https://github.com/postgres/postgres/blob/master/src/backend/utils/sort/tuplesort.c)).
122
123 The biggest downside of only supporting external sorting is that we
124 might have to do multiple re-sorts for a complex query plan (for
125 instance, if a query plan has multiple sequential joins on multiple
126 different join conditions, that would require a re-join after the
127 first join and before the second), since each processor's ordering
128 requirements must be treated as a stricter precondition than it
129 currently is (at least until we have sophisticated table statistics
130 and join planning that can take that into account). This is a
131 quick-and-dirty fix.
132
133 ## Option 2: Use temporary flat files to page out partial streams.
134 For processors that require an unsorted map, we could use temporary
135 files to page out partial output streams, a la GRACE hash joins,
136 merging them back later. Walking through two examples for how flat
137 files would be used by processors:
138
139 ### Example 1: HashJoin with on-disk streams
140 Consider the following algorithm for
141 processing a HashJoin:
142
143 1. Read n rows from the right input stream, where n is the upper limit
144 on the number of rows we can hold in-memory.
145
146 2. Read all rows from the left input stream, and where we see a match,
147 serialize the row and emit it to a flat file.
148
149 3. Read the next n rows from the right input stream, and read all rows
150 from the start from the left input stream, and emit it to a second
151 flat file (note that DistSQL does not currently allow for rereading
152 streams, so that capability would have to be added as well to make
153 this work).
154
155 4. Repeat until we exhaust the right input stream.
156
157 5. Read all the flat files back, merging them in (if we need to
158 preserve some ordering), and emitting tuples.
159
160 ### Example 2: MergeSort with on-disk streams
161 1. Read n rows from the input stream, inserting it into an in-memory
162 heap. Flush the heap to disk.
163
164 2. Repeat until the input stream is exhausted.
165
166 3. Merge all the files.
167
168 ### Downsides:
169 The first downside is in the runtime: GRACE hash joins are not
170 particularly pretty in worst case algorithmic analysis, and this is
171 evident in that the hash join reads complete streams multiple times.
172
173 The second downside that we need to build a new processor that can
174 provide this buffering service. However, this is mitigated by the fact
175 that a BufferedStream processor would reuse the same code, since all
176 it does it write a stream to disk, and reread it as many times as
177 necessary.
178
179 Finally, this requires a bunch of processor rewriting work: each
180 processor would have to be rewritten to deal with multiple passes, and
181 this also makes the physical planner's job more critical: runtimes can
182 get pathological if we have to resort to paging out streams, so
183 getting ordering requirements right becomes even more important.
184
185 While writing serialized streams to disk has the advantage of having
186 no write amplification, it has very high read amplification since
187 streams reread multiple times. Unfortunately, GRACE hash joins can
188 become very expensive for very large datasets, where the number of
189 passes over the data exceeds `log(n)`. Thus, this is not the most
190 scalable implementation. In combination with a specialized sorter
191 node, which does recover the `n log(n)` complexity, however, and some
192 syntax for explicitly hinting when to sort data, we can recover
193 reasonable performance at scale.
194
195
196 ## Option 3: Use RocksDB for storing sorted maps
197 For processors that require (or could benefit from) a sorted map,
198 RocksDB offers the obvious straightforward on-disk representation. A
199 way to elegantly augment existing processors is for them to use a new
200 map library, which stores a map in-memory, but flushes it to a
201 temporary RocksDB keyspace when it reaches some threshold. Instead of
202 then using a second in-memory map and having to merge streams on the
203 return path, the processors continue using the on-disk map, inserting
204 directly into RocksDB, and performing a single scan at the end.
205
206 This version also gives us random access reads and writes, which makes
207 using external storage for aggregations where there are a large number
208 of buckets simple: we simply use the KV interface to store
209 aggregations by bucket. As a bonus, when we finally read off the
210 aggregations, they will be in sorted order.
211
212 Do note that if configured correctly, RocksDB recovers the algorithm
213 in Option 2 Example 2: if we insert sequential sorted `WriteBatches`,
214 and turn compaction off. This leaves `cgo` overhead as the only
215 remaining downside of this path.
216
217 # Proposal
218
219 We propose to use RocksDB for external storage, using a dedicated
220 additional RocksDB instance solely for DistSQL processors. This is the
221 most general implementation (in terms of scaling to very complex
222 queries), is completely opaque to the user, requires no special syntax
223 at large scales, and requires lower implementation work in modifying
224 processors one-by-one (their internal maps would be changed to use an
225 opaque KV library that switches from in-memory maps to RocksDB at a
226 configurable threshold). The internal knowledge base for tuning and
227 working with RocksDB is available at CockroachDB, so this reduces
228 future implementation risk as well.
229
230 ## Benchmarking overhead
231
232 As a sanity check to ensure that this isn't wildly off-base, we will
233 first benchmark using RocksDB with minimal compaction versus a
234 standard external sort library, and make sure it isn't egregious.
235
236 We need to isolate the following costs:
237
238 1. External sort vs RocksDB: We will sort a billion rows using a
239 standard external sort, and insert (in random order) and scan it
240 from RocksDB.
241 2. RocksDB vs Pure-go: We will also try, as a stretch goal, to measure
242 the RocksDB overhead against a pure-Go implementation, such
243 as [Badger](https://github.com/dgraph-io/badger).
244
245 ## RocksDB Implementation Concerns
246 The primary concern in utilizing disk is contention with the existing
247 transactional RocksDB storage engine for Disk IO. We want to allow for
248 the use of an explicit flag to denote the temporary storage location
249 to be used, if the node is deployed with multiple disks.
250
251 The second concern is that the existing RocksDB instance is tuned to
252 be as efficient as possible for reads, and thus trades off much higher
253 write amplification (via aggressive compactions) in exchange for much
254 lower read amplification. This is not the same tradeoff that we
255 want.
256
257 Finally, RocksDB is very complex, and for this use case we don't need
258 the majority of its database features, so it's potentially
259 overkill. We have to be careful that the overhead is not too high.
260
261 These concerns lead us to a few rules for implementing the RocksDB
262 options:
263
264 * Use a separate RocksDB instance explicitly for DistSQL processors.
265
266 * By default this instance uses a `tmp_storage` location under the
267 first store's directory, overridable by a command line flag. It is
268 deleted and recreated from scratch whenever the node starts.
269
270 * Disable write-ahead logging (`write_option.disableWAL`) to lower
271 write amplification.
272
273 * Processors carve out temporary keyspaces in this RocksDB instance.
274
275 * Use range deletion tombstones for rapidly deleting a processor's
276 entire keyspace when it is done.
277
278 * Experiment with compaction settings (including the extreme of
279 turning it off completely), since, fundamentally, we need to write
280 data once and read it back once.
281
282 * This is an ephemeral RocksDB instance, and some cloud providers
283 provide non-persistent disks. We want to eventually support
284 configuring nodes to use non-persistent disks for this use
285 case.
286
287 * `storage/engine` currently is tailored almost exclusively for use by
288 the `storage` package. This implementation should reuse as much of
289 the existing `engine` code as possible, modifying (and generalizing
290 the package) where appropriate. Eventually, other users who also
291 need a non-persistent node-local key-value store can use it as well.
292
293 ## Potential optimizations: planning subsequent processors with additional orderings
294
295 Using RocksDB to store on-disk data will result in on-disk data being
296 sorted by key. Thus, for example, if a HashJoin evicts to disk, we can
297 recover an extra ordering that was not expected in
298 planning. Currently, there is no way for a DistSQL stream to indicate
299 that tuples are being delivered with additional orderings not
300 specified in the physical plan, but if we could send this information
301 in the producer metadata, downstream processors could potentially take
302 advantage of the additional sorting.