github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20191113_vectorized_external_storage.md (about) 1 - Feature Name: Vectorized External Storage 2 - Status: draft 3 - Start Date: 2019-11-13 4 - Authors: Alfonso Subiotto Marqués and Yahor Yuzefovich 5 - RFC PR: [#42546](https://github.com/cockroachdb/cockroach/pull/42546) 6 - Cockroach Issue: [#37303](https://github.com/cockroachdb/cockroach/issues/37303) 7 8 # Summary 9 10 Add the ability for vectorized operators to store intermediate results 11 to disk when these are too large to fit in main memory by adding an 12 on-disk queue abstraction. Without this ability, queries that use these 13 vectorized operators will encounter an out of memory error and will not be 14 able to complete. Examples of these operators are sorts and joins (among 15 others). Adding support for vectorized external storage will allow the 16 full vectorized execution engine to be turned on by default, unlocking 17 performance gains for analytical queries that use these buffering operators. 18 19 Note that CockroachDB already supports external 20 storage in its row-oriented execution engine (covered 21 [here](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170522_external_storage.md)), 22 this RFC covers external storage for the vectorized execution engine. 23 24 # Motivation 25 26 The vectorized execution engine offers some 27 significant performance gains for analytical queries 28 (https://www.cockroachlabs.com/blog/how-we-built-a-vectorized-execution-engine/). 29 However, at the time of writing (19.2.0), only queries that use a constant 30 amount of memory are run through the vectorized execution engine by 31 default. This excludes a large family of queries that use operators such as sorts 32 or joins because their use could lead to an out of memory error. Adding external 33 storage would allow these vectorized operators to be turned on by default, 34 which will result in performance improvements for analytical workloads. 35 36 # Guide-level explanation 37 38 The full list of components that need external storage are: 39 40 | Name | Reason | Solution | 41 |---|---|---| 42 | Hash router | Buffering data for blocked output | On-disk queue | 43 | Merge joiner | Buffering equality groups | On-disk queue | 44 | Hash joiner | Buffering the build table | GRACE hash join | 45 | Hash aggregator | Too many groups to fit in memory | GRACE hash join | 46 | Sorter | Buffering data to sort | External merge sort | 47 | Unordered distinct | Seen table too big to fit in memory | GRACE hash join | 48 | Window functions | Buffering window function results | Mix of external merge sort and GRACE hash join | 49 50 All of these components will use one or more of the 51 following: 52 - On-disk queue data structure 53 - [External merge sort 54 algorithm](https://en.wikipedia.org/wiki/External_sorting#External_merge_sort) 55 - [GRACE hash join 56 algorithm](https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join) 57 58 Similarly to the row execution engine [RFC on external 59 storage](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170522_external_storage.md#detailed-requirements), 60 the algorithms to implement are limited to an external sort and external hash join. 61 However, this 62 proposal differs from the original external storage RFC in that instead of using a 63 sorted KV storage engine as a map and delegating 64 on-disk sorting to it, the vectorized engine will use disk only to store data and 65 use external-storage aware operators to perform external merge 66 sorts and GRACE hash joins. This ends up being a lot more performant 67 (refer to the [Alternatives](#rationale-and-alternatives) section). 68 69 The proposal in this RFC is to introduce an on-disk queue data structure 70 backed by flat files where batches of columnar data are serialized 71 using the [Arrow IPC format](https://arrow.apache.org/docs/ipc.html). 72 This queue can be used directly by most operators that need to buffer 73 unlimited data in FIFO order. There will also be an additional abstraction 74 that will give the caller the option to use separate queues as distinct 75 partitions. This last abstraction will be used to flush and re-read 76 partitions in the case of a GRACE hash join or external merge sort. 77 78 The impact on the end user will be improved performance for large 79 analytical queries, as queries that would previously not run through 80 the vectorized execution engine will do so after the implementation of 81 external storage. There will be no other user-facing changes, the change 82 proposed in this RFC is purely focused on existing feature improvement. 83 84 # Reference-level explanation 85 86 Given that external merge sorts and GRACE hash joins are well 87 established algorithms (external sorts more so, refer to [this 88 issue](https://github.com/cockroachdb/cockroach/issues/24582) for 89 an explanation of a GRACE hash join), this RFC will focus on the 90 backend infrastructure that will allow these algorithms to spill 91 to disk by focusing on describing the design of a single queue. 92 93 ## Detailed design 94 95 https://github.com/asubiotto/cockroach/tree/prototype is a 96 prototype of the proposed on-disk queue backed by flat files. A 97 Pebble-backed alternative was also considered in that prototype. 98 The choice of storage method is examined in more detail in the 99 [Rationale and Alternatives section](#rationale-and-alternatives). 100 101 The requirements for the storage method are that: 102 - It must provide encryption 103 if required. 104 - It must provide the capability for write throttling. Heavy use 105 of a temporary store should not block writes to the primary store. 106 - Temporary files should be cleaned up on startup in case of a crash. 107 108 Using the same filesystem `vfs.FS` abstraction as Pebble does will 109 allow us to write an on-disk queue backed by flat files with the 110 first two of these requirements. The `vfs.FS` will be the same as the 111 temporary storage engine, so we will piggy-back off of the existing 112 temporary file management code. The interface is as simple as: 113 114 ``` 115 type Queue interface { 116 Enqueue(coldata.Batch) error 117 Dequeue() (coldata.Batch, error) 118 Close() error 119 } 120 ``` 121 122 The flat file `Queue` implementation will have knobs to configure 123 the in-memory buffer size to amortize write overhead, and the maximum 124 size of a file, after which it will roll over to a new file. Different 125 configurations will be benchmarked to find a good default value. 126 127 Note that the maximum in-use disk space will be limited by a disk monitor 128 similarly to what is currently done with external storage in the row execution 129 engine. As bytes are written to disk, these are accounted for by a 130 `mon.BytesMonitor`, which additionally provides observability in to how much 131 space is being used at a given time as well as how much disk space was used 132 in a given query through `EXPLAIN ANALYZE`. 133 134 The implementation will create unique files using the following naming scheme: 135 ``` 136 <FlowID>_<QueueID>_<FileID> 137 ``` 138 A `Queue` will be created with a unique integer `ID` and keep a counter 139 for the number of files created, which it will use as a suffix for 140 the file names to maintain uniqueness within the `Queue`. Prefixing 141 the filenames with the `FlowID` UUID (which uniquely describes a 142 sub-plan of a distributed query running on a given node), will allow 143 the `Flow` to perform any file cleanup once the query completes (in 144 case of orphaned files when `recover`ing from possible `panic`s). 145 146 Thankfully, serialization of `coldata.Batch`es is already 147 implemented in the `colserde` package using the [Arrow IPC 148 format](https://arrow.apache.org/docs/ipc.html). These serialized bytes will 149 be buffered until a flush is required, at which point they are written to a 150 file with an accompanying file footer. The start offset of these written bytes 151 as well as the number of bytes written will be stored for when the batches 152 should be dequeued. A current limitation is that bytes cannot be deserialized 153 at a finer granularity than they are written at, although we are planning 154 on fixing this (https://github.com/cockroachdb/cockroach/issues/42045). 155 Snappy compression is used to minimize file size. 156 157 The aforementioned abstraction of partitions as distinct queues can be trivially 158 implemented by instantiating as many queues as partitions necessary. This could 159 mean many small files, as a queue must have at least one file. The problem with 160 this design is that there might be many open file descriptors at once if the 161 user of the queue needs to read a little data from each partition, in the case 162 of an external merge sort, for example. The number of open file descriptors 163 is limited on some file systems, so this could lead to crashes. However, this 164 problem can be mitigated by having a maximum number of open file descriptors 165 scoped to a temporary storage engine as well as per queue. The implementation 166 of partitions would keep as many file descriptors open as allowed, closing the 167 least recently used file when a new partition must be read from. 168 169 ## Rationale and Alternatives 170 171 A queue backed by Pebble was also considered as an alternative. The 172 configuration options for this queue are the write buffer size and 173 the maximum value size. Batches are enqueued up to the maximum value 174 size and written to a buffered writer with an ordinal as the key. 175 176 The benchmark used to test the performance of both queue implementations is 177 a simple write-everything then read-everything benchmark run on a linux gceworker. 178 This simulates the general case of having a single goroutine flush data 179 which it then rereads and operates on. The flat file `Queue` implementation 180 syncs files every `512KiB`, similarly to Pebble. Note that both implementations 181 use snappy compression. 182 183 Before benchmarking each implementation with a varying buffer size, 184 the data size to store was fixed at `512MiB` and each implementation was 185 independently benchmarked to find the optimal maximum file size in the 186 case of flat files, and value size in the case of pebble, while keeping 187 the write buffer size constant at 0. The options used for Pebble are 188 the same ones the Pebble temporary storage instance uses (see 189 https://github.com/cockroachdb/cockroach/blob/09b391536f72349c3951ce6b75c5231b58933b07/pkg/storage/engine/temp_engine.go#L117). 190 191  192 193 Given these graphs, a fixed file size of `32MiB` was chosen for the flat queue 194 implementation and a `128KiB` maximum value size for the Pebble implementation. 195 196  197 198 Overall, we see a ~50% improvement by using flat files 199 instead of Pebble as an on-disk queue implementation in the best case. 200 201 Initially, the downside of choosing flat files over Pebble was that 202 there wasn’t a clear understanding of how the requirements outlined 203 in [Detailed Design](#detailed-design) would be satisfied when using 204 flat files. However, since the `vfs.FS` implementation that the 205 Pebble temporary engine uses can be reused, there is no upside to 206 using pebble barring perhaps implementation simplicity since details 207 like compression and file descriptor management are taken care of.