github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20191113_vectorized_external_storage.md (about)

     1  - Feature Name: Vectorized External Storage
     2  - Status: draft
     3  - Start Date: 2019-11-13
     4  - Authors: Alfonso Subiotto Marqués and Yahor Yuzefovich
     5  - RFC PR: [#42546](https://github.com/cockroachdb/cockroach/pull/42546)
     6  - Cockroach Issue: [#37303](https://github.com/cockroachdb/cockroach/issues/37303)
     7  
     8  # Summary
     9  
    10  Add the ability for vectorized operators to store intermediate results
    11  to disk when these are too large to fit in main memory by adding an
    12  on-disk queue abstraction. Without this ability, queries that use these
    13  vectorized operators will encounter an out of memory error and will not be
    14  able to complete. Examples of these operators are sorts and joins (among
    15  others). Adding support for vectorized external storage will allow the
    16  full vectorized execution engine to be turned on by default, unlocking
    17  performance gains for analytical queries that use these buffering operators.
    18  
    19  Note that CockroachDB already supports external
    20  storage in its row-oriented execution engine (covered
    21  [here](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170522_external_storage.md)),
    22  this RFC covers external storage for the vectorized execution engine.
    23  
    24  # Motivation
    25  
    26  The vectorized execution engine offers some
    27  significant performance gains for analytical queries
    28  (https://www.cockroachlabs.com/blog/how-we-built-a-vectorized-execution-engine/).
    29  However, at the time of writing (19.2.0), only queries that use a constant
    30  amount of memory are run through the vectorized execution engine by
    31  default. This excludes a large family of queries that use operators such as sorts
    32  or joins because their use could lead to an out of memory error. Adding external
    33  storage would allow these vectorized operators to be turned on by default,
    34  which will result in performance improvements for analytical workloads.
    35  
    36  # Guide-level explanation
    37  
    38  The full list of components that need external storage are:
    39  
    40  | Name | Reason | Solution |
    41  |---|---|---|
    42  | Hash router | Buffering data for blocked output | On-disk queue |
    43  | Merge joiner | Buffering equality groups | On-disk queue |
    44  | Hash joiner | Buffering the build table | GRACE hash join |
    45  | Hash aggregator | Too many groups to fit in memory | GRACE hash join |
    46  | Sorter | Buffering data to sort | External merge sort |
    47  | Unordered distinct | Seen table too big to fit in memory | GRACE hash join |
    48  | Window functions | Buffering window function results | Mix of external merge sort and GRACE hash join |
    49  
    50  All of these components will use one or more of the
    51  following:
    52  - On-disk queue data structure
    53  - [External merge sort
    54  algorithm](https://en.wikipedia.org/wiki/External_sorting#External_merge_sort)
    55  - [GRACE hash join
    56  algorithm](https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join)
    57  
    58  Similarly to the row execution engine [RFC on external
    59  storage](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20170522_external_storage.md#detailed-requirements),
    60  the algorithms to implement are limited to an external sort and external hash join.
    61  However, this
    62  proposal differs from the original external storage RFC in that instead of using a
    63  sorted KV storage engine as a map and delegating
    64  on-disk sorting to it, the vectorized engine will use disk only to store data and
    65  use external-storage aware operators to perform external merge
    66  sorts and GRACE hash joins. This ends up being a lot more performant
    67  (refer to the [Alternatives](#rationale-and-alternatives) section).
    68  
    69  The proposal in this RFC is to introduce an on-disk queue data structure
    70  backed by flat files where batches of columnar data are serialized
    71  using the [Arrow IPC format](https://arrow.apache.org/docs/ipc.html).
    72  This queue can be used directly by most operators that need to buffer
    73  unlimited data in FIFO order. There will also be an additional abstraction
    74  that will give the caller the option to use separate queues as distinct
    75  partitions. This last abstraction will be used to flush and re-read
    76  partitions in the case of a GRACE hash join or external merge sort.
    77  
    78  The impact on the end user will be improved performance for large
    79  analytical queries, as queries that would previously not run through
    80  the vectorized execution engine will do so after the implementation of
    81  external storage. There will be no other user-facing changes, the change
    82  proposed in this RFC is purely focused on existing feature improvement.
    83  
    84  # Reference-level explanation
    85  
    86  Given that external merge sorts and GRACE hash joins are well
    87  established algorithms (external sorts more so, refer to [this
    88  issue](https://github.com/cockroachdb/cockroach/issues/24582) for
    89  an explanation of a GRACE hash join), this RFC will focus on the
    90  backend infrastructure that will allow these algorithms to spill
    91  to disk by focusing on describing the design of a single queue.
    92  
    93  ## Detailed design
    94  
    95  https://github.com/asubiotto/cockroach/tree/prototype is a
    96  prototype of the proposed on-disk queue backed by flat files. A
    97  Pebble-backed alternative was also considered in that prototype.
    98  The choice of storage method is examined in more detail in the
    99  [Rationale and Alternatives section](#rationale-and-alternatives).
   100  
   101  The requirements for the storage method are that:
   102  - It must provide encryption
   103  if required.
   104  - It must provide the capability for write throttling. Heavy use
   105  of a temporary store should not block writes to the primary store.
   106  - Temporary files should be cleaned up on startup in case of a crash.
   107  
   108  Using the same filesystem `vfs.FS` abstraction as Pebble does will
   109  allow us to write an on-disk queue backed by flat files with the
   110  first two of these requirements. The `vfs.FS` will be the same as the
   111  temporary storage engine, so we will piggy-back off of the existing
   112  temporary file management code. The interface is as simple as:
   113  
   114  ```
   115  type Queue interface {
   116      Enqueue(coldata.Batch) error
   117      Dequeue() (coldata.Batch, error)
   118      Close() error
   119  }
   120  ```
   121  
   122  The flat file `Queue` implementation will have knobs to configure
   123  the in-memory buffer size to amortize write overhead, and the maximum
   124  size of a file, after which it will roll over to a new file. Different
   125  configurations will be benchmarked to find a good default value.
   126  
   127  Note that the maximum in-use disk space will be limited by a disk monitor
   128  similarly to what is currently done with external storage in the row execution
   129  engine. As bytes are written to disk, these are accounted for by a
   130  `mon.BytesMonitor`, which additionally provides observability in to how much
   131  space is being used at a given time as well as how much disk space was used
   132  in a given query through `EXPLAIN ANALYZE`.
   133  
   134  The implementation will create unique files using the following naming scheme:
   135   ```
   136  <FlowID>_<QueueID>_<FileID>
   137   ```
   138  A `Queue` will be created with a unique integer `ID` and keep a counter
   139  for the number of files created, which it will use as a suffix for
   140  the file names to maintain uniqueness within the `Queue`. Prefixing
   141  the filenames with the `FlowID` UUID (which uniquely describes a
   142  sub-plan of a distributed query running on a given node), will allow
   143  the `Flow` to perform any file cleanup once the query completes (in
   144  case of orphaned files when `recover`ing from possible `panic`s).
   145  
   146  Thankfully, serialization of `coldata.Batch`es is already
   147  implemented in the `colserde` package using the [Arrow IPC
   148  format](https://arrow.apache.org/docs/ipc.html). These serialized bytes will
   149  be buffered until a flush is required, at which point they are written to a
   150  file with an accompanying file footer. The start offset of these written bytes
   151  as well as the number of bytes written will be stored for when the batches
   152  should be dequeued. A current limitation is that bytes cannot be deserialized
   153  at a finer granularity than they are written at, although we are planning
   154  on fixing this (https://github.com/cockroachdb/cockroach/issues/42045).
   155  Snappy compression is used to minimize file size.
   156  
   157  The aforementioned abstraction of partitions as distinct queues can be trivially
   158  implemented by instantiating as many queues as partitions necessary. This could
   159  mean many small files, as a queue must have at least one file. The problem with
   160  this design is that there might be many open file descriptors at once if the
   161  user of the queue needs to read a little data from each partition, in the case
   162  of an external merge sort, for example. The number of open file descriptors
   163  is limited on some file systems, so this could lead to crashes. However, this
   164  problem can be mitigated by having a maximum number of open file descriptors
   165  scoped to a temporary storage engine as well as per queue. The implementation
   166  of partitions would keep as many file descriptors open as allowed, closing the
   167  least recently used file when a new partition must be read from.
   168  
   169  ## Rationale and Alternatives
   170  
   171  A queue backed by Pebble was also considered as an alternative. The
   172  configuration options for this queue are the write buffer size and
   173  the maximum value size. Batches are enqueued up to the maximum value
   174  size and written to a buffered writer with an ordinal as the key.
   175  
   176  The benchmark used to test the performance of both queue implementations is
   177  a simple write-everything then read-everything benchmark run on a linux gceworker.
   178  This simulates the general case of having a single goroutine flush data
   179  which it then rereads and operates on. The flat file `Queue` implementation
   180  syncs files every `512KiB`, similarly to Pebble. Note that both implementations
   181  use snappy compression.
   182  
   183  Before benchmarking each implementation with a varying buffer size,
   184  the data size to store was fixed at `512MiB` and each implementation was
   185  independently benchmarked to find the optimal maximum file size in the
   186  case of flat files, and value size in the case of pebble, while keeping
   187  the write buffer size constant at 0. The options used for Pebble are
   188  the same ones the Pebble temporary storage instance uses (see
   189  https://github.com/cockroachdb/cockroach/blob/09b391536f72349c3951ce6b75c5231b58933b07/pkg/storage/engine/temp_engine.go#L117).
   190  
   191  ![Find](images/vectorized_external_storage1.png?raw=true "Find")
   192  
   193  Given these graphs, a fixed file size of `32MiB` was chosen for the flat queue
   194  implementation and a `128KiB` maximum value size for the Pebble implementation.
   195  
   196  ![Compare](images/vectorized_external_storage2.png?raw=true "Compare")
   197  
   198  Overall, we see a ~50% improvement by using flat files
   199  instead of Pebble as an on-disk queue implementation in the best case.
   200  
   201  Initially, the downside of choosing flat files over Pebble was that
   202  there wasn’t a clear understanding of how the requirements outlined
   203  in [Detailed Design](#detailed-design) would be satisfied when using
   204  flat files. However, since the `vfs.FS` implementation that the
   205  Pebble temporary engine uses can be reused, there is no upside to
   206  using pebble barring perhaps implementation simplicity since details
   207  like compression and file descriptor management are taken care of.