github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170522_external_storage.md (about)

     1  - Feature Name: External Storage for DistSQL processors
     2  - Status: completed
     3  - Start Date: 2017-05-11
     4  - Authors: Arjun Narayan and Alfonso Subiotto Marques
     5  - RFC PR: [#16069](https://github.com/cockroachdb/cockroach/pull/16069)
     6  - Cockroach Issue: [#15206](https://github.com/cockroachdb/cockroach/issues/15206)
     7  
     8  # Summary
     9  Add support for DistSQL processors to use external storage in addition
    10  to memory to store intermediate data when processing large DistSQL
    11  queries.
    12  
    13  # Motivation and Background
    14  Currently, DistSQL processors have a hard limit on the amount of
    15  memory they can use before they get OOM-killed. This limits the size
    16  of certain queries, particularly when efficient secondary indices do
    17  not exist, so some amount of computation must be done holding a large
    18  amount of intermediate state at a processor. For instance, a JOIN on
    19  two very large tables, where there is no ordering on either table's
    20  JOIN predicate columns, must either
    21  
    22  * Use a HashJoiner, which stores one table in a single large
    23    hashtable, and stream the other table, looking up for matches on the
    24    hashtable, emitting rows when appropriate.
    25  
    26  * Sort both tables by the JOIN predicate columns, and then stream them
    27    to a MergeJoiner.
    28  
    29  Either of these two solutions runs out of memory at some query size,
    30  and requires external storage to process larger queries.
    31  
    32  # Scope
    33  This problem was first encountered in running TPC-H queries on
    34  moderately sized scale factors (5 and above). It was exacerbated by
    35  the fact that DistSQL currently does not plan MergeJoins and resorts
    36  to HashJoins in all cases. TPC-H itself should not be bottlenecked on
    37  the lack of external storage, as relevant secondary indices exist for
    38  all queries that we are aware of, such that optimal planning keeps
    39  tuples flowing smoothly through the dataflow graph without requiring
    40  any processor to accumulate a disproportional number of
    41  tuples. However, users might still want to run the occasional
    42  analytics query without indexes, and we should support that use case.
    43  
    44  At a minimum, external storage should provide enough swap space that
    45  it stops all DistSQL processors from being OOM-killed on queries on
    46  the TPC-H scalefactor 300 dataset. Whatever storage format is chosen,
    47  this should be compatible with all DistSQL processors.
    48  
    49  # Related Work
    50  
    51  None of this is new. Existing databases use both paths for query
    52  execution, and typically use a cost-based query planner to decide
    53  _which_ of the two paths to use. So for us, it's more a case of
    54  prioritization and beginning with an implementation that is
    55  extensible. External sorting is typically the first implemented in
    56  OLTP databases, since it comes in handy for building new indices, but
    57  we have a different pathway for executing those. On-disk storage for
    58  query execution only comes into play for OLAP-style queries,
    59  
    60  For example, in the case of running a JOIN:
    61  
    62  1. They have implementations for the GRACE hash join, as well as the
    63     external sorter + merge join.
    64  
    65  2. They have extensive microbenchmarks for the underlying costs of
    66     each operation (e.g. total time cost of writing a tuple into a
    67     serialized stream, or the expected time cost of doing a random
    68     insert into a sorted on-disk BTree). They have extensive table
    69     statistics.
    70  
    71  3. At query planning time, with the table statistics, the database can
    72     compute the cost of using either plan (i.e. estimate the number of
    73     passes the GRACE hash join would take and multiply it out with the
    74     average cost of writing each tuple, etc.) and choose the better
    75     one.
    76  
    77  4. We don't have (2), and thus we don't have a cost-based query
    78     planner capable of doing (3). Right now, our implementations only
    79     do purely logical query planning (i.e. do only those wins that are
    80     asymptotic wins), so the RocksDB path is along those lines as well.
    81  
    82  # Detailed requirements
    83  Examining all DistSQL processors, we have the following data
    84  structures used:
    85  
    86  | Processor name  | On Disk Data structure|
    87  |-----------------|----------:|
    88  | NoopCore        | n/a       |
    89  | TableReader     | n/a       |
    90  | JoinReader      | n/a       |
    91  | Sorter          | Sorted Map|
    92  | Aggregator      | Map       |
    93  | Distinct        | Map       |
    94  | MergeJoiner     | n/a       |
    95  | HashJoiner      | Map       |
    96  | Values          | n/a       |
    97  | AlgebraicSetOp  | Map       |
    98  
    99  Essentially, we have two data types to store: a sorted map and a hash
   100  map. The processors that use a hash map (distinct, hashjoin, and
   101  algebraic set op) do internally benefit from keys being sorted, but
   102  when they are sorted, the emit rows eagerly, and don't need to build
   103  expensive in-memory state! They thus only accumulate state when keys
   104  are *not* sorted. Thus, sorting is in direct opposition to
   105  accumulating state large enough for external storage.
   106  
   107  # Potential solutions
   108  
   109  ## Option 1: On-disk sort-only
   110  The first and simplest solution, is to create an on-disk method
   111  exclusively for the Sorter processor. All other processors use less
   112  memory when the input data is appropriately sorted, so the planner
   113  would leverage the on-disk capable Sorter to sort data upstream.
   114  
   115  The simplest on-disk sorter is to use an external Unix `sort` process,
   116  or some other on-disk-capable sorter, feeding it rows, and reading
   117  back the sorted stream of tuples. More seriously, we could use an
   118  external sort implementation that's more battle tested, or even roll
   119  our own native Go implementation (e.g. see
   120  Postgresql's
   121  [external sort](https://github.com/postgres/postgres/blob/master/src/backend/utils/sort/tuplesort.c)).
   122  
   123  The biggest downside of only supporting external sorting is that we
   124  might have to do multiple re-sorts for a complex query plan (for
   125  instance, if a query plan has multiple sequential joins on multiple
   126  different join conditions, that would require a re-join after the
   127  first join and before the second), since each processor's ordering
   128  requirements must be treated as a stricter precondition than it
   129  currently is (at least until we have sophisticated table statistics
   130  and join planning that can take that into account). This is a
   131  quick-and-dirty fix.
   132  
   133  ## Option 2: Use temporary flat files to page out partial streams.
   134  For processors that require an unsorted map, we could use temporary
   135  files to page out partial output streams, a la GRACE hash joins,
   136  merging them back later. Walking through two examples for how flat
   137  files would be used by processors:
   138  
   139  ### Example 1: HashJoin with on-disk streams
   140  Consider the following algorithm for
   141  processing a HashJoin:
   142  
   143  1. Read n rows from the right input stream, where n is the upper limit
   144     on the number of rows we can hold in-memory.
   145  
   146  2. Read all rows from the left input stream, and where we see a match,
   147     serialize the row and emit it to a flat file.
   148  
   149  3. Read the next n rows from the right input stream, and read all rows
   150     from the start from the left input stream, and emit it to a second
   151     flat file (note that DistSQL does not currently allow for rereading
   152     streams, so that capability would have to be added as well to make
   153     this work).
   154  
   155  4. Repeat until we exhaust the right input stream.
   156  
   157  5. Read all the flat files back, merging them in (if we need to
   158     preserve some ordering), and emitting tuples.
   159  
   160  ### Example 2: MergeSort with on-disk streams
   161  1. Read n rows from the input stream, inserting it into an in-memory
   162     heap. Flush the heap to disk.
   163  
   164  2. Repeat until the input stream is exhausted.
   165  
   166  3. Merge all the files.
   167  
   168  ### Downsides:
   169  The first downside is in the runtime: GRACE hash joins are not
   170  particularly pretty in worst case algorithmic analysis, and this is
   171  evident in that the hash join reads complete streams multiple times.
   172  
   173  The second downside that we need to build a new processor that can
   174  provide this buffering service. However, this is mitigated by the fact
   175  that a BufferedStream processor would reuse the same code, since all
   176  it does it write a stream to disk, and reread it as many times as
   177  necessary.
   178  
   179  Finally, this requires a bunch of processor rewriting work: each
   180  processor would have to be rewritten to deal with multiple passes, and
   181  this also makes the physical planner's job more critical: runtimes can
   182  get pathological if we have to resort to paging out streams, so
   183  getting ordering requirements right becomes even more important.
   184  
   185  While writing serialized streams to disk has the advantage of having
   186  no write amplification, it has very high read amplification since
   187  streams reread multiple times. Unfortunately, GRACE hash joins can
   188  become very expensive for very large datasets, where the number of
   189  passes over the data exceeds `log(n)`. Thus, this is not the most
   190  scalable implementation. In combination with a specialized sorter
   191  node, which does recover the `n log(n)` complexity, however, and some
   192  syntax for explicitly hinting when to sort data, we can recover
   193  reasonable performance at scale.
   194  
   195  
   196  ## Option 3: Use RocksDB for storing sorted maps
   197  For processors that require (or could benefit from) a sorted map,
   198  RocksDB offers the obvious straightforward on-disk representation. A
   199  way to elegantly augment existing processors is for them to use a new
   200  map library, which stores a map in-memory, but flushes it to a
   201  temporary RocksDB keyspace when it reaches some threshold. Instead of
   202  then using a second in-memory map and having to merge streams on the
   203  return path, the processors continue using the on-disk map, inserting
   204  directly into RocksDB, and performing a single scan at the end.
   205  
   206  This version also gives us random access reads and writes, which makes
   207  using external storage for aggregations where there are a large number
   208  of buckets simple: we simply use the KV interface to store
   209  aggregations by bucket. As a bonus, when we finally read off the
   210  aggregations, they will be in sorted order.
   211  
   212  Do note that if configured correctly, RocksDB recovers the algorithm
   213  in Option 2 Example 2: if we insert sequential sorted `WriteBatches`,
   214  and turn compaction off. This leaves `cgo` overhead as the only
   215  remaining downside of this path.
   216  
   217  # Proposal
   218  
   219  We propose to use RocksDB for external storage, using a dedicated
   220  additional RocksDB instance solely for DistSQL processors. This is the
   221  most general implementation (in terms of scaling to very complex
   222  queries), is completely opaque to the user, requires no special syntax
   223  at large scales, and requires lower implementation work in modifying
   224  processors one-by-one (their internal maps would be changed to use an
   225  opaque KV library that switches from in-memory maps to RocksDB at a
   226  configurable threshold). The internal knowledge base for tuning and
   227  working with RocksDB is available at CockroachDB, so this reduces
   228  future implementation risk as well.
   229  
   230  ## Benchmarking overhead
   231  
   232  As a sanity check to ensure that this isn't wildly off-base, we will
   233  first benchmark using RocksDB with minimal compaction versus a
   234  standard external sort library, and make sure it isn't egregious.
   235  
   236  We need to isolate the following costs:
   237  
   238  1. External sort vs RocksDB: We will sort a billion rows using a
   239     standard external sort, and insert (in random order) and scan it
   240     from RocksDB.
   241  2. RocksDB vs Pure-go: We will also try, as a stretch goal, to measure
   242     the RocksDB overhead against a pure-Go implementation, such
   243     as [Badger](https://github.com/dgraph-io/badger).
   244  
   245  ## RocksDB Implementation Concerns
   246  The primary concern in utilizing disk is contention with the existing
   247  transactional RocksDB storage engine for Disk IO. We want to allow for
   248  the use of an explicit flag to denote the temporary storage location
   249  to be used, if the node is deployed with multiple disks.
   250  
   251  The second concern is that the existing RocksDB instance is tuned to
   252  be as efficient as possible for reads, and thus trades off much higher
   253  write amplification (via aggressive compactions) in exchange for much
   254  lower read amplification. This is not the same tradeoff that we
   255  want.
   256  
   257  Finally, RocksDB is very complex, and for this use case we don't need
   258  the majority of its database features, so it's potentially
   259  overkill. We have to be careful that the overhead is not too high.
   260  
   261  These concerns lead us to a few rules for implementing the RocksDB
   262  options:
   263  
   264  * Use a separate RocksDB instance explicitly for DistSQL processors.
   265  
   266  * By default this instance uses a `tmp_storage` location under the
   267    first store's directory, overridable by a command line flag. It is
   268    deleted and recreated from scratch whenever the node starts.
   269  
   270  * Disable write-ahead logging (`write_option.disableWAL`) to lower
   271    write amplification.
   272  
   273  * Processors carve out temporary keyspaces in this RocksDB instance.
   274  
   275  * Use range deletion tombstones for rapidly deleting a processor's
   276    entire keyspace when it is done.
   277  
   278  * Experiment with compaction settings (including the extreme of
   279    turning it off completely), since, fundamentally, we need to write
   280    data once and read it back once.
   281  
   282  * This is an ephemeral RocksDB instance, and some cloud providers
   283    provide non-persistent disks. We want to eventually support
   284    configuring nodes to use non-persistent disks for this use
   285    case.
   286  
   287  * `storage/engine` currently is tailored almost exclusively for use by
   288    the `storage` package. This implementation should reuse as much of
   289    the existing `engine` code as possible, modifying (and generalizing
   290    the package) where appropriate. Eventually, other users who also
   291    need a non-persistent node-local key-value store can use it as well.
   292  
   293  ## Potential optimizations: planning subsequent processors with additional orderings
   294  
   295  Using RocksDB to store on-disk data will result in on-disk data being
   296  sorted by key. Thus, for example, if a HashJoin evicts to disk, we can
   297  recover an extra ordering that was not expected in
   298  planning. Currently, there is no way for a DistSQL stream to indicate
   299  that tuples are being delivered with additional orderings not
   300  specified in the physical plan, but if we could send this information
   301  in the producer metadata, downstream processors could potentially take
   302  advantage of the additional sorting.