github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/tech-notes/sql-principles.md (about)

     1  # Guiding principles for the SQL middle-end and back-end in CockroachDB
     2  
     3  This tech nodes outlines some general guiding principles for future
     4  work, during the period 2017-2019.
     5  
     6  Reminder from "[The SQL layer in CockroachDB](sql.md)" - the SQL layer
     7  in CockroachDB contains 5 main component groups, including:
     8  
     9  - the SQL **front-end**, responsible for parsing,
    10    desugaring, free simplifications and semantic analysis; this
    11    comprises the two blocks "Parser" and "Expression analysis" in the
    12    overview diagram.
    13  
    14  - the SQL **middle-end**, responsible for logical planning and
    15    optimization; this comprises the two blocks "(plan constructors)"
    16    and "Logical optimizer" in the overview diagram.
    17  
    18  - the SQL **back-end**, of which there are actually two:
    19    - the distSQL back-end, itself comprised of its own physical
    20      planner, optimizer and runner, able to process any statement/query
    21      *for which a distSQL plan can be constructed*;
    22    - the **local runner**, responsible for everything else.
    23  
    24  This tech note focuses on the latter two, although the general
    25  principles introduced here may be applicable through the entire
    26  CockroachDB project.
    27  
    28  Original author: Raphael 'kena' Poss
    29  
    30  ## Overview
    31  
    32  CockroachDB must:
    33  
    34  1) catch up with and innovate beyond, over the course of five years,
    35  what hundreds of database experts have built in tens of projects over
    36  a period of thirty years.
    37  
    38  2) do so incrementally: function, orthogonality (new!), then performance.
    39  
    40  3) do so openly, transparently and dependably. Openness is about
    41  readiness to interact with a community beyond Cockroach Labs;
    42  transparency is about developing and sustaining trust of the community
    43  in the quality of the code; dependability is about delivering what we
    44  promised.
    45  
    46  ## Catching up and beyond
    47  
    48  The main challenge to the first goal is that published articles, books
    49  and documentation about the state of the art *describe* the work that
    50  has been performed by other SQL implementors but is insufficient to
    51  *reproduce* it.
    52  
    53  There are two obstacles really.
    54  
    55  The first obstacle is that the descriptions are largely
    56  incomplete. This is an historical artifact of the community where the
    57  work was performed: SQL engines have been mostly proprietary affairs,
    58  where good technology is still hidden from study by secrecy and
    59  software patents.
    60  
    61  Reproducing a SQL engine using database literature is like trying to
    62  [picture an elephant without ready access to
    63  one](https://www.google.com/search?tbm=isch&q=medieval+drawings+of+elephants). The
    64  experience is very different from e.g. what programming language
    65  experts are used to, where published literature often contains
    66  complete instructions sufficient to reproduce the results, or links to
    67  source code to achieve the same. These luxuries are simply absent from
    68  most significant published works on databases.
    69  
    70  The main issue that a reader of database literature has to contend
    71  with is *the lack of concrete description of the data structures used
    72  to represent the concepts described*. As Fred Brookes puts it in the
    73  Mythical Man-Month: *"show me your flowcharts and conceal your tables,
    74  and I shall continue to be mystified. Show me your tables, and I won’t
    75  usually need your flowcharts; they’ll be obvious"* later translated
    76  by Linus Torvalds: *"Bad programmers worry about the code. Good
    77  programmers worry about data structures and their relationships."*
    78  
    79  To bring CockroachDB to the state of the art and beyond, it will thus
    80  be insufficient to look at published artifacts and hope to be able to
    81  follow the example of good performers. A large amount of
    82  *reverse-engineering*, both of the literature and competing technology,
    83  will be needed as well.
    84  
    85  The second obstacle to catching up with the state of the art is that
    86  the architecture of CockroachDB fundamentally deviates from the
    87  assumptions that have driven SQL research in the past 30 years. To
    88  understand why this matters, recall that the main question that drives
    89  the design and implementation of a SQL optimization engine is the
    90  following:
    91  
    92  *what are the parameters that influence the execution latency, row
    93  throughput and CPU/memory usage of a query, and how to decide which
    94  SQL query plan combines those parameters in a way that minimizes
    95  latency, maximizes throughput and minimizes CPU/memory usage?*
    96  
    97  There are 3 moving pieces in this question:
    98  
    99  - which parameters influence the performance of a query?
   100  - for each candidate query plan, how to estimate performance for that plan?
   101  - when considering multiple candidates, how to decide the best plan(s)?
   102  
   103  Now, there has been a lot of research into the latter two of these
   104  three items. But there's one thing that has barely changed since the
   105  late 1960's: understanding of the low-level parameters that influence
   106  performance.
   107  
   108  Of these, we can make two groups: those parameters that are different
   109  in CockroachDB for no good reason, and for which we can work to
   110  converge with the state of the art; and those parameters that are
   111  really fundamentally different for a good reason.
   112  
   113  Some examples in the first group:
   114  
   115  - Traditional relational engines know how to perform every SQL
   116    operation in a fixed, small RAM budget. CockroachDB does not know
   117    how to work with restricted memory budgets. This causes RAM usage to
   118    remain a significant parameter to query planning in CockroachDB,
   119    whereas it is largely irrelevant in the state of the art. We will
   120    work on removing this parameter.
   121  - Traditional relational engines always assume up-to-date row counts
   122    in tables and cardinality estimates for indexed columns. These are
   123    the bread and butter of most costing functions to decide the best
   124    query plan candidate. CockroachDB currently maintains neither, but
   125    will soon.
   126  
   127  Some examples in the second group:
   128  
   129  | What happened in research before 1980, and remains since then. | What happens in CockroachDB |
   130  |---------------------------|-----------------------------|
   131  | Data is stored in equally sized pages; pages contain approximately equal numbers of rows. | Prefix compression for keys, Snappy compression for values. |
   132  | Parallelism, if at all used in query optimization, only appears in data scans. | CockroachDB already contains a moderately complex distributed execution engine. |
   133  | Parallelism, if at all used in query optimization, assumes a fixed number of processors. | CockroachDB node counts evolve dynamically, even possibly during query execution. |
   134  | The overall latency of a query is solely decided by the number of rows processed in memory and the number of page accesses on disk. | Network effects and implicit retries dominate. |
   135  | The costing estimates use a fixed ratio between disk and CPU performance. | Each node in a cluster can have a different CPU/disk performance ratio. Network performance evolves over time. |
   136  | Each area of contiguous storage contains data from at most one table or one index. Storage locality maps 1-to-1 to query locality. | [Interleaved tables.](../RFCS/20160624_sql_interleaved_tables.md) |
   137  | A logical range of values, in a SQL query, corresponds to a contiguous area of storage when the columns are indexed. | [Partitioned tables.](../RFCS/20170921_sql_partitioning.md) |
   138  
   139  For this second group, original research will be needed to construct,
   140  from the ground up, CockroachDB-specific replacements to the best
   141  practices available in the literature and most competing products.
   142  
   143  ## Function, orthogonality, then performance
   144  
   145  Regarding the relationship between function and performance: this is
   146  an axiomatic *modus operandi* of this particular engineering
   147  organization.
   148  
   149  - "Make it work correctly quickly, only then make it work correctly and increasingly fast."
   150    a.k.a. "Make a MVP early, then use adoption to drive further development."
   151  - Functional behavior and correctness (which outputs are produced for
   152    which inputs and why) comes before non-functional behavior (latency,
   153    throughput, jitter, resource costs).
   154  
   155  Reminder: the motivation to prioritize function over performance is
   156  that correctness would be much more difficult to obtain and
   157  demonstrate otherwise.
   158  
   159  I am adding
   160  [*orthogonality*](https://en.wikipedia.org/wiki/Orthogonality_(programming))
   161  in the mix, between function and performance.
   162  
   163  Orthogonality is a general design principle for a software project
   164  where the designers strive to evolve a small set of basic building
   165  blocks (and actively eliminating redundancy over time whenever it is
   166  discovered) while maximizing the number of their allowable
   167  combinations.
   168  
   169  From a management perspective, orthogonality is something that is
   170  essential, but invisible in the short term. Introducing a focus on
   171  orthogonality in a team tends to naturally push for modularity, DRY,
   172  reusability, good abstractions and creative discovery of new useful
   173  features for end-users, without having to train these individual
   174  traits separately. It is the means by which we ensure we can keep
   175  growing CockroachDB without making each future change incrementally
   176  expensive.
   177  
   178  ## On openness, transparency and dependability
   179  
   180  The challenges and goals set forth above create valid concerns for
   181  external observers:
   182  
   183  - will CockroachDB deliver on both performance and quality? Can we
   184    "catch up on 30 years of database research" without creating a messy
   185    product hacked together too fast?
   186  
   187  - at every point in time when the team decides to transition from a
   188    functional, correct feature to a faster implementation of the same,
   189    how is it guaranteed that the faster implementation is as correct as
   190    the base, functional one?
   191  
   192  These concerns are already highly relevant:
   193  
   194  - the code base is already hard to understand! I posit that it is so
   195    because it was hacked together too fast, without general guiding
   196    principles and directions (such as those proposed in this
   197    document). Openness, transparency and dependability will offer our
   198    community a way to understand CockroachDB above and beyond what is
   199    currently implemented at each point in time.
   200  
   201  - we have already made mistakes by introducing new bugs while
   202    "optimizing" the code to make it faster. Openness, transparency and
   203    dependability on the plan to implementation, and releasing
   204    intermediate functional features before they are optimized, will
   205    help building trust by establishing baselines for testing and validation.
   206  
   207  The nuance between "openness", "transparency" and "dependability"
   208  relates to the wants of the community:
   209  
   210  - transparency is our ability to communicate clearly about "why" and "when".
   211  - openness is our ability to answer "how" and "why not".
   212  - dependability is our ability to match what we announce we will do, and what we end up doing.