github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151214_sql_column_families.md (about)

     1  - Feature Name: sql_column_families
     2  - Status: completed
     3  - Start Date: 2015-12-14
     4  - Authors: Daniel Harrison, Peter Mattis
     5  - RFC PR: [#6712](https://github.com/cockroachdb/cockroach/pull/6712)
     6  - Cockroach Issue:
     7  
     8  # Summary
     9  
    10  Store multiple columns for a row in a single value to amortize the
    11  overhead of keys and the fixed costs associated with values. The
    12  mapping of a column to a column family will be determined when a column
    13  is added to the table.
    14  
    15  # Motivation
    16  
    17  The SQL to KV mapping currently utilizes a key/value pair per
    18  non-primary key column. The key is structured as:
    19  
    20  ```
    21  /<tableID>/<indexID>/<primaryKeyColumns...>/<columnID>
    22  ```
    23  
    24  Keys also have an additional suffix imposed by the MVCC
    25  implementation: ~10 bytes of timestamp.
    26  
    27  The value is a proto (note, this is what is stored on disk, we also
    28  include a timestamp when transmitting over the network):
    29  
    30  ```
    31  message Value {
    32    optional bytes raw_bytes;
    33  }
    34  ```
    35  
    36  The raw_bytes field currently holds a 4 byte CRC followed by the bytes
    37  for a single "value" encoding (as opposed to the util/encoding "key"
    38  encodings) of the `<columnVal>`. The value encoding is a tag byte
    39  followed by a payload.
    40  
    41  Consider the following table:
    42  
    43  ```
    44  CREATE TABLE kv (
    45    k INT PRIMARY KEY,
    46    v INT
    47  )
    48  ```
    49  
    50  A row for this table logically has 16 bytes of data (`INTs` are 64-bit
    51  values). The current SQL to KV mapping will create 2 key/value pairs:
    52  a sentinel key for the row and the key/value for column `v`. These two
    53  keys imply an overhead of ~28 bytes (timestamp + checksum) when stored
    54  on disk and more than that when transmitted over the network (disk
    55  storage takes advantage of prefix compression of the keys). We can cut
    56  this overhead in half by storing only a single key/value for the
    57  row. The savings are even more significant as more columns are added
    58  to the table.
    59  
    60  Note that there is also per key overhead at the transaction
    61  level. Every key written within a transaction has a "write intent"
    62  associated with it and these intents need to be resolved when the
    63  transaction is committed. Reducing the number of keys per row reduces
    64  this overhead.
    65  
    66  # Detailed design
    67  
    68  The `ColumnDescriptor` proto will be extended with a `FamilyID` field.
    69  `FamilyIDs` will be allocated per-table using a new
    70  `TableDescriptor.NextFamilyID` field. `FamilyID(0)` will be always be
    71  written for a row and will take the place of the sentinel key. A
    72  `FamilyDescriptor` proto will be added along with a repeated field for
    73  them in `TableDescriptor`. It will contain the id, a user provided or
    74  autogenerated name, and future metadata.
    75  
    76  The structure for table keys will be changed to:
    77  
    78  ```
    79  /<tableID>/<indexID>/<primaryKeyColumns...>/<familyID>
    80  ```
    81  
    82  The value proto will remain almost the same. The same 4 byte CRC will
    83  be followed by a series of `<columnID>/<columnVal>` pairs where
    84  `<columnID>` is varint encoded and `<columnVal>` is encoded using
    85  (almost) the same value encoding as before. Unfortunately, the
    86  bytes, string, and decimal encodings do not self-delimit their length,
    87  so additional ones will be added with a length prefix. These new
    88  encodings will also be used by distributed SQL, which also needs self-
    89  delimiting value encodings. Similar to the existing convention, `NULL`
    90  column values will not be present in the value.
    91  
    92  For backwards compatibility, families with only a single column will use
    93  exactly the old encoding. Columns without a family will use the
    94  `<columnID>` as the `<familyID>`. Finally, the `/<familyID>` key suffix
    95  will be omitted when encoding `FamilyID(0)` to make the encoding
    96  identical to the previous sentinel key encoding.
    97  
    98  When a column is added to a table it will be assigned to a family. This
    99  assignment will be done with a set of heuristics, but will be
   100  override-able by the user at column creation using a SQL syntax
   101  extension.
   102  
   103  ## SQL Syntax Examples
   104  
   105  ```
   106  CREATE TABLE t (
   107    a INT PRIMARY KEY,
   108    b INT,
   109    c STRING,
   110    FAMILY primary (a, b),
   111    FAMILY (c)
   112  )
   113  ```
   114  
   115  `ALTER TABLE t ADD COLUMN d DECIMAL FAMILY d`
   116  
   117  ## Heuristics for Fitting Columns into Families
   118  
   119  - Fixed sized columns (`INT`, `DECIMAL` with precision, `FLOAT`, `BOOL`, `DATE`,
   120  `TIMESTAMP`, `INTERVAL`, `STRING` with length, `BYTES` with length) will be
   121  packed into a family, up to a threshold (initially 256 bytes).
   122  
   123  - `STRING` and `BYTES` columns without a length restriction (and `DECIMAL`
   124  without precision) get their own family.
   125  
   126  - If the declared length of a `STRING` or `BYTES` column is changed with an
   127  `ALTER COLUMN`, family assignments are unaffected.
   128  
   129  - Columns in the primary index are declared as family 0, but they're stored in
   130  the key, so they don't count against the byte limit.
   131  
   132  - Family 0 (the sentinel) always has at least one column (though it might be a
   133  primary index column stored in the key). Non-nullable and fixed sized columns
   134  are preferred.
   135  
   136  Note that these heuristics only apply at column creation, so there's no
   137  backwards compatibility issues in revisiting them at any time.
   138  
   139  # Drawbacks
   140  
   141  * We have to support the old format for backwards compatibility.
   142  Luckily, we can re-frame most of the complexity as a one column family
   143  optimization instead of pure technical debt.
   144  
   145  * `UPDATE` will now have to fetch the previous values of every column
   146  in a family being modified. However, we already have to scan before
   147  every `UPDATE` so, at worst, this is just returning more fields from
   148  the scan than before.
   149  
   150  # Alternatives
   151  
   152  * We could introduce a richer KV layer api that pushes knowledge of
   153  columns from the SQL layer down to the KV layer. This is not as
   154  radical as it sounds as it is essentially the Bigtable/HBase
   155  API. Specifically, we could enhance KV to know about rows and columns
   156  where columns are identified by integers but not restricted to a
   157  predefined schemas. This is actually a somewhat more limited version
   158  of the Bigtable API which allows columns to be arbitrary strings and
   159  also has the concept of column families. The upside to this approach
   160  is that the encoding of the data at the MVCC layer would be
   161  unchanged. We'd still have a key/value per column. But we'd get
   162  something akin to prefix compression at the network level where
   163  setting or retrieving multiple columns for a single row would only
   164  send the row key once. Additionally, we could likely get away with a
   165  single intent per row as opposed to an intent per column in the
   166  existing system. The downside to this approach is that it appears to
   167  be much more invasive than the column family change. Would every
   168  consumer of the KV api need to change?
   169  
   170  * Another alternative would be to omit the sentinel key when there is
   171  no non-NULL/non-primary-key column. For example, we could omit the
   172  sentinel key in the following table because we know there will always
   173  be one KV pair:
   174  
   175  ```
   176  CREATE TABLE kv (
   177    k INT PRIMARY KEY,
   178    v INT NOT NULL
   179  )
   180  ```
   181  
   182  # Unresolved questions
   183  
   184  * Some of the complexity around legacy compatibility could be reduced
   185  if `FamilyDescriptor`s could be backfilled for tables missing them on
   186  cluster upgrade. Is there a reasonable mechanism for this? Or should
   187  it be hidden as a TableDescriptor transformation in the
   188  `getAliasedTableLease` call?
   189  
   190  # Future Work
   191  
   192  * Allow changing of a column's family using ALTER COLUMN and schema changes.
   193  
   194    * Mark column as write-only in new family.
   195    * Write column to both old and new families for queries, reading it only from old family.
   196    * Backfill column data to new family.
   197    * Remove column from old family, mark readable in new family.
   198    * TODO(dan): Consider if we can instead write only to the new family
   199      and read from both old and new, preferring new. Maybe some issues
   200      with NULLs.
   201  
   202  * Add metadata to column families. @bdarnell mentions this would let us
   203  map some families to an alternative storage model (perhaps using rocksdb
   204  column families) and that this was very useful with bigtable column
   205  families/locality families. Once the schema changes are supported, it
   206  will also be needed to keep track of which column, if any, was used
   207  with the single column optimization.
   208  
   209  # Performance Experiments
   210  
   211  Note: This is from @petermattis's original doc, so it might be out of
   212  date. The conclusion is still valid though.
   213  
   214  For the above `kv` table, we can approximate the benefit of this
   215  change for a benchmark by not writing the sentinel key for each
   216  row. The numbers below show the delta for that change using the `kv`
   217  table structure described above (instead of the 1 column table
   218  currently used in the `{Insert,Scan}` benchmarks).
   219  
   220  ```
   221  name                   old time/op    new time/op    delta
   222  Insert1_Cockroach-8       983µs ± 1%     948µs ± 0%   -3.53%    (p=0.000 n=9+9)
   223  Insert10_Cockroach-8     1.72ms ± 1%    1.34ms ± 0%  -22.05%   (p=0.000 n=10+9)
   224  Insert100_Cockroach-8    8.52ms ± 1%    4.99ms ± 1%  -41.42%  (p=0.000 n=10+10)
   225  Scan1_Cockroach-8         348µs ± 1%     345µs ± 1%   -1.07%  (p=0.002 n=10+10)
   226  Scan10_Cockroach-8        464µs ± 1%     419µs ± 1%   -9.68%  (p=0.000 n=10+10)
   227  Scan100_Cockroach-8      1.33ms ± 1%    0.95ms ± 1%  -28.61%  (p=0.000 n=10+10)
   228  ```
   229  
   230  While the benefit is fairly small for single row insertions, this is
   231  only benchmarking the simplest of tables. We'd expect a bigger benefit
   232  for tables with more columns.