github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151214_sql_column_families.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151214_sql_column_families.md (about)

1 - Feature Name: sql_column_families
2 - Status: completed
3 - Start Date: 2015-12-14
4 - Authors: Daniel Harrison, Peter Mattis
5 - RFC PR: [#6712](https://github.com/cockroachdb/cockroach/pull/6712)
6 - Cockroach Issue:
7
8 # Summary
9
10 Store multiple columns for a row in a single value to amortize the
11 overhead of keys and the fixed costs associated with values. The
12 mapping of a column to a column family will be determined when a column
13 is added to the table.
14
15 # Motivation
16
17 The SQL to KV mapping currently utilizes a key/value pair per
18 non-primary key column. The key is structured as:
19
20 ```
21 /<tableID>/<indexID>/<primaryKeyColumns...>/<columnID>
22 ```
23
24 Keys also have an additional suffix imposed by the MVCC
25 implementation: ~10 bytes of timestamp.
26
27 The value is a proto (note, this is what is stored on disk, we also
28 include a timestamp when transmitting over the network):
29
30 ```
31 message Value {
32 optional bytes raw_bytes;
33 }
34 ```
35
36 The raw_bytes field currently holds a 4 byte CRC followed by the bytes
37 for a single "value" encoding (as opposed to the util/encoding "key"
38 encodings) of the `<columnVal>`. The value encoding is a tag byte
39 followed by a payload.
40
41 Consider the following table:
42
43 ```
44 CREATE TABLE kv (
45 k INT PRIMARY KEY,
46 v INT
47 )
48 ```
49
50 A row for this table logically has 16 bytes of data (`INTs` are 64-bit
51 values). The current SQL to KV mapping will create 2 key/value pairs:
52 a sentinel key for the row and the key/value for column `v`. These two
53 keys imply an overhead of ~28 bytes (timestamp + checksum) when stored
54 on disk and more than that when transmitted over the network (disk
55 storage takes advantage of prefix compression of the keys). We can cut
56 this overhead in half by storing only a single key/value for the
57 row. The savings are even more significant as more columns are added
58 to the table.
59
60 Note that there is also per key overhead at the transaction
61 level. Every key written within a transaction has a "write intent"
62 associated with it and these intents need to be resolved when the
63 transaction is committed. Reducing the number of keys per row reduces
64 this overhead.
65
66 # Detailed design
67
68 The `ColumnDescriptor` proto will be extended with a `FamilyID` field.
69 `FamilyIDs` will be allocated per-table using a new
70 `TableDescriptor.NextFamilyID` field. `FamilyID(0)` will be always be
71 written for a row and will take the place of the sentinel key. A
72 `FamilyDescriptor` proto will be added along with a repeated field for
73 them in `TableDescriptor`. It will contain the id, a user provided or
74 autogenerated name, and future metadata.
75
76 The structure for table keys will be changed to:
77
78 ```
79 /<tableID>/<indexID>/<primaryKeyColumns...>/<familyID>
80 ```
81
82 The value proto will remain almost the same. The same 4 byte CRC will
83 be followed by a series of `<columnID>/<columnVal>` pairs where
84 `<columnID>` is varint encoded and `<columnVal>` is encoded using
85 (almost) the same value encoding as before. Unfortunately, the
86 bytes, string, and decimal encodings do not self-delimit their length,
87 so additional ones will be added with a length prefix. These new
88 encodings will also be used by distributed SQL, which also needs self-
89 delimiting value encodings. Similar to the existing convention, `NULL`
90 column values will not be present in the value.
91
92 For backwards compatibility, families with only a single column will use
93 exactly the old encoding. Columns without a family will use the
94 `<columnID>` as the `<familyID>`. Finally, the `/<familyID>` key suffix
95 will be omitted when encoding `FamilyID(0)` to make the encoding
96 identical to the previous sentinel key encoding.
97
98 When a column is added to a table it will be assigned to a family. This
99 assignment will be done with a set of heuristics, but will be
100 override-able by the user at column creation using a SQL syntax
101 extension.
102
103 ## SQL Syntax Examples
104
105 ```
106 CREATE TABLE t (
107 a INT PRIMARY KEY,
108 b INT,
109 c STRING,
110 FAMILY primary (a, b),
111 FAMILY (c)
112 )
113 ```
114
115 `ALTER TABLE t ADD COLUMN d DECIMAL FAMILY d`
116
117 ## Heuristics for Fitting Columns into Families
118
119 - Fixed sized columns (`INT`, `DECIMAL` with precision, `FLOAT`, `BOOL`, `DATE`,
120 `TIMESTAMP`, `INTERVAL`, `STRING` with length, `BYTES` with length) will be
121 packed into a family, up to a threshold (initially 256 bytes).
122
123 - `STRING` and `BYTES` columns without a length restriction (and `DECIMAL`
124 without precision) get their own family.
125
126 - If the declared length of a `STRING` or `BYTES` column is changed with an
127 `ALTER COLUMN`, family assignments are unaffected.
128
129 - Columns in the primary index are declared as family 0, but they're stored in
130 the key, so they don't count against the byte limit.
131
132 - Family 0 (the sentinel) always has at least one column (though it might be a
133 primary index column stored in the key). Non-nullable and fixed sized columns
134 are preferred.
135
136 Note that these heuristics only apply at column creation, so there's no
137 backwards compatibility issues in revisiting them at any time.
138
139 # Drawbacks
140
141 * We have to support the old format for backwards compatibility.
142 Luckily, we can re-frame most of the complexity as a one column family
143 optimization instead of pure technical debt.
144
145 * `UPDATE` will now have to fetch the previous values of every column
146 in a family being modified. However, we already have to scan before
147 every `UPDATE` so, at worst, this is just returning more fields from
148 the scan than before.
149
150 # Alternatives
151
152 * We could introduce a richer KV layer api that pushes knowledge of
153 columns from the SQL layer down to the KV layer. This is not as
154 radical as it sounds as it is essentially the Bigtable/HBase
155 API. Specifically, we could enhance KV to know about rows and columns
156 where columns are identified by integers but not restricted to a
157 predefined schemas. This is actually a somewhat more limited version
158 of the Bigtable API which allows columns to be arbitrary strings and
159 also has the concept of column families. The upside to this approach
160 is that the encoding of the data at the MVCC layer would be
161 unchanged. We'd still have a key/value per column. But we'd get
162 something akin to prefix compression at the network level where
163 setting or retrieving multiple columns for a single row would only
164 send the row key once. Additionally, we could likely get away with a
165 single intent per row as opposed to an intent per column in the
166 existing system. The downside to this approach is that it appears to
167 be much more invasive than the column family change. Would every
168 consumer of the KV api need to change?
169
170 * Another alternative would be to omit the sentinel key when there is
171 no non-NULL/non-primary-key column. For example, we could omit the
172 sentinel key in the following table because we know there will always
173 be one KV pair:
174
175 ```
176 CREATE TABLE kv (
177 k INT PRIMARY KEY,
178 v INT NOT NULL
179 )
180 ```
181
182 # Unresolved questions
183
184 * Some of the complexity around legacy compatibility could be reduced
185 if `FamilyDescriptor`s could be backfilled for tables missing them on
186 cluster upgrade. Is there a reasonable mechanism for this? Or should
187 it be hidden as a TableDescriptor transformation in the
188 `getAliasedTableLease` call?
189
190 # Future Work
191
192 * Allow changing of a column's family using ALTER COLUMN and schema changes.
193
194 * Mark column as write-only in new family.
195 * Write column to both old and new families for queries, reading it only from old family.
196 * Backfill column data to new family.
197 * Remove column from old family, mark readable in new family.
198 * TODO(dan): Consider if we can instead write only to the new family
199 and read from both old and new, preferring new. Maybe some issues
200 with NULLs.
201
202 * Add metadata to column families. @bdarnell mentions this would let us
203 map some families to an alternative storage model (perhaps using rocksdb
204 column families) and that this was very useful with bigtable column
205 families/locality families. Once the schema changes are supported, it
206 will also be needed to keep track of which column, if any, was used
207 with the single column optimization.
208
209 # Performance Experiments
210
211 Note: This is from @petermattis's original doc, so it might be out of
212 date. The conclusion is still valid though.
213
214 For the above `kv` table, we can approximate the benefit of this
215 change for a benchmark by not writing the sentinel key for each
216 row. The numbers below show the delta for that change using the `kv`
217 table structure described above (instead of the 1 column table
218 currently used in the `{Insert,Scan}` benchmarks).
219
220 ```
221 name old time/op new time/op delta
222 Insert1_Cockroach-8 983µs ± 1% 948µs ± 0% -3.53% (p=0.000 n=9+9)
223 Insert10_Cockroach-8 1.72ms ± 1% 1.34ms ± 0% -22.05% (p=0.000 n=10+9)
224 Insert100_Cockroach-8 8.52ms ± 1% 4.99ms ± 1% -41.42% (p=0.000 n=10+10)
225 Scan1_Cockroach-8 348µs ± 1% 345µs ± 1% -1.07% (p=0.002 n=10+10)
226 Scan10_Cockroach-8 464µs ± 1% 419µs ± 1% -9.68% (p=0.000 n=10+10)
227 Scan100_Cockroach-8 1.33ms ± 1% 0.95ms ± 1% -28.61% (p=0.000 n=10+10)
228 ```
229
230 While the benefit is fairly small for single row insertions, this is
231 only benchmarking the simplest of tables. We'd expect a bigger benefit
232 for tables with more columns.