github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151214_sql_column_families.md (about) 1 - Feature Name: sql_column_families 2 - Status: completed 3 - Start Date: 2015-12-14 4 - Authors: Daniel Harrison, Peter Mattis 5 - RFC PR: [#6712](https://github.com/cockroachdb/cockroach/pull/6712) 6 - Cockroach Issue: 7 8 # Summary 9 10 Store multiple columns for a row in a single value to amortize the 11 overhead of keys and the fixed costs associated with values. The 12 mapping of a column to a column family will be determined when a column 13 is added to the table. 14 15 # Motivation 16 17 The SQL to KV mapping currently utilizes a key/value pair per 18 non-primary key column. The key is structured as: 19 20 ``` 21 /<tableID>/<indexID>/<primaryKeyColumns...>/<columnID> 22 ``` 23 24 Keys also have an additional suffix imposed by the MVCC 25 implementation: ~10 bytes of timestamp. 26 27 The value is a proto (note, this is what is stored on disk, we also 28 include a timestamp when transmitting over the network): 29 30 ``` 31 message Value { 32 optional bytes raw_bytes; 33 } 34 ``` 35 36 The raw_bytes field currently holds a 4 byte CRC followed by the bytes 37 for a single "value" encoding (as opposed to the util/encoding "key" 38 encodings) of the `<columnVal>`. The value encoding is a tag byte 39 followed by a payload. 40 41 Consider the following table: 42 43 ``` 44 CREATE TABLE kv ( 45 k INT PRIMARY KEY, 46 v INT 47 ) 48 ``` 49 50 A row for this table logically has 16 bytes of data (`INTs` are 64-bit 51 values). The current SQL to KV mapping will create 2 key/value pairs: 52 a sentinel key for the row and the key/value for column `v`. These two 53 keys imply an overhead of ~28 bytes (timestamp + checksum) when stored 54 on disk and more than that when transmitted over the network (disk 55 storage takes advantage of prefix compression of the keys). We can cut 56 this overhead in half by storing only a single key/value for the 57 row. The savings are even more significant as more columns are added 58 to the table. 59 60 Note that there is also per key overhead at the transaction 61 level. Every key written within a transaction has a "write intent" 62 associated with it and these intents need to be resolved when the 63 transaction is committed. Reducing the number of keys per row reduces 64 this overhead. 65 66 # Detailed design 67 68 The `ColumnDescriptor` proto will be extended with a `FamilyID` field. 69 `FamilyIDs` will be allocated per-table using a new 70 `TableDescriptor.NextFamilyID` field. `FamilyID(0)` will be always be 71 written for a row and will take the place of the sentinel key. A 72 `FamilyDescriptor` proto will be added along with a repeated field for 73 them in `TableDescriptor`. It will contain the id, a user provided or 74 autogenerated name, and future metadata. 75 76 The structure for table keys will be changed to: 77 78 ``` 79 /<tableID>/<indexID>/<primaryKeyColumns...>/<familyID> 80 ``` 81 82 The value proto will remain almost the same. The same 4 byte CRC will 83 be followed by a series of `<columnID>/<columnVal>` pairs where 84 `<columnID>` is varint encoded and `<columnVal>` is encoded using 85 (almost) the same value encoding as before. Unfortunately, the 86 bytes, string, and decimal encodings do not self-delimit their length, 87 so additional ones will be added with a length prefix. These new 88 encodings will also be used by distributed SQL, which also needs self- 89 delimiting value encodings. Similar to the existing convention, `NULL` 90 column values will not be present in the value. 91 92 For backwards compatibility, families with only a single column will use 93 exactly the old encoding. Columns without a family will use the 94 `<columnID>` as the `<familyID>`. Finally, the `/<familyID>` key suffix 95 will be omitted when encoding `FamilyID(0)` to make the encoding 96 identical to the previous sentinel key encoding. 97 98 When a column is added to a table it will be assigned to a family. This 99 assignment will be done with a set of heuristics, but will be 100 override-able by the user at column creation using a SQL syntax 101 extension. 102 103 ## SQL Syntax Examples 104 105 ``` 106 CREATE TABLE t ( 107 a INT PRIMARY KEY, 108 b INT, 109 c STRING, 110 FAMILY primary (a, b), 111 FAMILY (c) 112 ) 113 ``` 114 115 `ALTER TABLE t ADD COLUMN d DECIMAL FAMILY d` 116 117 ## Heuristics for Fitting Columns into Families 118 119 - Fixed sized columns (`INT`, `DECIMAL` with precision, `FLOAT`, `BOOL`, `DATE`, 120 `TIMESTAMP`, `INTERVAL`, `STRING` with length, `BYTES` with length) will be 121 packed into a family, up to a threshold (initially 256 bytes). 122 123 - `STRING` and `BYTES` columns without a length restriction (and `DECIMAL` 124 without precision) get their own family. 125 126 - If the declared length of a `STRING` or `BYTES` column is changed with an 127 `ALTER COLUMN`, family assignments are unaffected. 128 129 - Columns in the primary index are declared as family 0, but they're stored in 130 the key, so they don't count against the byte limit. 131 132 - Family 0 (the sentinel) always has at least one column (though it might be a 133 primary index column stored in the key). Non-nullable and fixed sized columns 134 are preferred. 135 136 Note that these heuristics only apply at column creation, so there's no 137 backwards compatibility issues in revisiting them at any time. 138 139 # Drawbacks 140 141 * We have to support the old format for backwards compatibility. 142 Luckily, we can re-frame most of the complexity as a one column family 143 optimization instead of pure technical debt. 144 145 * `UPDATE` will now have to fetch the previous values of every column 146 in a family being modified. However, we already have to scan before 147 every `UPDATE` so, at worst, this is just returning more fields from 148 the scan than before. 149 150 # Alternatives 151 152 * We could introduce a richer KV layer api that pushes knowledge of 153 columns from the SQL layer down to the KV layer. This is not as 154 radical as it sounds as it is essentially the Bigtable/HBase 155 API. Specifically, we could enhance KV to know about rows and columns 156 where columns are identified by integers but not restricted to a 157 predefined schemas. This is actually a somewhat more limited version 158 of the Bigtable API which allows columns to be arbitrary strings and 159 also has the concept of column families. The upside to this approach 160 is that the encoding of the data at the MVCC layer would be 161 unchanged. We'd still have a key/value per column. But we'd get 162 something akin to prefix compression at the network level where 163 setting or retrieving multiple columns for a single row would only 164 send the row key once. Additionally, we could likely get away with a 165 single intent per row as opposed to an intent per column in the 166 existing system. The downside to this approach is that it appears to 167 be much more invasive than the column family change. Would every 168 consumer of the KV api need to change? 169 170 * Another alternative would be to omit the sentinel key when there is 171 no non-NULL/non-primary-key column. For example, we could omit the 172 sentinel key in the following table because we know there will always 173 be one KV pair: 174 175 ``` 176 CREATE TABLE kv ( 177 k INT PRIMARY KEY, 178 v INT NOT NULL 179 ) 180 ``` 181 182 # Unresolved questions 183 184 * Some of the complexity around legacy compatibility could be reduced 185 if `FamilyDescriptor`s could be backfilled for tables missing them on 186 cluster upgrade. Is there a reasonable mechanism for this? Or should 187 it be hidden as a TableDescriptor transformation in the 188 `getAliasedTableLease` call? 189 190 # Future Work 191 192 * Allow changing of a column's family using ALTER COLUMN and schema changes. 193 194 * Mark column as write-only in new family. 195 * Write column to both old and new families for queries, reading it only from old family. 196 * Backfill column data to new family. 197 * Remove column from old family, mark readable in new family. 198 * TODO(dan): Consider if we can instead write only to the new family 199 and read from both old and new, preferring new. Maybe some issues 200 with NULLs. 201 202 * Add metadata to column families. @bdarnell mentions this would let us 203 map some families to an alternative storage model (perhaps using rocksdb 204 column families) and that this was very useful with bigtable column 205 families/locality families. Once the schema changes are supported, it 206 will also be needed to keep track of which column, if any, was used 207 with the single column optimization. 208 209 # Performance Experiments 210 211 Note: This is from @petermattis's original doc, so it might be out of 212 date. The conclusion is still valid though. 213 214 For the above `kv` table, we can approximate the benefit of this 215 change for a benchmark by not writing the sentinel key for each 216 row. The numbers below show the delta for that change using the `kv` 217 table structure described above (instead of the 1 column table 218 currently used in the `{Insert,Scan}` benchmarks). 219 220 ``` 221 name old time/op new time/op delta 222 Insert1_Cockroach-8 983µs ± 1% 948µs ± 0% -3.53% (p=0.000 n=9+9) 223 Insert10_Cockroach-8 1.72ms ± 1% 1.34ms ± 0% -22.05% (p=0.000 n=10+9) 224 Insert100_Cockroach-8 8.52ms ± 1% 4.99ms ± 1% -41.42% (p=0.000 n=10+10) 225 Scan1_Cockroach-8 348µs ± 1% 345µs ± 1% -1.07% (p=0.002 n=10+10) 226 Scan10_Cockroach-8 464µs ± 1% 419µs ± 1% -9.68% (p=0.000 n=10+10) 227 Scan100_Cockroach-8 1.33ms ± 1% 0.95ms ± 1% -28.61% (p=0.000 n=10+10) 228 ``` 229 230 While the benefit is fairly small for single row insertions, this is 231 only benchmarking the simplest of tables. We'd expect a bigger benefit 232 for tables with more columns.