github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200331_enums.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200331_enums.md (about)

1 - Feature Name: Enum Data Types in CockroachDB
2 - Status: accepted
3 - Start Date: 2020-03-31
4 - Authors: Rohan Yadav, Lucy Zhang, Andrew Werner, Jordan Lewis
5 - RFC PR: #47070
6 - Cockroach Issue: #24873
7
8 # Summary
9
10 This RFC proposes adding enum types to CockroachDB.
11
12 # Background
13
14 Enum types are a class of user defined types where the values in
15 the type are constrained to a fixed set
16 of user specified values. The system then ensures type safety over operations
17 on this type. This includes ensuring that only values that are members of the
18 enum can be inserted into a column of the enum type, and that enums can only
19 be compared to other values of the same enum type. For example, consider an
20 application that needs to store events and the days of the week that they happen.
21 This application could use an enum to represent the days of the week.
22
23 ```sql
24 CREATE TYPE day ENUM AS ('monday', 'tuesday', 'wednesday'...);
25 CREATE TABLE events (id INT, dayofweek day);
26 INSERT INTO events VALUES (1, 'monday');
27 ```
28
29
30 # Overview
31
32 To implement enum types in CockroachDB, we have to touch many layers
33 of the system. In particular, we need to introduce a way of storing
34 metadata about enums durably in the database. We then need a way to
35 cache this metadata so that lookups on this metadata is fast, as well
36 as a way to invalidate this cache when enum metadata changes. When
37 enum metadata changes, we need to ensure that these changes do not
38 result in some nodes in the cluster entering a situation where
39 they are unable to process enum values they find. Lastly, we need
40 to define a physical layout for enums and integrate enums within
41 the type system and SQL execution stack.
42
43 # Detailed Explanation
44
45 ## Metadata Storage
46
47 Enums themselves are a special case of user-defined types. In order
48 to lay the groundwork for future work in this area, we propose storing
49 metadata about an enum in a new descriptor called a `TypeDescriptor`.
50 This descriptor will be added to the descriptor union alongside table and
51 database descriptors. The descriptor will store metadata about the type,
52 including the parent database and schema IDs, a unique ID for the type, and
53 the name of the type. The descriptor will also include specific information
54 for the kind of type being stored in the descriptor (as of now there
55 would only be enums). For enums, this information would include the mapping
56 of the enum's values to their physical representations. A proposal of the
57 descriptor's contents is below:
58
59 ```proto
60 message TypeDescriptor {
61 // Used by all kinds of user-defined types.
62 // Parent database and schema.
63 uint32 parent_id;
64 uint32 parent_schema_id;
65 // ID and Postgres compatible OID of the type.
66 uint32 id;
67 uint32 oid;
68 // Visible name of the type.
69 string name;
70
71 // Enum specific fields.
72 message enum_members {
73 byte[] physical_encoding;
74 string name;
75 };
76 enum_members[] members;
77 }
78 ```
79
80 These descriptors
81 will be stored in the `system.descriptor` table and will use the leasing
82 and versioning system being built. There is ongoing work on unifying
83 the leasing interface so that components are easily shared across
84 different descriptor types, and we will take advantage of these
85 systems once they are available. The leasing system will enable caching
86 and cache invalidation of type descriptors. Until the leasing system
87 is ready for integration, we will first implement a prototype
88 that either doesn't use a cache or uses a simple incoherent cache for
89 `TypeDescriptor` access.
90
91 ## Name Resolution
92
93 Enums are scoped within a database and a schema. In Postgres, enums
94 cannot be accessed from other databases -- they can only be accessed from
95 different schemas in the same database. However, there is no core reason
96 that CockroachDB cannot support this. In fact, we might need to support
97 references of types across databases to be in line with other cross
98 database references that we currently support. The topic of cross database
99 references has come up in discussion about
100 [user defined schemas](https://github.com/cockroachdb/cockroach/pull/48276)
101 as well. The direction that we take in allowing cross database references
102 vs allowing only cross schema references will follow what has been decided
103 in that context.
104
105 Table and type names exist within the same namespace in Postgres. This means
106 that it is possible to create a type and table of the same name within
107 the same schema. Additionally, tables in Postgres are types themselves
108 as a record type where each field is typed like the tables columns. Therefore,
109 we will store type namespace entries along with table namespace entries
110 in the `system.namespace` table. This allows namespace conflicts between
111 types and tables to be properly detected, as well as allowing us to reuse
112 a large amount of name resolution logic that exists for table name lookup.
113 This strategy also will allow the user defined types implementation to
114 adapt to new features like user defined schemas without extra work.
115
116 ## ID's and OID's
117
118 All user defined types will need a stable ID that they are uniquely addressable
119 by from within CockroachDB, as well as an OID that can be used for Postgres
120 compliant operations. Importantly, the OIDs cannot conflict
121 with existing type OIDs. Our proposal is to separate dealing with these two
122 concepts. Since we propose to store `TypeDescriptor`s in the `system.descriptor`
123 table, the `TypeDescriptor`s will already have a stable ID assigned to them.
124 For type OIDs, we propose to use a separate k/v level counter that can check
125 if a new allocated OID conflicts with one of the statically known OIDs. These
126 type OID's will also be stored in the `TypeDescriptor` for display purposes
127 in catalog tables like `pg_catalog.pg_type`.
128
129 This strategy was chosen because it can be extended to supporting interpreting
130 tables as types. Rather than creating separate type entries for tables, tables
131 can just be given a type OID and resolved wherever type names are requested.
132
133 ## Changing Enum Definitions
134
135 There are a few ways that enums can change over time.
136 * The name can change.
137 * The schema the enum is in can change.
138 * A new enum member can be added to the set of values.
139 * A member in the enum can be renamed.
140 * The enum can be dropped.
141
142 In order to rename an enum or a value in an enum can be done with a write
143 to the enum descriptor and then waiting for all nodes to agree on the new value.
144 There are plans to lift operations on descriptor names off of the individual
145 descriptors, because such operations are common to all of them. This work
146 would involve moving the draining names off of descriptors as well. It's
147 possible that this work would be part of or take advantage of this effort.
148
149 The case of adding a new enum element is more difficult. The key difficulty comes
150 from ensuring that a node does not attempt to translate a physical layout that it
151 does not know about yet into a user facing representation of the enum. If we naively
152 just add the new enum value to the enum metadata, it is possible that another node
153 reads a newly written enum from disk and is unsure how to decode it. Consider the
154 following sequence of events:
155 * Node 1 receives a new enum element `foo` to its enum descriptor and blocks on
156 `WaitForOneVersion`
157 * Node 2 receives the new enum descriptor update and writes a value with `foo`
158 * Node 3 tries to read the value of `foo` before receiving the update to
159 its enum descriptor.
160
161 In order to avoid these situations, we propose an extension of the strategy
162 used for performing online schema changes. As a reminder, when we add a new
163 schema object to a table, it moves through a series of states before becoming
164 usable. As the object moves through these states, the types of operations
165 that are allowed upon the object change. Between each state, we require that
166 all nodes in the cluster agree on the new version of the schema object.
167 For more details, refer to the
168 [online schema changes RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151014_online_schema_change.md).
169 We propose a similar state
170 progression to adding new elements to an enum type.
171 1. When a new value is added
172 to an enum, it is instead placed into a "read only" state.
173 2. After all nodes agree on the "read only" state, the new enum value
174 is promoted into the set of writeable values in the enum.
175
176 This process ensures that all nodes know
177 about all potential enum values before they have a chance to be written.
178 This approach has the drawback of not being able to add an enum value and
179 then insert that value in the same transaction. This drawback is similar
180 to our existing limitation of not being able to add a column and insert
181 into it in the same transaction.
182
183 This enum schema change will be implemented with a new job, rather than
184 trying to build off of the existing table schema changer. While conceptually
185 similar to a table schema change, there is not much implementation to share.
186 This new job will
187 1. Collect all "read only" enum values and wait for one version in the cluster.
188 2. Transition these values to "public", and then wait for one version in the cluster.
189
190 A rollback of this job can just remove the "read-only" values.
191 Additionally, enums don't really need a concept of mutations like tables. The
192 members of an enum in the enum's `TypeDescriptor` can be tagged with whether
193 the member is "read only" or public.
194
195 In Postgres, if an enum is dropped without `CASCADE`, the operation will not succeed
196 if there are any tables that use the enum. If an enum is dropped with
197 `CASCADE`, all dependent columns are dropped as well. If the database
198 that an enum is created within is dropped, then the enum
199 is dropped as well. In order to maintain this information, the
200 descriptors that represent an enum need to hold back-references to
201 the tables and columns that use them. We expect the descriptor leasing
202 system being developed to manage invalidation of cached enums when enums
203 are destroyed in these cases.
204
205 ## Physical Layout
206
207 At first, it may seem that a valid implementation of enum values is
208 to map each to an integer, and then store these integers on disk.
209 This implementation seems like it would supply all the ordering
210 guarantees needed of enums. However, Postgres allows for adding
211 new enums and specifying the order of the newly created enum
212 with respect to an existing value of the enum. This looks like:
213 ```sql
214 CREATE TYPE t ENUM AS ('v1', 'v2');
215 ALTER TYPE t ADD VALUE 'v1.5' AFTER 'v1'
216 ```
217 This means add the value `v1.5` to the enum `t` and order it
218 after the value `v1`. Using just integers as the backing value
219 for enums would not allow us to handle this sort of case.
220 Postgres implements this feature on enums by storing a sorting
221 order for enums as a float. When a new value is added like this,
222 Postgres takes the sort orders of the enums that the new enum is
223 being inserted in between, and creates a float that bisects the
224 range between the two orders. Concretely, if `v1` had a sort order
225 of `1.5` and `v2` had a sort order of `2.0`, then `v1.5` would be
226 inserted with a sort order of `1.75`. However, once the floating
227 point precision limit has been reached, Postgres rewrites all
228 sort orders to integral values. Postgres can do this because it
229 doesn't require a stable disk encoding for enums. In our case,
230 we need to have a stable encoding to store data on disk if an enum
231 is used in an index, and cannot afford to rewrite all tables using an
232 enum if the enum sort order has changed.
233
234 We propose a different strategy that is related to this idea of
235 bisecting ranges, but doesn't suffer from problems due to floating
236 point arithmetic precision. The general idea is to use byte arrays
237 to hold the sort order of our enums, and reserve some bytes in the
238 arrays to create the ordering that we need. In particular we reserve
239 the minimum byte (`0`) and have a maximum allowed byte. In practice
240 this will be `255`. An example of the encoding scheme is below.
241
242 Assume we started with 3 elements (`a`, `b`, `c`), and let the maximum byte value be 3.
243 The sort order byte arrays for each element would be:
244 ```
245 a 1/
246 b 2/
247 c 3/
248 ```
249 To add an element after `b` we can create a new key that sits in the middle of the range
250 between `b` and `c`.
251 ```
252 a 1/
253 b 2/
254 d 2/2/
255 c 3/
256 ```
257 Now lets add more values before `d`. The first one is easy:
258 ```
259 a 1/
260 b 2/
261 e 2/1/
262 d 2/2/
263 c 3/
264 ```
265 The tricky case is adding a value before `e`. Because we reserved the minimum byte, we can
266 append it and then bisect the range again.
267 ```
268 a 1/
269 b 2/
270 f 2/0/2
271 e 2/1/
272 d 2/2/
273 c 3/
274 ```
275 This strategy can be extended indefinitely as long as this pattern is followed to reserve
276 the minimum byte. A prototype of the exact algorithm is included as part of the RFC PR.
277
278 This sort order byte array will be the physical layout and identifier of the enum. We expect
279 that for small enums only a byte or two will be used to hold all the values, and that our
280 compression strategies at the storage layer will compress this data well.
281
282 Since the common case of adding members to an enum is to add a member at the beginning
283 or end of the set of values, we can adjust the algorithm slightly to better
284 handle this case. When generating a new key byte where one of the endpoints is
285 the min or max element, the algorithm can add or subtract a small constant from
286 the existing key rather than bisecting the range. This allows for adding many
287 more elements to the beginning or end of the range without increasing the
288 number of bytes used to store the enum. The algorithm can be found implemented in
289 [this PR](https://github.com/cockroachdb/cockroach/pull/47939).
290
291 ## Parsing
292
293 Currently, the CockroachDB grammar is not equipped to handle type names
294 that are qualified due to changes made in the past that separated parsing of
295 object and type identifiers. Some of these changes will have to be
296 reverted/adapted in order to allow for types to have qualifications again.
297 The work to allow the parser to recognize qualified names has been done in
298 [this PR](https://github.com/cockroachdb/cockroach/pull/47216).
299
300 ## Type System Changes
301
302 The type system of CockroachDB currently makes an assumption that anywhere
303 a type is present in an AST, that type is statically known. In code, this
304 means that every AST object that holds a type (like a `CastExpr` or
305 `ColumnDef`) holds a `*types.T`, which is constructed at parse time.
306 As part of implementing user defined types, the type system must be taught
307 that all types are no longer statically known. The general idea is to change
308 the types in AST nodes to a new interface representing an unresolved type
309 reference. These type references can then be resolved into `*types.T` through
310 type resolution. Additionally, we must enforce that types are only attempted
311 to be accessed after type checking, when all type references have been resolved.
312 A prototype of this approach can be found in
313 [this PR](https://github.com/cockroachdb/cockroach/pull/47386).
314
315 After the process of type resolution, enums need a `types.T` for interaction
316 with other components of the system. We will introduce a new family for enums,
317 and the `types.T` for an enums will contain the stable ID for the
318 `TypeDescriptor` that backs the type. The `types.T` will also contain extra
319 fields for an enum like the mapping of names to values. Importantly, these
320 extra fields will not be serialized as part of the proto. Instead, when a
321 type is resolved, the returned `*types.T` will be hydrated to populate these
322 fields.
323
324 A potential option was to avoid using
325 a `TypeDescriptor` and instead just extend the `types.T` proto to contain
326 necessary fields for user defined types. However, this is not feasible because
327 the `types.T` proto's are stored on disk in various descriptors. It is too
328 expensive to update all descriptors that contain a type every time the type
329 is altered.
330
331 A new `Datum` `DEnum` will be introduced to represent values of the
332 enums at runtime. A `DEnum` will store the physical representation of the
333 enum as well as the hydrated `*types.T` of its type. The extra fields in the
334 `*types.T` that hold information about enum values will be used for datum
335 operations without the need to thread ID resolution capabilities to evaluation
336 of operations on datums.
337
338 When a user-defined type is created in Postgres, Postgres will automatically
339 create an alias for an array of the new type. For example, if a user creates
340 a type `days`, the system would also create the type `_days` as an alias for
341 `days[]`. This type tracks changes made to the referenced type as it
342 moves through schemas and is dropped. It is unclear if supporting this sort
343 of type aliasing will be a significant lift from the enum work.
344
345 ## Semantic Analysis Changes
346
347 The optimizer will need to be taught about the check constraint implied by
348 a column being of an enum type. Additionally, it will need to be taught how
349 to convert enum values from their input string representation into their
350 `Datum` physical representation.
351
352 The `Catalog` that is used by the optimizer will need to be extended to support
353 resolution of types. The way that the catalog represents user defined types is
354 important for invalidation of cached plans. If a type is updated, all plans
355 containing data sources using the type need to be invalidated.
356
357 ## DistSQL
358 The gateway node that plans a SQL query has access to all resolved type
359 information for the query. Remote nodes that different parts of the query
360 are planned on need access this information in order to correctly execute
361 the query. In particular, these nodes need to hydrate their `*types.T`
362 containers with metadata and they need to parse and type check serialized
363 expressions. The hydration of `*types.T` objects can be done at operator
364 initialization. The trickier problem is type checking serialized expressions --
365 we don't want to pay the cost of name resolution again. Our proposed solution
366 is similar to what is currently done by the serialization of some Postgres
367 OID types like `regclass`.
368
369 Assume we have an enum `mytype` with ID `10`, and a value in `mytype` is `'hello'`.
370 Rather than serializing `'hello'` as `'hello':::mytype`, we will introduce a
371 new `crdb_internal` builtin `crdb_internal.create_user_defined_type(str, uint32)`
372 that returns the input string casted to the type with the input ID. So, `'hello'`
373 would get serialized as `crdb_internal.create_user_defined_type('hello', 10)`,
374 which allows us to avoid name resolution on remote nodes.
375
376 # Alternatives
377
378 ## Namespacing and Metadata Storage
379 During discussion of the RFC, some alternatives were debated. In particular,
380 the ideas of using a separate namespace table for types and/or a separate
381 descriptor table for metadata storage. The benefit of a separate namespace
382 table is that it has the potential of making future work in allowing tables
383 to be interpreted as types more straightforward. However, using a separate
384 namespace table complicates existing name resolution and conflict detection
385 strategies. A separate descriptor table allows for scans over all tables or
386 types to not have to touch descriptors of different types, which is a
387 performance improvement for catalog table operations. However, this problem
388 is somewhat orthogonal to this work, and would be better solved by building
389 some sort of indexing structure on the `system.descriptor` table.
390 Using the existing namespace table allows most of the existing name resolution
391 code to be used directly, and using the same descriptor table allows for
392 leasing primitives to be built on only one system table.
393
394 ## Overall Alternative
395 One alternative approach to this physical layout was to store just an
396 enum ID on disk, and store ordering and representation information in
397 a separate lookup table. When operations like on enums would involve
398 joining or rendering the enums, a join would be produced against this
399 reference table. This allows for easy changing of enum data, but
400 results in a variety of complexity during planning.
401
402 # Unresolved questions
403
404 It is unclear what interactions will arise between this work and the
405 planned/ongoing work with user defined schemas.