github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200331_enums.md (about) 1 - Feature Name: Enum Data Types in CockroachDB 2 - Status: accepted 3 - Start Date: 2020-03-31 4 - Authors: Rohan Yadav, Lucy Zhang, Andrew Werner, Jordan Lewis 5 - RFC PR: #47070 6 - Cockroach Issue: #24873 7 8 # Summary 9 10 This RFC proposes adding enum types to CockroachDB. 11 12 # Background 13 14 Enum types are a class of user defined types where the values in 15 the type are constrained to a fixed set 16 of user specified values. The system then ensures type safety over operations 17 on this type. This includes ensuring that only values that are members of the 18 enum can be inserted into a column of the enum type, and that enums can only 19 be compared to other values of the same enum type. For example, consider an 20 application that needs to store events and the days of the week that they happen. 21 This application could use an enum to represent the days of the week. 22 23 ```sql 24 CREATE TYPE day ENUM AS ('monday', 'tuesday', 'wednesday'...); 25 CREATE TABLE events (id INT, dayofweek day); 26 INSERT INTO events VALUES (1, 'monday'); 27 ``` 28 29 30 # Overview 31 32 To implement enum types in CockroachDB, we have to touch many layers 33 of the system. In particular, we need to introduce a way of storing 34 metadata about enums durably in the database. We then need a way to 35 cache this metadata so that lookups on this metadata is fast, as well 36 as a way to invalidate this cache when enum metadata changes. When 37 enum metadata changes, we need to ensure that these changes do not 38 result in some nodes in the cluster entering a situation where 39 they are unable to process enum values they find. Lastly, we need 40 to define a physical layout for enums and integrate enums within 41 the type system and SQL execution stack. 42 43 # Detailed Explanation 44 45 ## Metadata Storage 46 47 Enums themselves are a special case of user-defined types. In order 48 to lay the groundwork for future work in this area, we propose storing 49 metadata about an enum in a new descriptor called a `TypeDescriptor`. 50 This descriptor will be added to the descriptor union alongside table and 51 database descriptors. The descriptor will store metadata about the type, 52 including the parent database and schema IDs, a unique ID for the type, and 53 the name of the type. The descriptor will also include specific information 54 for the kind of type being stored in the descriptor (as of now there 55 would only be enums). For enums, this information would include the mapping 56 of the enum's values to their physical representations. A proposal of the 57 descriptor's contents is below: 58 59 ```proto 60 message TypeDescriptor { 61 // Used by all kinds of user-defined types. 62 // Parent database and schema. 63 uint32 parent_id; 64 uint32 parent_schema_id; 65 // ID and Postgres compatible OID of the type. 66 uint32 id; 67 uint32 oid; 68 // Visible name of the type. 69 string name; 70 71 // Enum specific fields. 72 message enum_members { 73 byte[] physical_encoding; 74 string name; 75 }; 76 enum_members[] members; 77 } 78 ``` 79 80 These descriptors 81 will be stored in the `system.descriptor` table and will use the leasing 82 and versioning system being built. There is ongoing work on unifying 83 the leasing interface so that components are easily shared across 84 different descriptor types, and we will take advantage of these 85 systems once they are available. The leasing system will enable caching 86 and cache invalidation of type descriptors. Until the leasing system 87 is ready for integration, we will first implement a prototype 88 that either doesn't use a cache or uses a simple incoherent cache for 89 `TypeDescriptor` access. 90 91 ## Name Resolution 92 93 Enums are scoped within a database and a schema. In Postgres, enums 94 cannot be accessed from other databases -- they can only be accessed from 95 different schemas in the same database. However, there is no core reason 96 that CockroachDB cannot support this. In fact, we might need to support 97 references of types across databases to be in line with other cross 98 database references that we currently support. The topic of cross database 99 references has come up in discussion about 100 [user defined schemas](https://github.com/cockroachdb/cockroach/pull/48276) 101 as well. The direction that we take in allowing cross database references 102 vs allowing only cross schema references will follow what has been decided 103 in that context. 104 105 Table and type names exist within the same namespace in Postgres. This means 106 that it is possible to create a type and table of the same name within 107 the same schema. Additionally, tables in Postgres are types themselves 108 as a record type where each field is typed like the tables columns. Therefore, 109 we will store type namespace entries along with table namespace entries 110 in the `system.namespace` table. This allows namespace conflicts between 111 types and tables to be properly detected, as well as allowing us to reuse 112 a large amount of name resolution logic that exists for table name lookup. 113 This strategy also will allow the user defined types implementation to 114 adapt to new features like user defined schemas without extra work. 115 116 ## ID's and OID's 117 118 All user defined types will need a stable ID that they are uniquely addressable 119 by from within CockroachDB, as well as an OID that can be used for Postgres 120 compliant operations. Importantly, the OIDs cannot conflict 121 with existing type OIDs. Our proposal is to separate dealing with these two 122 concepts. Since we propose to store `TypeDescriptor`s in the `system.descriptor` 123 table, the `TypeDescriptor`s will already have a stable ID assigned to them. 124 For type OIDs, we propose to use a separate k/v level counter that can check 125 if a new allocated OID conflicts with one of the statically known OIDs. These 126 type OID's will also be stored in the `TypeDescriptor` for display purposes 127 in catalog tables like `pg_catalog.pg_type`. 128 129 This strategy was chosen because it can be extended to supporting interpreting 130 tables as types. Rather than creating separate type entries for tables, tables 131 can just be given a type OID and resolved wherever type names are requested. 132 133 ## Changing Enum Definitions 134 135 There are a few ways that enums can change over time. 136 * The name can change. 137 * The schema the enum is in can change. 138 * A new enum member can be added to the set of values. 139 * A member in the enum can be renamed. 140 * The enum can be dropped. 141 142 In order to rename an enum or a value in an enum can be done with a write 143 to the enum descriptor and then waiting for all nodes to agree on the new value. 144 There are plans to lift operations on descriptor names off of the individual 145 descriptors, because such operations are common to all of them. This work 146 would involve moving the draining names off of descriptors as well. It's 147 possible that this work would be part of or take advantage of this effort. 148 149 The case of adding a new enum element is more difficult. The key difficulty comes 150 from ensuring that a node does not attempt to translate a physical layout that it 151 does not know about yet into a user facing representation of the enum. If we naively 152 just add the new enum value to the enum metadata, it is possible that another node 153 reads a newly written enum from disk and is unsure how to decode it. Consider the 154 following sequence of events: 155 * Node 1 receives a new enum element `foo` to its enum descriptor and blocks on 156 `WaitForOneVersion` 157 * Node 2 receives the new enum descriptor update and writes a value with `foo` 158 * Node 3 tries to read the value of `foo` before receiving the update to 159 its enum descriptor. 160 161 In order to avoid these situations, we propose an extension of the strategy 162 used for performing online schema changes. As a reminder, when we add a new 163 schema object to a table, it moves through a series of states before becoming 164 usable. As the object moves through these states, the types of operations 165 that are allowed upon the object change. Between each state, we require that 166 all nodes in the cluster agree on the new version of the schema object. 167 For more details, refer to the 168 [online schema changes RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20151014_online_schema_change.md). 169 We propose a similar state 170 progression to adding new elements to an enum type. 171 1. When a new value is added 172 to an enum, it is instead placed into a "read only" state. 173 2. After all nodes agree on the "read only" state, the new enum value 174 is promoted into the set of writeable values in the enum. 175 176 This process ensures that all nodes know 177 about all potential enum values before they have a chance to be written. 178 This approach has the drawback of not being able to add an enum value and 179 then insert that value in the same transaction. This drawback is similar 180 to our existing limitation of not being able to add a column and insert 181 into it in the same transaction. 182 183 This enum schema change will be implemented with a new job, rather than 184 trying to build off of the existing table schema changer. While conceptually 185 similar to a table schema change, there is not much implementation to share. 186 This new job will 187 1. Collect all "read only" enum values and wait for one version in the cluster. 188 2. Transition these values to "public", and then wait for one version in the cluster. 189 190 A rollback of this job can just remove the "read-only" values. 191 Additionally, enums don't really need a concept of mutations like tables. The 192 members of an enum in the enum's `TypeDescriptor` can be tagged with whether 193 the member is "read only" or public. 194 195 In Postgres, if an enum is dropped without `CASCADE`, the operation will not succeed 196 if there are any tables that use the enum. If an enum is dropped with 197 `CASCADE`, all dependent columns are dropped as well. If the database 198 that an enum is created within is dropped, then the enum 199 is dropped as well. In order to maintain this information, the 200 descriptors that represent an enum need to hold back-references to 201 the tables and columns that use them. We expect the descriptor leasing 202 system being developed to manage invalidation of cached enums when enums 203 are destroyed in these cases. 204 205 ## Physical Layout 206 207 At first, it may seem that a valid implementation of enum values is 208 to map each to an integer, and then store these integers on disk. 209 This implementation seems like it would supply all the ordering 210 guarantees needed of enums. However, Postgres allows for adding 211 new enums and specifying the order of the newly created enum 212 with respect to an existing value of the enum. This looks like: 213 ```sql 214 CREATE TYPE t ENUM AS ('v1', 'v2'); 215 ALTER TYPE t ADD VALUE 'v1.5' AFTER 'v1' 216 ``` 217 This means add the value `v1.5` to the enum `t` and order it 218 after the value `v1`. Using just integers as the backing value 219 for enums would not allow us to handle this sort of case. 220 Postgres implements this feature on enums by storing a sorting 221 order for enums as a float. When a new value is added like this, 222 Postgres takes the sort orders of the enums that the new enum is 223 being inserted in between, and creates a float that bisects the 224 range between the two orders. Concretely, if `v1` had a sort order 225 of `1.5` and `v2` had a sort order of `2.0`, then `v1.5` would be 226 inserted with a sort order of `1.75`. However, once the floating 227 point precision limit has been reached, Postgres rewrites all 228 sort orders to integral values. Postgres can do this because it 229 doesn't require a stable disk encoding for enums. In our case, 230 we need to have a stable encoding to store data on disk if an enum 231 is used in an index, and cannot afford to rewrite all tables using an 232 enum if the enum sort order has changed. 233 234 We propose a different strategy that is related to this idea of 235 bisecting ranges, but doesn't suffer from problems due to floating 236 point arithmetic precision. The general idea is to use byte arrays 237 to hold the sort order of our enums, and reserve some bytes in the 238 arrays to create the ordering that we need. In particular we reserve 239 the minimum byte (`0`) and have a maximum allowed byte. In practice 240 this will be `255`. An example of the encoding scheme is below. 241 242 Assume we started with 3 elements (`a`, `b`, `c`), and let the maximum byte value be 3. 243 The sort order byte arrays for each element would be: 244 ``` 245 a 1/ 246 b 2/ 247 c 3/ 248 ``` 249 To add an element after `b` we can create a new key that sits in the middle of the range 250 between `b` and `c`. 251 ``` 252 a 1/ 253 b 2/ 254 d 2/2/ 255 c 3/ 256 ``` 257 Now lets add more values before `d`. The first one is easy: 258 ``` 259 a 1/ 260 b 2/ 261 e 2/1/ 262 d 2/2/ 263 c 3/ 264 ``` 265 The tricky case is adding a value before `e`. Because we reserved the minimum byte, we can 266 append it and then bisect the range again. 267 ``` 268 a 1/ 269 b 2/ 270 f 2/0/2 271 e 2/1/ 272 d 2/2/ 273 c 3/ 274 ``` 275 This strategy can be extended indefinitely as long as this pattern is followed to reserve 276 the minimum byte. A prototype of the exact algorithm is included as part of the RFC PR. 277 278 This sort order byte array will be the physical layout and identifier of the enum. We expect 279 that for small enums only a byte or two will be used to hold all the values, and that our 280 compression strategies at the storage layer will compress this data well. 281 282 Since the common case of adding members to an enum is to add a member at the beginning 283 or end of the set of values, we can adjust the algorithm slightly to better 284 handle this case. When generating a new key byte where one of the endpoints is 285 the min or max element, the algorithm can add or subtract a small constant from 286 the existing key rather than bisecting the range. This allows for adding many 287 more elements to the beginning or end of the range without increasing the 288 number of bytes used to store the enum. The algorithm can be found implemented in 289 [this PR](https://github.com/cockroachdb/cockroach/pull/47939). 290 291 ## Parsing 292 293 Currently, the CockroachDB grammar is not equipped to handle type names 294 that are qualified due to changes made in the past that separated parsing of 295 object and type identifiers. Some of these changes will have to be 296 reverted/adapted in order to allow for types to have qualifications again. 297 The work to allow the parser to recognize qualified names has been done in 298 [this PR](https://github.com/cockroachdb/cockroach/pull/47216). 299 300 ## Type System Changes 301 302 The type system of CockroachDB currently makes an assumption that anywhere 303 a type is present in an AST, that type is statically known. In code, this 304 means that every AST object that holds a type (like a `CastExpr` or 305 `ColumnDef`) holds a `*types.T`, which is constructed at parse time. 306 As part of implementing user defined types, the type system must be taught 307 that all types are no longer statically known. The general idea is to change 308 the types in AST nodes to a new interface representing an unresolved type 309 reference. These type references can then be resolved into `*types.T` through 310 type resolution. Additionally, we must enforce that types are only attempted 311 to be accessed after type checking, when all type references have been resolved. 312 A prototype of this approach can be found in 313 [this PR](https://github.com/cockroachdb/cockroach/pull/47386). 314 315 After the process of type resolution, enums need a `types.T` for interaction 316 with other components of the system. We will introduce a new family for enums, 317 and the `types.T` for an enums will contain the stable ID for the 318 `TypeDescriptor` that backs the type. The `types.T` will also contain extra 319 fields for an enum like the mapping of names to values. Importantly, these 320 extra fields will not be serialized as part of the proto. Instead, when a 321 type is resolved, the returned `*types.T` will be hydrated to populate these 322 fields. 323 324 A potential option was to avoid using 325 a `TypeDescriptor` and instead just extend the `types.T` proto to contain 326 necessary fields for user defined types. However, this is not feasible because 327 the `types.T` proto's are stored on disk in various descriptors. It is too 328 expensive to update all descriptors that contain a type every time the type 329 is altered. 330 331 A new `Datum` `DEnum` will be introduced to represent values of the 332 enums at runtime. A `DEnum` will store the physical representation of the 333 enum as well as the hydrated `*types.T` of its type. The extra fields in the 334 `*types.T` that hold information about enum values will be used for datum 335 operations without the need to thread ID resolution capabilities to evaluation 336 of operations on datums. 337 338 When a user-defined type is created in Postgres, Postgres will automatically 339 create an alias for an array of the new type. For example, if a user creates 340 a type `days`, the system would also create the type `_days` as an alias for 341 `days[]`. This type tracks changes made to the referenced type as it 342 moves through schemas and is dropped. It is unclear if supporting this sort 343 of type aliasing will be a significant lift from the enum work. 344 345 ## Semantic Analysis Changes 346 347 The optimizer will need to be taught about the check constraint implied by 348 a column being of an enum type. Additionally, it will need to be taught how 349 to convert enum values from their input string representation into their 350 `Datum` physical representation. 351 352 The `Catalog` that is used by the optimizer will need to be extended to support 353 resolution of types. The way that the catalog represents user defined types is 354 important for invalidation of cached plans. If a type is updated, all plans 355 containing data sources using the type need to be invalidated. 356 357 ## DistSQL 358 The gateway node that plans a SQL query has access to all resolved type 359 information for the query. Remote nodes that different parts of the query 360 are planned on need access this information in order to correctly execute 361 the query. In particular, these nodes need to hydrate their `*types.T` 362 containers with metadata and they need to parse and type check serialized 363 expressions. The hydration of `*types.T` objects can be done at operator 364 initialization. The trickier problem is type checking serialized expressions -- 365 we don't want to pay the cost of name resolution again. Our proposed solution 366 is similar to what is currently done by the serialization of some Postgres 367 OID types like `regclass`. 368 369 Assume we have an enum `mytype` with ID `10`, and a value in `mytype` is `'hello'`. 370 Rather than serializing `'hello'` as `'hello':::mytype`, we will introduce a 371 new `crdb_internal` builtin `crdb_internal.create_user_defined_type(str, uint32)` 372 that returns the input string casted to the type with the input ID. So, `'hello'` 373 would get serialized as `crdb_internal.create_user_defined_type('hello', 10)`, 374 which allows us to avoid name resolution on remote nodes. 375 376 # Alternatives 377 378 ## Namespacing and Metadata Storage 379 During discussion of the RFC, some alternatives were debated. In particular, 380 the ideas of using a separate namespace table for types and/or a separate 381 descriptor table for metadata storage. The benefit of a separate namespace 382 table is that it has the potential of making future work in allowing tables 383 to be interpreted as types more straightforward. However, using a separate 384 namespace table complicates existing name resolution and conflict detection 385 strategies. A separate descriptor table allows for scans over all tables or 386 types to not have to touch descriptors of different types, which is a 387 performance improvement for catalog table operations. However, this problem 388 is somewhat orthogonal to this work, and would be better solved by building 389 some sort of indexing structure on the `system.descriptor` table. 390 Using the existing namespace table allows most of the existing name resolution 391 code to be used directly, and using the same descriptor table allows for 392 leasing primitives to be built on only one system table. 393 394 ## Overall Alternative 395 One alternative approach to this physical layout was to store just an 396 enum ID on disk, and store ordering and representation information in 397 a separate lookup table. When operations like on enums would involve 398 joining or rendering the enums, a join would be produced against this 399 reference table. This allows for easy changing of enum data, but 400 results in a variety of complexity during planning. 401 402 # Unresolved questions 403 404 It is unclear what interactions will arise between this work and the 405 planned/ongoing work with user defined schemas.