github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20160426_fk.md (about) 1 - Feature Name: Foreign Keys 2 - Status: completed 3 - Start Date: 2016-04-20 4 - Authors: David Taylor 5 - RFC PR: [#6309](https://github.com/cockroachdb/cockroach/pull/6309) 6 - Cockroach Issue: [#2968](https://github.com/cockroachdb/cockroach/issues/2968) 7 8 # Summary 9 Support REFERENCES / FOREIGN KEY constraints to enforce referential integrity 10 between tables. 11 12 Specifically, if a column (the "referencing" or "child" column) declares that it 13 `REFERENCES` a another column, often in another table, (the "referenced" or 14 "parent" column), then: 15 * a value `X` may only be inserted into the child column if `X` also exists in 16 the parent column. 17 * a value `X` in a child column may only be updated to a value `Y` if `Y` 18 exists in the parent column. 19 * a value `X` in the parent column may only be changed or deleted if `X` does 20 not exist in the child column. 21 22 In the default (`RESTRICT`) case, operations that would violate one of the above 23 rules fail. Alternative behaviors can be specified though e.g. `CASCADE`. 24 25 Note: For simplicity, this specification primarily discusses references as being 26 between a referencing and referenced _column_, but it can also be between 27 matching _lists of columns_, in which case the values mentioned above are tuples 28 of values, but the semantics are the same. 29 30 # Motivation 31 One of the major advantages of a transactional database is the ability to 32 maintain consistency between separate but related collections. 33 Server-enforced referential integrity takes this even further, by preventing 34 even misbehaving applications from creating inconsistency between collections. 35 36 # Detailed design 37 38 ## Indexes 39 Enforcement of foreign key relationships primarily consists of looking for the 40 presence (or absence) of specific values in specific columns, making indexes on 41 those columns highly desirable. While some databases encourage but do not 42 require indexes on both referenced and referencing columns, making them 43 mandatory makes the implementation easier to reason about. 44 45 Values in a referenced column must be unique, otherwise it would be ambiguous 46 which instance of a duplicated value was the one referenced and thus subject to 47 the foreign key restrictions. In practice, uniqueness enforcement is implemented 48 via an index, so the requirement of an index on the referenced column was already 49 implied by the uniqueness requirement. 50 51 Allowing the referencing column to be unindexed has obvious performance 52 drawbacks (operations on the referenced table would need to perform table-scans). 53 Implicitly adding an index however can also easily lead to unpleasant surprises for 54 operators if it changed performance characteristics without notice, and it is 55 unclear what would be expected if they constraint were dropped -- leave an 56 orphaned index no one explicitly created, or implicitly drop it, possibly 57 causing serious changes in query execution? 58 59 Requiring indexes exist ensures that lookups are performant and makes the 60 implementation simpler, since it can rely on their presence, while requiring they 61 be explicitly created by operators, avoids any surprising or unexpected changes. 62 63 ## Tracking FK Relationships 64 When a column references another column, the indexes on both that column and on 65 referenced column are annotated to reflect that relationship: 66 67 ```proto 68 message ForeignKeyRef { 69 optional uint32 table = 1 70 optional uint32 index = 2 71 optional bool unvalidated = 3 72 optional string constraint_name = 4 73 } 74 75 message IndexDescriptor { 76 ... 77 optional TableAndIndexID foreign_key = 8; 78 repeated TableAndIndexID referenced_by = 9; 79 ``` 80 81 When and how these index descriptors are modified is discussed below. 82 83 On insert or update, if the descriptor for the index on the written column 84 indicates that it references another table, the presence of a matching value in 85 that table is checked. 86 87 On delete or update, for each index mentioned in the `referenced_by` in the 88 descriptors of the indexes on the written table, the presence of a referencing 89 value is checked. 90 91 Batched inserts or modifications of many rows could potentially be validated via 92 a single range lookup or even a JOIN with the referenced table as a performance 93 optimization. 94 95 ## Creating and Modifying FK constraints 96 There are three rough classes of modification to FK constraints: 97 * Creating a new table containing a column with a constraint. 98 * Adding a new column with a constraint to an existing table. 99 * Adding a new constraint to an existing column. 100 101 In all three cases, the operation succeeds iff there are no existing 102 violations and no write can later succeed that would create a violation. 103 104 In the cases that add a new element (a new table or new column in an existing 105 table) that references an existing column, the new element must only become 106 public (i.e. accept reads/writes) once referential integrity is ensured, meaning 107 all nodes are aware of it and have started applying integrity checks to 108 operations on the referenced column. Specifically, the steps to add a new element 109 are: 110 1. Add the new element and its index in a non-public state. 111 * If adding a column to an existing table, wait for backfill of default values, then scan for violations. 112 1. Wait for all nodes to be aware of the new (hidden) element. 113 1. Update the referenced parent table, noting the reference from the (hidden) element. 114 1. Wait for all nodes to be aware of the new version of the parent table. 115 1. Update the new element to be public. 116 117 When adding a new constraint between *existing* columns, the process changes 118 slightly, in that the constraint itself is the one with an extra, intermediate 119 state: 120 1. Update the child table descriptor to include the outbound reference, 121 but mark the reference indicating it is not validated. 122 * Whether or not unvalidated constraints should be considered public is an open question. 123 1. Update the referenced parent table, noting the reference. 124 1. Wait for all nodes to be aware of the new versions of the both tables. 125 1. Scan to find existing violations. 126 1. Update the child table to mark the constraint as validated. 127 128 ### Concerns Handling Related Edits to Multiple Tables 129 Unlike the process described in the F1 paper, Cockroach leases individual table 130 descriptors instead of whole schemas. For isolated changes on individual tables 131 this difference is minor, but it complicates FK changes, which make related 132 alterations to multiple tables at once: the system ensures that at-most-two 133 versions of a particular table are active at the same time in the cluster, but 134 does not, directly, offer guarantees about separate tables -- thus the process 135 for making related changes to multiple tables must at times wait until all nodes 136 are using a particular version of one descriptor before initiating a change to 137 another. 138 139 Transactional modification of multiple table descriptors makes modifying FK 140 relationships much more difficult, since it needs to be denormalized to both 141 of the referencing and referenced table. Managing updates to a single record 142 per-database, e.g. a list of all the FK relationships or even just a single 143 record with all the table descriptors embedded in it, might be easier to reason 144 about, but would require substantial refactoring for DB-level leasing. 145 See discussion on #7508. 146 147 # Drawbacks 148 The addition of the checks before scanning for existing violations means those 149 checks may reject some writes but later be reverted when the scan finds a 150 violation, and the rejected writes would mention a constraint that was never 151 (successfully) created. (more discussion in Alternatives). 152 153 # Future Work 154 ## When to Evaluate References 155 In the simplest case, which will be implemented first, as each row is written, 156 any columns it references or is referenced by are checked for violations. This 157 however means that other writes in the same statement may or may not be visible 158 to those checks. 159 160 The `DEFERRED` keyword allows delaying checks until the transaction is committed, 161 thus including the effects of any other statements in the transaction (and is 162 thus required for handling circularly dependent values). 163 164 Even without `DEFERRED`, Postgres allows referencing a value inserted in the 165 same statement, even if it appears later in the statement, meaning it must stage 166 all the writes for a statement before attempting validation. Implementation of 167 this behavior would incur additional runtime overhead though, as checks and/or 168 writes would need to be buffered. 169 170 Feedback from early users of foreign-key constraints, using the simpler row-by-row 171 validation, can guide which, if either, of these behaviors we should implement. 172 173 ## `CASCADE` and other behaviors 174 ORM compatibility and user feedback will likely guide which, if any, of the 175 configurable behaviors like `CASCASE`, `SET NULL`, or `SET DEFAULT` we want to 176 implement as enhancements to the basic, `RESTRICT`-only implementation.