github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151009_table_descriptor_lease.md (about) 1 - Feature Name: table_descriptor_lease 2 - Status: completed 3 - Start Date: 2015-10-09 4 - RFC PR: [#2810](https://github.com/cockroachdb/cockroach/pull/2036) 5 - Cockroach Issue: [#2036](https://github.com/cockroachdb/cockroach/issues/2036) 6 7 # Summary 8 9 Implement a table descriptor lease mechanism to allow safe usage of 10 cached table descriptors. 11 12 # Motivation 13 14 Table descriptors contain the schema for a single table and are 15 utilized by nearly every SQL operation. Fast access to the table 16 descriptor is critical for good performance. Reading the table 17 descriptor from the KV store on every operation adds significant 18 latency. 19 20 Table descriptors are currently distributed to every node in the 21 cluster via gossipping of the system config (see 22 [schema_gossip](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20150720_schema_gossip.md)). Unfortunately, 23 it is not safe to use these gossipped table descriptors in almost any 24 circumstance. Consider the statements: 25 26 ```sql 27 CREATE TABLE test (key INT PRIMARY KEY, value INT); 28 CREATE INDEX foo ON test (value); 29 INSERT INTO test VALUES (1, 2); 30 ``` 31 32 Depending on when the gossip of the schema change is received, the 33 `INSERT` statement may either see no cached descriptor for the `test` 34 table (which is safe because it will fall back to reading from the KV 35 store), the descriptor as it looks after the `CREATE TABLE` statement 36 or the descriptor as it looks after the `CREATE INDEX` statement. If 37 the `INSERT` statement does not see the `CREATE INDEX` statement then 38 it will merrily insert the values into `test` without creating the 39 entry for the `foo` index, thus leaving the table data in an 40 inconsistent state. 41 42 It is easy to show similar problematic sequences of operations for 43 `DELETE`, `UPDATE` and `SELECT`. Fortunately, a solution exists in the 44 literature: break up the schema change into discrete steps for which 45 any two consecutive versions of the schema are safe for concurrent 46 use. For example, to add an index to a table we would perform the 47 following steps: 48 49 * Add the index to the table descriptor but mark it as delete-only: 50 update and delete operations which would delete entries for the 51 index do so, but insert and update operations which would add new 52 elements and read operations do not use the index. 53 * Wait for the table descriptor change to propagate to all nodes in 54 the cluster and all uses of the previous version to finish. 55 * Mark the index as write-only: insert, update and delete operations 56 would add entries for the index, but read operations would not use 57 the index. 58 * Wait for the table descriptor change to propagate to all nodes in 59 the cluster and all uses of the previous version to finish. 60 * Backfill the index. 61 * Mark the index in the table descriptor as read-write. 62 * Wait for the table descriptor change to propagate to all nodes in 63 the cluster and all uses of the previous version to finish. 64 65 This RFC is focused on how to wait for the table descriptor change to 66 propagate to all nodes in the cluster. More accurately, it is focused 67 on how to determine when a previous version of the descriptor is no 68 longer in use for read-write operations. 69 70 Separately, we implement another lease mechanism not discussed here 71 called a schema change lease. When a schema change is to be performed, 72 the operation first acquires a schema change lease for the table. 73 The schema change lease ensures that only the lease holder can execute 74 the state machine for a schema change and update the table descriptor. 75 76 Before using a table descriptor for a DML operation (i.e. `SELECT`, 77 `INSERT`, `UPDATE`, `DELETE`, etc), the operation needs to obtain a 78 table lease for the descriptor. The lease will be guaranteed valid for 79 a significant period of time (on the order of minutes). When the operation completes it will release the table lease. 80 81 # Detailed design 82 83 The design maintains two invariant: 84 * Two safe versions: A transaction at a particular timestamp is allowed to use 85 one of two versions of a table descriptor. 86 * Two leased version: There can be valid leases on at most the 2 latest 87 versions of a table in the cluster at any time. Leases are usually granted 88 on the latest version. 89 90 Table descriptors will be extended with a version number that is 91 incremented on every change to the descriptor: 92 93 ```proto 94 message TableDescriptor { 95 ... 96 optional uint32 id; 97 optional uint32 version; 98 optional util.hlc.Timestamp modification_time; 99 ... 100 } 101 ``` 102 103 A table descriptor at a version `v` has a validity period spanning from 104 its `ModificationTime` until the `ModificationTime` of the table descriptor at 105 version `v + 2`: [`ModificationTime`, `ModificationTime[v+2]`). A transaction 106 at time `T` can safely use one of two table descriptors versions: the two 107 versions with highest `ModificationTime` less than or equal to `T`. 108 Once a table descriptor at `v` has been written, the validity period of the 109 table descriptor at `v - 2` is fixed. A node can cache a copy of 110 `v-2` along with its fixed validity window and use it for 111 transactions whose timestamps fall within its validity window. 112 113 Leases are needed because the validity window of the latest versions 114 (`v` and `v - 1` ) are unknown (`v + 1` hasn't yet been written). 115 Since table descriptors can be written at any time, this design is 116 about defining a frontier for an undefined validity window, 117 and guaranteeing that the frontier lies within the as of yet 118 to be defined validity window. We call such a validity window 119 a temporary validity window for the version. The use of a cached copy 120 of a table descriptor is allowed while enforcing the temporary 121 validity window. 122 123 Leases will be tied to a specific version of the descriptor. Lease 124 state will be stored in a new `system.lease` table: 125 126 ```sql 127 CREATE TABLE system.lease ( 128 DescID INT, 129 Version INT, 130 NodeID INT, 131 # Expiration is a wall time in microseconds. This is a 132 # microsecond rounded timestamp produced from the timestamp 133 # of the transaction inserting a new row. It would ideally 134 # be an hlc.timestamp but cannot be changed now without much 135 # difficulty. 136 Expiration TIMESTAMP, 137 PRIMARY KEY (DescID, Version, Expiration, NodeID) 138 ) 139 ``` 140 141 Entries in the lease table will be added and removed as leases are 142 acquired and released. A background goroutine running on the lease holder 143 for the system range will periodically delete expired leases 144 (not yet implemented). 145 146 Leases will be granted for a duration measured in minutes (we'll 147 assume 5m for the rest of this doc, though experimentation may tune 148 this number). A node will acquire a lease before using it in an 149 operation and may release the lease when the last local operation 150 completes that was using the lease and a new version of the descriptor 151 exists. When a new version exists all new transactions use a new lease 152 on the new version even when the older lease is still in use by older 153 transactions. 154 155 The lease holder of the range containing a table descriptor will gossip the 156 most recent version of that table descriptor using the gossip key 157 `"table-<descID>"` and the value will be the version number. The 158 gossiping of the most recent table versions allows nodes to 159 asynchronously discover when a new table version is written. But note 160 that it is not necessary for correctness as the protocol for acquiring 161 a lease ensures that only the two recent versions of a descriptor can 162 have a new lease granted on it. 163 164 All timestamps discussed in this document are references to database 165 timestamps. 166 167 ## Lease acquisition 168 169 Lease acquisition will perform the following steps in a transaction with 170 timestamp `L`: 171 172 * `SELECT descriptor FROM system.descriptor WHERE id = <descID>` 173 * `lease.expiration = L + lease_duration` rounded to microseconds (SQL DTimestamp). 174 * `INSERT INTO system.lease VALUES (<desc.ID>, <desc.version>, <nodeID>,<lease.expiration>)` 175 176 The `lease` is used by transactions that fall within its temporary 177 validity window. Nodes will maintain a map from `<descID, version>` to 178 a lease: `<TableDescriptor, expiration, localRefCount>`. The local 179 reference count will be incremented when a transaction first uses a table 180 and decremented when the transaction commits/aborts. When the node 181 discovers a new version of the table, either via gossip or by 182 acquiring a lease and discovering the version has changed it can 183 release the lease on the old version when there are no more local 184 references: 185 186 * `DELETE FROM system.lease WHERE (DescID, Version, NodeID, Expiration) = (<descID>, <version>, <nodeID>, <lease.expiration>)` 187 188 ## Incrementing the version 189 190 A schema change operation needs to ensure that there is only one version of 191 a descriptor in use in the cluster before incrementing the version of the 192 descriptor. The operation will perform the following steps transactionally 193 using timestamp `SC`: 194 195 * `SELECT descriptor FROM system.descriptor WHERE id = <descID>` 196 * Set `desc.ModificationTime` to `SC` 197 * `SELECT COUNT(DISTINCT version) FROM system.lease WHERE descID = <descID> AND version = <prevVersion> AND expiration > DTimestamp(SC)` == 0 198 * Perform the edits to the descriptor. 199 * Increment `desc.Version`. 200 * `UPDATE system.descriptor WHERE id = <descID> SET descriptor = <desc>` 201 202 The schema change operation only scans the leases with the previous 203 version so as to not cause a lot of aborted transactions trying to 204 acquire leases on the new version of the table. The above schema change 205 transaction is retried in a loop until it suceeds. 206 207 Note that the updating of the table descriptor will cause the table 208 version to be gossiped alerting nodes to the new version and causing 209 them to release leases on the old version. The expectation is that 210 nodes will fairly quickly transition to using the new version and 211 release all leases to the old version allowing another step in the 212 schema change operation to take place. 213 214 When a node holding leases dies permanently or becomes unresponsive 215 (e.g. detached from the network) schema change operations will have to 216 wait for any leases that node held to expire. This will result in an 217 inability to perform more than one schema modification step to the 218 descriptors referred to by those leases. The maximum amount of time 219 this condition can exist is the lease duration (5m). 220 221 As described above, leases will be retained for the lifetime of a 222 transaction. In a multi-statement transaction we need to ensure that 223 the same version of each table is used throughout the transaction. To 224 do this we will add the descriptor IDs and versions to the transaction 225 structure. When a node receives a SQL statement within a 226 multi-statement transaction, it will attempt to use the table version 227 specified. 228 229 While we normally acquire a lease at the latest version, occasionally a 230 transaction might require a lease on a previous version because it falls 231 before the validity window of the latest version: 232 233 A table descriptor at version `v - 1` can be read using timestamp 234 `ModificationTime - 1ns` where `ModificationTime` is for table descriptor 235 at version `v`. Note that this method can be used to read a table descriptor 236 at any version. 237 238 A lease can be acquired on a previous version using a transaction at 239 timestamp `P` by running the following: 240 * `SELECT descriptor FROM system.descriptor WHERE id = <descID>` 241 * check that `desc.version == v + 1` 242 * `lease.expiration = P + lease_duration` rounded to microseconds (DTimestamp). 243 * `INSERT INTO system.lease VALUES (<desc.ID>, <v>, <nodeID>, <lease.expiration>)` 244 245 ## Correctness 246 247 It is valuable to consider various scenarios to check lease usage correctness. 248 Assume `SC` is the timestamp of the latest schema change descriptor modification 249 time, while `L` is the timestamp of a transaction that has acquired a lease: 250 * `L < SC`: The lease will contain a table descriptor with the previous version 251 and will write a row in the lease table using timestamp `L` which will be seen 252 by the schema change which uses a timestamp `SC`. As long as the lease is not 253 released another schema change cannot use a `timestamp <= lease.expiration` 254 * `L > SC`: The lease will read the version of the table descriptor written by the 255 schema change. 256 * `L == SC`: If the lease reads the descriptor first, the schema change will see 257 the read in the read timestamp cache and will get restarted. If the schema 258 change writes the descriptor at the new version first, the lease will read the new 259 descriptor and create a lease with the new version. 260 261 Temporary validity window for a leased table descriptor is either one of: 262 1. `[ModificationTime, D)`: where `D` is the timestamp of the lease 263 release transaction. Since a transaction with timestamp `T` using a lease 264 and a release transaction originate on the same node, the release follows 265 the last transaction using the lease, `T < D` is always true. 266 2. `[ModificationTime, hlc.Timestamp(lease.expiration))` is valid 267 because the actual stored table version during the window is guaranteed 268 to be at most off by 1. 269 270 Note that two transactions with timestamp T~1~ and T~2~ using versions 271 `v` and `v+2` respectively that touch some data in common, are guaranteed 272 to have a serial execution order T~1~ < T~2~. This property is important 273 when we want to positively guarantee that one transaction sees the effect 274 of another. 275 276 Note that the real time at which a transaction commits will be different 277 from the wall time in its database timestamp. On an idle range, transactions 278 may be allowed to commit with timestamps far in the past (although 279 the read timestamp cache will not permit writing with a very old timestamp). 280 The expiration of a table descriptor lease does not imply that all 281 transactions using that lease have finished. Even if a transaction commits 282 later in time, CRDB guarantees serializability of transactions thereby 283 sometimes aborting old transactions that attempting to write using an 284 old timestamp. 285 286 Examples of transactions that need serial execution that use 287 version `v` and `v+2`: 288 1. A transaction attempts to DELETE a row using a descriptor without 289 an index, and commits after the row is being acted on by an UPDATE 290 seeing an index in the WRITE_ONLY state. The DELETE is guaranteed 291 to see the UPDATE and be aborted, or the UPDATE sees the delete tombstone. 292 2. A transaction attempts to run a DELETE on a table in the DELETE_ONLY 293 state and the transaction commits during the backfill. The DELETE is 294 guaranteed to be seen by the backfill, or aborted. 295 3. A transaction attempts to run an UPDATE on a table with an index in 296 the WRITE_ONLY state and the transaction commits when the index is readable 297 via a SELECT. The UPDATE is either guaranteed to be seen by the SELECT, 298 or be aborted. 299 300 # Accommodating schema changes within transactions 301 302 A node acquires a lease on a table descriptor using a transaction created 303 for this purpose (instead of using the transaction that triggered the 304 lease acquisition), and the transaction triggering the lease acquisition 305 must take further precautions to prevent hitting a deadlock with 306 the node's lease acquiring transaction. A transaction that runs a 307 CREATE TABLE followed by other operations on the table will hit 308 a deadlock situation where the table descriptor hasn't 309 yet been committed while the node is trying to acquire a lease 310 on the table descriptor using a separate transaction. The commands 311 following the CREATE TABLE trying to acquire a table lease 312 will block on their own transaction that has written out a 313 new uncommitted table. 314 315 A similar situation happens when a table exists but a node 316 has no lease on the table, and a transaction runs a schema change 317 that modifies the table without incrementing the version, and 318 subsequently runs other commands referencing the table. 319 Care has to be taken to first acquire a table lease before running 320 the transaction. While it is possible to acquire the lease in this 321 way before running an ALTER TABLE it is not possible to do the same 322 in the CREATE TABLE case. 323 324 Commands within a transaction would like to see the schema 325 changes made within the transaction reducing the chance of 326 user surprise. Both this requirement and the deadlock 327 prevention requirement discussed above are solved through a 328 solution where table descriptors modified within a transaction 329 are cached specifically for the use of the transaction, with the 330 transaction not needing a lease for the table. 331 332 # Drawbacks 333 334 * The lack of a central authority for a lease places additional stress 335 on the correct implementation of the transactions to acquire a lease 336 and publish a new version of a descriptor. 337 338 # Alternatives 339 340 * Earlier versions of this proposal utilized a centralized lease 341 service. Such a service has some conceptual niceties (a single 342 authority for managing the lease state of a table), yet introduces 343 another service that must be squeezed into the system. Such a lease 344 service would undoubtedly store state in the KV layer as well. Given 345 that the KV layer provides robust transactions the benefit is 346 smaller than it might otherwise have been. 347 348 * We could use an existing lock service such as etcd or Zookeeper. The 349 primary benefit would be the builtin watch functionality, but we can 350 get some of that functionality from gossip. We would still need the 351 logic for local reference counts. 352 353 * Keeping track of local references to descriptor versions in order to 354 early release leases adds complexity. We could just wait for leases 355 to expire, though that would cause a 3-step schema modification to 356 take at least 10m to complete. 357 358 # Unresolved questions 359 360 * Gossip currently introduces a 2s/hop delay in transmitting gossip 361 info. It would be nice to figure out how to introduce some sort of 362 "high-priority" flag for gossipping of descriptor version info to 363 reduce the latency in notifying nodes of a new descriptor version. 364 365 * This RFC doesn't address the full complexity of table descriptor 366 schema changes. For example, when adding an index the node 367 performing the operation might die while backfilling the index 368 data. We'll almost certainly want to parallelize the backfilling 369 operation. And we'll want to disallow dropping an index that is 370 currently being added. These issues are considered outside of the 371 scope of this RFC.