github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20151009_table_descriptor_lease.md (about)

     1  - Feature Name: table_descriptor_lease
     2  - Status: completed
     3  - Start Date: 2015-10-09
     4  - RFC PR: [#2810](https://github.com/cockroachdb/cockroach/pull/2036)
     5  - Cockroach Issue: [#2036](https://github.com/cockroachdb/cockroach/issues/2036)
     6  
     7  # Summary
     8  
     9  Implement a table descriptor lease mechanism to allow safe usage of
    10  cached table descriptors.
    11  
    12  # Motivation
    13  
    14  Table descriptors contain the schema for a single table and are
    15  utilized by nearly every SQL operation. Fast access to the table
    16  descriptor is critical for good performance. Reading the table
    17  descriptor from the KV store on every operation adds significant
    18  latency.
    19  
    20  Table descriptors are currently distributed to every node in the
    21  cluster via gossipping of the system config (see
    22  [schema_gossip](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20150720_schema_gossip.md)). Unfortunately,
    23  it is not safe to use these gossipped table descriptors in almost any
    24  circumstance. Consider the statements:
    25  
    26  ```sql
    27  CREATE TABLE test (key INT PRIMARY KEY, value INT);
    28  CREATE INDEX foo ON test (value);
    29  INSERT INTO test VALUES (1, 2);
    30  ```
    31  
    32  Depending on when the gossip of the schema change is received, the
    33  `INSERT` statement may either see no cached descriptor for the `test`
    34  table (which is safe because it will fall back to reading from the KV
    35  store), the descriptor as it looks after the `CREATE TABLE` statement
    36  or the descriptor as it looks after the `CREATE INDEX` statement. If
    37  the `INSERT` statement does not see the `CREATE INDEX` statement then
    38  it will merrily insert the values into `test` without creating the
    39  entry for the `foo` index, thus leaving the table data in an
    40  inconsistent state.
    41  
    42  It is easy to show similar problematic sequences of operations for
    43  `DELETE`, `UPDATE` and `SELECT`. Fortunately, a solution exists in the
    44  literature: break up the schema change into discrete steps for which
    45  any two consecutive versions of the schema are safe for concurrent
    46  use. For example, to add an index to a table we would perform the
    47  following steps:
    48  
    49  * Add the index to the table descriptor but mark it as delete-only:
    50    update and delete operations which would delete entries for the
    51    index do so, but insert and update operations which would add new
    52    elements and read operations do not use the index.
    53  * Wait for the table descriptor change to propagate to all nodes in
    54    the cluster and all uses of the previous version to finish.
    55  * Mark the index as write-only: insert, update and delete operations
    56    would add entries for the index, but read operations would not use
    57    the index.
    58  * Wait for the table descriptor change to propagate to all nodes in
    59    the cluster and all uses of the previous version to finish.
    60  * Backfill the index.
    61  * Mark the index in the table descriptor as read-write.
    62  * Wait for the table descriptor change to propagate to all nodes in
    63    the cluster and all uses of the previous version to finish.
    64  
    65  This RFC is focused on how to wait for the table descriptor change to
    66  propagate to all nodes in the cluster. More accurately, it is focused
    67  on how to determine when a previous version of the descriptor is no
    68  longer in use for read-write operations.
    69  
    70  Separately, we implement another lease mechanism not discussed here
    71  called a schema change lease. When a schema change is to be performed,
    72  the operation first acquires a schema change lease for the table.
    73  The schema change lease ensures that only the lease holder can execute
    74  the state machine for a schema change and update the table descriptor.
    75  
    76  Before using a table descriptor for a DML operation (i.e. `SELECT`,
    77  `INSERT`, `UPDATE`, `DELETE`, etc), the operation needs to obtain a
    78  table lease for the descriptor. The lease will be guaranteed valid for
    79  a significant period of time (on the order of minutes). When the operation completes it will release the table lease.
    80  
    81  # Detailed design
    82  
    83  The design maintains two invariant:
    84  * Two safe versions: A transaction at a particular timestamp is allowed to use
    85  one of two versions of a table descriptor.
    86  * Two leased version: There can be valid leases on at most the 2 latest
    87  versions of a table in the cluster at any time. Leases are usually granted
    88  on the latest version.
    89  
    90  Table descriptors will be extended with a version number that is
    91  incremented on every change to the descriptor:
    92  
    93  ```proto
    94  message TableDescriptor {
    95    ...
    96    optional uint32 id;
    97    optional uint32 version;
    98    optional util.hlc.Timestamp modification_time;
    99    ...
   100  }
   101  ```
   102  
   103  A table descriptor at a version `v` has a validity period spanning from
   104  its `ModificationTime` until the `ModificationTime` of the table descriptor at
   105  version `v + 2`: [`ModificationTime`, `ModificationTime[v+2]`). A transaction
   106  at time `T` can safely use one of two table descriptors versions: the two
   107  versions with highest `ModificationTime` less than or equal to `T`.
   108  Once a table descriptor at `v` has been written, the validity period of the
   109  table descriptor at `v - 2` is fixed. A node can cache a copy of
   110  `v-2` along with its fixed validity window and use it for
   111  transactions whose timestamps fall within its validity window.
   112  
   113  Leases are needed because the validity window of the latest versions
   114  (`v` and `v - 1` ) are unknown (`v + 1` hasn't yet been written).
   115  Since table descriptors can be written at any time, this design is
   116  about defining a frontier for an undefined validity window,
   117  and guaranteeing that the frontier lies within the as of yet
   118  to be defined validity window. We call such a validity window
   119  a temporary validity window for the version. The use of a cached copy
   120  of a table descriptor is allowed while enforcing the temporary
   121  validity window.
   122  
   123  Leases will be tied to a specific version of the descriptor. Lease
   124  state will be stored in a new `system.lease` table:
   125  
   126  ```sql
   127  CREATE TABLE system.lease (
   128    DescID     INT,
   129    Version    INT,
   130    NodeID     INT,
   131    # Expiration is a wall time in microseconds. This is a
   132    # microsecond rounded timestamp produced from the timestamp
   133    # of the transaction inserting a new row. It would ideally
   134    # be an hlc.timestamp but cannot be changed now without much
   135    # difficulty.
   136    Expiration TIMESTAMP,
   137    PRIMARY KEY (DescID, Version, Expiration, NodeID)
   138  )
   139  ```
   140  
   141  Entries in the lease table will be added and removed as leases are
   142  acquired and released. A background goroutine running on the lease holder
   143  for the system range will periodically delete expired leases
   144   (not yet implemented).
   145  
   146  Leases will be granted for a duration measured in minutes (we'll
   147  assume 5m for the rest of this doc, though experimentation may tune
   148  this number). A node will acquire a lease before using it in an
   149  operation and may release the lease when the last local operation
   150  completes that was using the lease and a new version of the descriptor
   151  exists. When a new version exists all new transactions use a new lease
   152  on the new version even when the older lease is still in use by older
   153  transactions.
   154  
   155  The lease holder of the range containing a table descriptor will gossip the
   156  most recent version of that table descriptor using the gossip key
   157  `"table-<descID>"` and the value will be the version number. The
   158  gossiping of the most recent table versions allows nodes to
   159  asynchronously discover when a new table version is written. But note
   160  that it is not necessary for correctness as the protocol for acquiring
   161  a lease ensures that only the two recent versions of a descriptor can
   162  have a new lease granted on it.
   163  
   164  All timestamps discussed in this document are references to database
   165  timestamps.
   166  
   167  ## Lease acquisition
   168  
   169  Lease acquisition will perform the following steps in a transaction with
   170  timestamp `L`:
   171  
   172  * `SELECT descriptor FROM system.descriptor WHERE id = <descID>`
   173  * `lease.expiration = L + lease_duration` rounded to microseconds (SQL DTimestamp).
   174  * `INSERT INTO system.lease VALUES (<desc.ID>, <desc.version>, <nodeID>,<lease.expiration>)`
   175  
   176  The `lease` is used by transactions that fall within its temporary
   177  validity window. Nodes will maintain a map from `<descID, version>` to
   178  a lease: `<TableDescriptor, expiration, localRefCount>`. The local
   179  reference count will be incremented when a transaction first uses a table
   180  and decremented when the transaction commits/aborts. When the node
   181  discovers a new version of the table, either via gossip or by
   182  acquiring a lease and discovering the version has changed it can
   183  release the lease on the old version when there are no more local
   184  references:
   185  
   186  * `DELETE FROM system.lease WHERE (DescID, Version, NodeID, Expiration) = (<descID>, <version>, <nodeID>, <lease.expiration>)`
   187  
   188  ## Incrementing the version
   189  
   190  A schema change operation needs to ensure that there is only one version of
   191  a descriptor in use in the cluster before incrementing the version of the
   192  descriptor. The operation will perform the following steps transactionally
   193  using timestamp `SC`:
   194  
   195  * `SELECT descriptor FROM system.descriptor WHERE id = <descID>`
   196  * Set `desc.ModificationTime` to `SC`
   197  * `SELECT COUNT(DISTINCT version) FROM system.lease WHERE descID = <descID> AND version = <prevVersion> AND expiration > DTimestamp(SC)` == 0
   198  * Perform the edits to the descriptor.
   199  * Increment `desc.Version`.
   200  * `UPDATE system.descriptor WHERE id = <descID> SET descriptor = <desc>`
   201  
   202  The schema change operation only scans the leases with the previous
   203  version so as to not cause a lot of aborted transactions trying to
   204  acquire leases on the new version of the table. The above schema change
   205  transaction is retried in a loop until it suceeds.
   206  
   207  Note that the updating of the table descriptor will cause the table
   208  version to be gossiped alerting nodes to the new version and causing
   209  them to release leases on the old version. The expectation is that
   210  nodes will fairly quickly transition to using the new version and
   211  release all leases to the old version allowing another step in the
   212  schema change operation to take place.
   213  
   214  When a node holding leases dies permanently or becomes unresponsive
   215  (e.g. detached from the network) schema change operations will have to
   216  wait for any leases that node held to expire. This will result in an
   217  inability to perform more than one schema modification step to the
   218  descriptors referred to by those leases. The maximum amount of time
   219  this condition can exist is the lease duration (5m).
   220  
   221  As described above, leases will be retained for the lifetime of a
   222  transaction. In a multi-statement transaction we need to ensure that
   223  the same version of each table is used throughout the transaction. To
   224  do this we will add the descriptor IDs and versions to the transaction
   225  structure. When a node receives a SQL statement within a
   226  multi-statement transaction, it will attempt to use the table version
   227  specified.
   228  
   229  While we normally acquire a lease at the latest version, occasionally a
   230  transaction might require a lease on a previous version because it falls
   231  before the validity window of the latest version:
   232  
   233  A table descriptor at version `v - 1` can be read using timestamp
   234  `ModificationTime - 1ns` where `ModificationTime` is for table descriptor
   235  at version `v`. Note that this method can be used to read a table descriptor
   236  at any version.
   237  
   238  A lease can be acquired on a previous version using a transaction at
   239  timestamp `P` by running the following:
   240  * `SELECT descriptor FROM system.descriptor WHERE id = <descID>`
   241  * check that `desc.version == v + 1`
   242  * `lease.expiration = P + lease_duration` rounded to microseconds (DTimestamp).
   243  * `INSERT INTO system.lease VALUES (<desc.ID>, <v>, <nodeID>, <lease.expiration>)`
   244  
   245  ## Correctness
   246  
   247  It is valuable to consider various scenarios to check lease usage correctness.
   248  Assume `SC` is the timestamp of the latest schema change descriptor modification
   249  time, while `L` is the timestamp of a transaction that has acquired a lease:
   250  * `L < SC`: The lease will contain a table descriptor with the previous version
   251  and will write a row in the lease table using timestamp `L` which will be seen
   252  by the schema change which uses a timestamp `SC`. As long as the lease is not
   253  released another schema change cannot use a `timestamp <= lease.expiration`
   254  * `L > SC`: The lease will read the version of the table descriptor written by the
   255  schema change.
   256  * `L == SC`: If the lease reads the descriptor first, the schema change will see
   257  the read in the read timestamp cache and will get restarted. If the schema
   258  change writes the descriptor at the new version first, the lease will read the new
   259  descriptor and create a lease with the new version.
   260  
   261  Temporary validity window for a leased table descriptor is either one of:
   262  1. `[ModificationTime, D)`: where `D` is the timestamp of the lease
   263  release transaction. Since a transaction with timestamp `T` using a lease
   264  and a release transaction originate on the same node, the release follows
   265  the last transaction using the lease, `T < D` is always true.
   266  2. `[ModificationTime, hlc.Timestamp(lease.expiration))` is valid
   267  because the actual stored table version during the window is guaranteed
   268  to be at most off by 1.
   269  
   270  Note that two transactions with timestamp T~1~ and T~2~ using versions
   271  `v` and `v+2` respectively that touch some data in common, are guaranteed
   272  to have a serial execution order T~1~ < T~2~. This property is important
   273  when we want to positively guarantee that one transaction sees the effect
   274  of another.
   275  
   276  Note that the real time at which a transaction commits will be different
   277  from the wall time in its database timestamp. On an idle range, transactions
   278  may be allowed to commit with timestamps far in the past (although
   279  the read timestamp cache will not permit writing with a very old timestamp).
   280  The expiration of a table descriptor lease does not imply that all
   281  transactions using that lease have finished. Even if a transaction commits
   282  later in time, CRDB guarantees serializability of transactions thereby
   283  sometimes aborting old transactions that attempting to write using an
   284  old timestamp.
   285  
   286  Examples of transactions that need serial execution that use
   287  version `v` and `v+2`:
   288  1. A transaction attempts to DELETE a row using a descriptor without
   289  an index, and commits after the row is being acted on by an UPDATE
   290  seeing an index in the WRITE_ONLY state. The DELETE is guaranteed
   291  to see the UPDATE and be aborted, or the UPDATE sees the delete tombstone.
   292  2. A transaction attempts to run a DELETE on a table in the DELETE_ONLY
   293  state and the transaction commits during the backfill. The DELETE is
   294  guaranteed to be seen by the backfill, or aborted.
   295  3. A transaction attempts to run an UPDATE on a table with an index in
   296  the WRITE_ONLY state and the transaction commits when the index is readable
   297  via a SELECT. The UPDATE is either guaranteed to be seen by the SELECT,
   298  or be aborted.
   299  
   300  # Accommodating schema changes within transactions
   301  
   302  A node acquires a lease on a table descriptor using a transaction created
   303  for this purpose (instead of using the transaction that triggered the
   304  lease acquisition), and the transaction triggering the lease acquisition
   305  must take further precautions to prevent hitting a deadlock with
   306  the node's lease acquiring transaction. A transaction that runs a
   307  CREATE TABLE followed by other operations on the table will hit
   308  a deadlock situation where the table descriptor hasn't
   309  yet been committed while the node is trying to acquire a lease
   310  on the table descriptor using a separate transaction. The commands
   311  following the CREATE TABLE trying to acquire a table lease
   312  will block on their own transaction that has written out a
   313  new uncommitted table.
   314  
   315  A similar situation happens when a table exists but a node
   316  has no lease on the table, and a transaction runs a schema change
   317  that modifies the table without incrementing the version, and
   318  subsequently runs other commands referencing the table.
   319  Care has to be taken to first acquire a table lease before running
   320  the transaction. While it is possible to acquire the lease in this
   321  way before running an ALTER TABLE it is not possible to do the same
   322  in the CREATE TABLE case.
   323  
   324  Commands within a transaction would like to see the schema
   325  changes made within the transaction reducing the chance of
   326  user surprise. Both this requirement and the deadlock
   327  prevention requirement discussed above are solved through a
   328  solution where table descriptors modified within a transaction
   329  are cached specifically for the use of the transaction, with the
   330  transaction not needing a lease for the table.
   331  
   332  # Drawbacks
   333  
   334  * The lack of a central authority for a lease places additional stress
   335    on the correct implementation of the transactions to acquire a lease
   336    and publish a new version of a descriptor.
   337  
   338  # Alternatives
   339  
   340  * Earlier versions of this proposal utilized a centralized lease
   341    service. Such a service has some conceptual niceties (a single
   342    authority for managing the lease state of a table), yet introduces
   343    another service that must be squeezed into the system. Such a lease
   344    service would undoubtedly store state in the KV layer as well. Given
   345    that the KV layer provides robust transactions the benefit is
   346    smaller than it might otherwise have been.
   347  
   348  * We could use an existing lock service such as etcd or Zookeeper. The
   349    primary benefit would be the builtin watch functionality, but we can
   350    get some of that functionality from gossip. We would still need the
   351    logic for local reference counts.
   352  
   353  * Keeping track of local references to descriptor versions in order to
   354    early release leases adds complexity. We could just wait for leases
   355    to expire, though that would cause a 3-step schema modification to
   356    take at least 10m to complete.
   357  
   358  # Unresolved questions
   359  
   360  * Gossip currently introduces a 2s/hop delay in transmitting gossip
   361    info. It would be nice to figure out how to introduce some sort of
   362    "high-priority" flag for gossipping of descriptor version info to
   363    reduce the latency in notifying nodes of a new descriptor version.
   364  
   365  * This RFC doesn't address the full complexity of table descriptor
   366    schema changes. For example, when adding an index the node
   367    performing the operation might die while backfilling the index
   368    data. We'll almost certainly want to parallelize the backfilling
   369    operation. And we'll want to disallow dropping an index that is
   370    currently being added. These issues are considered outside of the
   371    scope of this RFC.