github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20191202_full_cluster_backup_restore.md (about)

     1  - Feature Name: Full Cluster Backup/Restore
     2  - Status: completed
     3  - Start Date: 2019-12-02
     4  - Authors: Paul Bardea
     5  - RFC PR: #[42887]()
     6  - Cockroach Issue:
     7    #[44814](https://github.com/cockroachdb/cockroach/issues/44814)
     8  
     9  
    10  # Summary
    11  
    12  Users should be able to `BACKUP` and `RESTORE` all relevant information stored
    13  in their cluster - namely relevant information stored in system tables.
    14  
    15  # Motivation
    16  
    17  Currently, only user data can be backed up and restored - along with very
    18  limited metadata information (table statistics if requested). There does not
    19  exist a mechanism for a user to easily restore their entire cluster as it
    20  appeared at the time of a backup.
    21  
    22  # Guide-level explanation
    23  
    24  This RFC builds on the original [Backup &
    25  Restore](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160720_backup_restore.md)
    26  functionality and extends it to include all logical data stored in the backup.
    27  A new syntax is introduced to perform a full cluster backup and restore:
    28  `BACKUP TO [...]` and `RESTORE FULL CLUSTER FROM [...]`.
    29  
    30  Additionally, incremental cluster backups are supported:
    31  ```sql
    32  > BACKUP TO 'nodelocal:///cluster-backup/1';
    33  > BACKUP TO 'nodelocal:///cluster-backup/2' INCREMENTAL FROM 'nodelocal:///cluster-backup/1';
    34  ```
    35  A user can create an incremental cluster backup, but they must also provide a
    36  full cluster backup and optionally additional incremental backups (as is the
    37  case for non-cluster backups). All listed backups must be full cluster backups.
    38  Incremental cluster backups can be restored in the usual way: `RESTORE FROM
    39  'nodelocal:///cluster-backup/1', 'nodelocal:///cluster-backup/2'`.
    40  Every backup listed must be a full-cluster backup.
    41  
    42  A full cluster RESTORE can only be performed in a fresh cluster with no user
    43  data. Some of the data in the system tables may be set (for example the
    44  `cluster.organization` setting must be set in order to even use this feature).
    45  However, it should be noted that this data will be modified by performing a
    46  full cluster RESTORE.
    47  
    48  A full cluster BACKUP/RESTORE could be thought of as performing the following steps:
    49  ```sql
    50  /* Full Cluster Backup
    51  There are no semantics to restore all user tables. Assume all user databases are: database_a, database_b, [...].
    52  /* Current backup also does not support backing up entire databases and individual tables, but only a subset of system tables should be backed up. */
    53  BACKUP DATABASE database_a, database_b, [...], system TO 'nodelocal:///cluster-backup/1';
    54  
    55  /* Full Cluster Restore */
    56  CREATE DATABASE crdb_system_temporary;
    57  RESTORE system.* FROM 'nodelocal:///cluster-backup/1' WITH into_db='crdb_system_temporary';
    58  
    59  /* Restore the user data. */
    60  RESTORE DATABASE database_a, database_b, [...] FROM 'nodelocal:///full-cluster-backup';
    61  
    62  /* Restore the system tables. */
    63  BEGIN;
    64  DELETE FROM system.users WHERE true;
    65  INSERT INTO system.users (SELECT * FROM crdb_system_temporary.zones);
    66  COMMIT;
    67  
    68  BEGIN;
    69  DELETE FROM system.settings WHERE true;
    70  INSERT INTO system.settings (SELECT * FROM crdb_system_temporary.zones);
    71  COMMIT;
    72  
    73  [...]
    74  ```
    75  
    76  Not all system tables should be included in a backup since some information
    77  relates to the physical properties of a cluster. The existing system tables
    78  have been audited below. New system tables will need to add themselves to the
    79  list of system tables that should be included in a backup. This will initially
    80  be a list of names of system tables maintained inside the `backupccl` package.
    81  
    82  ### System Tables
    83  | Table Name | Description | Included | Notes |
    84  |---|---|---|---|
    85  | namespace | Provides relationship between parentID <-> descriptor name <-> descriptor ID| No | This information should be generated by the restoring cluster |
    86  | descriptor | Maps ID <-> Descriptor Proto| No | New descriptors should be made for every RESTOREd table. |
    87  | users | Stores the users in the table. | Yes | |
    88  | zones | Stores the zone config protos | Yes | |
    89  | settings | Stores all the cluster settings | Yes | |
    90  | leases | Table leases | No | Leases held in the old cluster are no longer relevant. |
    91  | eventlog | A log of a variety of events (schema changes, node additions, etc..) | No | Most events are not node-specific and would be useful to backup. This may produce confusing output if restored into a cluster with a different number of nodes. See Future work. |
    92  | rangelog | Range level events. | No | Ranges on the old and new cluster will not match. |
    93  | ui | A set of KV pairs used by the UI | Yes | |
    94  | jobs | A list of all jobs that are running or have run. | Yes | |
    95  | web\_sessions |  | No | This could eventually be moved into the backup. Unclear. |
    96  | table\_statistics | | Yes | This information is currently backed up in the BACKUP manifest to BACKUP and RESTORE table statistics on a per-table level. |
    97  | locations | Stores information about the localities. | Yes |  |
    98  | role\_members | Contains role-role and user-role relationships for RBAC | Yes |  |
    99  | comments | Stores up to 1 string comment per object ID | Yes | |
   100  | replication\_* | | No | Replication stats should be regenerated when the data is RESTORED. |
   101  | reports_meta | | No | " |
   102  | protectedts\_* | As proposed by the [protected timestamp RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20191009_gc_protected_timestamps.md) | No | Restore only restores a snapshot of the data in the backup, not the entire MVCC history. |
   103  
   104  There is no information in the system ranges that should be included in a
   105  CLUSTER backup since they all relate to properties of the ranges/nodes.
   106  
   107  # Reference-level explanation
   108  
   109  This RFC assumes that a cluster restore will occur on a fresh cluster with no
   110  user data. This allows the data to be restored _exactly_ as it appeared in the
   111  backup. Otherwise, it will be necessary to re-key the user tables as well as
   112  the **content** inside the system tables which reference table IDs/KV spans
   113  (such as zone configs). Note: this implies that behavior resulting in
   114  interactions with the restoring cluster is undefined until the restore
   115  succeeds. This may be extended in the future, see the Future work section for
   116  more details.
   117  
   118  Additionally, incremental cluster backups and restoration is supported using
   119  the same syntax as the existing `BACKUP`. In addition to checking that the
   120  previous backups cover the necassary span of keybase and time, a check must be
   121  added `backupPlanHook` to verify that every backup that this incremental backup
   122  builds upon are also cluster backups. Additionally, full cluster restore should
   123  only be permitted on Therefore it is necasary to add a flag in the backup
   124  manifest (`BackupDescriptor`) indicating whether or not a given backup is a
   125  full cluster backup or not. The primary reason for this bit is to ensure that
   126  full cluster restore can only restore full cluster backup files.
   127  
   128  ## Interfaces
   129  
   130  Users will mainly interact with this new feature through the new syntax
   131  introduced: `BACKUP FULL CLUSTER TO [...]` and `RESTORE FULL CLUSTER FROM
   132  [...]`.
   133  
   134  Additionally, incremental cluster backups are supported:
   135  ```sql
   136  > BACKUP FULL CLUSTER TO 'nodelocal:///cluster-backup/1';
   137  > BACKUP FULL CLUSTER TO 'nodelocal:///cluster-backup/2' INCREMENTAL FROM 'nodelocal:///cluster-backup/1';
   138  ```
   139  
   140  These backups can be restored: `RESTORE FULL CLUSTER FROM
   141  'nodelocal:///cluster-backup/1', 'nodelocal:///cluster-backup/2'`. Every backup
   142  listed must be a full-cluster backup.
   143  
   144  This new syntax introduces a new target "FULL CLUSTER" which can be used
   145  instead of specifying particular databases/tables to be restored. Replacing the
   146  targets for the new target (FULL CLUSTER), should not result in any UX
   147  surprises.
   148  
   149  A user can then examine a full cluster backup using the `SHOW BACKUP` command
   150  (`start_time` and `end_time` are omitted from this example for brevity):
   151  ```sql
   152  root@:26258/default_db> SHOW BACKUP 'nodelocal://1/full-cluster';
   153    database_name | table_name | size_bytes | rows | is_full_cluster |
   154  +---------------+------------+------------+------+-----------------+
   155    some_user_db  | foo        |          0 |    0 |      true       |
   156    system        | zones      |        252 |    0 |      true       |
   157    system        | users      |         99 |    0 |      true       |
   158    ...
   159  ```
   160  
   161  This command shows the user what type of metadata is stored in the backup.
   162  Since users must specify only full cluster backups to build incremental
   163  backups, this allows users to inspect a backup to check what cluster
   164  information is stored.
   165  
   166  With regards to user-visible errors introduced by this feature, users can
   167  expect to see an error when:
   168  - They create an full cluster incremental backup on top of a non-full cluster
   169    backup.
   170  - They perform a full cluster restore in a cluster with existing user data
   171    (there may be table/database ID collisions, which will not be handled). As
   172    described, a check will be performed ensuring that no user tables/databases
   173    have been created (practically, this means ensuring that no descriptors
   174    should exist with an ID greater than or equal to
   175    `MinNonPredefinedUserDescID`).
   176  - They attempt to perform a full cluster restore from a non-full cluster
   177    backup.
   178  
   179  Note: it is expected that users will be able to perform a non-full cluster
   180  RESTORE on a full-cluster BACKUP.
   181  
   182  ## Detailed design
   183  
   184  ## Backup
   185  
   186  The first difference between a full cluster backup and a regular (non-full
   187  cluster) backup is that a full cluster backup includes all user tables in the
   188  backup. This can be accomplished by including all tables -- as defined by
   189  enumerating the descriptors table -- except for the set explicitly excluded as
   190  defined above.
   191  
   192  Additionally, all OFFLINE tables need to be included in a BACKUP (they are not
   193  today). This is used to ensure that the in-progress jobs may be able to
   194  continue after a full cluster restoration. See the Jobs section below.
   195  
   196  Finally, the backup manifest (`BackupDescriptor` protobuf) needs to be augmented
   197  with an enum specifying the amount of cluster information stored in the backup.
   198  An enum `DescriptorCoverage` will be added to the `BackupDescriptor` and will
   199  have options: `RequestedDescriptors`, which is the default and is what existing
   200  backups will have going forward, and `AllDescriptors` for full cluster backup.
   201  This enum is required to prevent a full cluster restore being performed from a
   202  non-full cluster backup file. In particular, this requirement exists because
   203  full cluster RESTORE guarnatees that the entire cluster has been RESTOREd (so we
   204  need the entire cluster to be in the backup file).
   205  
   206  ## Restore
   207  
   208  Upon a full cluster restore, the order in which data is restored becomes
   209  relevant. In particular, `system.zones` must be restored prior to restoring the
   210  user data in order to ensure that the user data is placed in the appropriate
   211  locality if appropriate. The user data will then be restored, and finally the
   212  rest of the system tables.
   213  
   214  First, a check is performed to ensure that no user data exists in this cluster.
   215  This is acheived by ensuring that no descriptors exist with ID greater than or
   216  equal to `MinNonPredefinedUserDescID`. Then the `DescIDGenerator` needs to be
   217  restored. This key is used to determine that value of the next descriptor ID
   218  (such as during the creation of a table or database). This check would also
   219  ensure that no other full cluster restores are in progress, as the full cluster
   220  restore would create a `crdb_system_temporary` table in the user database
   221  space. It is incremented whenever a descriptor is created. Let `MaxID` by the
   222  maximum descriptor ID found in the backup, then the `DescIDGenerator` should be
   223  set to `MaxID + 1` so that new descriptors can be created after the restore
   224  with correct IDs. 
   225  
   226  System tables cannot be restored in the same way as user data tables are since
   227  they occupy a fixed keyspace (and thus cannot be re-keyed as we do today for
   228  new tables). First we restore the system tables into a temporary database. The
   229  `DescIDGenerator` key must be updated prior to creating this temporary table to
   230  ensure no conflicts with a user table that needs to be restored (and thus the
   231  ID of this table will be `MaxID + 1` and the `DescIDGenerator` will be
   232  incremented again).
   233  ```sql
   234  CREATE DATABASE crdb_system_temporary;
   235  RESTORE system.* FROM 'nodelocal://1/full-backup/1' WITH into_db='crdb_system_temporary';
   236  ```
   237  
   238  In an internal executor execute:
   239  ```sql
   240  BEGIN;
   241  DELETE FROM system.zones WHERE true;
   242  INSERT INTO system.zones (SELECT * FROM crdb_system_temporary.zones);
   243  COMMIT;
   244  ```
   245  
   246  Before restoring the user data, we need to ensure that all the user tables and
   247  database descriptors are created with the same ID as they have in the backup.
   248  This differs from the current implementation which generates a new ID for these
   249  items. This allows for a potential future optimization to skip the no-op
   250  rekeying. User tables can then be restored normally.
   251  
   252  Finally, to restore the remaining of the system tables, perform a transaction
   253  similar to the one listed above, but rather than only restoring the zones table,
   254  restore the rest of the system tables. It is preferable to restore all of the
   255  remaining system tables in one transaction in order to ensure atomicity across
   256  the restoration of all the system tables.  However, there may be a limitation
   257  based on the maximum transaction size, in which case the possibility of
   258  restoring the system tables 1 by 1 could be investigated. However, the maximum
   259  size of a transaction is quite large and is _not_ expected to cause issue.
   260  
   261  Finally, the temporary `crdb_system_temporary` database is deleted.
   262  
   263  ### Jobs
   264  
   265  During a cluster backup, a job may be in progress. The state for these jobs
   266  should persisted in the user-data and in the `system.jobs` table. These jobs
   267  will be restored into a running state and nobody will have a lease on this job.
   268  This job should be adopted and the continued.  For the job to be able to be
   269  continued, all OFFLINE tables need to be included in the BACKUP.
   270  
   271  ## Locality Awareness
   272  
   273  The current implementation of locality-aware BACKUPs should continue to work
   274  with cluster backup without further work. BACKUP for the system tables will
   275  operate just as the user-data tables and the relevant lease-holders will backup
   276  to the appropriate locality.
   277  
   278  ## Failure Modes
   279  
   280  ### General Restore Failure/Cancellation
   281  
   282  The happy path for a full cluster backup is when the restore is started and all
   283  nodes remain available until the restore is complete.
   284  
   285  Non-cluster restore creates the tables for the user-data tables at the start of
   286  the restore. These tables are in an OFFLINE state - inaccessible to the
   287  user[1]. If there is a failure during the restore, and these tables are marked
   288  as DROP and will be removed. Full cluster restore can recover the user-data
   289  tables that it restored this way as well. The difficulty lies in handling
   290  system tables that it has already restored. This will likely be only the
   291  `system.zones` table since the remainder of the system tables will be restored
   292  in a single transaction near the end of the job, however the general case is
   293  considered.
   294  
   295  Since full cluster backup must have been run on a fresh cluster, the first
   296  iteration of full cluster restore could require the cluster to be destroyed if
   297  the restore fails. This can likely be improved as detailed in the Future Work
   298  section.
   299  
   300  
   301  # Alternatives Considered
   302  
   303  ## System Table Restorations
   304  
   305  ### AddSSTable and TimeBoundedDelete
   306  
   307  One reasonable question is why doesn't CockroachDB load the system data the
   308  same way as the user-data tables. One difficulty that this would present is
   309  that user-data restoration happens on new tables, but the system tables in the
   310  new cluster already have data in them. This method directly ingests the
   311  SSTables for the system tables spans, then issues a `TimeBoundedDelete`. This
   312  does not yet exist, but can be implemented by leveraging
   313  `engine.MVCCClearTimeRange`, similarly to the `batcheval.RevertRange` command.
   314  This leaves the possibility of having a potentially dirty state in the system
   315  tables. Additionally, the keys in the SSTs would need to have their timestamp
   316  updated to some time greater than the time of the start of the restore.
   317  
   318  The reason we take an approach of loading SSTs directly into the storage layer
   319  for user data is that we typically expect a large volume of data. Additionally,
   320  we can ensure that this data is not needed or accessed by the user while it
   321  is being loaded. Since the size of the system tables is expected to be much
   322  smaller than the size of the user-data tables, there are no advantages to
   323  this approach and it is more complex.
   324  
   325  ## Cluster Info Metadata
   326  
   327  ### Only Look at Backup Contents
   328  Additionally, instead of marking a particular backup as "full cluster", the
   329  system tables that it holds could be examined. This would allow for previous
   330  backups that included the system table information to be restored via a full
   331  cluster backup. One problem with this approach is that if new system tables are
   332  added to the list of expected system tables, a mapping between version numbers
   333  and which tables are expected to be included in that version needs to be
   334  maintained. This problem is avoided by marking backups as full cluster since we
   335  can assume that all system tables included in those backups are safe to restore
   336  (and override the existing ones).
   337  
   338  # Future Work
   339  
   340  - As mentioned, it may be possible to enable cluster restoration on a non-new
   341    cluster - however this does raise further complications. Since it seems that
   342    the vast majority of uses cases for this feature are to restore a cluster
   343    exactly how it was in the backup, there is little motivation to generalize
   344    cluster restoration in this way. In particular, this would require the
   345    contents of the system tables to be re-keyes (in addition to the user-data
   346    KVs themselves). This would require each system table to provide a way to map
   347    each of their rows to an updated row based on a re-keyer.
   348  
   349  - One large remaining piece of work is how to handle the case where the
   350    metadata of the restoring cluster does not match that of the one in the
   351    backup. For example, if the cluster on which the BACKUP was performed as a
   352    given set of localities which do not exist on the cluster that is being
   353    restored, there is currently no way for the user to map the localities from
   354    the backed up cluster to the values they should be changed to in the
   355    restoring cluster. Currently since all BACKUP and RESTORE interactions
   356    happens at the CLI, a major difficulty is providing a powerful enough
   357    interface for the user to provide these mappings.
   358  
   359  - Include `system.eventlog` in a full cluster backup. One reason for not doing
   360    this initially is that some event logs may be non-sensical if the table is
   361    restored in a cluster with a different number of nodes.
   362  
   363  - Additionally, one further improvement would be allow the restoration of a set
   364    of tables/databases with their respective configuration. This requires that
   365    the RESTORE process find which rows in the system tables are applicable to
   366    the given database/table. This also implies that we'd need to add the ability
   367    to rewrite values in other tables. This is out of scope of this RFC.
   368  
   369  - A potential improvement to ensure that the cluster is in a fresh state would
   370    be to mark a cluster for restoration at creation time (similar to `cockroach
   371    init`). This would also prevent any operations to interfere with the restore.
   372  
   373  - A more graceful failure mode could be implemented which ensure's that the
   374    cluster's state is guaranteed to be healthy in the case of a failed full
   375    cluster RESTORE.
   376  
   377  - The initial implementation will not consider what happens if there is a
   378    failure in the middle of the backup. It will clean up the data following the
   379    normal backup procedures. In the case that there is a failure while updating
   380    the system tables, the cluster should be started up again. Since we enforce
   381    that the cluster we are restoring to has no user data, this is acceptible.
   382  
   383  # Drawbacks
   384  
   385  Due to the restriction that full cluster backup can only be performed on a
   386  newly created cluster, there may be some user surprised when trying to perform
   387  a full cluster restore when this assumption is violated.
   388  
   389  
   390  
   391  [1] Offline tables can however be references when setting zone configs. See
   392  #40285.