github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20191202_full_cluster_backup_restore.md (about) 1 - Feature Name: Full Cluster Backup/Restore 2 - Status: completed 3 - Start Date: 2019-12-02 4 - Authors: Paul Bardea 5 - RFC PR: #[42887]() 6 - Cockroach Issue: 7 #[44814](https://github.com/cockroachdb/cockroach/issues/44814) 8 9 10 # Summary 11 12 Users should be able to `BACKUP` and `RESTORE` all relevant information stored 13 in their cluster - namely relevant information stored in system tables. 14 15 # Motivation 16 17 Currently, only user data can be backed up and restored - along with very 18 limited metadata information (table statistics if requested). There does not 19 exist a mechanism for a user to easily restore their entire cluster as it 20 appeared at the time of a backup. 21 22 # Guide-level explanation 23 24 This RFC builds on the original [Backup & 25 Restore](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160720_backup_restore.md) 26 functionality and extends it to include all logical data stored in the backup. 27 A new syntax is introduced to perform a full cluster backup and restore: 28 `BACKUP TO [...]` and `RESTORE FULL CLUSTER FROM [...]`. 29 30 Additionally, incremental cluster backups are supported: 31 ```sql 32 > BACKUP TO 'nodelocal:///cluster-backup/1'; 33 > BACKUP TO 'nodelocal:///cluster-backup/2' INCREMENTAL FROM 'nodelocal:///cluster-backup/1'; 34 ``` 35 A user can create an incremental cluster backup, but they must also provide a 36 full cluster backup and optionally additional incremental backups (as is the 37 case for non-cluster backups). All listed backups must be full cluster backups. 38 Incremental cluster backups can be restored in the usual way: `RESTORE FROM 39 'nodelocal:///cluster-backup/1', 'nodelocal:///cluster-backup/2'`. 40 Every backup listed must be a full-cluster backup. 41 42 A full cluster RESTORE can only be performed in a fresh cluster with no user 43 data. Some of the data in the system tables may be set (for example the 44 `cluster.organization` setting must be set in order to even use this feature). 45 However, it should be noted that this data will be modified by performing a 46 full cluster RESTORE. 47 48 A full cluster BACKUP/RESTORE could be thought of as performing the following steps: 49 ```sql 50 /* Full Cluster Backup 51 There are no semantics to restore all user tables. Assume all user databases are: database_a, database_b, [...]. 52 /* Current backup also does not support backing up entire databases and individual tables, but only a subset of system tables should be backed up. */ 53 BACKUP DATABASE database_a, database_b, [...], system TO 'nodelocal:///cluster-backup/1'; 54 55 /* Full Cluster Restore */ 56 CREATE DATABASE crdb_system_temporary; 57 RESTORE system.* FROM 'nodelocal:///cluster-backup/1' WITH into_db='crdb_system_temporary'; 58 59 /* Restore the user data. */ 60 RESTORE DATABASE database_a, database_b, [...] FROM 'nodelocal:///full-cluster-backup'; 61 62 /* Restore the system tables. */ 63 BEGIN; 64 DELETE FROM system.users WHERE true; 65 INSERT INTO system.users (SELECT * FROM crdb_system_temporary.zones); 66 COMMIT; 67 68 BEGIN; 69 DELETE FROM system.settings WHERE true; 70 INSERT INTO system.settings (SELECT * FROM crdb_system_temporary.zones); 71 COMMIT; 72 73 [...] 74 ``` 75 76 Not all system tables should be included in a backup since some information 77 relates to the physical properties of a cluster. The existing system tables 78 have been audited below. New system tables will need to add themselves to the 79 list of system tables that should be included in a backup. This will initially 80 be a list of names of system tables maintained inside the `backupccl` package. 81 82 ### System Tables 83 | Table Name | Description | Included | Notes | 84 |---|---|---|---| 85 | namespace | Provides relationship between parentID <-> descriptor name <-> descriptor ID| No | This information should be generated by the restoring cluster | 86 | descriptor | Maps ID <-> Descriptor Proto| No | New descriptors should be made for every RESTOREd table. | 87 | users | Stores the users in the table. | Yes | | 88 | zones | Stores the zone config protos | Yes | | 89 | settings | Stores all the cluster settings | Yes | | 90 | leases | Table leases | No | Leases held in the old cluster are no longer relevant. | 91 | eventlog | A log of a variety of events (schema changes, node additions, etc..) | No | Most events are not node-specific and would be useful to backup. This may produce confusing output if restored into a cluster with a different number of nodes. See Future work. | 92 | rangelog | Range level events. | No | Ranges on the old and new cluster will not match. | 93 | ui | A set of KV pairs used by the UI | Yes | | 94 | jobs | A list of all jobs that are running or have run. | Yes | | 95 | web\_sessions | | No | This could eventually be moved into the backup. Unclear. | 96 | table\_statistics | | Yes | This information is currently backed up in the BACKUP manifest to BACKUP and RESTORE table statistics on a per-table level. | 97 | locations | Stores information about the localities. | Yes | | 98 | role\_members | Contains role-role and user-role relationships for RBAC | Yes | | 99 | comments | Stores up to 1 string comment per object ID | Yes | | 100 | replication\_* | | No | Replication stats should be regenerated when the data is RESTORED. | 101 | reports_meta | | No | " | 102 | protectedts\_* | As proposed by the [protected timestamp RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20191009_gc_protected_timestamps.md) | No | Restore only restores a snapshot of the data in the backup, not the entire MVCC history. | 103 104 There is no information in the system ranges that should be included in a 105 CLUSTER backup since they all relate to properties of the ranges/nodes. 106 107 # Reference-level explanation 108 109 This RFC assumes that a cluster restore will occur on a fresh cluster with no 110 user data. This allows the data to be restored _exactly_ as it appeared in the 111 backup. Otherwise, it will be necessary to re-key the user tables as well as 112 the **content** inside the system tables which reference table IDs/KV spans 113 (such as zone configs). Note: this implies that behavior resulting in 114 interactions with the restoring cluster is undefined until the restore 115 succeeds. This may be extended in the future, see the Future work section for 116 more details. 117 118 Additionally, incremental cluster backups and restoration is supported using 119 the same syntax as the existing `BACKUP`. In addition to checking that the 120 previous backups cover the necassary span of keybase and time, a check must be 121 added `backupPlanHook` to verify that every backup that this incremental backup 122 builds upon are also cluster backups. Additionally, full cluster restore should 123 only be permitted on Therefore it is necasary to add a flag in the backup 124 manifest (`BackupDescriptor`) indicating whether or not a given backup is a 125 full cluster backup or not. The primary reason for this bit is to ensure that 126 full cluster restore can only restore full cluster backup files. 127 128 ## Interfaces 129 130 Users will mainly interact with this new feature through the new syntax 131 introduced: `BACKUP FULL CLUSTER TO [...]` and `RESTORE FULL CLUSTER FROM 132 [...]`. 133 134 Additionally, incremental cluster backups are supported: 135 ```sql 136 > BACKUP FULL CLUSTER TO 'nodelocal:///cluster-backup/1'; 137 > BACKUP FULL CLUSTER TO 'nodelocal:///cluster-backup/2' INCREMENTAL FROM 'nodelocal:///cluster-backup/1'; 138 ``` 139 140 These backups can be restored: `RESTORE FULL CLUSTER FROM 141 'nodelocal:///cluster-backup/1', 'nodelocal:///cluster-backup/2'`. Every backup 142 listed must be a full-cluster backup. 143 144 This new syntax introduces a new target "FULL CLUSTER" which can be used 145 instead of specifying particular databases/tables to be restored. Replacing the 146 targets for the new target (FULL CLUSTER), should not result in any UX 147 surprises. 148 149 A user can then examine a full cluster backup using the `SHOW BACKUP` command 150 (`start_time` and `end_time` are omitted from this example for brevity): 151 ```sql 152 root@:26258/default_db> SHOW BACKUP 'nodelocal://1/full-cluster'; 153 database_name | table_name | size_bytes | rows | is_full_cluster | 154 +---------------+------------+------------+------+-----------------+ 155 some_user_db | foo | 0 | 0 | true | 156 system | zones | 252 | 0 | true | 157 system | users | 99 | 0 | true | 158 ... 159 ``` 160 161 This command shows the user what type of metadata is stored in the backup. 162 Since users must specify only full cluster backups to build incremental 163 backups, this allows users to inspect a backup to check what cluster 164 information is stored. 165 166 With regards to user-visible errors introduced by this feature, users can 167 expect to see an error when: 168 - They create an full cluster incremental backup on top of a non-full cluster 169 backup. 170 - They perform a full cluster restore in a cluster with existing user data 171 (there may be table/database ID collisions, which will not be handled). As 172 described, a check will be performed ensuring that no user tables/databases 173 have been created (practically, this means ensuring that no descriptors 174 should exist with an ID greater than or equal to 175 `MinNonPredefinedUserDescID`). 176 - They attempt to perform a full cluster restore from a non-full cluster 177 backup. 178 179 Note: it is expected that users will be able to perform a non-full cluster 180 RESTORE on a full-cluster BACKUP. 181 182 ## Detailed design 183 184 ## Backup 185 186 The first difference between a full cluster backup and a regular (non-full 187 cluster) backup is that a full cluster backup includes all user tables in the 188 backup. This can be accomplished by including all tables -- as defined by 189 enumerating the descriptors table -- except for the set explicitly excluded as 190 defined above. 191 192 Additionally, all OFFLINE tables need to be included in a BACKUP (they are not 193 today). This is used to ensure that the in-progress jobs may be able to 194 continue after a full cluster restoration. See the Jobs section below. 195 196 Finally, the backup manifest (`BackupDescriptor` protobuf) needs to be augmented 197 with an enum specifying the amount of cluster information stored in the backup. 198 An enum `DescriptorCoverage` will be added to the `BackupDescriptor` and will 199 have options: `RequestedDescriptors`, which is the default and is what existing 200 backups will have going forward, and `AllDescriptors` for full cluster backup. 201 This enum is required to prevent a full cluster restore being performed from a 202 non-full cluster backup file. In particular, this requirement exists because 203 full cluster RESTORE guarnatees that the entire cluster has been RESTOREd (so we 204 need the entire cluster to be in the backup file). 205 206 ## Restore 207 208 Upon a full cluster restore, the order in which data is restored becomes 209 relevant. In particular, `system.zones` must be restored prior to restoring the 210 user data in order to ensure that the user data is placed in the appropriate 211 locality if appropriate. The user data will then be restored, and finally the 212 rest of the system tables. 213 214 First, a check is performed to ensure that no user data exists in this cluster. 215 This is acheived by ensuring that no descriptors exist with ID greater than or 216 equal to `MinNonPredefinedUserDescID`. Then the `DescIDGenerator` needs to be 217 restored. This key is used to determine that value of the next descriptor ID 218 (such as during the creation of a table or database). This check would also 219 ensure that no other full cluster restores are in progress, as the full cluster 220 restore would create a `crdb_system_temporary` table in the user database 221 space. It is incremented whenever a descriptor is created. Let `MaxID` by the 222 maximum descriptor ID found in the backup, then the `DescIDGenerator` should be 223 set to `MaxID + 1` so that new descriptors can be created after the restore 224 with correct IDs. 225 226 System tables cannot be restored in the same way as user data tables are since 227 they occupy a fixed keyspace (and thus cannot be re-keyed as we do today for 228 new tables). First we restore the system tables into a temporary database. The 229 `DescIDGenerator` key must be updated prior to creating this temporary table to 230 ensure no conflicts with a user table that needs to be restored (and thus the 231 ID of this table will be `MaxID + 1` and the `DescIDGenerator` will be 232 incremented again). 233 ```sql 234 CREATE DATABASE crdb_system_temporary; 235 RESTORE system.* FROM 'nodelocal://1/full-backup/1' WITH into_db='crdb_system_temporary'; 236 ``` 237 238 In an internal executor execute: 239 ```sql 240 BEGIN; 241 DELETE FROM system.zones WHERE true; 242 INSERT INTO system.zones (SELECT * FROM crdb_system_temporary.zones); 243 COMMIT; 244 ``` 245 246 Before restoring the user data, we need to ensure that all the user tables and 247 database descriptors are created with the same ID as they have in the backup. 248 This differs from the current implementation which generates a new ID for these 249 items. This allows for a potential future optimization to skip the no-op 250 rekeying. User tables can then be restored normally. 251 252 Finally, to restore the remaining of the system tables, perform a transaction 253 similar to the one listed above, but rather than only restoring the zones table, 254 restore the rest of the system tables. It is preferable to restore all of the 255 remaining system tables in one transaction in order to ensure atomicity across 256 the restoration of all the system tables. However, there may be a limitation 257 based on the maximum transaction size, in which case the possibility of 258 restoring the system tables 1 by 1 could be investigated. However, the maximum 259 size of a transaction is quite large and is _not_ expected to cause issue. 260 261 Finally, the temporary `crdb_system_temporary` database is deleted. 262 263 ### Jobs 264 265 During a cluster backup, a job may be in progress. The state for these jobs 266 should persisted in the user-data and in the `system.jobs` table. These jobs 267 will be restored into a running state and nobody will have a lease on this job. 268 This job should be adopted and the continued. For the job to be able to be 269 continued, all OFFLINE tables need to be included in the BACKUP. 270 271 ## Locality Awareness 272 273 The current implementation of locality-aware BACKUPs should continue to work 274 with cluster backup without further work. BACKUP for the system tables will 275 operate just as the user-data tables and the relevant lease-holders will backup 276 to the appropriate locality. 277 278 ## Failure Modes 279 280 ### General Restore Failure/Cancellation 281 282 The happy path for a full cluster backup is when the restore is started and all 283 nodes remain available until the restore is complete. 284 285 Non-cluster restore creates the tables for the user-data tables at the start of 286 the restore. These tables are in an OFFLINE state - inaccessible to the 287 user[1]. If there is a failure during the restore, and these tables are marked 288 as DROP and will be removed. Full cluster restore can recover the user-data 289 tables that it restored this way as well. The difficulty lies in handling 290 system tables that it has already restored. This will likely be only the 291 `system.zones` table since the remainder of the system tables will be restored 292 in a single transaction near the end of the job, however the general case is 293 considered. 294 295 Since full cluster backup must have been run on a fresh cluster, the first 296 iteration of full cluster restore could require the cluster to be destroyed if 297 the restore fails. This can likely be improved as detailed in the Future Work 298 section. 299 300 301 # Alternatives Considered 302 303 ## System Table Restorations 304 305 ### AddSSTable and TimeBoundedDelete 306 307 One reasonable question is why doesn't CockroachDB load the system data the 308 same way as the user-data tables. One difficulty that this would present is 309 that user-data restoration happens on new tables, but the system tables in the 310 new cluster already have data in them. This method directly ingests the 311 SSTables for the system tables spans, then issues a `TimeBoundedDelete`. This 312 does not yet exist, but can be implemented by leveraging 313 `engine.MVCCClearTimeRange`, similarly to the `batcheval.RevertRange` command. 314 This leaves the possibility of having a potentially dirty state in the system 315 tables. Additionally, the keys in the SSTs would need to have their timestamp 316 updated to some time greater than the time of the start of the restore. 317 318 The reason we take an approach of loading SSTs directly into the storage layer 319 for user data is that we typically expect a large volume of data. Additionally, 320 we can ensure that this data is not needed or accessed by the user while it 321 is being loaded. Since the size of the system tables is expected to be much 322 smaller than the size of the user-data tables, there are no advantages to 323 this approach and it is more complex. 324 325 ## Cluster Info Metadata 326 327 ### Only Look at Backup Contents 328 Additionally, instead of marking a particular backup as "full cluster", the 329 system tables that it holds could be examined. This would allow for previous 330 backups that included the system table information to be restored via a full 331 cluster backup. One problem with this approach is that if new system tables are 332 added to the list of expected system tables, a mapping between version numbers 333 and which tables are expected to be included in that version needs to be 334 maintained. This problem is avoided by marking backups as full cluster since we 335 can assume that all system tables included in those backups are safe to restore 336 (and override the existing ones). 337 338 # Future Work 339 340 - As mentioned, it may be possible to enable cluster restoration on a non-new 341 cluster - however this does raise further complications. Since it seems that 342 the vast majority of uses cases for this feature are to restore a cluster 343 exactly how it was in the backup, there is little motivation to generalize 344 cluster restoration in this way. In particular, this would require the 345 contents of the system tables to be re-keyes (in addition to the user-data 346 KVs themselves). This would require each system table to provide a way to map 347 each of their rows to an updated row based on a re-keyer. 348 349 - One large remaining piece of work is how to handle the case where the 350 metadata of the restoring cluster does not match that of the one in the 351 backup. For example, if the cluster on which the BACKUP was performed as a 352 given set of localities which do not exist on the cluster that is being 353 restored, there is currently no way for the user to map the localities from 354 the backed up cluster to the values they should be changed to in the 355 restoring cluster. Currently since all BACKUP and RESTORE interactions 356 happens at the CLI, a major difficulty is providing a powerful enough 357 interface for the user to provide these mappings. 358 359 - Include `system.eventlog` in a full cluster backup. One reason for not doing 360 this initially is that some event logs may be non-sensical if the table is 361 restored in a cluster with a different number of nodes. 362 363 - Additionally, one further improvement would be allow the restoration of a set 364 of tables/databases with their respective configuration. This requires that 365 the RESTORE process find which rows in the system tables are applicable to 366 the given database/table. This also implies that we'd need to add the ability 367 to rewrite values in other tables. This is out of scope of this RFC. 368 369 - A potential improvement to ensure that the cluster is in a fresh state would 370 be to mark a cluster for restoration at creation time (similar to `cockroach 371 init`). This would also prevent any operations to interfere with the restore. 372 373 - A more graceful failure mode could be implemented which ensure's that the 374 cluster's state is guaranteed to be healthy in the case of a failed full 375 cluster RESTORE. 376 377 - The initial implementation will not consider what happens if there is a 378 failure in the middle of the backup. It will clean up the data following the 379 normal backup procedures. In the case that there is a failure while updating 380 the system tables, the cluster should be started up again. Since we enforce 381 that the cluster we are restoring to has no user data, this is acceptible. 382 383 # Drawbacks 384 385 Due to the restriction that full cluster backup can only be performed on a 386 newly created cluster, there may be some user surprised when trying to perform 387 a full cluster restore when this assumption is violated. 388 389 390 391 [1] Offline tables can however be references when setting zone configs. See 392 #40285.