github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20191202_full_cluster_backup_restore.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20191202_full_cluster_backup_restore.md (about)

1 - Feature Name: Full Cluster Backup/Restore
2 - Status: completed
3 - Start Date: 2019-12-02
4 - Authors: Paul Bardea
5 - RFC PR: #[42887]()
6 - Cockroach Issue:
7 #[44814](https://github.com/cockroachdb/cockroach/issues/44814)
8
9
10 # Summary
11
12 Users should be able to `BACKUP` and `RESTORE` all relevant information stored
13 in their cluster - namely relevant information stored in system tables.
14
15 # Motivation
16
17 Currently, only user data can be backed up and restored - along with very
18 limited metadata information (table statistics if requested). There does not
19 exist a mechanism for a user to easily restore their entire cluster as it
20 appeared at the time of a backup.
21
22 # Guide-level explanation
23
24 This RFC builds on the original [Backup &
25 Restore](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20160720_backup_restore.md)
26 functionality and extends it to include all logical data stored in the backup.
27 A new syntax is introduced to perform a full cluster backup and restore:
28 `BACKUP TO [...]` and `RESTORE FULL CLUSTER FROM [...]`.
29
30 Additionally, incremental cluster backups are supported:
31 ```sql
32 > BACKUP TO 'nodelocal:///cluster-backup/1';
33 > BACKUP TO 'nodelocal:///cluster-backup/2' INCREMENTAL FROM 'nodelocal:///cluster-backup/1';
34 ```
35 A user can create an incremental cluster backup, but they must also provide a
36 full cluster backup and optionally additional incremental backups (as is the
37 case for non-cluster backups). All listed backups must be full cluster backups.
38 Incremental cluster backups can be restored in the usual way: `RESTORE FROM
39 'nodelocal:///cluster-backup/1', 'nodelocal:///cluster-backup/2'`.
40 Every backup listed must be a full-cluster backup.
41
42 A full cluster RESTORE can only be performed in a fresh cluster with no user
43 data. Some of the data in the system tables may be set (for example the
44 `cluster.organization` setting must be set in order to even use this feature).
45 However, it should be noted that this data will be modified by performing a
46 full cluster RESTORE.
47
48 A full cluster BACKUP/RESTORE could be thought of as performing the following steps:
49 ```sql
50 /* Full Cluster Backup
51 There are no semantics to restore all user tables. Assume all user databases are: database_a, database_b, [...].
52 /* Current backup also does not support backing up entire databases and individual tables, but only a subset of system tables should be backed up. */
53 BACKUP DATABASE database_a, database_b, [...], system TO 'nodelocal:///cluster-backup/1';
54
55 /* Full Cluster Restore */
56 CREATE DATABASE crdb_system_temporary;
57 RESTORE system.* FROM 'nodelocal:///cluster-backup/1' WITH into_db='crdb_system_temporary';
58
59 /* Restore the user data. */
60 RESTORE DATABASE database_a, database_b, [...] FROM 'nodelocal:///full-cluster-backup';
61
62 /* Restore the system tables. */
63 BEGIN;
64 DELETE FROM system.users WHERE true;
65 INSERT INTO system.users (SELECT * FROM crdb_system_temporary.zones);
66 COMMIT;
67
68 BEGIN;
69 DELETE FROM system.settings WHERE true;
70 INSERT INTO system.settings (SELECT * FROM crdb_system_temporary.zones);
71 COMMIT;
72
73 [...]
74 ```
75
76 Not all system tables should be included in a backup since some information
77 relates to the physical properties of a cluster. The existing system tables
78 have been audited below. New system tables will need to add themselves to the
79 list of system tables that should be included in a backup. This will initially
80 be a list of names of system tables maintained inside the `backupccl` package.
81
82 ### System Tables
83 | Table Name | Description | Included | Notes |
84 |---|---|---|---|
85 | namespace | Provides relationship between parentID <-> descriptor name <-> descriptor ID| No | This information should be generated by the restoring cluster |
86 | descriptor | Maps ID <-> Descriptor Proto| No | New descriptors should be made for every RESTOREd table. |
87 | users | Stores the users in the table. | Yes | |
88 | zones | Stores the zone config protos | Yes | |
89 | settings | Stores all the cluster settings | Yes | |
90 | leases | Table leases | No | Leases held in the old cluster are no longer relevant. |
91 | eventlog | A log of a variety of events (schema changes, node additions, etc..) | No | Most events are not node-specific and would be useful to backup. This may produce confusing output if restored into a cluster with a different number of nodes. See Future work. |
92 | rangelog | Range level events. | No | Ranges on the old and new cluster will not match. |
93 | ui | A set of KV pairs used by the UI | Yes | |
94 | jobs | A list of all jobs that are running or have run. | Yes | |
95 | web\_sessions | | No | This could eventually be moved into the backup. Unclear. |
96 | table\_statistics | | Yes | This information is currently backed up in the BACKUP manifest to BACKUP and RESTORE table statistics on a per-table level. |
97 | locations | Stores information about the localities. | Yes | |
98 | role\_members | Contains role-role and user-role relationships for RBAC | Yes | |
99 | comments | Stores up to 1 string comment per object ID | Yes | |
100 | replication\_* | | No | Replication stats should be regenerated when the data is RESTORED. |
101 | reports_meta | | No | " |
102 | protectedts\_* | As proposed by the [protected timestamp RFC](https://github.com/cockroachdb/cockroach/blob/master/docs/RFCS/20191009_gc_protected_timestamps.md) | No | Restore only restores a snapshot of the data in the backup, not the entire MVCC history. |
103
104 There is no information in the system ranges that should be included in a
105 CLUSTER backup since they all relate to properties of the ranges/nodes.
106
107 # Reference-level explanation
108
109 This RFC assumes that a cluster restore will occur on a fresh cluster with no
110 user data. This allows the data to be restored _exactly_ as it appeared in the
111 backup. Otherwise, it will be necessary to re-key the user tables as well as
112 the **content** inside the system tables which reference table IDs/KV spans
113 (such as zone configs). Note: this implies that behavior resulting in
114 interactions with the restoring cluster is undefined until the restore
115 succeeds. This may be extended in the future, see the Future work section for
116 more details.
117
118 Additionally, incremental cluster backups and restoration is supported using
119 the same syntax as the existing `BACKUP`. In addition to checking that the
120 previous backups cover the necassary span of keybase and time, a check must be
121 added `backupPlanHook` to verify that every backup that this incremental backup
122 builds upon are also cluster backups. Additionally, full cluster restore should
123 only be permitted on Therefore it is necasary to add a flag in the backup
124 manifest (`BackupDescriptor`) indicating whether or not a given backup is a
125 full cluster backup or not. The primary reason for this bit is to ensure that
126 full cluster restore can only restore full cluster backup files.
127
128 ## Interfaces
129
130 Users will mainly interact with this new feature through the new syntax
131 introduced: `BACKUP FULL CLUSTER TO [...]` and `RESTORE FULL CLUSTER FROM
132 [...]`.
133
134 Additionally, incremental cluster backups are supported:
135 ```sql
136 > BACKUP FULL CLUSTER TO 'nodelocal:///cluster-backup/1';
137 > BACKUP FULL CLUSTER TO 'nodelocal:///cluster-backup/2' INCREMENTAL FROM 'nodelocal:///cluster-backup/1';
138 ```
139
140 These backups can be restored: `RESTORE FULL CLUSTER FROM
141 'nodelocal:///cluster-backup/1', 'nodelocal:///cluster-backup/2'`. Every backup
142 listed must be a full-cluster backup.
143
144 This new syntax introduces a new target "FULL CLUSTER" which can be used
145 instead of specifying particular databases/tables to be restored. Replacing the
146 targets for the new target (FULL CLUSTER), should not result in any UX
147 surprises.
148
149 A user can then examine a full cluster backup using the `SHOW BACKUP` command
150 (`start_time` and `end_time` are omitted from this example for brevity):
151 ```sql
152 root@:26258/default_db> SHOW BACKUP 'nodelocal://1/full-cluster';
153 database_name | table_name | size_bytes | rows | is_full_cluster |
154 +---------------+------------+------------+------+-----------------+
155 some_user_db | foo | 0 | 0 | true |
156 system | zones | 252 | 0 | true |
157 system | users | 99 | 0 | true |
158 ...
159 ```
160
161 This command shows the user what type of metadata is stored in the backup.
162 Since users must specify only full cluster backups to build incremental
163 backups, this allows users to inspect a backup to check what cluster
164 information is stored.
165
166 With regards to user-visible errors introduced by this feature, users can
167 expect to see an error when:
168 - They create an full cluster incremental backup on top of a non-full cluster
169 backup.
170 - They perform a full cluster restore in a cluster with existing user data
171 (there may be table/database ID collisions, which will not be handled). As
172 described, a check will be performed ensuring that no user tables/databases
173 have been created (practically, this means ensuring that no descriptors
174 should exist with an ID greater than or equal to
175 `MinNonPredefinedUserDescID`).
176 - They attempt to perform a full cluster restore from a non-full cluster
177 backup.
178
179 Note: it is expected that users will be able to perform a non-full cluster
180 RESTORE on a full-cluster BACKUP.
181
182 ## Detailed design
183
184 ## Backup
185
186 The first difference between a full cluster backup and a regular (non-full
187 cluster) backup is that a full cluster backup includes all user tables in the
188 backup. This can be accomplished by including all tables -- as defined by
189 enumerating the descriptors table -- except for the set explicitly excluded as
190 defined above.
191
192 Additionally, all OFFLINE tables need to be included in a BACKUP (they are not
193 today). This is used to ensure that the in-progress jobs may be able to
194 continue after a full cluster restoration. See the Jobs section below.
195
196 Finally, the backup manifest (`BackupDescriptor` protobuf) needs to be augmented
197 with an enum specifying the amount of cluster information stored in the backup.
198 An enum `DescriptorCoverage` will be added to the `BackupDescriptor` and will
199 have options: `RequestedDescriptors`, which is the default and is what existing
200 backups will have going forward, and `AllDescriptors` for full cluster backup.
201 This enum is required to prevent a full cluster restore being performed from a
202 non-full cluster backup file. In particular, this requirement exists because
203 full cluster RESTORE guarnatees that the entire cluster has been RESTOREd (so we
204 need the entire cluster to be in the backup file).
205
206 ## Restore
207
208 Upon a full cluster restore, the order in which data is restored becomes
209 relevant. In particular, `system.zones` must be restored prior to restoring the
210 user data in order to ensure that the user data is placed in the appropriate
211 locality if appropriate. The user data will then be restored, and finally the
212 rest of the system tables.
213
214 First, a check is performed to ensure that no user data exists in this cluster.
215 This is acheived by ensuring that no descriptors exist with ID greater than or
216 equal to `MinNonPredefinedUserDescID`. Then the `DescIDGenerator` needs to be
217 restored. This key is used to determine that value of the next descriptor ID
218 (such as during the creation of a table or database). This check would also
219 ensure that no other full cluster restores are in progress, as the full cluster
220 restore would create a `crdb_system_temporary` table in the user database
221 space. It is incremented whenever a descriptor is created. Let `MaxID` by the
222 maximum descriptor ID found in the backup, then the `DescIDGenerator` should be
223 set to `MaxID + 1` so that new descriptors can be created after the restore
224 with correct IDs.
225
226 System tables cannot be restored in the same way as user data tables are since
227 they occupy a fixed keyspace (and thus cannot be re-keyed as we do today for
228 new tables). First we restore the system tables into a temporary database. The
229 `DescIDGenerator` key must be updated prior to creating this temporary table to
230 ensure no conflicts with a user table that needs to be restored (and thus the
231 ID of this table will be `MaxID + 1` and the `DescIDGenerator` will be
232 incremented again).
233 ```sql
234 CREATE DATABASE crdb_system_temporary;
235 RESTORE system.* FROM 'nodelocal://1/full-backup/1' WITH into_db='crdb_system_temporary';
236 ```
237
238 In an internal executor execute:
239 ```sql
240 BEGIN;
241 DELETE FROM system.zones WHERE true;
242 INSERT INTO system.zones (SELECT * FROM crdb_system_temporary.zones);
243 COMMIT;
244 ```
245
246 Before restoring the user data, we need to ensure that all the user tables and
247 database descriptors are created with the same ID as they have in the backup.
248 This differs from the current implementation which generates a new ID for these
249 items. This allows for a potential future optimization to skip the no-op
250 rekeying. User tables can then be restored normally.
251
252 Finally, to restore the remaining of the system tables, perform a transaction
253 similar to the one listed above, but rather than only restoring the zones table,
254 restore the rest of the system tables. It is preferable to restore all of the
255 remaining system tables in one transaction in order to ensure atomicity across
256 the restoration of all the system tables. However, there may be a limitation
257 based on the maximum transaction size, in which case the possibility of
258 restoring the system tables 1 by 1 could be investigated. However, the maximum
259 size of a transaction is quite large and is _not_ expected to cause issue.
260
261 Finally, the temporary `crdb_system_temporary` database is deleted.
262
263 ### Jobs
264
265 During a cluster backup, a job may be in progress. The state for these jobs
266 should persisted in the user-data and in the `system.jobs` table. These jobs
267 will be restored into a running state and nobody will have a lease on this job.
268 This job should be adopted and the continued. For the job to be able to be
269 continued, all OFFLINE tables need to be included in the BACKUP.
270
271 ## Locality Awareness
272
273 The current implementation of locality-aware BACKUPs should continue to work
274 with cluster backup without further work. BACKUP for the system tables will
275 operate just as the user-data tables and the relevant lease-holders will backup
276 to the appropriate locality.
277
278 ## Failure Modes
279
280 ### General Restore Failure/Cancellation
281
282 The happy path for a full cluster backup is when the restore is started and all
283 nodes remain available until the restore is complete.
284
285 Non-cluster restore creates the tables for the user-data tables at the start of
286 the restore. These tables are in an OFFLINE state - inaccessible to the
287 user[1]. If there is a failure during the restore, and these tables are marked
288 as DROP and will be removed. Full cluster restore can recover the user-data
289 tables that it restored this way as well. The difficulty lies in handling
290 system tables that it has already restored. This will likely be only the
291 `system.zones` table since the remainder of the system tables will be restored
292 in a single transaction near the end of the job, however the general case is
293 considered.
294
295 Since full cluster backup must have been run on a fresh cluster, the first
296 iteration of full cluster restore could require the cluster to be destroyed if
297 the restore fails. This can likely be improved as detailed in the Future Work
298 section.
299
300
301 # Alternatives Considered
302
303 ## System Table Restorations
304
305 ### AddSSTable and TimeBoundedDelete
306
307 One reasonable question is why doesn't CockroachDB load the system data the
308 same way as the user-data tables. One difficulty that this would present is
309 that user-data restoration happens on new tables, but the system tables in the
310 new cluster already have data in them. This method directly ingests the
311 SSTables for the system tables spans, then issues a `TimeBoundedDelete`. This
312 does not yet exist, but can be implemented by leveraging
313 `engine.MVCCClearTimeRange`, similarly to the `batcheval.RevertRange` command.
314 This leaves the possibility of having a potentially dirty state in the system
315 tables. Additionally, the keys in the SSTs would need to have their timestamp
316 updated to some time greater than the time of the start of the restore.
317
318 The reason we take an approach of loading SSTs directly into the storage layer
319 for user data is that we typically expect a large volume of data. Additionally,
320 we can ensure that this data is not needed or accessed by the user while it
321 is being loaded. Since the size of the system tables is expected to be much
322 smaller than the size of the user-data tables, there are no advantages to
323 this approach and it is more complex.
324
325 ## Cluster Info Metadata
326
327 ### Only Look at Backup Contents
328 Additionally, instead of marking a particular backup as "full cluster", the
329 system tables that it holds could be examined. This would allow for previous
330 backups that included the system table information to be restored via a full
331 cluster backup. One problem with this approach is that if new system tables are
332 added to the list of expected system tables, a mapping between version numbers
333 and which tables are expected to be included in that version needs to be
334 maintained. This problem is avoided by marking backups as full cluster since we
335 can assume that all system tables included in those backups are safe to restore
336 (and override the existing ones).
337
338 # Future Work
339
340 - As mentioned, it may be possible to enable cluster restoration on a non-new
341 cluster - however this does raise further complications. Since it seems that
342 the vast majority of uses cases for this feature are to restore a cluster
343 exactly how it was in the backup, there is little motivation to generalize
344 cluster restoration in this way. In particular, this would require the
345 contents of the system tables to be re-keyes (in addition to the user-data
346 KVs themselves). This would require each system table to provide a way to map
347 each of their rows to an updated row based on a re-keyer.
348
349 - One large remaining piece of work is how to handle the case where the
350 metadata of the restoring cluster does not match that of the one in the
351 backup. For example, if the cluster on which the BACKUP was performed as a
352 given set of localities which do not exist on the cluster that is being
353 restored, there is currently no way for the user to map the localities from
354 the backed up cluster to the values they should be changed to in the
355 restoring cluster. Currently since all BACKUP and RESTORE interactions
356 happens at the CLI, a major difficulty is providing a powerful enough
357 interface for the user to provide these mappings.
358
359 - Include `system.eventlog` in a full cluster backup. One reason for not doing
360 this initially is that some event logs may be non-sensical if the table is
361 restored in a cluster with a different number of nodes.
362
363 - Additionally, one further improvement would be allow the restoration of a set
364 of tables/databases with their respective configuration. This requires that
365 the RESTORE process find which rows in the system tables are applicable to
366 the given database/table. This also implies that we'd need to add the ability
367 to rewrite values in other tables. This is out of scope of this RFC.
368
369 - A potential improvement to ensure that the cluster is in a fresh state would
370 be to mark a cluster for restoration at creation time (similar to `cockroach
371 init`). This would also prevent any operations to interfere with the restore.
372
373 - A more graceful failure mode could be implemented which ensure's that the
374 cluster's state is guaranteed to be healthy in the case of a failed full
375 cluster RESTORE.
376
377 - The initial implementation will not consider what happens if there is a
378 failure in the middle of the backup. It will clean up the data following the
379 normal backup procedures. In the case that there is a failure while updating
380 the system tables, the cluster should be started up again. Since we enforce
381 that the cluster we are restoring to has no user data, this is acceptible.
382
383 # Drawbacks
384
385 Due to the restriction that full cluster backup can only be performed on a
386 newly created cluster, there may be some user surprised when trying to perform
387 a full cluster restore when this assumption is violated.
388
389
390
391 [1] Offline tables can however be references when setting zone configs. See
392 #40285.