github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170215_system_jobs.md (about) 1 - Feature Name: System Jobs 2 - Status: completed 3 - Start Date: 2017-02-13 4 - Authors: Nikhil Benesch 5 - RFC PR: [#13656] 6 - Cockroach Issue: [#12555] 7 8 # Summary 9 10 Add a system table to track the progress of backups, restores, and schema 11 changes. 12 13 # Motivation 14 15 When performing a schema change, only one bit of progress information is 16 available: whether the change has completed, indicated by whether the query has 17 returned. Similarly, when performing a backup or restore, status is only 18 reported after the backup or restore has completed. Given that a full backup of 19 a 2TB database takes on the order of several hours to complete, the lack of 20 progress information is a serious pain point for users. 21 22 Additionally, while each node runs a schema change daemon that can restart 23 pending schema changes if the coordinating node dies, the same is not true for 24 backups and restores. If the coordinating node for a backup or restore job dies, the 25 job will abort, even if the individual workers were otherwise successful. 26 27 This RFC proposes a new system table, `system.jobs`, that tracks the status of 28 these long-running backup, restore, and schema change "jobs." This table will 29 directly expose the desired progress information via `SELECT` queries over the 30 table, enable an API endpoint to expose the same progress information in the 31 admin UI, and enable an internal daemon that periodically adopts and resumes all 32 types of orphaned jobs. The table will also serve as a convenient place for 33 schema changes to store checkpoint state; currently, checkpoints are stored on 34 the affected table descriptor, which must be gossiped on every write. 35 36 # Detailed design 37 38 ## Prior art 39 40 Adding a `system.jobs` table has been proposed and unsuccessfully implemented 41 several times. [@vivekmenezes]'s initial attempt in [#7037] was abandoned since, 42 at the time, there was no way to safely add a new system table to an existing 43 cluster. Several months later, after [@a-robinson] and [@danhhz] built the 44 necessary cluster migration framework, @a-robinson submitted a migration to add 45 a `system.jobs` table in [#11722] as an example of using the new framework. His 46 PR was rejected because the table's schema hadn't been fully considered. This 47 RFC describes a revised `system.jobs` schema for us to thoroughly vet before we 48 proceed with a final implementation. 49 50 ## The proposal 51 52 Within this RFC, the term "job" or "long-running job" refers only to backups, 53 restores, and schema changes. Other types of long-running queries, like slow 54 `SELECT` statements, are explicitly out of scope. A "job-creating query," then, 55 is any query of one of the following types: 56 57 - `ALTER TABLE`, which creates a schema change job 58 - `BACKUP`, which creates a backup job 59 - `RESTORE`, which creates a restore job 60 61 To track the progress of these jobs, the following table will be injected into 62 the `system` database using the [cluster migration framework]: 63 64 ```sql 65 CREATE TABLE system.jobs ( 66 id INT DEFAULT unique_rowid() PRIMARY KEY, 67 status STRING NOT NULL, 68 created TIMESTAMP NOT NULL DEFAULT now(), 69 payload BYTES, 70 INDEX (status, created) 71 ) 72 ``` 73 74 Each job is identified by a unique `id`, which is assigned when the job is 75 created. Currently, this ID serves only to identify the job to the user, but 76 future SQL commands to e.g. abort running jobs will need this ID to 77 unambiguously specify the target job. 78 79 The `status` column represents a state machine with `pending`, `running`, 80 `succeeded`, and `failed` states. Jobs are created in the `pending` state when 81 the job-creating query is accepted, move to the `running` state once work on the 82 job has actually begun, then move to a final state of `succeeded` or `failed`. 83 The `pending` state warrants additional explanation: it's used to track jobs 84 that are enqueued but not currently performing work. Schema changes, for 85 example, will sit in the `pending` state until all prior schema change jobs have 86 completed. 87 88 The `created` field, unsurprisingly, is set to the current timestamp at the time 89 the record is created. 90 91 The admin UI job status page is expected to display jobs ordered first by their 92 status, then by their creation time. To make this query efficient, the table has 93 a secondary index on `status, created`. 94 95 We want to avoid future schema changes to `system.jobs` if at all possible. 96 Every schema change requires a cluster migration, and every cluster migration 97 introduces node startup time overhead, plus some risk and complexity. To that 98 end, any field not required by an index is stashed in the `payload` column, 99 which stores a protobuf that can be evolved per the standard protobuf 100 forwards-compatibility support. The proposed message definition for the 101 `payload` follows. 102 103 ```protobuf 104 message BackupJobPayload { 105 // Intentionally unspecified. 106 } 107 108 message RestoreJobPayload { 109 // Intentionally unspecified. 110 } 111 112 message SchemaChangeJobPayload { 113 uint32 mutation_id = 1; 114 repeated roachpb.Span resume_spans = 2; 115 } 116 117 message JobLease { 118 uint32 node_id = 1; 119 int64 expires = 2; 120 } 121 122 message JobPayload { 123 string description = 1; 124 string creator = 2; 125 int64 started = 4; 126 int64 finished = 5; 127 int64 modified = 6; 128 repeated uint32 descriptor_ids = 7; 129 float fraction_completed = 8; 130 string error = 9; 131 oneof details { 132 BackupJobPayload backup_details = 10; 133 RestoreJobPayload restore_details = 11; 134 SchemaChangeJobPayload schema_change_details = 12; 135 } 136 JobLease lease = 13; 137 } 138 ``` 139 140 The `description` field stores the text of the job-creating query for display in 141 the UI. Schema changes will store the query verbatim; backups and restores, 142 which may have sensitive cloud storage credentials specified in the query, will 143 store a sanitized version of the query. 144 145 The `creator` field records the user who launched the job. 146 147 Next up are four fields to track the timing of the job. The `created` field 148 tracks when the job record is created, the `started` field tracks when the job 149 switches from `pending` to `running`, and the `finished` field tracks when the 150 job switches from `running` to its final state of `succeeded` or `failed`. The 151 `modified` field is updated whenever the job is updated and can be used to 152 detect when a job has stalled. 153 154 The repeated `descriptor_id` field stores the IDs of the databases or tables 155 affected by the job. For backups and restores, the IDs of any tables targeted 156 will have an entry. For schema migrations, the ID of the one database (`ALTER 157 DATABASE...`) or table (`ALTER TABLE...`) under modification will be stored. 158 Future long-running jobs which don't operate on databases or tables can simply 159 leave this field empty. 160 161 The `fraction_completed` field is periodically updated from 0.0 to 1.0 while the 162 job is `running`. Jobs in the `succeeded` state will always have a 163 `fraction_completed` of 1.0, while jobs in the `failed` state may have any 164 `fraction_completed` value. This value is stored as a float instead of an 165 integer to avoid needing to choose a fixed denominator for the fraction (e.g. 166 100 or 1000). 167 168 The `error` field stores the reason for failure, if any. This is the same error 169 message that is reported to the user through the normal query failure path, but 170 is recorded in the table for posterity. 171 172 The type of job can be determined by reflection on the `details` oneof, which 173 stores additional details relevant to a specific job type. The 174 `SchemaJobPayload`, for example, stores the ID of the underlying mutation and 175 checkpoint status to resume an in-progress backfill if the original coordinator 176 dies. `BackupJobPayload` and `RestoreJobPayload` are currently empty and exist 177 only to allow reflection on the `details` oneof. 178 179 Finally, the `lease` field tracks whether the job has a live coordinator. The 180 field stores the node ID of the current coordinator in `lease.node_id` and when 181 their lease expires in `lease.expires`. Each node will run a daemon to scan for 182 running jobs whose leases have expired and attempt to become the new 183 coordinator. (See the next section for a proposed lease acquisition scheme.) 184 Schema changes have an existing daemon that does exactly this, but the daemon 185 currently stores the lease information on the table descriptor. The daemon will 186 be adjusted to store lease information here instead and extended to support 187 backup and restore jobs. 188 189 Several alternative divisions of fields between the schema and the protobuf were 190 considered; see [Alternatives](#alternatives) for more discussion. 191 192 ## Expected queries 193 194 To help evaluate the schema design, a selection of SQL queries expected to be 195 run against the `system.jobs` table follows. Most of these queries will be 196 executed by the database internals, though some are expected to be run manually 197 by users monitoring job progress. 198 199 To create a new job: 200 201 ```sql 202 -- {} is imaginary syntax for a protobuf literal. 203 INSERT 204 INTO system.jobs (status, payload) 205 VALUES ('pending', { 206 description = 'BACKUP foodb TO barstorage', 207 creator = 'root', 208 modified = now(), 209 descriptor_ids = [50], 210 backup_details = {} 211 }) 212 RETURNING id 213 ``` 214 215 To mark a job as running: 216 217 ```sql 218 -- {...old, col = 'new-value' } is imaginary syntax for a protobuf literal that 219 -- has the same values as old, except col is updated to 'new-value'. 220 UPDATE system.jobs 221 SET status = 'running', 222 payload = { 223 ...payload, 224 started = now(), 225 modified = now(), 226 fraction_completed = 0.0, 227 lease = { node_id = 1, expires = now() + JobLeaseDuration} 228 } 229 WHERE id = ? 230 ``` 231 232 To update the status of a running job: 233 234 ```sql 235 UPDATE system.jobs 236 SET payload = {...payload, modified = now(), fraction_completed = 0.2442} 237 WHERE id = ? 238 ``` 239 240 To take over an expired lease: 241 242 ```go 243 func maybeAcquireAbandonedJob() (int, JobPayload) { 244 jobs = db.Query("SELECT id, payload FROM system.jobs WHERE status = 'running'") 245 for _, job := range jobs { 246 payload := decode(job.payload) 247 if payload.lease.expires.Add(MaxClockOffset).Before(time.Now()) { 248 payload.lease = &JobLease{NodeID: NODE-ID, Expires: time.Now().Add(JobLeaseDuration)} 249 if db.Exec( 250 "UPDATE payload = ? WHERE id = ? AND payload = ?", 251 encode(payload), job.ID, job.payload, 252 ).RowsAffected() == 1 { 253 // Acquired the lease on this job. 254 return job.id, payload 255 } 256 // Another node got the lease. Try the next job. 257 } 258 // This job still has an active lease. Try the next job. 259 } 260 return nil, nil 261 } 262 ``` 263 264 To mark a job as successful: 265 266 ```sql 267 UPDATE system.jobs 268 SET status = 'succeeded' 269 payload = {...payload, modified = now()} 270 WHERE id = ? 271 ``` 272 273 To mark a job as failed: 274 275 ```sql 276 UPDATE system.jobs 277 SET status = 'failed', 278 payload = {...payload, modified = now(), error = 's3.aws.amazon.com: host unreachable'} 279 WHERE id = ? 280 ``` 281 282 To find queued or running jobs (e.g., for the default "System jobs" admin view): 283 284 ```sql 285 SELECT * FROM system.jobs WHERE status IN ('pending', 'running') ORDER BY created; 286 ``` 287 288 To get the status of a specific job (e.g., a user in the SQL CLI): 289 290 ```sql 291 SELECT status FROM system.jobs WHERE id = ?; 292 ``` 293 294 # Drawbacks 295 296 - Requiring the job leader to periodically issue `UPDATE system.jobs SET payload 297 = {...payload, fraction_completed = ?}` queries to update the progress of 298 running jobs is somewhat unsatisfying. One wishes to be able to conjure the 299 `fraction_completed` column only when the record is read, but this design 300 would introduce significant implementation complexity. 301 302 - Users cannot retrieve fields stored in the protobuf from SQL directly, but 303 several fields that might be useful to users, like `fraction_completed` and 304 `creator`, are stored within the protobuf. We can solve this by introducing a 305 special syntax, like `SHOW JOBS`, if the need arises. Additionally, support 306 for reaching into protobuf columns from a SQL query is planned. 307 308 Note that at least one current customer has requested the ability to query job 309 status from SQL directly. Even without a `SHOW JOBS` command, basic status 310 information (i.e., `pending`, `running`, `succeeded`, or `failed`) is 311 available directly through SQL under this proposal. 312 313 # Alternatives 314 315 ## Wider protobufs 316 317 To further minimize the chances that we'll need to modify the `system.jobs` 318 schema, we could instead stuff all the data into the `payload` protobuf: 319 320 ```sql 321 CREATE TABLE system.jobs ( 322 id INT DEFAULT unique_rowid() PRIMARY KEY, 323 payload BYTES, 324 ) 325 ``` 326 327 This allows for complete flexibility in adjusting the schema, but prevents 328 essentially all useful SQL queries and indices over the table until protobuf 329 columns are natively supported. 330 331 ## Narrow protobufs 332 333 We could also allow all data to be filtered by widening the `system.jobs` table 334 to include some (or all) of the fields proposed to be stored in the `payload` 335 protobuf. Following is an example where all but the job-specific fields are 336 pulled out of `payload`. 337 338 ```sql 339 CREATE TABLE system.jobs ( 340 id INT DEFAULT unique_rowid() PRIMARY KEY, 341 status STRING NOT NULL, 342 description STRING NOT NULL, 343 creator STRING NOT NULL, 344 nodeID INT, 345 created TIMESTAMP NOT NULL DEFAULT now(), 346 started TIMESTAMP, 347 finished TIMESTAMP, 348 modified TIMESTAMP NOT NULL DEFAULT now(), 349 descriptors INT[], 350 fractionCompleted FLOAT, 351 INDEX (status, created) 352 ) 353 ``` 354 355 The `payload` type then simplifies to the below definition, where the 356 job-specific message types are defined as above. 357 358 ```protobuf 359 message JobPayload { 360 oneof details { 361 BackupJobPayload backup_details = 1; 362 RestoreJobPayload restore_details = 2; 363 SchemaChangeJobPayload schema_change_details = 3; 364 } 365 } 366 ``` 367 368 This alternative poses a significant risk if we need to adjust the schema, but 369 unlocks or simplifies some useful SQL queries. Additionally, none of the 370 `UPDATE` queries in the [Expected queries](#expected-queries) would need to 371 modify the protobuf in this alternative. 372 373 ## Tracking job update history 374 375 Also considered was a schema capable of recording every change made to a job. 376 Each job, then, would consist of a collection of records in the `system.jobs` 377 table, one row per update. The table schema would include a timestamp on 378 every row, and the primary key would expand to `id, timestamp`: 379 380 ```sql 381 CREATE TABLE system.jobs ( 382 id INT DEFAULT unique_rowid(), 383 timestamp TIMESTAMP NOT NULL DEFAULT now() 384 status STRING NOT NULL, 385 payload BYTES, 386 PRIMARY KEY (id, timestamp) 387 ) 388 ``` 389 390 The `created`, `started`, and `finished` fields could then be derived from the 391 `timestamp` of the first record in the new state, so the protobuf would simplify 392 to: 393 394 ```protobuf 395 message JobPayload { 396 string description = 1; 397 string creator = 2; 398 float fraction_completed = ?; 399 uint32 node_id = 2; 400 repeated uint32 descriptor_id = 6; 401 string error = 7; 402 oneof details { 403 BackupJobPayload backup_details = 8; 404 RestoreJobPayload restore_details = 9; 405 SchemaChangeJobPayload schema_change_details = 10; 406 } 407 } 408 ``` 409 410 The first entry into the table for a given job would include the immutable facts 411 about the job, like the `description` and the `creator`. Future updates to a job 412 would only include the updated fields in the protobuf. A running job would 413 update `fraction_completed` in the usual case, for example, and would update 414 `node_id` if the coordinating node changed. 415 416 Protobufs elide omitted fields, so the space requirement of such a scheme is a 417 modest several dozen kilobytes per job, assuming each job is updated several 418 thousand times. Unfortunately, this design drastically complicates the queries 419 necessary to retrieve information from the table. For example, the admin UI 420 would need something like the following to display the list of in-progress and 421 running jobs: 422 423 ```sql 424 SELECT 425 latest.id AS id, 426 latest.status AS status, 427 latest.timestamp AS updated, 428 latest.payload AS latestPayload, 429 initial.payload AS initialPayload, 430 initial.timestamp AS created 431 FROM ( 432 SELECT jobs.id, jobs.timestamp, jobs.status, jobs.payload 433 FROM (SELECT id, max(timestamp) as timestamp FROM jobs GROUP BY id) AS latest 434 JOIN jobs ON jobs.id = latest.id AND jobs.timestamp = latest.timestamp 435 ) AS latest 436 JOIN jobs AS initial ON initial.id = latest.id AND initial.status = 'pending' 437 WHERE latest.status IN ('pending', 'running') 438 ORDER BY initial.timestamp 439 ``` 440 441 The above query could be simplified if we instead reproduced the entire record 442 with every update, but that would significantly increase the space requirement. 443 In short, this alternative either complicates implementation or incurs 444 significant space overhead for no clear win, since we don't currently have a 445 specific compelling use case for a full update history. 446 447 # Unresolved questions and future work 448 449 - How does a user get a job ID via SQL? Job-creating queries currently block 450 until the job completes; this behavior is consistent with e.g. `ALTER...` 451 queries in other databases. This means, however, that a job-creating query 452 cannot return a job ID immediately. Users will need to search the 453 `system.jobs` table manually for the record that matches the query they ran. 454 The answer to this question is unlikely to influence the design of the schema 455 itself, since how we communicate the job ID to the user is orthogonal to how 456 we keep track of job progress. 457 458 - All system log tables, including `system.jobs`, will eventually need garbage 459 collection to prune old entries from the table, likely with a 460 user-configurable timeframe. 461 462 [#7037]: https://github.com/cockroachdb/cockroach/pull/7073 463 [#11722]: https://github.com/cockroachdb/cockroach/pull/11722 464 [#12555]: https://github.com/cockroachdb/cockroach/issues/12555 465 [#13656]: https://github.com/cockroachdb/cockroach/pull/13656 466 [@a-robinson]: https://github.com/a-robinson 467 [@danhhz]: https://github.com/danhhz 468 [@vivekmenezes]: https://github.com/vivekmenezes 469 [cluster migration framework]: https://github.com/cockroachdb/cockroach/pull/11658