github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170215_system_jobs.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20170215_system_jobs.md (about)

     1  - Feature Name: System Jobs
     2  - Status: completed
     3  - Start Date: 2017-02-13
     4  - Authors: Nikhil Benesch
     5  - RFC PR: [#13656]
     6  - Cockroach Issue: [#12555]
     7  
     8  # Summary
     9  
    10  Add a system table to track the progress of backups, restores, and schema
    11  changes.
    12  
    13  # Motivation
    14  
    15  When performing a schema change, only one bit of progress information is
    16  available: whether the change has completed, indicated by whether the query has
    17  returned. Similarly, when performing a backup or restore, status is only
    18  reported after the backup or restore has completed. Given that a full backup of
    19  a 2TB database takes on the order of several hours to complete, the lack of
    20  progress information is a serious pain point for users.
    21  
    22  Additionally, while each node runs a schema change daemon that can restart
    23  pending schema changes if the coordinating node dies, the same is not true for
    24  backups and restores. If the coordinating node for a backup or restore job dies, the
    25  job will abort, even if the individual workers were otherwise successful.
    26  
    27  This RFC proposes a new system table, `system.jobs`, that tracks the status of
    28  these long-running backup, restore, and schema change "jobs." This table will
    29  directly expose the desired progress information via `SELECT` queries over the
    30  table, enable an API endpoint to expose the same progress information in the
    31  admin UI, and enable an internal daemon that periodically adopts and resumes all
    32  types of orphaned jobs. The table will also serve as a convenient place for
    33  schema changes to store checkpoint state; currently, checkpoints are stored on
    34  the affected table descriptor, which must be gossiped on every write.
    35  
    36  # Detailed design
    37  
    38  ## Prior art
    39  
    40  Adding a `system.jobs` table has been proposed and unsuccessfully implemented
    41  several times. [@vivekmenezes]'s initial attempt in [#7037] was abandoned since,
    42  at the time, there was no way to safely add a new system table to an existing
    43  cluster. Several months later, after [@a-robinson] and [@danhhz] built the
    44  necessary cluster migration framework, @a-robinson submitted a migration to add
    45  a `system.jobs` table in [#11722] as an example of using the new framework. His
    46  PR was rejected because the table's schema hadn't been fully considered. This
    47  RFC describes a revised `system.jobs` schema for us to thoroughly vet before we
    48  proceed with a final implementation.
    49  
    50  ## The proposal
    51  
    52  Within this RFC, the term "job" or "long-running job" refers only to backups,
    53  restores, and schema changes. Other types of long-running queries, like slow
    54  `SELECT` statements, are explicitly out of scope. A "job-creating query," then,
    55  is any query of one of the following types:
    56  
    57  - `ALTER TABLE`, which creates a schema change job
    58  - `BACKUP`, which creates a backup job
    59  - `RESTORE`, which creates a restore job
    60  
    61  To track the progress of these jobs, the following table will be injected into
    62  the `system` database using the [cluster migration framework]:
    63  
    64  ```sql
    65  CREATE TABLE system.jobs (
    66      id      INT DEFAULT unique_rowid() PRIMARY KEY,
    67      status  STRING NOT NULL,
    68      created TIMESTAMP NOT NULL DEFAULT now(),
    69      payload BYTES,
    70      INDEX   (status, created)
    71  )
    72  ```
    73  
    74  Each job is identified by a unique `id`, which is assigned when the job is
    75  created. Currently, this ID serves only to identify the job to the user, but
    76  future SQL commands to e.g. abort running jobs will need this ID to
    77  unambiguously specify the target job.
    78  
    79  The `status` column represents a state machine with `pending`, `running`,
    80  `succeeded`, and `failed` states. Jobs are created in the `pending` state when
    81  the job-creating query is accepted, move to the `running` state once work on the
    82  job has actually begun, then move to a final state of `succeeded` or `failed`.
    83  The `pending` state warrants additional explanation: it's used to track jobs
    84  that are enqueued but not currently performing work. Schema changes, for
    85  example, will sit in the `pending` state until all prior schema change jobs have
    86  completed.
    87  
    88  The `created` field, unsurprisingly, is set to the current timestamp at the time
    89  the record is created.
    90  
    91  The admin UI job status page is expected to display jobs ordered first by their
    92  status, then by their creation time. To make this query efficient, the table has
    93  a secondary index on `status, created`.
    94  
    95  We want to avoid future schema changes to `system.jobs` if at all possible.
    96  Every schema change requires a cluster migration, and every cluster migration
    97  introduces node startup time overhead, plus some risk and complexity. To that
    98  end, any field not required by an index is stashed in the `payload` column,
    99  which stores a protobuf that can be evolved per the standard protobuf
   100  forwards-compatibility support. The proposed message definition for the
   101  `payload` follows.
   102  
   103  ```protobuf
   104  message BackupJobPayload {
   105      // Intentionally unspecified.
   106  }
   107  
   108  message RestoreJobPayload {
   109      // Intentionally unspecified.
   110  }
   111  
   112  message SchemaChangeJobPayload {
   113      uint32 mutation_id = 1;
   114      repeated roachpb.Span resume_spans = 2;
   115  }
   116  
   117  message JobLease {
   118      uint32 node_id = 1;
   119      int64 expires = 2;
   120  }
   121  
   122  message JobPayload {
   123      string description = 1;
   124      string creator = 2;
   125      int64 started = 4;
   126      int64 finished = 5;
   127      int64 modified = 6;
   128      repeated uint32 descriptor_ids = 7;
   129      float fraction_completed = 8;
   130      string error = 9;
   131      oneof details {
   132          BackupJobPayload backup_details = 10;
   133          RestoreJobPayload restore_details = 11;
   134          SchemaChangeJobPayload schema_change_details = 12;
   135      }
   136      JobLease lease = 13;
   137  }
   138  ```
   139  
   140  The `description` field stores the text of the job-creating query for display in
   141  the UI. Schema changes will store the query verbatim; backups and restores,
   142  which may have sensitive cloud storage credentials specified in the query, will
   143  store a sanitized version of the query.
   144  
   145  The `creator` field records the user who launched the job.
   146  
   147  Next up are four fields to track the timing of the job. The `created` field
   148  tracks when the job record is created, the `started` field tracks when the job
   149  switches from `pending` to `running`, and the `finished` field tracks when the
   150  job switches from `running` to its final state of `succeeded` or `failed`. The
   151  `modified` field is updated whenever the job is updated and can be used to
   152  detect when a job has stalled.
   153  
   154  The repeated `descriptor_id` field stores the IDs of the databases or tables
   155  affected by the job. For backups and restores, the IDs of any tables targeted
   156  will have an entry. For schema migrations, the ID of the one database (`ALTER
   157  DATABASE...`) or table (`ALTER TABLE...`) under modification will be stored.
   158  Future long-running jobs which don't operate on databases or tables can simply
   159  leave this field empty.
   160  
   161  The `fraction_completed` field is periodically updated from 0.0 to 1.0 while the
   162  job is `running`. Jobs in the `succeeded` state will always have a
   163  `fraction_completed` of 1.0, while jobs in the `failed` state may have any
   164  `fraction_completed` value. This value is stored as a float instead of an
   165  integer to avoid needing to choose a fixed denominator for the fraction (e.g.
   166  100 or 1000).
   167  
   168  The `error` field stores the reason for failure, if any. This is the same error
   169  message that is reported to the user through the normal query failure path, but
   170  is recorded in the table for posterity.
   171  
   172  The type of job can be determined by reflection on the `details` oneof, which
   173  stores additional details relevant to a specific job type. The
   174  `SchemaJobPayload`, for example, stores the ID of the underlying mutation and
   175  checkpoint status to resume an in-progress backfill if the original coordinator
   176  dies. `BackupJobPayload` and `RestoreJobPayload` are currently empty and exist
   177  only to allow reflection on the `details` oneof.
   178  
   179  Finally, the `lease` field tracks whether the job has a live coordinator. The
   180  field stores the node ID of the current coordinator in `lease.node_id` and when
   181  their lease expires in `lease.expires`. Each node will run a daemon to scan for
   182  running jobs whose leases have expired and attempt to become the new
   183  coordinator. (See the next section for a proposed lease acquisition scheme.)
   184  Schema changes have an existing daemon that does exactly this, but the daemon
   185  currently stores the lease information on the table descriptor. The daemon will
   186  be adjusted to store lease information here instead and extended to support
   187  backup and restore jobs.
   188  
   189  Several alternative divisions of fields between the schema and the protobuf were
   190  considered; see [Alternatives](#alternatives) for more discussion.
   191  
   192  ## Expected queries
   193  
   194  To help evaluate the schema design, a selection of SQL queries expected to be
   195  run against the `system.jobs` table follows. Most of these queries will be
   196  executed by the database internals, though some are expected to be run manually
   197  by users monitoring job progress.
   198  
   199  To create a new job:
   200  
   201  ```sql
   202  -- {} is imaginary syntax for a protobuf literal.
   203  INSERT
   204      INTO system.jobs (status, payload)
   205      VALUES ('pending', {
   206          description = 'BACKUP foodb TO barstorage',
   207          creator = 'root',
   208          modified = now(),
   209          descriptor_ids = [50],
   210          backup_details = {}
   211      })
   212  RETURNING id
   213  ```
   214  
   215  To mark a job as running:
   216  
   217  ```sql
   218  -- {...old, col = 'new-value' } is imaginary syntax for a protobuf literal that
   219  -- has the same values as old, except col is updated to 'new-value'.
   220  UPDATE system.jobs
   221      SET status = 'running',
   222          payload = {
   223              ...payload,
   224              started = now(),
   225              modified = now(),
   226              fraction_completed = 0.0,
   227              lease = { node_id = 1, expires = now() + JobLeaseDuration}
   228          }
   229      WHERE id = ?
   230  ```
   231  
   232  To update the status of a running job:
   233  
   234  ```sql
   235  UPDATE system.jobs
   236      SET payload = {...payload, modified = now(), fraction_completed = 0.2442}
   237      WHERE id = ?
   238  ```
   239  
   240  To take over an expired lease:
   241  
   242  ```go
   243  func maybeAcquireAbandonedJob() (int, JobPayload) {
   244      jobs = db.Query("SELECT id, payload FROM system.jobs WHERE status = 'running'")
   245      for _, job := range jobs {
   246          payload := decode(job.payload)
   247          if payload.lease.expires.Add(MaxClockOffset).Before(time.Now()) {
   248              payload.lease = &JobLease{NodeID: NODE-ID, Expires: time.Now().Add(JobLeaseDuration)}
   249              if db.Exec(
   250                  "UPDATE payload = ? WHERE id = ? AND payload = ?",
   251                  encode(payload), job.ID, job.payload,
   252              ).RowsAffected() == 1 {
   253                  // Acquired the lease on this job.
   254                  return job.id, payload
   255              }
   256              // Another node got the lease. Try the next job.
   257          }
   258          // This job still has an active lease. Try the next job.
   259      }
   260      return nil, nil
   261  }
   262  ```
   263  
   264  To mark a job as successful:
   265  
   266  ```sql
   267  UPDATE system.jobs
   268      SET status = 'succeeded'
   269          payload = {...payload, modified = now()}
   270      WHERE id = ?
   271  ```
   272  
   273  To mark a job as failed:
   274  
   275  ```sql
   276  UPDATE system.jobs
   277      SET status = 'failed',
   278          payload = {...payload, modified = now(), error = 's3.aws.amazon.com: host unreachable'}
   279      WHERE id = ?
   280  ```
   281  
   282  To find queued or running jobs (e.g., for the default "System jobs" admin view):
   283  
   284  ```sql
   285  SELECT * FROM system.jobs WHERE status IN ('pending', 'running') ORDER BY created;
   286  ```
   287  
   288  To get the status of a specific job (e.g., a user in the SQL CLI):
   289  
   290  ```sql
   291  SELECT status FROM system.jobs WHERE id = ?;
   292  ```
   293  
   294  # Drawbacks
   295  
   296  - Requiring the job leader to periodically issue `UPDATE system.jobs SET payload
   297    = {...payload, fraction_completed = ?}` queries to update the progress of
   298    running jobs is somewhat unsatisfying. One wishes to be able to conjure the
   299    `fraction_completed` column only when the record is read, but this design
   300    would introduce significant implementation complexity.
   301  
   302  - Users cannot retrieve fields stored in the protobuf from SQL directly, but
   303    several fields that might be useful to users, like `fraction_completed` and
   304    `creator`, are stored within the protobuf. We can solve this by introducing a
   305    special syntax, like `SHOW JOBS`, if the need arises. Additionally, support
   306    for reaching into protobuf columns from a SQL query is planned.
   307  
   308    Note that at least one current customer has requested the ability to query job
   309    status from SQL directly. Even without a `SHOW JOBS` command, basic status
   310    information (i.e., `pending`, `running`, `succeeded`, or `failed`) is
   311    available directly through SQL under this proposal.
   312  
   313  # Alternatives
   314  
   315  ## Wider protobufs
   316  
   317  To further minimize the chances that we'll need to modify the `system.jobs`
   318  schema, we could instead stuff all the data into the `payload` protobuf:
   319  
   320  ```sql
   321  CREATE TABLE system.jobs (
   322      id      INT DEFAULT unique_rowid() PRIMARY KEY,
   323      payload BYTES,
   324  )
   325  ```
   326  
   327  This allows for complete flexibility in adjusting the schema, but prevents
   328  essentially all useful SQL queries and indices over the table until protobuf
   329  columns are natively supported.
   330  
   331  ## Narrow protobufs
   332  
   333  We could also allow all data to be filtered by widening the `system.jobs` table
   334  to include some (or all) of the fields proposed to be stored in the `payload`
   335  protobuf. Following is an example where all but the job-specific fields are
   336  pulled out of `payload`.
   337  
   338  ```sql
   339  CREATE TABLE system.jobs (
   340      id                INT DEFAULT unique_rowid() PRIMARY KEY,
   341      status            STRING NOT NULL,
   342      description       STRING NOT NULL,
   343      creator           STRING NOT NULL,
   344      nodeID            INT,
   345      created           TIMESTAMP NOT NULL DEFAULT now(),
   346      started           TIMESTAMP,
   347      finished          TIMESTAMP,
   348      modified          TIMESTAMP NOT NULL DEFAULT now(),
   349      descriptors       INT[],
   350      fractionCompleted FLOAT,
   351      INDEX (status, created)
   352  )
   353  ```
   354  
   355  The `payload` type then simplifies to the below definition, where the
   356  job-specific message types are defined as above.
   357  
   358  ```protobuf
   359  message JobPayload {
   360      oneof details {
   361          BackupJobPayload backup_details = 1;
   362          RestoreJobPayload restore_details = 2;
   363          SchemaChangeJobPayload schema_change_details = 3;
   364      }
   365  }
   366  ```
   367  
   368  This alternative poses a significant risk if we need to adjust the schema, but
   369  unlocks or simplifies some useful SQL queries. Additionally, none of the
   370  `UPDATE` queries in the [Expected queries](#expected-queries) would need to
   371  modify the protobuf in this alternative.
   372  
   373  ## Tracking job update history
   374  
   375  Also considered was a schema capable of recording every change made to a job.
   376  Each job, then, would consist of a collection of records in the `system.jobs`
   377  table, one row per update. The table schema would include a timestamp on
   378  every row, and the primary key would expand to `id, timestamp`:
   379  
   380  ```sql
   381  CREATE TABLE system.jobs (
   382      id        INT DEFAULT unique_rowid(),
   383      timestamp TIMESTAMP NOT NULL DEFAULT now()
   384      status    STRING NOT NULL,
   385      payload   BYTES,
   386      PRIMARY KEY (id, timestamp)
   387  )
   388  ```
   389  
   390  The `created`, `started`, and `finished` fields could then be derived from the
   391  `timestamp` of the first record in the new state, so the protobuf would simplify
   392  to:
   393  
   394  ```protobuf
   395  message JobPayload {
   396      string description = 1;
   397      string creator = 2;
   398      float fraction_completed = ?;
   399      uint32 node_id = 2;
   400      repeated uint32 descriptor_id = 6;
   401      string error = 7;
   402      oneof details {
   403          BackupJobPayload backup_details = 8;
   404          RestoreJobPayload restore_details = 9;
   405          SchemaChangeJobPayload schema_change_details = 10;
   406      }
   407  }
   408  ```
   409  
   410  The first entry into the table for a given job would include the immutable facts
   411  about the job, like the `description` and the `creator`. Future updates to a job
   412  would only include the updated fields in the protobuf. A running job would
   413  update `fraction_completed` in the usual case, for example, and would update
   414  `node_id` if the coordinating node changed.
   415  
   416  Protobufs elide omitted fields, so the space requirement of such a scheme is a
   417  modest several dozen kilobytes per job, assuming each job is updated several
   418  thousand times. Unfortunately, this design drastically complicates the queries
   419  necessary to retrieve information from the table. For example, the admin UI
   420  would need something like the following to display the list of in-progress and
   421  running jobs:
   422  
   423  ```sql
   424  SELECT
   425      latest.id AS id,
   426      latest.status AS status,
   427      latest.timestamp AS updated,
   428      latest.payload AS latestPayload,
   429      initial.payload AS initialPayload,
   430      initial.timestamp AS created
   431  FROM (
   432      SELECT jobs.id, jobs.timestamp, jobs.status, jobs.payload
   433      FROM (SELECT id, max(timestamp) as timestamp FROM jobs GROUP BY id) AS latest
   434      JOIN jobs ON jobs.id = latest.id AND jobs.timestamp = latest.timestamp
   435  ) AS latest
   436  JOIN jobs AS initial ON initial.id = latest.id AND initial.status = 'pending'
   437  WHERE latest.status IN ('pending', 'running')
   438  ORDER BY initial.timestamp
   439  ```
   440  
   441  The above query could be simplified if we instead reproduced the entire record
   442  with every update, but that would significantly increase the space requirement.
   443  In short, this alternative either complicates implementation or incurs
   444  significant space overhead for no clear win, since we don't currently have a
   445  specific compelling use case for a full update history.
   446  
   447  # Unresolved questions and future work
   448  
   449  - How does a user get a job ID via SQL? Job-creating queries currently block
   450    until the job completes; this behavior is consistent with e.g. `ALTER...`
   451    queries in other databases. This means, however, that a job-creating query
   452    cannot return a job ID immediately. Users will need to search the
   453    `system.jobs` table manually for the record that matches the query they ran.
   454    The answer to this question is unlikely to influence the design of the schema
   455    itself, since how we communicate the job ID to the user is orthogonal to how
   456    we keep track of job progress.
   457  
   458  - All system log tables, including `system.jobs`, will eventually need garbage
   459    collection to prune old entries from the table, likely with a
   460    user-configurable timeframe.
   461  
   462  [#7037]: https://github.com/cockroachdb/cockroach/pull/7073
   463  [#11722]: https://github.com/cockroachdb/cockroach/pull/11722
   464  [#12555]: https://github.com/cockroachdb/cockroach/issues/12555
   465  [#13656]: https://github.com/cockroachdb/cockroach/pull/13656
   466  [@a-robinson]: https://github.com/a-robinson
   467  [@danhhz]: https://github.com/danhhz
   468  [@vivekmenezes]: https://github.com/vivekmenezes
   469  [cluster migration framework]: https://github.com/cockroachdb/cockroach/pull/11658