github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200414_scheduled_jobs.md (about)

     1   Feature Name: Scheduled Jobs
     2  - Status: Ready for review.
     3  - Start Date: 2020-04-14
     4  - Authors: Yevgeniy Miretkiy, David Taylor
     5  - RFC PR: 48131
     6  - Cockroach Issue: 47223
     7  
     8  
     9  # Summary
    10  
    11  Scheduled jobs adds an ability to execute scheduled, periodic jobs to CockroachDB.
    12  
    13  # Motivation
    14  
    15  Database administrators have a need to execute certain jobs periodically or at fixed times.
    16    * Run daily or hourly backups
    17    * Run import job daily (for example to add data to development database) 
    18    or refresh externally provided datasets
    19    * Produce monthly reports
    20    * Export data to reporting or analytics pipelines
    21  
    22  Users have also expressed desire to execute a one-off job 
    23  (for example: administrator creates a temporary database for a user for 1 week;
    24  one week from now -- drop this database).
    25  
    26  Of course these types of activities are already possible by using external systems.
    27  
    28  However, the external-cron solution has few deficiencies:
    29    * Our "make easy" ethos means wherever possible we bundle dependencies 
    30      (config, orchestration, etc) into the system.  Ideally no separate services 
    31      should be required to operate CockroachDB, including with our recommended
    32      cadence of periodic backups. 
    33    * It is hard to maintain external cron mechanism particularly in an HA deployment or 
    34      in a shared environment (i.e. cockroach cloud, or on-prem shared database 
    35      where the administrator may want to control how often each database gets backed up)
    36    * It is more difficult to have the external system take into account
    37      the current conditions of the database (load for example) when 
    38      scheduling jobs.
    39  
    40  # Detailed design
    41  
    42  ## System Requirements
    43  
    44  The primary motivation for adding scheduled jobs system is to support the important
    45  case of _making backups easy_.  While backups are the primary
    46  driver, we want the scheduled jobs to be flexible in order to support other use cases.
    47  
    48  The scheduled jobs goals are:
    49  * Support running different types of work:
    50      * Backup tasks
    51      * Import tasks
    52  * Support flexible execution policies:
    53      * Periodic execution
    54      * Configurable error handling (e.g. start next run if previous failed or not)
    55  * Support user/role delegation when running jobs and tasks (V2 probably)
    56  * Scheduled jobs must be resilient in a distributed environment
    57      * Ensure the system spawns one instance of the task
    58      * Must be resilient to node restarts/crashes
    59  * Be scalable (particularly for virtualized cloud environment)
    60  
    61  The non-goals are:
    62  * Scheduled jobs cannot provide guarantees that the job will run at 
    63  exactly the time specified in the job spec, though the system will try to do so
    64  if it's possible.
    65  * The scheduled jobs system is not responsible for actually _executing_ the job itself. 
    66  That is still handled like any other job by our jobs system. 
    67  It is responsible for pushing work into that system at the designated time and 
    68  with the configured recurrence policy.
    69  
    70  
    71  ## Scheduled Job
    72  
    73  A scheduled job is a task that runs at some time in the future.  
    74  The task may be periodic, or it may be a one-off task.  
    75  
    76  The task itself is treated as a black box: an opaque protocol message which is interpreted
    77  by the task runner -- an executor -- for this particular task.
    78  
    79  The task runner itself is something which plans and executes the task -- for example,
    80  a task runner may create  a "job" record (entry in system.jobs) for execution by the existing 
    81  jobs framework.
    82  
    83  The initial implementation will provide one task runner type which evaluates SQL statements.
    84  We will only support SQL statements that can be executed by the system jobs framework.
    85  Currently, cockroach supports limited set of such statements (`BACKUP`, `RESTORE`, etc), 
    86  but we plan on evaluating and possibly adding support for running any sql statement, 
    87  wrapped as a system job, in future work.
    88  
    89  ### SQL Interface with Scheduled Jobs
    90  
    91  It should be possible to interact with the scheduled job system via SQL.
    92  
    93  We plan on *eventually* supporting the following statements:
    94    * PAUSE/RESUME SCHEDULE `id`: pauses/resumes specified schedule.
    95    * PAUSE JOBS FOR SCHEDULE: pause any jobs that were created by this schedule that are running.
    96    * SHOW SCHEDULES
    97  
    98  These statements are a syntactic sugar ("SHOW SCHEDULES" is more/less identical to "SELECT *"). 
    99  We will most likely defer work on these statements until later versions of this system.
   100  
   101  Also, note that the actual creation of the scheduled job record is outside the scope of this
   102  RFC. This RFC describes a low level scheduling sub-system.  
   103  
   104  We plan on implementing other, higher level abstractions on top of scheduled jobs. 
   105  For example, a "backup schedules system" will concern itself with all aspects of periodic backup
   106  schedules.  That system will address its SQL/UI needs 
   107  (for example, "CREATE PERIODIC BACKUP ON foo RUNNING@daily") and those higher level statements
   108  will be responsible for creating appropriate scheduled jobs records.
   109  
   110  We may want to disable direct access (even for root/super users) to the underlying 
   111  `system.scheduled_jobs` table, and only provide accessors via those "higher level" statements 
   112  (this is similar how `jobs` uses virtual stables/show statements)
   113  
   114  ## Scheduled Job Representation
   115  
   116  The system stores scheduled job related information in a new `system.scheduled_jobs` table.
   117  In addition, we add a new indexed column to the `system.jobs` table to correlate
   118  system jobs to their schedule.
   119  
   120  #### `system.scheduled_jobs` Table
   121  Scheduled jobs stores job definitions in the `system.scheduled_jobs` table:
   122  
   123  ```sql
   124  CREATE TABLE system.scheduled_jobs (
   125      schedule_id       INT DEFAULT unique_rowid() PRIMARY KEY,
   126      schedule_name     STRING NOT NULL,
   127      created           TIMESTAMPTZ NOT NULL DEFAULT NOW(),
   128      owner             STRING,
   129      next_run          TIMESTAMPTZ,
   130      schedule_expr     STRING,
   131      schedule_details  BYTES,
   132      executor_type     STRING NOT NULL,
   133      execution_args    BYTES NOT NULL,
   134      schedule_changes  BYTES,
   135  
   136      INDEX             (next_run),
   137      FAMILY "sched"    (sched_id, next_run),
   138      FAMILY "other"    (... all other fields...)
   139  )
   140  ```
   141  
   142  * `schedule_id` is a unique id for this scheduled job
   143  * `schedule_name` is a descriptive name  (used when showing this schedule in the UI)
   144  * `created` is the scheduled job creation timestamp
   145  * `owner` is the user who owns (created) this schedule (and in v1, who to execute the job as)
   146  * `next_run` is the next scheduled run of this job.
   147  If next_run is `NULL`, then the job does not run (it's paused).
   148  * `schedule_expr` is the job schedule specified in crontab format; may be empty
   149  * `schedule_details` is a serialized `ScheduleDetails` protocol message.
   150  * `executor_type` is a name identifying executor to use when executing this schedule
   151  * `execution_args` is a serialized `ExecutionArguments` protocol message
   152  * `schedule_changes` is a serialized `ScheduleChangeInfo` protocol message.
   153  
   154  Scheduled jobs system uses the `crontab` format to specify job schedule.  We believe that, even
   155  though this format is not the easiest to use/parse, the crontab format is, nonetheless,
   156  well-established and very flexible.  We will attempt to use existing crontab parsing
   157  packages if possible.
   158  
   159  The initial version of scheduled jobs will use UTC timezone for all the schedules.
   160  
   161  ### Modifications to the `system.jobs` Table
   162  
   163  We will modify `system.jobs` to add two columns (along with an index):
   164  
   165  ```sql
   166  ALTER TABLE system.jobs ADD COLUMN created_by_type STRING;
   167  ALTER TABLE system.jobs ADD COLUMN created_by_id INT;
   168  CREATE INDEX creator_idx ON system.jobs(created_by_type, created_by_id) STORING(status);
   169  ```
   170  
   171  The purpose for the `created_by_type` is to describe which system
   172  created the job, and `created_by_id` describes "id" specific to the system
   173  which created this job.
   174  
   175  Scheduled job system will use `scheduled` as its value for `created_by_type`.
   176  
   177  The purpose of these two columns is to enable scheduled job to determine
   178  which schedule trigger a particular job and take action based on the job
   179  execution result (i.e. failed job handling).
   180  
   181  These two columns will also be useful for other systems that create jobs and will
   182  provide additional context to the operators.  For example, complex schema change 
   183  operation, that creates multiple jobs, can use these columns to annotate
   184  jobs with useful information (which job spawn some other job, etc)
   185  
   186  
   187  #### Alternative: Modify introduce new `system.scheduled_job_runs` table:
   188  
   189  We have considered alternative implementation that does not add columns
   190  to the `system.jobs` table.  Namely, we considered addition of a third
   191  `system.scheduled_job_runs` table to establish schedule/job run relation:
   192  
   193  ```sql
   194  CREATE TABLE system.scheduled_job_runs (
   195    job_id              INT NOT NULL PRIMARY KEY REFERENCES system.jobs (id) ON DELETE CASCADE,
   196    schedule_id         INT NOT NULL,
   197    termination_handled BOOL DEFAULT false,
   198    INDEX (schedule_id),
   199  )
   200  ```
   201  
   202  The `job_id` and `schedule_id` columns establish schedule to job relation.
   203  The `termination_handled` is used in order to implement efficient polling,
   204  as well as assist in implementation of SQL statements related
   205  to scheduled jobs (e.g. `PAUSE JOBS FOR SCHEDULE 123`)
   206  
   207  `job_id` has a foreign key constraint on the `system.jobs` PK to ensure
   208  we cleanup `system.scheduled_job_runs` when we cleanup `system.jobs`.
   209  However, we do not have FK constraint on the `schedule_id` in order
   210  to preserve run history, even if we delete schedule.
   211  
   212  However, after much deliberation, we decided that this alternative
   213  has the following deficiencies:
   214    * We burn 2 system table IDs, instead of 1.  System table IDs is a precious
   215    commodity; We only have 13 more to go before we run out and need to 
   216    come up with alternatives.  Rework and changes to the cluster creation/bootstrapping
   217    is a bit outside the scope for this RFC.
   218    * More complex migration.  Even though we do not add columns to the `system.jobs`
   219    table, we still need to modify its descriptor to indicate foreign key relationship.
   220    * Interdependence between system tables is probably not a great thing to have
   221    in general.  For example, this would complicate backup and restores which would 
   222    now have to deal with restoring the system tables in the right order.
   223    * An extra table is a bit less efficient since we're effectively duplicating
   224    state already available in the `system.jobs` table
   225  
   226  The benefit of this solution is that we would not be tying `system.jobs` schema
   227  in any visible way to the scheduled jobs.  However, by introducing FK constraint, we
   228  are tying things up anyway.  The alternative to that is to forego any sort of
   229  FK constraint for the extra cost of having to write and maintain code to clean up
   230  old job run records.  This would complicate debugging and visibility into these 
   231  systems since deletions from system jobs and scheduled job runs will happen under 
   232  different schedules/policies.
   233  
   234  We believe the two column addition to be a better alternative.
   235  
   236  
   237  ### Scheduled Job Protocol Messages
   238  
   239  The `ScheduleDetails` describes how to schedule the job:
   240  ```proto
   241  message ScheduleDetails {
   242    // WaitBehavior describes how to handle previously  started
   243    // jobs that have not completed yet.
   244    enum WaitBehavior {
   245      // Wait for the previous run to complete
   246      // before starting the next one.
   247      WAIT = 0;
   248      // Do not wait for the previous run to complete.
   249      NO_WAIT = 1;
   250      // If the previous run is still running, skip this run
   251      // and advance schedule to the next recurrence.
   252      SKIP = 2;
   253    }
   254  
   255    // ErrorHandlingBehavior describes how to handle failed job runs.
   256    enum ErrorHandlingBehavior {
   257      // By default, failed jobs will run again, based on their schedule.
   258      RETRY_SCHED = 0;
   259      // Retry failed jobs soon.
   260      RETRY_SOON = 1;
   261      // Stop running this schedule
   262      PAUSE_SCHED = 2;
   263    }
   264  
   265    // How to handle running jobs started by this schedule.
   266    WaitBehavior wait = 1;
   267  
   268    // How to handle failed jobs started by this schedule.
   269    ErrorHandlingBehavior on_error = 2;
   270  }
   271  ```
   272  
   273  The default scheduling policy is:
   274    * Wait for the outstanding job to finish before starting the next one
   275      (`wait == WaitBehavior.WAIT`)
   276    * If there were any errors in the execution, we continue attempting
   277    to execute scheduled job (`on_error == RETRY_SCHED`)
   278  
   279  Note: `ScheduleDetails` does not include information on when to run the schedule.
   280  This data is available in the `system.scheduled_jobs.schedule_expr`.
   281  
   282  In the subsequent iterations of this system, we can extend `ScheduleDetails`
   283  protocol messages to support more use cases.  Some future enhancements might include:
   284    * Do not spawn new job if more than N instances of this job still running
   285      (instead of restricting to 1)
   286    * max_retries: Pause scheduled job if more than X runs resulted in an error.
   287    * Add execution time limits (kill a job if it runs longer than X) 
   288    * Skip execution if we are too far behind (for example, if the cluster was down,
   289      we may want to skip all "old" runs and reschedule them to continue based on their
   290      schedule settings.
   291  
   292  Scheduled job system treats work it needs to execute as a black box.
   293  `ExecutionArguments` describes the work to be executed:
   294  
   295  ```proto
   296  message ExecutorArguments {
   297    google.protobuf.Any args = 1
   298  }
   299  ```
   300   
   301  `ExecutorArguments` is a protocol message containing opaque `args` which are
   302  arguments specific to different executor implementations (see #scheduled-job-executors
   303  below).
   304  
   305  
   306  Finally, whenever we perform changes to the scheduled job (i.e. pause it), we want
   307  to record this information for debugging purposes in the following protocol message:
   308  
   309  ```proto
   310  // ScheduleChangeInfo describes the reasons for schedule changes.
   311  message ScheduleChangeInfo {
   312    message Change {
   313      google.protobuf.Timestamp time = 1;
   314      string reason = 2;
   315    }
   316    repeated Change changes = 1  [(gogoproto.nullable) = false];
   317  }
   318  ```
   319  
   320  #### Protocol Messages vs Columns: side note
   321  It would be nice to put schedule in its own protocol message so that we may extend
   322  it in the future (e.g. add TZ).
   323  However, the use of protocol buffers directly in the database proved to be difficult
   324  from the UI perspective (the operator may want to view the schedule) as well as
   325  debug-ability.  If the native protobuf support improves, we may replace some
   326  of the `system.scheduled_jobs` fields with protocol messages.
   327  See [#47534](https://github.com/cockroachdb/cockroach/issues/47534)
   328  
   329  ## Scheduled Job Daemon
   330  Each node in the cluster runs a scheduled job execution daemon responsible
   331  for finding and executing periodic jobs (see #scheduled-job-execution below).
   332  
   333  Each node in the cluster will wait some time (between 2-5 minutes) before performing
   334  initial scan to let the cluster initialize itself after a restart.
   335  
   336  The scheduling daemon is controlled via server settings:
   337    * `server.job_scheduler.enable` the main knob to enable/disable scheduling daemon
   338    * `server.job_scheduler.pace` how often to scan `system.scheduled_jobs` table; Default 1 min.
   339      * Daemon will apply a small amount of jitter to this setting (10-20%).
   340    * `server.job_scheduler.max_started_jobs_per_iteration`  maximum number of schedules daemon
   341    will execute per iteration.  Default: 10 (note: this is per node in the cluster)
   342  
   343  The `server.job_scheduler.max_started_jobs_per_iteration` (in conjunction with 
   344  `server.job_scheduler.pace`) acts as a safety valve to avoid starting too many jobs 
   345  at the same time -- we want to smooth out the number of jobs we start over time.  
   346  In addition, the default schedule setting prevents starting new instance of the job 
   347  if the previous one has not completed yet; this effectively puts the limit of how many jobs 
   348  can be started by scheduled job system.
   349  
   350  We will rely on `system.jobs` to properly manage the load and not to start job execution 
   351  if doing so would be detrimental to cluster health.  The `system.jobs` system may, for example,
   352  decide to limit the number of jobs that can be started by any particular user.  As mentioned
   353  above, the default scheduling policy will prevent more jobs from being scheduled.
   354  
   355  The load balancing in `system.jobs`  has not been implemented yet 
   356  ([#48825](https://github.com/cockroachdb/cockroach/issues/48825)).
   357   
   358  
   359  #### Scheduled Job Execution
   360  At a high level, the scheduled job daemon periodically performs the following steps:
   361    1. Starts transaction
   362    1. Polls `system.scheduled_jobs` to find eligible scheduled job
   363    1. Updates the `next_run` according to the cron experession, setting it to the next computed time,
   364       or null if there is no next run.
   365       * This update acquires intent-based lock on this row and, in effect, exclusively
   366       locks this row against other dispatchers
   367    1. Executes the job
   368        * Job execution means that we *queue* the job to be executed by the system jobs.
   369        * Job execution is done by the job executor configured for this schedule
   370        * Job execution is subject to the cluster conditions (e.g. cluster settings);
   371        * Future versions will also take into consideration cluster wide conditions 
   372        (such as load), and may decide to reschedule the job instead of executing it.
   373  
   374    1. Commits transaction 
   375  
   376  The daemon will apply a small amount of jitter to it's timing to reduce the
   377  probability of the conflicting transactions.
   378  
   379  The above steps run inside a single transaction to ensure that
   380  only one node starts running a job. Any errors encountered during
   381  execution are handled based on the `ScheduleDetails` configuration.
   382  
   383  The important thing to keep in mind is that since we plan on executing
   384  inside the transaction (at least for the initial version), the actual
   385  statement must complete fairly quickly.  To that end, we plan on 
   386  modifying relevant statements (`BACKUP`, `IMPORT`, etc) to add a variant of these
   387  statements to only queue a job in their transaction 
   388  ([#47539)[https://github.com/cockroachdb/cockroach/issues/47539])
   389  
   390  
   391  #### Scheduled Job Executors
   392  
   393  Once determination is made to execute the job, the actual execution
   394  is controlled by an implementation of the `ScheduledJobExecutor` interface.
   395  
   396  We plan on supporting multiple types of executors for running periodic jobs, and
   397  those implementations need to register themselves with the scheduled jobs system:
   398  
   399  ```go
   400  type ScheduledJobExecutorFactory = func(...) (ScheduledJobExecutor, error)
   401  
   402  func RegisterScheduledJobExecutorFactory(name string, fact ScheduledJobExecutorFactory) error
   403  ```
   404  
   405  The executors are identified by name, which is stored in the 
   406  `system.scheduled_jobs.executor_type` column.
   407  
   408  The initial implementation will come with just one executor, `inline`, 
   409  which just runs the SQL statement in an InternalExecutor 
   410  in the same transaction used to read/update schedule.
   411  
   412  Since `inline` executor would not be suitable for long-running queries, a follow-up can
   413  add additional executor types:
   414    * SqlJob executor: create a job to asynchronously execute the SQL statement arguments.
   415    * backup executor: special purpose backup executor.
   416  
   417  The `ScheduledJobExecutor` is defined as follows:
   418  ```go
   419  type ScheduledJobExecutor interface {
   420    ExecuteJob(context, ExecutionArguments) error
   421    NotifyJobTermination(schedulID, md *jobspb.JobMetadata, txn *kv.Txn) error
   422  }
   423  ```
   424  
   425  `ExecuteJob` as the name implies, executes the job.
   426  
   427  `NotifyJobTermination` is invoked whenever job completes.  The executor
   428  has a chance to examine job termination status and take action appropriate
   429  for this executor.
   430  
   431  The `jobspb.JobMetadata` contains all the relevant job information:
   432  job ID, job completion status, as well as opaque payload,
   433  describing the job itself.  The transaction passed to this method
   434  can be used to commit any changes necessary for this particular executor
   435  (for example, backup specific executor may want to record some history data in its own table).
   436  
   437  
   438  #### Schedule Polling
   439  Our initial MVP version will use a simple polling mechanism where each node picks 
   440  the next runnable job, and attempts to execute it.
   441  
   442  Inside the transaction, we select the next eligible job to run.
   443  
   444  ```sql
   445  SELECT 
   446    S.*,
   447    (SELECT count(*) 
   448     FROM system.scheduled_job_runs R
   449     WHERE R.schedule_id = S.schedule_id AND NOT R.job_completed
   450    ) AS num_running,
   451  FROM system.scheduled_jobs S
   452  WHERE S.next_run < current_timestamp()
   453  ```
   454  
   455  If `num_running > 0` and policy setting `skip_wait == false`, then we need to
   456  wait for the previous invocation to complete.  To do this, we simply advance
   457  the `next_run` time by some amount and commit transaction.
   458  
   459  Once job execution policy constraints have been satisfied, 
   460  we can compute the next time this job is supposed to run.
   461  If the schedule is set, then compute the next_run based on the schedule.
   462  If not, this is a one-off job, set `next_run` to `NULL` (and update
   463  `change_info` to indicate that the schedule completed).
   464  
   465  Finally, execute the statement and commit the transaction.
   466  
   467  We have multiple venues on improving scalability and performance in the future releases,
   468  driven by the customer needs (Cockroach Cloud, etc): 
   469    * Nodes may pick a random runnable job (instead of the oldest), to reduce contention
   470    * If polling becomes problematic (scale), we can explore using another mechanism 
   471    (leasing/locks) to coordinate which node has exclusive dispatch responsibility,
   472     and then allow that node to load the schedule into a more efficient structure 
   473     like a timer wheel that is populated via polling or a rangefeed.
   474    * We can add a coordinator job responsible for finding lots of work to do and 
   475      spawning these jobs
   476    * Loading some (or all) scheduled state in memory to avoid rescans
   477  
   478  #### Job Completion Handling
   479  
   480  Recall, the scheduled job framework is responsible for spawning the
   481  system job to do the actual work.  
   482  
   483  The schedule executor needs to know when the job completes.
   484  We will defer the implementation of the generic job completion
   485  notification mechanism to later versions.
   486  
   487  Since the backup is the only user of scheduled jobs at this point, 
   488  we will modify the backup resumer to call `NotifyJobTermination` before
   489  it completes.
   490  
   491  
   492  #### Pause/Resume
   493  Pausing the schedule involves setting `next_run` to NULL (and updating
   494  change info appropriately, e.g. reason="paused by an operator")
   495  
   496  Un-pausing the schedule involves computing the `next_run` time based on the `schedule`
   497  (and recording the change info)
   498  
   499  #### One-Off Jobs
   500  It is possible to use scheduled jobs to execute a one-off job.  A one-off job is represented
   501  by having an entry in the `system.scheduled_jobs` table that has a `next_run` set, but does not
   502  have a `schedule_expr`.
   503  
   504  ## Monitoring and Logging
   505  
   506  Since scheduled jobs is responsible for running unattended jobs, it is import that
   507  we provide visibility into the health of those jobs.
   508  
   509  The schedule changes are reflected in the `system.scheduled_jobs.schedule_changes`
   510  column.  However, once the work is handed off to the system jobs framework, the
   511  logging and visibility into the actual job will be the responsibility of that system
   512  (see [#47212](https://github.com/cockroachdb/cockroach/issues/47212)).
   513  
   514  The full details of the monitoring integration are outside the scope
   515  of this document, however, at the very least, we will export the following
   516  metrics:
   517     * number of paused scheduled jobs
   518     * number of jobs started
   519     
   520  ## Future work
   521  
   522  * Support user/role delegation: run job as another user
   523  * Support job prioritization
   524  * Support scheduling virtualization 
   525  * Define a new runnable job that wrap and arbitrary block of SQL statements, 
   526    so that the scheduled job system could then create such jobs, effectively allowing 
   527    the execution of any sql statement as a scheduled job
   528  * Save terminal job state information into “history” table upon job completion.
   529