github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200414_scheduled_jobs.md

github.com/cockroachdb/cockroach@v20.2.0-alpha.1+incompatible/docs/RFCS/20200414_scheduled_jobs.md (about)

1 Feature Name: Scheduled Jobs
2 - Status: Ready for review.
3 - Start Date: 2020-04-14
4 - Authors: Yevgeniy Miretkiy, David Taylor
5 - RFC PR: 48131
6 - Cockroach Issue: 47223
7
8
9 # Summary
10
11 Scheduled jobs adds an ability to execute scheduled, periodic jobs to CockroachDB.
12
13 # Motivation
14
15 Database administrators have a need to execute certain jobs periodically or at fixed times.
16 * Run daily or hourly backups
17 * Run import job daily (for example to add data to development database)
18 or refresh externally provided datasets
19 * Produce monthly reports
20 * Export data to reporting or analytics pipelines
21
22 Users have also expressed desire to execute a one-off job
23 (for example: administrator creates a temporary database for a user for 1 week;
24 one week from now -- drop this database).
25
26 Of course these types of activities are already possible by using external systems.
27
28 However, the external-cron solution has few deficiencies:
29 * Our "make easy" ethos means wherever possible we bundle dependencies
30 (config, orchestration, etc) into the system. Ideally no separate services
31 should be required to operate CockroachDB, including with our recommended
32 cadence of periodic backups.
33 * It is hard to maintain external cron mechanism particularly in an HA deployment or
34 in a shared environment (i.e. cockroach cloud, or on-prem shared database
35 where the administrator may want to control how often each database gets backed up)
36 * It is more difficult to have the external system take into account
37 the current conditions of the database (load for example) when
38 scheduling jobs.
39
40 # Detailed design
41
42 ## System Requirements
43
44 The primary motivation for adding scheduled jobs system is to support the important
45 case of _making backups easy_. While backups are the primary
46 driver, we want the scheduled jobs to be flexible in order to support other use cases.
47
48 The scheduled jobs goals are:
49 * Support running different types of work:
50 * Backup tasks
51 * Import tasks
52 * Support flexible execution policies:
53 * Periodic execution
54 * Configurable error handling (e.g. start next run if previous failed or not)
55 * Support user/role delegation when running jobs and tasks (V2 probably)
56 * Scheduled jobs must be resilient in a distributed environment
57 * Ensure the system spawns one instance of the task
58 * Must be resilient to node restarts/crashes
59 * Be scalable (particularly for virtualized cloud environment)
60
61 The non-goals are:
62 * Scheduled jobs cannot provide guarantees that the job will run at
63 exactly the time specified in the job spec, though the system will try to do so
64 if it's possible.
65 * The scheduled jobs system is not responsible for actually _executing_ the job itself.
66 That is still handled like any other job by our jobs system.
67 It is responsible for pushing work into that system at the designated time and
68 with the configured recurrence policy.
69
70
71 ## Scheduled Job
72
73 A scheduled job is a task that runs at some time in the future.
74 The task may be periodic, or it may be a one-off task.
75
76 The task itself is treated as a black box: an opaque protocol message which is interpreted
77 by the task runner -- an executor -- for this particular task.
78
79 The task runner itself is something which plans and executes the task -- for example,
80 a task runner may create a "job" record (entry in system.jobs) for execution by the existing
81 jobs framework.
82
83 The initial implementation will provide one task runner type which evaluates SQL statements.
84 We will only support SQL statements that can be executed by the system jobs framework.
85 Currently, cockroach supports limited set of such statements (`BACKUP`, `RESTORE`, etc),
86 but we plan on evaluating and possibly adding support for running any sql statement,
87 wrapped as a system job, in future work.
88
89 ### SQL Interface with Scheduled Jobs
90
91 It should be possible to interact with the scheduled job system via SQL.
92
93 We plan on *eventually* supporting the following statements:
94 * PAUSE/RESUME SCHEDULE `id`: pauses/resumes specified schedule.
95 * PAUSE JOBS FOR SCHEDULE: pause any jobs that were created by this schedule that are running.
96 * SHOW SCHEDULES
97
98 These statements are a syntactic sugar ("SHOW SCHEDULES" is more/less identical to "SELECT *").
99 We will most likely defer work on these statements until later versions of this system.
100
101 Also, note that the actual creation of the scheduled job record is outside the scope of this
102 RFC. This RFC describes a low level scheduling sub-system.
103
104 We plan on implementing other, higher level abstractions on top of scheduled jobs.
105 For example, a "backup schedules system" will concern itself with all aspects of periodic backup
106 schedules. That system will address its SQL/UI needs
107 (for example, "CREATE PERIODIC BACKUP ON foo RUNNING@daily") and those higher level statements
108 will be responsible for creating appropriate scheduled jobs records.
109
110 We may want to disable direct access (even for root/super users) to the underlying
111 `system.scheduled_jobs` table, and only provide accessors via those "higher level" statements
112 (this is similar how `jobs` uses virtual stables/show statements)
113
114 ## Scheduled Job Representation
115
116 The system stores scheduled job related information in a new `system.scheduled_jobs` table.
117 In addition, we add a new indexed column to the `system.jobs` table to correlate
118 system jobs to their schedule.
119
120 #### `system.scheduled_jobs` Table
121 Scheduled jobs stores job definitions in the `system.scheduled_jobs` table:
122
123 ```sql
124 CREATE TABLE system.scheduled_jobs (
125 schedule_id INT DEFAULT unique_rowid() PRIMARY KEY,
126 schedule_name STRING NOT NULL,
127 created TIMESTAMPTZ NOT NULL DEFAULT NOW(),
128 owner STRING,
129 next_run TIMESTAMPTZ,
130 schedule_expr STRING,
131 schedule_details BYTES,
132 executor_type STRING NOT NULL,
133 execution_args BYTES NOT NULL,
134 schedule_changes BYTES,
135
136 INDEX (next_run),
137 FAMILY "sched" (sched_id, next_run),
138 FAMILY "other" (... all other fields...)
139 )
140 ```
141
142 * `schedule_id` is a unique id for this scheduled job
143 * `schedule_name` is a descriptive name (used when showing this schedule in the UI)
144 * `created` is the scheduled job creation timestamp
145 * `owner` is the user who owns (created) this schedule (and in v1, who to execute the job as)
146 * `next_run` is the next scheduled run of this job.
147 If next_run is `NULL`, then the job does not run (it's paused).
148 * `schedule_expr` is the job schedule specified in crontab format; may be empty
149 * `schedule_details` is a serialized `ScheduleDetails` protocol message.
150 * `executor_type` is a name identifying executor to use when executing this schedule
151 * `execution_args` is a serialized `ExecutionArguments` protocol message
152 * `schedule_changes` is a serialized `ScheduleChangeInfo` protocol message.
153
154 Scheduled jobs system uses the `crontab` format to specify job schedule. We believe that, even
155 though this format is not the easiest to use/parse, the crontab format is, nonetheless,
156 well-established and very flexible. We will attempt to use existing crontab parsing
157 packages if possible.
158
159 The initial version of scheduled jobs will use UTC timezone for all the schedules.
160
161 ### Modifications to the `system.jobs` Table
162
163 We will modify `system.jobs` to add two columns (along with an index):
164
165 ```sql
166 ALTER TABLE system.jobs ADD COLUMN created_by_type STRING;
167 ALTER TABLE system.jobs ADD COLUMN created_by_id INT;
168 CREATE INDEX creator_idx ON system.jobs(created_by_type, created_by_id) STORING(status);
169 ```
170
171 The purpose for the `created_by_type` is to describe which system
172 created the job, and `created_by_id` describes "id" specific to the system
173 which created this job.
174
175 Scheduled job system will use `scheduled` as its value for `created_by_type`.
176
177 The purpose of these two columns is to enable scheduled job to determine
178 which schedule trigger a particular job and take action based on the job
179 execution result (i.e. failed job handling).
180
181 These two columns will also be useful for other systems that create jobs and will
182 provide additional context to the operators. For example, complex schema change
183 operation, that creates multiple jobs, can use these columns to annotate
184 jobs with useful information (which job spawn some other job, etc)
185
186
187 #### Alternative: Modify introduce new `system.scheduled_job_runs` table:
188
189 We have considered alternative implementation that does not add columns
190 to the `system.jobs` table. Namely, we considered addition of a third
191 `system.scheduled_job_runs` table to establish schedule/job run relation:
192
193 ```sql
194 CREATE TABLE system.scheduled_job_runs (
195 job_id INT NOT NULL PRIMARY KEY REFERENCES system.jobs (id) ON DELETE CASCADE,
196 schedule_id INT NOT NULL,
197 termination_handled BOOL DEFAULT false,
198 INDEX (schedule_id),
199 )
200 ```
201
202 The `job_id` and `schedule_id` columns establish schedule to job relation.
203 The `termination_handled` is used in order to implement efficient polling,
204 as well as assist in implementation of SQL statements related
205 to scheduled jobs (e.g. `PAUSE JOBS FOR SCHEDULE 123`)
206
207 `job_id` has a foreign key constraint on the `system.jobs` PK to ensure
208 we cleanup `system.scheduled_job_runs` when we cleanup `system.jobs`.
209 However, we do not have FK constraint on the `schedule_id` in order
210 to preserve run history, even if we delete schedule.
211
212 However, after much deliberation, we decided that this alternative
213 has the following deficiencies:
214 * We burn 2 system table IDs, instead of 1. System table IDs is a precious
215 commodity; We only have 13 more to go before we run out and need to
216 come up with alternatives. Rework and changes to the cluster creation/bootstrapping
217 is a bit outside the scope for this RFC.
218 * More complex migration. Even though we do not add columns to the `system.jobs`
219 table, we still need to modify its descriptor to indicate foreign key relationship.
220 * Interdependence between system tables is probably not a great thing to have
221 in general. For example, this would complicate backup and restores which would
222 now have to deal with restoring the system tables in the right order.
223 * An extra table is a bit less efficient since we're effectively duplicating
224 state already available in the `system.jobs` table
225
226 The benefit of this solution is that we would not be tying `system.jobs` schema
227 in any visible way to the scheduled jobs. However, by introducing FK constraint, we
228 are tying things up anyway. The alternative to that is to forego any sort of
229 FK constraint for the extra cost of having to write and maintain code to clean up
230 old job run records. This would complicate debugging and visibility into these
231 systems since deletions from system jobs and scheduled job runs will happen under
232 different schedules/policies.
233
234 We believe the two column addition to be a better alternative.
235
236
237 ### Scheduled Job Protocol Messages
238
239 The `ScheduleDetails` describes how to schedule the job:
240 ```proto
241 message ScheduleDetails {
242 // WaitBehavior describes how to handle previously started
243 // jobs that have not completed yet.
244 enum WaitBehavior {
245 // Wait for the previous run to complete
246 // before starting the next one.
247 WAIT = 0;
248 // Do not wait for the previous run to complete.
249 NO_WAIT = 1;
250 // If the previous run is still running, skip this run
251 // and advance schedule to the next recurrence.
252 SKIP = 2;
253 }
254
255 // ErrorHandlingBehavior describes how to handle failed job runs.
256 enum ErrorHandlingBehavior {
257 // By default, failed jobs will run again, based on their schedule.
258 RETRY_SCHED = 0;
259 // Retry failed jobs soon.
260 RETRY_SOON = 1;
261 // Stop running this schedule
262 PAUSE_SCHED = 2;
263 }
264
265 // How to handle running jobs started by this schedule.
266 WaitBehavior wait = 1;
267
268 // How to handle failed jobs started by this schedule.
269 ErrorHandlingBehavior on_error = 2;
270 }
271 ```
272
273 The default scheduling policy is:
274 * Wait for the outstanding job to finish before starting the next one
275 (`wait == WaitBehavior.WAIT`)
276 * If there were any errors in the execution, we continue attempting
277 to execute scheduled job (`on_error == RETRY_SCHED`)
278
279 Note: `ScheduleDetails` does not include information on when to run the schedule.
280 This data is available in the `system.scheduled_jobs.schedule_expr`.
281
282 In the subsequent iterations of this system, we can extend `ScheduleDetails`
283 protocol messages to support more use cases. Some future enhancements might include:
284 * Do not spawn new job if more than N instances of this job still running
285 (instead of restricting to 1)
286 * max_retries: Pause scheduled job if more than X runs resulted in an error.
287 * Add execution time limits (kill a job if it runs longer than X)
288 * Skip execution if we are too far behind (for example, if the cluster was down,
289 we may want to skip all "old" runs and reschedule them to continue based on their
290 schedule settings.
291
292 Scheduled job system treats work it needs to execute as a black box.
293 `ExecutionArguments` describes the work to be executed:
294
295 ```proto
296 message ExecutorArguments {
297 google.protobuf.Any args = 1
298 }
299 ```
300
301 `ExecutorArguments` is a protocol message containing opaque `args` which are
302 arguments specific to different executor implementations (see #scheduled-job-executors
303 below).
304
305
306 Finally, whenever we perform changes to the scheduled job (i.e. pause it), we want
307 to record this information for debugging purposes in the following protocol message:
308
309 ```proto
310 // ScheduleChangeInfo describes the reasons for schedule changes.
311 message ScheduleChangeInfo {
312 message Change {
313 google.protobuf.Timestamp time = 1;
314 string reason = 2;
315 }
316 repeated Change changes = 1 [(gogoproto.nullable) = false];
317 }
318 ```
319
320 #### Protocol Messages vs Columns: side note
321 It would be nice to put schedule in its own protocol message so that we may extend
322 it in the future (e.g. add TZ).
323 However, the use of protocol buffers directly in the database proved to be difficult
324 from the UI perspective (the operator may want to view the schedule) as well as
325 debug-ability. If the native protobuf support improves, we may replace some
326 of the `system.scheduled_jobs` fields with protocol messages.
327 See [#47534](https://github.com/cockroachdb/cockroach/issues/47534)
328
329 ## Scheduled Job Daemon
330 Each node in the cluster runs a scheduled job execution daemon responsible
331 for finding and executing periodic jobs (see #scheduled-job-execution below).
332
333 Each node in the cluster will wait some time (between 2-5 minutes) before performing
334 initial scan to let the cluster initialize itself after a restart.
335
336 The scheduling daemon is controlled via server settings:
337 * `server.job_scheduler.enable` the main knob to enable/disable scheduling daemon
338 * `server.job_scheduler.pace` how often to scan `system.scheduled_jobs` table; Default 1 min.
339 * Daemon will apply a small amount of jitter to this setting (10-20%).
340 * `server.job_scheduler.max_started_jobs_per_iteration` maximum number of schedules daemon
341 will execute per iteration. Default: 10 (note: this is per node in the cluster)
342
343 The `server.job_scheduler.max_started_jobs_per_iteration` (in conjunction with
344 `server.job_scheduler.pace`) acts as a safety valve to avoid starting too many jobs
345 at the same time -- we want to smooth out the number of jobs we start over time.
346 In addition, the default schedule setting prevents starting new instance of the job
347 if the previous one has not completed yet; this effectively puts the limit of how many jobs
348 can be started by scheduled job system.
349
350 We will rely on `system.jobs` to properly manage the load and not to start job execution
351 if doing so would be detrimental to cluster health. The `system.jobs` system may, for example,
352 decide to limit the number of jobs that can be started by any particular user. As mentioned
353 above, the default scheduling policy will prevent more jobs from being scheduled.
354
355 The load balancing in `system.jobs` has not been implemented yet
356 ([#48825](https://github.com/cockroachdb/cockroach/issues/48825)).
357
358
359 #### Scheduled Job Execution
360 At a high level, the scheduled job daemon periodically performs the following steps:
361 1. Starts transaction
362 1. Polls `system.scheduled_jobs` to find eligible scheduled job
363 1. Updates the `next_run` according to the cron experession, setting it to the next computed time,
364 or null if there is no next run.
365 * This update acquires intent-based lock on this row and, in effect, exclusively
366 locks this row against other dispatchers
367 1. Executes the job
368 * Job execution means that we *queue* the job to be executed by the system jobs.
369 * Job execution is done by the job executor configured for this schedule
370 * Job execution is subject to the cluster conditions (e.g. cluster settings);
371 * Future versions will also take into consideration cluster wide conditions
372 (such as load), and may decide to reschedule the job instead of executing it.
373
374 1. Commits transaction
375
376 The daemon will apply a small amount of jitter to it's timing to reduce the
377 probability of the conflicting transactions.
378
379 The above steps run inside a single transaction to ensure that
380 only one node starts running a job. Any errors encountered during
381 execution are handled based on the `ScheduleDetails` configuration.
382
383 The important thing to keep in mind is that since we plan on executing
384 inside the transaction (at least for the initial version), the actual
385 statement must complete fairly quickly. To that end, we plan on
386 modifying relevant statements (`BACKUP`, `IMPORT`, etc) to add a variant of these
387 statements to only queue a job in their transaction
388 ([#47539)[https://github.com/cockroachdb/cockroach/issues/47539])
389
390
391 #### Scheduled Job Executors
392
393 Once determination is made to execute the job, the actual execution
394 is controlled by an implementation of the `ScheduledJobExecutor` interface.
395
396 We plan on supporting multiple types of executors for running periodic jobs, and
397 those implementations need to register themselves with the scheduled jobs system:
398
399 ```go
400 type ScheduledJobExecutorFactory = func(...) (ScheduledJobExecutor, error)
401
402 func RegisterScheduledJobExecutorFactory(name string, fact ScheduledJobExecutorFactory) error
403 ```
404
405 The executors are identified by name, which is stored in the
406 `system.scheduled_jobs.executor_type` column.
407
408 The initial implementation will come with just one executor, `inline`,
409 which just runs the SQL statement in an InternalExecutor
410 in the same transaction used to read/update schedule.
411
412 Since `inline` executor would not be suitable for long-running queries, a follow-up can
413 add additional executor types:
414 * SqlJob executor: create a job to asynchronously execute the SQL statement arguments.
415 * backup executor: special purpose backup executor.
416
417 The `ScheduledJobExecutor` is defined as follows:
418 ```go
419 type ScheduledJobExecutor interface {
420 ExecuteJob(context, ExecutionArguments) error
421 NotifyJobTermination(schedulID, md *jobspb.JobMetadata, txn *kv.Txn) error
422 }
423 ```
424
425 `ExecuteJob` as the name implies, executes the job.
426
427 `NotifyJobTermination` is invoked whenever job completes. The executor
428 has a chance to examine job termination status and take action appropriate
429 for this executor.
430
431 The `jobspb.JobMetadata` contains all the relevant job information:
432 job ID, job completion status, as well as opaque payload,
433 describing the job itself. The transaction passed to this method
434 can be used to commit any changes necessary for this particular executor
435 (for example, backup specific executor may want to record some history data in its own table).
436
437
438 #### Schedule Polling
439 Our initial MVP version will use a simple polling mechanism where each node picks
440 the next runnable job, and attempts to execute it.
441
442 Inside the transaction, we select the next eligible job to run.
443
444 ```sql
445 SELECT
446 S.*,
447 (SELECT count(*)
448 FROM system.scheduled_job_runs R
449 WHERE R.schedule_id = S.schedule_id AND NOT R.job_completed
450 ) AS num_running,
451 FROM system.scheduled_jobs S
452 WHERE S.next_run < current_timestamp()
453 ```
454
455 If `num_running > 0` and policy setting `skip_wait == false`, then we need to
456 wait for the previous invocation to complete. To do this, we simply advance
457 the `next_run` time by some amount and commit transaction.
458
459 Once job execution policy constraints have been satisfied,
460 we can compute the next time this job is supposed to run.
461 If the schedule is set, then compute the next_run based on the schedule.
462 If not, this is a one-off job, set `next_run` to `NULL` (and update
463 `change_info` to indicate that the schedule completed).
464
465 Finally, execute the statement and commit the transaction.
466
467 We have multiple venues on improving scalability and performance in the future releases,
468 driven by the customer needs (Cockroach Cloud, etc):
469 * Nodes may pick a random runnable job (instead of the oldest), to reduce contention
470 * If polling becomes problematic (scale), we can explore using another mechanism
471 (leasing/locks) to coordinate which node has exclusive dispatch responsibility,
472 and then allow that node to load the schedule into a more efficient structure
473 like a timer wheel that is populated via polling or a rangefeed.
474 * We can add a coordinator job responsible for finding lots of work to do and
475 spawning these jobs
476 * Loading some (or all) scheduled state in memory to avoid rescans
477
478 #### Job Completion Handling
479
480 Recall, the scheduled job framework is responsible for spawning the
481 system job to do the actual work.
482
483 The schedule executor needs to know when the job completes.
484 We will defer the implementation of the generic job completion
485 notification mechanism to later versions.
486
487 Since the backup is the only user of scheduled jobs at this point,
488 we will modify the backup resumer to call `NotifyJobTermination` before
489 it completes.
490
491
492 #### Pause/Resume
493 Pausing the schedule involves setting `next_run` to NULL (and updating
494 change info appropriately, e.g. reason="paused by an operator")
495
496 Un-pausing the schedule involves computing the `next_run` time based on the `schedule`
497 (and recording the change info)
498
499 #### One-Off Jobs
500 It is possible to use scheduled jobs to execute a one-off job. A one-off job is represented
501 by having an entry in the `system.scheduled_jobs` table that has a `next_run` set, but does not
502 have a `schedule_expr`.
503
504 ## Monitoring and Logging
505
506 Since scheduled jobs is responsible for running unattended jobs, it is import that
507 we provide visibility into the health of those jobs.
508
509 The schedule changes are reflected in the `system.scheduled_jobs.schedule_changes`
510 column. However, once the work is handed off to the system jobs framework, the
511 logging and visibility into the actual job will be the responsibility of that system
512 (see [#47212](https://github.com/cockroachdb/cockroach/issues/47212)).
513
514 The full details of the monitoring integration are outside the scope
515 of this document, however, at the very least, we will export the following
516 metrics:
517 * number of paused scheduled jobs
518 * number of jobs started
519
520 ## Future work
521
522 * Support user/role delegation: run job as another user
523 * Support job prioritization
524 * Support scheduling virtualization
525 * Define a new runnable job that wrap and arbitrary block of SQL statements,
526 so that the scheduled job system could then create such jobs, effectively allowing
527 the execution of any sql statement as a scheduled job
528 * Save terminal job state information into “history” table upon job completion.
529