vitess.io/vitess@v0.16.2/doc/design-docs/OnlineDDLScheduler.md (about)

     1  # Online DDL Scheduler
     2  
     3  The DDL scheduler is a control plane that runs on a `PRIMARY` vttablet, as part of the state manager. It is responsible for identifying new migration requests, to choose and execute the next migration, to review running migrations, cleaning up after completion, etc.
     4  
     5  This document explains the general logic behind `onlineddl.Executor` and, in particular, the scheduling aspect.
     6  
     7  ## OnlineDDL & VTTablet state manager
     8  
     9  `onlineddl.Executor` runs on `PRIMARY` tablets. It `Open`s when a tablet turns primary, and `Close`s when the tablet changes its type away from `PRIMARY`. It only operates when in the open state.
    10  
    11  ## General operations
    12  
    13  The scheduler:
    14  
    15  - Identifies queued migrations
    16  - Picks next migration to run
    17  - Executes a migration
    18  - Follows up on migration progress
    19  - Identifies completion or failure
    20  - Cleans up artifacts
    21  - Identifies stale (rogue) migrations that need to be marked as failed
    22  - Identifies migrations started by another tablet
    23  - Possibly auto-retries migrations
    24  
    25  The executor also receives requests from the tablet's query engine/executor to:
    26  
    27  - Submit a new migration
    28  - Cancel a migration
    29  - Retry a migration
    30  
    31  It also responds on the following API endpoint:
    32  
    33  - `/schema-migration/report-status`: called by `gh-ost` and `pt-online-schema-change` to report liveness, completion or failure.
    34  
    35  # The scheduler
    36  
    37  Breaking down the scheduler logic
    38  
    39  ## Migration states & transitions
    40  
    41  A migration can be in any one of these states:
    42  
    43  - `queued`: a migration is submitted
    44  - `ready`: a migration is picked from the queue to run
    45  - `running`: a migration was started. It is periodically tested to be making progress.
    46  - `complete`: a migration completed successfully
    47  - `failed`: a migration failed due to whatever reason. It may have ran for a while, or it may have been marked as `failed` before even running.
    48  - `cancelled`: a _pending_ migration was cancelled
    49  
    50  A migration is said to be _pending_ if we expect it to run and complete. Pending migrations are those in `queued`, `ready` and `running` states.
    51  
    52  Some possible state transitions are:
    53  
    54  - `queued -> ready -> running -> complete`: the ideal flow where everything just works
    55  - `queued -> ready -> running -> failed`: a migration breaks
    56  - `queued -> cancelled`: a migration is cancelled by the user before taken out of queue
    57  - `queued -> ready -> cancelled`: a migration is cancelled by the user before running
    58  - `queued -> ready -> running -> failed`: a running migration is cancelled by the user and forcefully terminated, causing it to enter the `failed` state
    59  - `queued -> ready -> running -> failed -> running`: a failed migration was _retried_
    60  - `queued -> ... cancelled -> queued -> ready -> running`: a cancelled migration was _retried_ (irrespective of whether it was running at time of cancellation)
    61  - `queued -> ready -> cancelled -> queued -> ready -> running -> failed -> running -> failed -> running -> completed`: a combined flow that shows we can retry multiple times
    62  
    63  ## General logic
    64  
    65  The scheduler works by periodically sampling the known migrations. Normally there's a once per minute tick that kicks in a series of checks. You may imagine a state machine that advances once per minute. However, some steps such as:
    66  
    67  - Submission of a new migration
    68  - Migration execution start
    69  - Migration execution completion
    70  - Open() state
    71  - Test suite scenario
    72  
    73  will kick a burst of additional ticks. This is done to speed up the progress of the state machine. For example, if a new migration is submitted, there's a good chance it will be clear to execute, so an increase in ticks will start the migration within a few seconds rather than one minute later.
    74  
    75  By default, Vitess schedules all migrations to run sequentially. Only a single migration is expected to run at any given time. However, there are cases for concurrent execution of migrations, and the user may request concurrent execution via `--allow-concurrent` flag in `ddl_strategy`. Some migrations are eligible to run concurrently, other migrations are eligible to run specific phases concurrently, and some do not allow concurrency. See the user guides for up-to-date information.
    76  
    77  ## Who runs the migration
    78  
    79  Some migrations are executed by the scheduler itself, some by a sub-process, and some implicitly by vreplication, as follows:
    80  
    81  - `CREATE TABLE` migrations are executed by the scheduler.
    82  - `DROP TABLE` migrations are executed by the scheduler.
    83  - `ALTER TABLE` migrations depend on `ddl_strategy`:
    84    - `vitess`/`online`: the scheduler configures, creates and starts a VReplication stream. From that point on, the tablet manager's VReplication logic takes ownership of the execution. The scheduler periodically checks progress. The scheduler identifies an end-of-migration scenario and finalizes the cut-over and termination of the VReplication stream. It is possible for a VReplication migration to span multiple tablets, detailed below. In this case, if the tablet goes down, then the migration will not be lost. It will be continued on another tablet, as described below.
    85    - `gh-ost`: the executor runs `gh-ost` via `os.Exec`. It runs the entire flow within a single function. Thus, `gh-ost` completes within the same lifetime of the scheduler (and the tablet space in which is operates). To clarify, if the tablet goes down, then the migration is deemed lost.
    86    - `pt-osc`: the executor runs `pt-online-schema-change` via `os.Exec`. It runs the entire flow within a single function. Thus, `pt-online-schema-change` completes within the same lifetime of the scheduler (and the tablet space in which is operates). To clarify, if the tablet goes down, then the migration is deemed lost.
    87  
    88  ## Stale migrations
    89  
    90  The scheduler maintains a _liveness_ timestamp for running migrations:
    91  
    92  - `vitess`/`online` migrations are based on VReplication, which reports last timestamp/transaction timestamp. The scheduler infers migration liveness based on these and on the stream status.
    93  - `gh-ost` migrations report liveness via `/schema-migration/report-status`
    94  - `pt-osc` does not report liveness. The scheduler actively checks for liveness by looking up the `pt-online-schema-change` process.
    95  
    96  One way or another, we expect at most (roughly) a 1 minute interval between a running migration's liveness reports. When a migration is expected to be running, and does not have a liveness report for `10` minutes, then it is considered _stale_.
    97  
    98  A stale migration can happen for various reasons. Perhaps a `pt-osc` process went zombie. Or a `gh-ost` process was locked.
    99  
   100  When the scheduler finds a stale migration, it:
   101  
   102  - Considers it to be broken and removes it from internal bookkeeping of running migrations.
   103  - Takes steps to forcefully terminate it, just in case it still happens to run:
   104    - For a `gh-ost` migration, it touches the panic flag file.
   105    - For `pt-osc`, it `kill`s the process, if any
   106    - For `online`, it stops and deletes the stream
   107  
   108  ## Failed tablet migrations
   109  
   110  A specially handled scenario is where a migration runs, and the owning (primary) tablet fails.
   111  
   112  For `gh-ost` and `pt-osc` migrations, it's impossible to resume the migration from the exact point of failure. The scheduler will attempt a full retry of the migration. This means throwing away the previous migration's artifacts (ghost tables) and starting anew.
   113  
   114  To avoid a cascading failure scenario, a migration is only auto-retried _once_. If a 2nd tablet failure takes place, it's up to the user to retry the failed migration.
   115  
   116  ## Cross tablet VReplication migrations
   117  
   118  VReplication is more capable than `gh-ost` and `pt-osc`, since it tracks its state transactionally in the same database server as the migration/ghost table. This means a stream can automatically recover after e.g. a failover. The new `primary` tablet has all the information in `_vt.vreplication`, `_vt.copy_state` to keep on running the stream.
   119  
   120  The scheduler supports that. It is able to identify a stream which started with a previous tablet, and is able to take ownership of such a stream. Because VReplication will recover/resume a stream independently of the scheduler, the scheduler will then implicitly find that the stream is _running_ and be able to assert its _liveness_.
   121  
   122  The result is that if a tablet fails mid-`online` migration, the new `primary` tablet will auto-resume migration _from the point of interruption_. This happens whether it's the same table that recovers as `primary` or whether its a new tablet that is promoted as `primary`. A migration can survive multiple tablet failures. It is only limited by VReplication's capabilities.