vitess.io/vitess@v0.16.2/doc/design-docs/LongRunningJobs.md

vitess.io/vitess@v0.16.2/doc/design-docs/LongRunningJobs.md (about)

     1  # Long Running Tasks
     2  
     3  Currently, in Vitess, long running tasks are used in a couple places:
     4  
     5  * `vtworker` runs tasks that can take a long time, as they can deal with a lot
     6    of data.
     7    
     8  * `vttablet` runs backups and restores that can also take a long time.
     9  
    10  In both cases, we have a streaming RPC API that starts the job, and streams some
    11  status back while it is running. If the streaming RPC is interrupted, the
    12  behavior is different: `vtworker` and `vttablet backup` will interrupt their
    13  jobs, whereas `vttablet restore` will try to keep going if it is past a point of
    14  no return for restores.
    15  
    16  This document proposes a different, but common design for both use cases.
    17  
    18  ## RPC Model
    19  
    20  We introduce a model with three different RPCs:
    21  
    22  * Start the job will be a synchronous non-streaming RPC, and just starts the
    23    job. It returns an identifier for the job.
    24  
    25  * Getting the job status is a streaming RPC. It takes the job identifier, and
    26    streams the current status back. This job status can contain log entries
    27    (regular messages) or progress (percentage, N/M completion, ...).
    28  
    29  * Canceling a running job is its own synchronous non-streaming RPC, and takes
    30    the job identifier.
    31  
    32  For simplicity, let's try to make these the same API for 'vtworker' and
    33  'vttablet'. Usually, a single destination can only run a single job, but let's
    34  not assume that in the API. If a destination process cannot run a job, it should
    35  return the usual `RESOURCE_EXHAUSTED` canonical error code.
    36  
    37  These RPCs should be grouped in a new API service. Let's describe it as usual in
    38  `jobdata.proto` and `jobservice.proto`. The current `vtworkerdata.proto` and
    39  `vtworkerservice.proto` will eventually be removed and replaced by the new
    40  service.
    41  
    42  Let's use the usual `repeated string args` to describe the job. `vtworker`
    43  already uses that.
    44  
    45  So the proposed proto definitions:
    46  
    47  ``` proto
    48  
    49  # in jobdata.proto
    50  
    51  message StartRequest {
    52    repeated string args = 1;
    53  }
    54  
    55  message StartResponse {
    56    string uid = 1;
    57  }
    58  
    59  message StatusRequest {
    60    string uid = 1;
    61  }
    62  
    63  // Progress describes the current progress of the task.
    64  // Note the fields here match the Progress and ProgressMessage from the Node
    65  // display of workflows.
    66  message Progress {
    67    // percentage can be 0-100 if known, or -1 if unknown.
    68    int8 percentage = 1;
    69  
    70    // message can be empty if percentage is set.
    71    string message = 2;
    72  }
    73  
    74  // FinalStatus describes the end result of a job.
    75  message FinalStatus {
    76    // error is empty if the job was successful.
    77    string error = 1;
    78  }
    79  
    80  // StatusResponse can have any of its fields set.
    81  message StatusResponse {
    82    // event is optional, used for logging.
    83    logutil.Event event = 1;
    84  
    85    // progress is optional, used to indicate progress.
    86    Progress progress = 2;
    87    
    88    // If final_status is set, this is the last StatusResponse for this job,
    89    // it is terminated.
    90    FinalStatus final_status = 3;
    91  }
    92  
    93  message CancelRequest {
    94    string uid = 1;
    95  }
    96  
    97  message CancelResponse {
    98  }
    99  
   100  # in jobdata.service
   101  
   102  service Job {
   103    rpc Start (StartRequest) returns (StartResponse) {};
   104    
   105    rpc Status (StatusRequest) returns (stream StatusResponse) {};
   106    
   107    rpc Cancel (CancelRequest) returns (CancelResponse) {};
   108  }
   109  ```
   110  
   111  ## Integration with Current Components
   112  
   113  ### vtworker
   114  
   115  This design is very simple to implement within vtworker. At first, we don't need
   116  to link the progress in, just the logging part.
   117  
   118  vtworker will only support running one job like this at a time. 
   119  
   120  ### vttablet
   121  
   122  This is also somewhat easy to implement within vttablet. Only `Backup` and
   123  `Restore` will be changed to use this.
   124  
   125  vttablet will only support running one job like this at a time. It will also
   126  take the ActionLock, so no other tablet actions can run at the same time (as we
   127  do now).
   128  
   129  ### vtctld Workflows Integration
   130  
   131  The link here is also very straightforward:
   132  
   133  * When successfully starting a remote job, the address of the remote worker and
   134    the UID of the job can be checkpointed.
   135    
   136  * After that, the workflow can just connect and update its status and logs when
   137    receiving an update.
   138  
   139  * If the workflow is aborted and reloaded somewhere else (vtctld restart), it
   140    can reconnect to the running job easily.
   141    
   142  * Canceling the job is also easy, just call the RPC.
   143  
   144  ### Comments
   145  
   146  Both vtworker and vttablet could remember the last N jobs they ran, and their
   147  status. So when a workflow tries to reconnect to a finished job, they just
   148  stream a single `StatusResponse` with a `final_status` field.