vitess.io/vitess@v0.16.2/doc/design-docs/LongRunningJobs.md (about) 1 # Long Running Tasks 2 3 Currently, in Vitess, long running tasks are used in a couple places: 4 5 * `vtworker` runs tasks that can take a long time, as they can deal with a lot 6 of data. 7 8 * `vttablet` runs backups and restores that can also take a long time. 9 10 In both cases, we have a streaming RPC API that starts the job, and streams some 11 status back while it is running. If the streaming RPC is interrupted, the 12 behavior is different: `vtworker` and `vttablet backup` will interrupt their 13 jobs, whereas `vttablet restore` will try to keep going if it is past a point of 14 no return for restores. 15 16 This document proposes a different, but common design for both use cases. 17 18 ## RPC Model 19 20 We introduce a model with three different RPCs: 21 22 * Start the job will be a synchronous non-streaming RPC, and just starts the 23 job. It returns an identifier for the job. 24 25 * Getting the job status is a streaming RPC. It takes the job identifier, and 26 streams the current status back. This job status can contain log entries 27 (regular messages) or progress (percentage, N/M completion, ...). 28 29 * Canceling a running job is its own synchronous non-streaming RPC, and takes 30 the job identifier. 31 32 For simplicity, let's try to make these the same API for 'vtworker' and 33 'vttablet'. Usually, a single destination can only run a single job, but let's 34 not assume that in the API. If a destination process cannot run a job, it should 35 return the usual `RESOURCE_EXHAUSTED` canonical error code. 36 37 These RPCs should be grouped in a new API service. Let's describe it as usual in 38 `jobdata.proto` and `jobservice.proto`. The current `vtworkerdata.proto` and 39 `vtworkerservice.proto` will eventually be removed and replaced by the new 40 service. 41 42 Let's use the usual `repeated string args` to describe the job. `vtworker` 43 already uses that. 44 45 So the proposed proto definitions: 46 47 ``` proto 48 49 # in jobdata.proto 50 51 message StartRequest { 52 repeated string args = 1; 53 } 54 55 message StartResponse { 56 string uid = 1; 57 } 58 59 message StatusRequest { 60 string uid = 1; 61 } 62 63 // Progress describes the current progress of the task. 64 // Note the fields here match the Progress and ProgressMessage from the Node 65 // display of workflows. 66 message Progress { 67 // percentage can be 0-100 if known, or -1 if unknown. 68 int8 percentage = 1; 69 70 // message can be empty if percentage is set. 71 string message = 2; 72 } 73 74 // FinalStatus describes the end result of a job. 75 message FinalStatus { 76 // error is empty if the job was successful. 77 string error = 1; 78 } 79 80 // StatusResponse can have any of its fields set. 81 message StatusResponse { 82 // event is optional, used for logging. 83 logutil.Event event = 1; 84 85 // progress is optional, used to indicate progress. 86 Progress progress = 2; 87 88 // If final_status is set, this is the last StatusResponse for this job, 89 // it is terminated. 90 FinalStatus final_status = 3; 91 } 92 93 message CancelRequest { 94 string uid = 1; 95 } 96 97 message CancelResponse { 98 } 99 100 # in jobdata.service 101 102 service Job { 103 rpc Start (StartRequest) returns (StartResponse) {}; 104 105 rpc Status (StatusRequest) returns (stream StatusResponse) {}; 106 107 rpc Cancel (CancelRequest) returns (CancelResponse) {}; 108 } 109 ``` 110 111 ## Integration with Current Components 112 113 ### vtworker 114 115 This design is very simple to implement within vtworker. At first, we don't need 116 to link the progress in, just the logging part. 117 118 vtworker will only support running one job like this at a time. 119 120 ### vttablet 121 122 This is also somewhat easy to implement within vttablet. Only `Backup` and 123 `Restore` will be changed to use this. 124 125 vttablet will only support running one job like this at a time. It will also 126 take the ActionLock, so no other tablet actions can run at the same time (as we 127 do now). 128 129 ### vtctld Workflows Integration 130 131 The link here is also very straightforward: 132 133 * When successfully starting a remote job, the address of the remote worker and 134 the UID of the job can be checkpointed. 135 136 * After that, the workflow can just connect and update its status and logs when 137 receiving an update. 138 139 * If the workflow is aborted and reloaded somewhere else (vtctld restart), it 140 can reconnect to the running job easily. 141 142 * Canceling the job is also easy, just call the RPC. 143 144 ### Comments 145 146 Both vtworker and vttablet could remember the last N jobs they ran, and their 147 status. So when a workflow tries to reconnect to a finished job, they just 148 stream a single `StatusResponse` with a `final_status` field.