github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/cli/job.md (about) 1 --- 2 layout: post 3 title: JOB 4 permalink: /docs/cli/job 5 redirect_from: 6 - /cli/job.md/ 7 - /docs/cli/job.md/ 8 --- 9 10 # Introduction, background, definitions 11 12 Batch operations that run asynchronously and may take seconds (minutes, hours, etc.) to execute - are called eXtended actions (xactions). 13 14 Internally, `xaction` is an abstraction at the root of the inheritance hierarchy that also contains specific user-visible jobs: `copy-bucket`, `evict-objects`, and more. 15 16 > For the most recently updated list of all supported jobs and their respective compile-time properties, see the [source](https://github.com/NVIDIA/aistore/blob/main/xact/api.go#L108). 17 18 **All jobs run asynchronously, have start and stop times, and common generic statistics** 19 20 Further, each and every job kind has its own display name, access permissions, scope (bucket and/or global), and a number of boolean properties - examples including: 21 22 | Property | Description | 23 | --- | --- | 24 | `Startable` | true if user can start this job via generic jobi-start API | 25 | `RefreshCap` | the system must refresh capacity stats upon the job's completion | 26 27 Many kinds of jobs can be manually started via generic job API (which's in turn utilized by the `ais start` command - see next). 28 29 Notable exceptions include electing new primary and listing objects in a given bucket - in both of those cases, there's a separate, more convenient and intuitive API that does the job, so to speak. 30 31 > Job starting, stopping (i.e., aborting), and monitoring commands all have equivalent *shorter* versions. For instance `ais start download` can be expressed as `ais start download`, while `ais wait copy-bucket Z8WkHxwIrr` is the same as `ais wait Z8WkHxwIrr`. 32 33 Rest of this document covers starting, stopping, and otherwise managing job kinds and specific job instances. For [job monitoring](/docs/cli/show.md#ais-show-job), please use `ais show job` command and its numerous subcommands and options. 34 35 * [`ais show job`](/docs/cli/show.md#ais-show-job) 36 37 ### See also 38 39 - [static descriptors (source code)](https://github.com/NVIDIA/aistore/blob/main/xact/api.go#L108) 40 - [`xact` package README](/xact/README.md). 41 - [`batch jobs`](/docs/batch.md) 42 - [CLI: `dsort` (distributed shuffle)](/docs/cli/dsort.md) 43 - [CLI: `download` from any remote source](/docs/cli/download.md) 44 - [built-in `rebalance`](/docs/rebalance.md) 45 46 # `ais job` command 47 48 Has the following static completions aka subcommands: 49 50 ```console 51 $ ais job <TAB-TAB> 52 start stop wait rm show 53 54 ``` 55 and further: 56 57 ```console 58 $ ais job --help 59 NAME: 60 ais job - monitor, query, start/stop and manage jobs and eXtended actions (xactions) 61 62 USAGE: 63 ais job command [command options] [arguments...] 64 65 COMMANDS: 66 start run batch job 67 stop terminate a single batch job or multiple jobs (press <TAB-TAB> to select, '--help' for options) 68 wait wait for a specific batch job to complete (press <TAB-TAB> to select, '--help' for options) 69 rm cleanup finished jobs 70 show show running and finished jobs ('--all' for all, or press <TAB-TAB> to select, '--help' for options) 71 72 OPTIONS: 73 --help, -h show help 74 ``` 75 76 Notice, though, that `start`, stop`, and `wait` (verbs) have shorter versions, e.g.: 77 78 * `ais start` is a built-in alias for `ais job start`, and so on. 79 80 > For all configured pre-built and user-defined aliases (aka "shortcuts"), run `ais alias` or `ais alias --help` 81 82 ## Table of Contents 83 - [Start job](#start-job) 84 - [Stop job](#stop-job) 85 - [Show job statistics](#show-job-statistics) 86 - [Show extended statistics](#show-extended-statistics) 87 - [Wait for job](#wait-for-job) 88 - [Distributed Sort](#distributed-sort) 89 - [Downloader](#downloader) 90 91 ## Start job 92 93 `ais start <JOB_NAME> [arguments...]` 94 95 Start a certain job. Some jobs require additional arguments such as bucket name to execute. 96 97 Note: `job start download|dsort` have slightly different options. Please see their documentation for more: 98 * [`job start download`](download.md#start-download-job) 99 * [`job start dsort`](dsort.md#start-dsort-job) 100 101 ### Examples 102 103 #### Start cluster-wide LRU 104 105 Starts LRU xaction on all nodes 106 107 ```console 108 $ ais start lru 109 Started "lru" xaction. 110 ``` 111 An administrator may choose to run LRU on a subset of buckets. This can be achieved by using the `--buckets` flag to provide a comma-separated list of buckets, for instance `--buckets bck1,gcp://bck2`, on which LRU needs to be performed. 112 Additionally, the `--force`(`-f`) option can be used to override the bucket's `lru.enabled` property. 113 114 **Note:** To ensure safety, the force flag (`-f`) only works when a list of buckets is provided. 115 ```console 116 $ ais start lru --buckets ais://buck1,aws://buck2 -f 117 ``` 118 119 ## Stop job 120 121 `ais stop [NAME] [JOB_ID] [NODE_ID] [BUCKET]` 122 123 Stop a single job or multiple jobs. 124 125 ### Examples stopping a single job: 126 127 * `ais stop download JOB_ID` 128 * `ais stop JOB_ID` 129 * `ais stop dsort JOB_ID` 130 131 ### Examples stopping multiple jobs: 132 133 * `ais stop download --all` # stop all downloads 134 * `ais stop copy-bucket ais://abc --all` # stop all `copy-bucket` jobs where the destination bucket is ais://abc 135 * `ais stop resilver t[rt2erGhbr]` # ask target t[rt2erGhbr] to stop resilvering 136 137 and more. 138 139 Note: `job stop download|dsort` have slightly different options. Please see their documentation for more: 140 * [`job stop download`](download.md#stop-download-job) 141 * [`job stop dsort`](dsort.md#stop-dsort-job) 142 143 ### More Examples 144 145 #### Stop cluster-wide LRU 146 147 Stops currently running LRU eviction. 148 149 ```console 150 $ ais stop lru 151 Stopped LRU eviction. 152 ``` 153 154 ## Show job statistics 155 156 `ais show job [NAME] [JOB_ID] [NODE_ID] [BUCKET]` 157 158 You can show jobs by any combination of the optional (filtering) arguments: NAME, JOB_ID, etc.. 159 160 Use `--all` option to include finished (or aborted) jobs. 161 162 As usual, press `<TAB-TAB> to select and see `--help` for details. 163 164 > `job show download|dsort` have slightly different options. Please see their documentation for more: 165 * [`job show download`](download.md#show-download-jobs-and-job-status) 166 * [`job show dsort`](dsort.md#show-dsort-jobs-and-job-status) 167 168 ### Show extended statistics 169 170 All jobs show the number of processed objects(column `OBJECTS`) and the total size of the data(column `BYTES`). 171 Both values are cumulative for the entire job's life-time. 172 173 Certain kinds of supported jobs provide extended statistics, including: 174 175 #### Show EC Encoding Statistics 176 177 The output contains a few extra columns: 178 179 - `ERRORS` - the total number of objects EC failed to encode 180 - `QUEUE` - the average length of working queue: the average number of objects waiting in the queue when a new EC encode request received. Values close to `0` mean that every object was processed immediately after the request had been received 181 - `AVG TIME` - the average total processing time for an object: from the moment the object is put to the working queue and to the moment the last encoded slice is sent to another target 182 - `ENC TIME` - the average amount of time spent on encoding an object. 183 184 The extended statistics may give a hint what is the possible bottleneck: 185 186 - high values in `QUEUE` - EC is congested and does not have time to process all incoming requests 187 - low values in `QUEUE` and `ENC TIME`, but high ones in `AVG TIME` may mean that the network is slow and a lot of time spent on sending the encoded slices 188 - low values in `QUEUE`, and `ENC TIME` close to `AVG TIME` may mean that the local hardware is overloaded: either local drives or CPUs are overloaded. 189 190 #### Show EC Restoring Statistics 191 Show information about EC restore requests. 192 193 The output contains a few extra columns: 194 195 - `ERRORS` - the total number of objects EC failed to restore 196 - `QUEUE` - the average length of working queue: the average number of objects waiting in the queue when a new EC encode request received. Values close to `0` mean that every object was processed immediately after the request had been received 197 - `AVG TIME` - the average total processing time for an object: from the moment the object is put to the working queue and to the moment the last encoded slice is sent to another target 198 199 ### Options 200 201 | Flag | Type | Description | Default | 202 | --- | --- | --- | --- | 203 | `--json` | `bool` | Output details in JSON format | `false` | 204 | `--all` | `bool` | If set, additionally displays old, finished xactions | `false` | 205 | `--active` | `bool` | If set, displays only running xactions | `false` | 206 | `--verbose` `-v` | `bool` | If set, displays all xaction statistics including extended ones. If the number of xaction to display is greater than one, the flag is ignored. | `false` | 207 208 Certain extended actions have additional CLI. In particular, rebalance stats can also be displayed using the following command: 209 210 `ais show rebalance` 211 212 Display details about the most recent rebalance xaction. 213 214 | Flag | Type | Description | Default | 215 | --- | --- | --- | --- | 216 | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds). Ctrl-C to stop monitoring. | ` ` | 217 | `--all` | `bool` | If set, show all rebalance xactions | `false` | 218 219 Output of this command differs from the generic xaction output. 220 221 ### Examples 222 223 Default compact tabular view: 224 225 ```console 226 $ ais show job --all 227 NODE ID KIND BUCKET OBJECTS BYTES START END STATE 228 zXZXt8084 FXjl0NWGOU ec-put TESTAISBUCKET-ec-mpaths 5 4.56MiB 12-02 13:04:50 12-02 13:04:50 Aborted 229 ``` 230 231 Verbose tabular view: 232 233 ```console 234 $ ais show job FXjl0NWGOU --verbose 235 PROPERTY VALUE 236 .aborted true 237 .bck ais://TESTAISBUCKET-ec-mpaths 238 .end 12-02 13:04:50 239 .id FXjl0NWGOU 240 .kind ec-put 241 .start 12-02 13:04:50 242 ec.delete.err.n 0 243 ec.delete.n 0 244 ec.delete.time 0s 245 ec.encode.err.n 0 246 ec.encode.n 5 247 ec.encode.size 4.56MiB 248 ec.encode.time 16.964552ms 249 ec.obj.process.time 17.142239ms 250 ec.queue.len.n 0 251 in.obj.n 0 252 in.obj.size 0 253 is_idle true 254 loc.obj.n 5 255 loc.obj.size 4.56MiB 256 out.obj.n 0 257 out.obj.size 0 258 ``` 259 260 ## Wait for job 261 262 `ais wait [NAME] [JOB_ID] [NODE_ID] [BUCKET]` 263 264 Wait for the specified job to finish. 265 266 > `job wait download|dsort` have slightly different options. Please see their documentation for more: 267 * [`job wait download`](download.md#wait-for-download-job) 268 * [`job wait dsort`](dsort.md#wait-for-dsort-job) 269 270 ### Options 271 272 | Flag | Type | Description | Default | 273 | --- | --- | --- | --- | 274 | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds) | ` ` | 275 276 ## Distributed Sort 277 278 `ais start dsort` or `ais start dsort` 279 280 Run [dSort](/docs/dsort.md). 281 [Further reference for this command can be found here.](dsort.md) 282 283 ## Downloader 284 285 `ais start download` or `ais start download` 286 287 Run the AIS [Downloader](/docs/README.md). 288 [Further reference for this command can be found here.](downloader.md)