github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/cli/job.md (about)

     1  ---
     2  layout: post
     3  title: JOB
     4  permalink: /docs/cli/job
     5  redirect_from:
     6   - /cli/job.md/
     7   - /docs/cli/job.md/
     8  ---
     9  
    10  # Introduction, background, definitions
    11  
    12  Batch operations that run asynchronously and may take seconds (minutes, hours, etc.) to execute - are called eXtended actions (xactions).
    13  
    14  Internally, `xaction` is an abstraction at the root of the inheritance hierarchy that also contains specific user-visible jobs: `copy-bucket`, `evict-objects`, and more.
    15  
    16  > For the most recently updated list of all supported jobs and their respective compile-time properties, see the [source](https://github.com/NVIDIA/aistore/blob/main/xact/api.go#L108).
    17  
    18  **All jobs run asynchronously, have start and stop times, and common generic statistics**
    19  
    20  Further, each and every job kind has its own display name, access permissions, scope (bucket and/or global), and a number of boolean properties - examples including:
    21  
    22  | Property | Description |
    23  | --- | --- |
    24  | `Startable` | true if user can start this job via generic jobi-start API |
    25  | `RefreshCap` | the system must refresh capacity stats upon the job's completion |
    26  
    27  Many kinds of jobs can be manually started via generic job API (which's in turn utilized by the `ais start` command - see next).
    28  
    29  Notable exceptions include electing new primary and listing objects in a given bucket - in both of those cases, there's a separate, more convenient and intuitive API that does the job, so to speak.
    30  
    31  > Job starting, stopping (i.e., aborting), and monitoring commands all have equivalent *shorter* versions. For instance `ais start download` can be expressed as `ais start download`, while `ais wait copy-bucket Z8WkHxwIrr` is the same as `ais wait Z8WkHxwIrr`.
    32  
    33  Rest of this document covers starting, stopping, and otherwise managing job kinds and specific job instances. For [job monitoring](/docs/cli/show.md#ais-show-job), please use `ais show job` command and its numerous subcommands and options.
    34  
    35  * [`ais show job`](/docs/cli/show.md#ais-show-job)
    36  
    37  ### See also
    38  
    39  - [static descriptors (source code)](https://github.com/NVIDIA/aistore/blob/main/xact/api.go#L108)
    40  - [`xact` package README](/xact/README.md).
    41  - [`batch jobs`](/docs/batch.md)
    42  - [CLI: `dsort` (distributed shuffle)](/docs/cli/dsort.md)
    43  - [CLI: `download` from any remote source](/docs/cli/download.md)
    44  - [built-in `rebalance`](/docs/rebalance.md)
    45  
    46  # `ais job` command
    47  
    48  Has the following static completions aka subcommands:
    49  
    50  ```console
    51  $ ais job <TAB-TAB>
    52  start   stop    wait    rm     show
    53  
    54  ```
    55  and further:
    56  
    57  ```console
    58  $ ais job --help
    59  NAME:
    60     ais job - monitor, query, start/stop and manage jobs and eXtended actions (xactions)
    61  
    62  USAGE:
    63     ais job command [command options] [arguments...]
    64  
    65  COMMANDS:
    66     start  run batch job
    67     stop   terminate a single batch job or multiple jobs (press <TAB-TAB> to select, '--help' for options)
    68     wait   wait for a specific batch job to complete (press <TAB-TAB> to select, '--help' for options)
    69     rm     cleanup finished jobs
    70     show   show running and finished jobs ('--all' for all, or press <TAB-TAB> to select, '--help' for options)
    71  
    72  OPTIONS:
    73     --help, -h  show help
    74  ```
    75  
    76  Notice, though, that `start`, stop`, and `wait` (verbs) have shorter versions, e.g.:
    77  
    78  * `ais start` is a built-in alias for `ais job start`, and so on.
    79  
    80  > For all configured pre-built and user-defined aliases (aka "shortcuts"), run `ais alias` or `ais alias --help`
    81  
    82  ## Table of Contents
    83  - [Start job](#start-job)
    84  - [Stop job](#stop-job)
    85  - [Show job statistics](#show-job-statistics)
    86    - [Show extended statistics](#show-extended-statistics)
    87  - [Wait for job](#wait-for-job)
    88  - [Distributed Sort](#distributed-sort)
    89  - [Downloader](#downloader)
    90  
    91  ## Start job
    92  
    93  `ais start <JOB_NAME> [arguments...]`
    94  
    95  Start a certain job. Some jobs require additional arguments such as bucket name to execute.
    96  
    97  Note: `job start download|dsort` have slightly different options. Please see their documentation for more:
    98  * [`job start download`](download.md#start-download-job)
    99  * [`job start dsort`](dsort.md#start-dsort-job)
   100  
   101  ### Examples
   102  
   103  #### Start cluster-wide LRU
   104  
   105  Starts LRU xaction on all nodes
   106  
   107  ```console
   108  $ ais start lru
   109  Started "lru" xaction.
   110  ```
   111  An administrator may choose to run LRU on a subset of buckets. This can be achieved by using the `--buckets` flag to provide a comma-separated list of buckets, for instance `--buckets bck1,gcp://bck2`, on which LRU needs to be performed.
   112  Additionally, the `--force`(`-f`) option can be used to override the bucket's `lru.enabled` property.
   113  
   114  **Note:** To ensure safety, the force flag (`-f`) only works when a list of buckets is provided.
   115  ```console
   116  $ ais start lru --buckets ais://buck1,aws://buck2 -f
   117  ```
   118  
   119  ## Stop job
   120  
   121  `ais stop [NAME] [JOB_ID] [NODE_ID] [BUCKET]`
   122  
   123  Stop a single job or multiple jobs.
   124  
   125  ### Examples stopping a single job:
   126  
   127  * `ais stop download JOB_ID`
   128  * `ais stop JOB_ID`
   129  * `ais stop dsort JOB_ID`
   130  
   131  ### Examples stopping multiple jobs:
   132  
   133  * `ais stop download --all`              # stop all downloads
   134  * `ais stop copy-bucket ais://abc --all` # stop all `copy-bucket` jobs where the destination bucket is ais://abc
   135  * `ais stop resilver t[rt2erGhbr]`       # ask target  t[rt2erGhbr] to stop resilvering
   136  
   137  and more.
   138  
   139  Note: `job stop download|dsort` have slightly different options. Please see their documentation for more:
   140  * [`job stop download`](download.md#stop-download-job)
   141  * [`job stop dsort`](dsort.md#stop-dsort-job)
   142  
   143  ### More Examples
   144  
   145  #### Stop cluster-wide LRU
   146  
   147  Stops currently running LRU eviction.
   148  
   149  ```console
   150  $ ais stop lru
   151  Stopped LRU eviction.
   152  ```
   153  
   154  ## Show job statistics
   155  
   156  `ais show job [NAME] [JOB_ID] [NODE_ID] [BUCKET]`
   157  
   158  You can show jobs by any combination of the optional (filtering) arguments: NAME, JOB_ID, etc..
   159  
   160  Use `--all` option to include finished (or aborted) jobs.
   161  
   162  As usual, press `<TAB-TAB> to select and see `--help` for details.
   163  
   164  > `job show download|dsort` have slightly different options. Please see their documentation for more:
   165  * [`job show download`](download.md#show-download-jobs-and-job-status)
   166  * [`job show dsort`](dsort.md#show-dsort-jobs-and-job-status)
   167  
   168  ### Show extended statistics
   169  
   170  All jobs show the number of processed objects(column `OBJECTS`) and the total size of the data(column `BYTES`).
   171  Both values are cumulative for the entire job's life-time.
   172  
   173  Certain kinds of supported jobs provide extended statistics, including:
   174  
   175  #### Show EC Encoding Statistics
   176  
   177  The output contains a few extra columns:
   178  
   179  - `ERRORS` - the total number of objects EC failed to encode
   180  - `QUEUE` - the average length of working queue: the average number of objects waiting in the queue when a new EC encode request received. Values close to `0` mean that every object was processed immediately after the request had been received
   181  - `AVG TIME` - the average total processing time for an object: from the moment the object is put to the working queue and to the moment the last encoded slice is sent to another target
   182  - `ENC TIME` - the average amount of time spent on encoding an object.
   183  
   184  The extended statistics may give a hint what is the possible bottleneck:
   185  
   186  - high values in `QUEUE` - EC is congested and does not have time to process all incoming requests
   187  - low values in `QUEUE` and `ENC TIME`, but high ones in `AVG TIME` may mean that the network is slow and a lot of time spent on sending the encoded slices
   188  - low values in `QUEUE`, and `ENC TIME` close to `AVG TIME` may mean that the local hardware is overloaded: either local drives or CPUs are overloaded.
   189  
   190  #### Show EC Restoring Statistics
   191  Show information about EC restore requests.
   192  
   193  The output contains a few extra columns:
   194  
   195  - `ERRORS` - the total number of objects EC failed to restore
   196  - `QUEUE` - the average length of working queue: the average number of objects waiting in the queue when a new EC encode request received. Values close to `0` mean that every object was processed immediately after the request had been received
   197  - `AVG TIME` - the average total processing time for an object: from the moment the object is put to the working queue and to the moment the last encoded slice is sent to another target
   198  
   199  ### Options
   200  
   201  | Flag | Type | Description | Default |
   202  | --- | --- | --- | --- |
   203  | `--json` | `bool` | Output details in JSON format | `false` |
   204  | `--all` | `bool` | If set, additionally displays old, finished xactions | `false` |
   205  | `--active` | `bool` | If set, displays only running xactions | `false` |
   206  | `--verbose` `-v` | `bool` | If set, displays all xaction statistics including extended ones. If the number of xaction to display is greater than one, the flag is ignored. | `false` |
   207  
   208  Certain extended actions have additional CLI. In particular, rebalance stats can also be displayed using the following command:
   209  
   210  `ais show rebalance`
   211  
   212  Display details about the most recent rebalance xaction.
   213  
   214  | Flag | Type | Description | Default |
   215  | --- | --- | --- | --- |
   216  | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds). Ctrl-C to stop monitoring. | ` ` |
   217  | `--all` | `bool` | If set, show all rebalance xactions | `false` |
   218  
   219  Output of this command differs from the generic xaction output.
   220  
   221  ### Examples
   222  
   223  Default compact tabular view:
   224  
   225  ```console
   226  $ ais show job --all
   227  NODE             ID              KIND    BUCKET                          OBJECTS         BYTES           START           END             STATE
   228  zXZXt8084        FXjl0NWGOU      ec-put  TESTAISBUCKET-ec-mpaths         5               4.56MiB         12-02 13:04:50  12-02 13:04:50  Aborted
   229  ```
   230  
   231  Verbose tabular view:
   232  
   233  ```console
   234  $ ais show job FXjl0NWGOU --verbose
   235  PROPERTY                 VALUE
   236  .aborted                 true
   237  .bck                     ais://TESTAISBUCKET-ec-mpaths
   238  .end                     12-02 13:04:50
   239  .id                      FXjl0NWGOU
   240  .kind                    ec-put
   241  .start                   12-02 13:04:50
   242  ec.delete.err.n          0
   243  ec.delete.n              0
   244  ec.delete.time           0s
   245  ec.encode.err.n          0
   246  ec.encode.n              5
   247  ec.encode.size           4.56MiB
   248  ec.encode.time           16.964552ms
   249  ec.obj.process.time      17.142239ms
   250  ec.queue.len.n           0
   251  in.obj.n                 0
   252  in.obj.size              0
   253  is_idle                  true
   254  loc.obj.n                5
   255  loc.obj.size             4.56MiB
   256  out.obj.n                0
   257  out.obj.size             0
   258  ```
   259  
   260  ## Wait for job
   261  
   262  `ais wait [NAME] [JOB_ID] [NODE_ID] [BUCKET]`
   263  
   264  Wait for the specified job to finish.
   265  
   266  > `job wait download|dsort` have slightly different options. Please see their documentation for more:
   267  * [`job wait download`](download.md#wait-for-download-job)
   268  * [`job wait dsort`](dsort.md#wait-for-dsort-job)
   269  
   270  ### Options
   271  
   272  | Flag | Type | Description | Default |
   273  | --- | --- | --- | --- |
   274  | `--refresh` | `duration` | Refresh interval - time duration between reports. The usual unit suffixes are supported and include `m` (for minutes), `s` (seconds), `ms` (milliseconds) | ` ` |
   275  
   276  ## Distributed Sort
   277  
   278  `ais start dsort` or `ais start dsort`
   279  
   280  Run [dSort](/docs/dsort.md).
   281  [Further reference for this command can be found here.](dsort.md)
   282  
   283  ## Downloader
   284  
   285  `ais start download` or `ais start download`
   286  
   287  Run the AIS [Downloader](/docs/README.md).
   288  [Further reference for this command can be found here.](downloader.md)