github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/cli/etl.md (about)

     1  ---
     2  layout: post
     3  title: ETL
     4  permalink: /docs/cli/etl
     5  redirect_from:
     6   - /cli/etl.md/
     7   - /docs/cli/etl.md/
     8  ---
     9  
    10  # CLI Reference for ETLs
    11  
    12  This section documents ETL management operations with `ais etl`. But first, note:
    13  
    14  > As with [global rebalance](/docs/rebalance.md), [dSort](/docs/dsort.md), and [download](/docs/download.md), all ETL management commands can be also executed via `ais job` and `ais show` - the commands that, by definition, support all AIS *xactions*, including AIS-ETL
    15  
    16  For background on AIS-ETL, getting-started steps, working examples, and tutorials, please refer to:
    17  
    18  * [ETL documentation](/docs/etl.md)
    19  
    20  ## Table of Contents
    21  
    22  - [Init ETL with spec](#init-etl-with-spec)
    23  - [Init ELT with code](#init-etl-with-code)
    24  - [List ETLs](#list-etls)
    25  - [View ETL Logs](#view-etl-logs)
    26  - [Stop ETL](#stop-etl)
    27  - [Transform object on-the-fly with given ETL](#transform-object-on-the-fly-with-given-etl)
    28  - [Transform a bucket offline with the given ETL](#transform-a-bucket-offline-with-the-given-etl)
    29  
    30  ## Init ETL with spec
    31  
    32  `ais etl init spec --from-file=SPEC_FILE --name=ETL_NAME [--comm-type=COMMUNICATION_TYPE] [--wait-timeout=TIMEOUT] [--arg-type=ARGUMENT_TYPE]` or `ais start etl init`
    33  
    34  Init ETL with Pod YAML specification file. The `--name` parameter is used to assign a user defined unique name to the ETL (ref: [here](/docs/etl.md#etl-name-specifications) for information on valid ETL name).
    35  
    36  ### Example
    37  
    38  Initialize ETL that computes MD5 of the object.
    39  
    40  ```console
    41  $ cat spec.yaml
    42  apiVersion: v1
    43  kind: Pod
    44  metadata:
    45    name: transformer-md5
    46  spec:
    47    containers:
    48      - name: server
    49        image: aistore/transformer_md5:latest
    50        ports:
    51          - name: default
    52            containerPort: 80
    53        command: ['/code/server.py', '--listen', '0.0.0.0', '--port', '80']
    54  $ ais etl init spec --from-file=spec.yaml --name=transformer-md5 --comm-type=hpull:// --wait-timeout=1m
    55  transformer-md5
    56  ```
    57  
    58  ## Init ETL with code
    59  
    60  `ais etl init code --name=ETL_NAME --from-file=CODE_FILE --runtime=RUNTIME [--chunk-size=NUM_OF_BYTES] [--transform=TRANSFORM_FUNC] [--before=BEFORE_FUNC] [--after=AFTER_FUNC] [--deps-file=DEPS_FILE] [--comm-type=COMMUNICATION_TYPE] [--wait-timeout=TIMEOUT] [--arg-type=ARGUMENT_TYPE]`
    61  
    62  Initializes ETL from provided `CODE_FILE` that contains a transformation function named `transform(input_bytes)` or `transform(input_bytes, context)`, an optional function executed prior to the transform function named `before(context)` which is supposed to initialize all the variables needed for the `transform(input_bytes, context)` and optional post transform function named `after(context)` which consolidates the results and returns to the user the transformed `output_bytes`.
    63  
    64  The `--name` parameter is used to assign a user defined unique name to the ETL (ref: [here](/docs/etl.md#etl-name-specifications) for information on valid ETL name).
    65  
    66  Based on the communication type used, there are mutiple ways you can initialize the `transform(input_bytes, context)`, `before(context)` and `after(context)` functions. Check [ETL Init Code Docs](docs/etl.md#init-code-request) for more info.
    67  
    68  All available runtimes are listed [here](/docs/etl.md#runtimes).
    69  
    70  Note:
    71  - Default value of --transform is "transform".
    72  
    73  ### Example
    74  
    75  Initialize ETL with code that computes MD5 of the object.
    76  
    77  ```console
    78  $ cat code.py
    79  import hashlib
    80  
    81  def transform(input_bytes):
    82      md5 = hashlib.md5()
    83      md5.update(input_bytes)
    84      return md5.hexdigest().encode()
    85  
    86  $ ais etl init code --from-file=code.py --runtime=python3.11v2 --name=transformer-md5 --comm-type hpull
    87  
    88  transformer-md5
    89  ```
    90  
    91  With `before(context)` and `after(context)` function with streaming (`CHUNK_SIZE` > 0):
    92  ```console
    93  $ cat code.py
    94  import hashlib
    95  def before(context):
    96      context["before"] = hashlib.md5()
    97      return context
    98  
    99  def transform(input_bytes, context):
   100      context["before"].update(input_bytes)
   101  
   102  def after(context):
   103      return context["before"].hexdigest().encode()
   104  
   105  $ ais etl init code --name=etl-md5 --from-file=code.py --runtime=python3.11v2 --chunk-size=32768 --before=before --after=after --comm-type hpull
   106  ```
   107  
   108  ## List ETLs
   109  
   110  `ais etl show` or, same, `ais job show etl`
   111  
   112  Lists all available ETLs.
   113  
   114  ## View ETL Logs
   115  
   116  `ais etl view-logs ETL_NAME [TARGET_ID]`
   117  
   118  Output logs produced by given ETL.
   119  It is possible to pass an additional parameter to specify a particular `TARGET_ID` from which the logs must be retrieved.
   120  
   121  ## Stop ETL
   122  
   123  `ais etl stop ETL_NAME` or, same, `ais stop etl`
   124  
   125  Stop ETL with the specified id.
   126  
   127  
   128  ## Start ETL
   129  
   130  `ais etl start ETL_NAME` or, same, `ais start etl`
   131  
   132  Start ETL with the specified id.
   133  
   134  
   135  ## Transform object on-the-fly with given ETL
   136  
   137  `ais etl object ETL_NAME BUCKET/OBJECT_NAME OUTPUT`
   138  
   139  Get object with ETL defined by `ETL_NAME`.
   140  
   141  ### Examples
   142  
   143  #### Transform object to STDOUT
   144  
   145  Does ETL on `shards/shard-0.tar` object with `transformer-md5` ETL (computes MD5 of the object) and print the output to the STDOUT.
   146  
   147  ```console
   148  $ ais etl object transformer-md5 ais://shards/shard-0.tar -
   149  393c6706efb128fbc442d3f7d084a426
   150  ```
   151  
   152  #### Transform object to output file
   153  
   154  Do ETL on the `shards/shard-0.tar` object with `transformer-md5` ETL (computes MD5 of the object) and save the output to the `output.txt` file.
   155  
   156  ```console
   157  $ ais etl object transformer-md5 ais://shards/shard-0.tar output.txt
   158  $ cat output.txt
   159  393c6706efb128fbc442d3f7d084a426
   160  ```
   161  
   162  ## Transform a bucket offline with the given ETL
   163  
   164  `ais etl bucket ETL_NAME SRC_BUCKET DST_BUCKET`
   165  
   166  Transform all or selected objects and put them into another bucket.
   167  
   168  | Flag | Type | Description |
   169  | --- | --- | --- |
   170  | `--list` | `string` | Comma-separated list of object names, e.g., 'obj1,obj2' |
   171  | `--template` | `string` | Template for matching object names, e.g, 'obj-{000..100}.tar' |
   172  | `--ext` | `string` | Mapping from old to new extensions of transformed objects, e.g. {jpg:txt}, "{ in1 : out1, in2 : out2 }"|
   173  | `--prefix` | `string` | Prefix added to every new object name |
   174  | `--wait` | `bool` | Wait until operation is finished |
   175  | `--requests-timeout` | `duration` | Timeout for a single object transformation |
   176  | `--dry-run` | `bool` | Don't actually transform the bucket, only display what would happen |
   177  
   178  Flags `--list` and `--template` are mutually exclusive. If neither of them is set, the command transforms the whole bucket.
   179  
   180  ### Examples
   181  
   182  #### Transform bucket with ETL
   183  
   184  Transform every object from `src_bucket` with ETL and put new objects to `dst_bucket`.
   185  
   186  ```console
   187  $ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket
   188  MMi9l8Z11
   189  $ ais wait xaction MMi9l8Z11
   190  ```
   191  
   192  #### Transform bucket with ETL
   193  
   194  The same as above, but wait for the ETL bucket to finish.
   195  
   196  ```console
   197  $ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --wait
   198  ```
   199  
   200  #### Transform selected objects in bucket with ETL
   201  
   202  Transform objects `shard-10.tar`, `shard-11.tar`, and `shard-12.tar` from `src_bucket` with ETL and put new objects to `dst_bucket`.
   203  
   204  ```console
   205  $ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --template "shard-{10..12}.tar"
   206  ```
   207  
   208  #### Transform bucket with ETL and additional parameters
   209  
   210  The same as above, but objects will have `etl-` prefix and objects with extension `.in1` will have `.out1` extension, objects with extension `.in2` will have `.out2` extension.
   211  
   212  ```console
   213  $ ais ls ais://src_bucket --props=name
   214  NAME
   215  obj1.in1
   216  obj2.in2
   217  (...)
   218  $ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --ext="{in1:out1, in2:out2}" --prefix="etl-" --wait
   219  $ ais ls ais://dst_bucket --props=name
   220  NAME
   221  etl-obj1.out1
   222  etl-obj2.out2
   223  (...)
   224  ```
   225  
   226  #### Transform bucket with ETL but with dry-run
   227  
   228  Dry-run won't perform any actions but rather just show what would be transformed if we actually transformed a bucket.
   229  This is useful for preparing the actual run.
   230  
   231  ```console
   232  $ ais ls ais://src_bucket --props=name,size
   233  NAME        SIZE
   234  obj1.in1    10MiB
   235  obj2.in2    10MiB
   236  (...)
   237  $ ais etl bucket transformer-md5 ais://src_bucket ais://dst_bucket --dry-run --wait
   238  [DRY RUN] No modifications on the cluster
   239  2 objects (20MiB) would have been put into bucket ais://dst_bucket
   240  ```