github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/etl.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/etl.md (about)

     1  ---
     2  layout: post
     3  title: ETL
     4  permalink: /docs/etl
     5  redirect_from:
     6   - /etl.md/
     7   - /docs/etl.md/
     8  ---
     9  
    10  **ETL** stands for **E**xtract, **T**ransform, **L**oad. More specifically:
    11  
    12  * **E**xtract - data from different original formats and/or multiple sources;
    13  * **T**ransform - to the unified common format optimized for subsequent computation (e.g., training deep learning model);
    14  * **L**oad - transformed data into a new destination - e.g., a storage system that supports high-performance computing over large scale datasets.
    15  
    16  The latter can be AIStore (AIS). The system is designed from the ground up to support all 3 stages of the ETL pre (or post) processing. With AIS, you can effortlessly manage the AIS cluster by executing custom transformations in two ways:
    17  
    18  1. **Inline**: This involves transforming datasets on the fly, where the data is read and streamed in a transformed format directly to computing clients.
    19  2. **Offline**: Here, the transformed output is stored as a new dataset, which AIS makes accessible for any future computations.
    20  
    21  
    22  > Implementation-wise, *offline* transformations of any kind, on the one hand, and copying datasets, on the other, are closely related - the latter being, effectively, a *no-op* offline transformation.
    23  
    24  Most notably, AIS always runs transformations locally - *close to data*. Running *close to data* has always been one of the cornerstone design principles whereby in a deployed cluster each AIStore target proportionally contributes to the resulting cumulative bandwidth - the bandwidth that, in turn, will scale linearly with each added target.
    25  
    26  This was the principle behind *distributed shuffle* (code-named [dSort](/docs/dsort.md)).
    27  And this is exactly how we have more recently implemented **AIS-ETL** - the ETL service provided by AIStore. Find more information on the architecture and implementation of `ais-etl` [here](/ext/etl/README.md).
    28  
    29  Technically, the service supports running user-provided ETL containers **and** custom Python scripts within the storage cluster.
    30  
    31  **Note:** AIS-ETL (service) requires [Kubernetes](https://kubernetes.io).
    32  
    33  ## Table of Contents
    34  
    35  - [Getting Started](#getting-started)
    36  - [Inline ETL example](#inline-etl-example)
    37  - [Offline ETL example](#offline-etl-example)
    38  - [Kubernetes Deployment](#kubernetes-deployment)
    39  - [Extract, Transform and Load using user-defined functions](#extract-transform-and-load-using-user-defined-functions)
    40  - [Extract, Transform and Load using custom containers](#extract-transform-and-load-using-custom-containers)
    41  - [*init code* request](#init-code-request)
    42    - [`hpush://` communication](#hpush-communication)
    43    - [`io://` communication](#io-communication)
    44    - [Runtimes](#runtimes)
    45    - [Argument Types](#argument-types)
    46  - [*init spec* request](#init-spec-request)
    47      - [Requirements](#requirements)
    48      - [Specification YAML](#specification-yaml)
    49      - [Required or additional fields](#required-or-additional-fields)
    50      - [Forbidden fields](#forbidden-fields)
    51      - [Communication Mechanisms](#communication-mechanisms)
    52      - [Argument Types](#argument-types-1)
    53  - [Transforming objects](#transforming-objects)
    54  - [API Reference](#api-reference)
    55  - [ETL name specifications](#etl-name-specifications)
    56  
    57  ## Getting Started with ETL in AIStore
    58  
    59  To begin using ETLs in AIStore, you'll need to run AIStore within a Kubernetes cluster. There are several ways to achieve this, each suited for different purposes:
    60  
    61  1. **AIStore Development with Native Kubernetes (minikube)**:
    62     - Folder: [deploy/dev/k8s](/deploy/dev/k8s)
    63     - Intended for: AIStore development using native Kubernetes provided by [minikube](https://minikube.sigs.k8s.io/docs)
    64     - How to use: Run minikube and deploy the AIS cluster on it using the carefully documented steps available [here](/deploy/dev/k8s/README.md).
    65     - Documentation: [README](/deploy/dev/k8s/README.md)
    66  
    67  2. **Production Deployment with Kubernetes**:
    68     - Folder: [deploy/prod/k8s](/deploy/prod/k8s)
    69     - Intended for: Production use
    70     - How to use: Utilize the Dockerfiles in this folder to build AIS images for production deployment. For this purpose, there is a separate dedicated [repository](https://github.com/NVIDIA/ais-k8s) that contains corresponding tools, scripts, and documentation.
    71     - Documentation: [AIS/K8s Operator and Deployment Playbooks](https://github.com/NVIDIA/ais-k8s)
    72  
    73  To verify that your deployment is correctly set up, execute the following [CLI](/docs/cli.md) command:
    74  
    75  ```console
    76  $ ais etl show
    77  ```
    78  
    79  If you receive an empty response without any errors, your AIStore cluster is now ready to run ETL tasks.
    80  
    81  ## Inline ETL example
    82  
    83  To follow this and subsequent examples, make sure you have the [AIS CLI](/docs/cli.md) installed on your system.
    84  
    85  ```console
    86  # Prerequisites:
    87  # AIStore must be running on Kubernetes. AIS CLI (Command-Line Interface) must be installed. Make sure you have a running AIStore cluster before proceeding. 
    88  
    89  # Step 1: Create a new bucket
    90  $ ais bucket create ais://src
    91  
    92  # Step 2: Show existing ETLs
    93  $ ais etl show
    94  
    95  # Step 3: Create a temporary file and add it to the bucket
    96  $ echo "hello world" > text.txt
    97  $ ais put text.txt ais://src
    98  
    99  # Step 4: Create a spec file to initialize the ETL process
   100  # In this example, we are using the MD5 transformer as a sample ETL.
   101  $ curl -s https://raw.githubusercontent.com/NVIDIA/ais-etl/master/transformers/md5/pod.yaml -o md5_spec.yaml
   102  
   103  # Step 5: Initialize the ETL process
   104  $ export COMMUNICATION_TYPE="hpull://"
   105  $ ais etl init spec --from-file md5_spec.yaml --name etl-md5 --comm-type $COMMUNICATION_TYPE
   106  
   107  # Step 6: Check if the ETL is running
   108  $ ais etl show
   109  
   110  # Step 7: Run the transformation
   111  # Replace 'ais://src/text.txt' with the appropriate source object location and '-' with the desired destination (e.g., local file path or another AIS bucket).
   112  $ ais etl object etl-md5 ais://src/text.txt -
   113  ```
   114  
   115  ## Offline ETL example
   116  ```console
   117  $ # Download Imagenet dataset
   118  $ wget -O imagenet.tar https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar --no-check-certificate
   119  
   120  $ # Extract Imagenet
   121  $ mkdir imagenet && tar -C imagenet -xvf imagenet.tar >/dev/null
   122  
   123  $ # Check if the dataset is extracted correctly
   124  $ cd imagenet
   125  $ ls | head -5
   126  
   127  $ # Add the entire dataset to a bucket
   128  $ ais bucket create ais://imagenet
   129  $ ais put . ais://imagenet -r
   130  
   131  $ # Check if the dataset is available in the bucket
   132  $ ais ls ais://imagenet | head -5
   133  
   134  $ # Create a custom transformation using torchvision
   135  $ # The `code.py` file contains the Python code for the transformation function, and `deps.txt` lists the dependencies required to run `code.py`
   136  $ cat code.py
   137  import io
   138  from PIL import Image
   139  from torchvision import transforms as T
   140  
   141  preprocess = T.Compose(
   142      [
   143          T.Resize(256),
   144          T.CenterCrop(224),
   145          T.ToTensor(),
   146          T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
   147          T.ToPILImage(),
   148      ]
   149  )
   150  
   151  # Define the transform function
   152  def transform(data: bytes) -> bytes:
   153      image = Image.open(io.BytesIO(data))    
   154      processed = preprocess(image)
   155      buf = io.BytesIO()
   156      processed.save(buf, format='JPEG')
   157      byte_im = buf.getvalue()
   158      return byte_im
   159  
   160  $ cat deps.txt
   161  torch==2.0.1
   162  torchvision==0.15.2
   163  
   164  $ ais etl init code --name etl-torchvision --from-file code.py --deps-file deps.txt --runtime python3.11v2 --comm-type hpull
   165  
   166  $ # Perform an offline transformation 
   167  $ ais etl bucket etl-torchvision ais://imagenet ais://imagenet-transformed --ext="{JPEG:JPEG}" 
   168  
   169  $ # Check if the transformed dataset is available in the bucket
   170  $ ais ls ais://imagenet-transformed | head -5
   171  
   172  $ # Verify one of the images by downloading its content
   173  $ ais object get ais://imagenet-transformed/ILSVRC2012_val_00050000.JPEG test.JPEG
   174  ```
   175  ## Extract, Transform, and Load using User-Defined Functions
   176  
   177  1. To perform Extract, Transform, and Load (ETL) using user-defined functions, send the transform function in the [**init code** request](#init-code-request) to an AIStore endpoint.
   178  
   179  2. Upon receiving the **init code** request, the AIStore proxy broadcasts the request to all AIStore targets in the cluster.
   180  
   181  3. When an AIStore target receives the **init code**, it initiates the execution of the container **locally** on the target's machine (also known as [Kubernetes Node](https://kubernetes.io/docs/concepts/architecture/nodes/)).
   182  
   183  ## Extract, Transform, and Load using Custom Containers
   184  
   185  1. To perform Extract, Transform, and Load (ETL) using custom containers, execute the [**init spec** API](#init-spec-request) to an AIStore endpoint.
   186     
   187     > The request contains a YAML spec and ultimately triggers the creation of [Kubernetes Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/) that run the user's ETL logic inside.
   188  
   189  2. Upon receiving the **init spec** request, the AIStore proxy broadcasts the request to all AIStore targets in the cluster.
   190  
   191  3. When a target receives the **init spec**, it starts the user container **locally** on the target's machine (also known as [Kubernetes Node](https://kubernetes.io/docs/concepts/architecture/nodes/)).
   192  
   193  ## *init code* request
   194  
   195  You can write your custom `transform` function that takes input object bytes as a parameter and returns output bytes (the transformed object's content).
   196  You can then use the *init code* request to execute this `transform` on the entire distributed dataset.
   197  
   198  In effect, a user can skip the entire step of writing their Dockerfile and building a custom ETL container - the *init code* capability allows the user to skip this step entirely.
   199  
   200  > If you are familiar with [FasS](https://en.wikipedia.org/wiki/Function_as_a_service), then you probably will find this type of ETL initialization the most intuitive.
   201  
   202  For a detailed step-by-step tutorial on *init code* requests, please see [Python SDK ETL Tutorial](https://github.com/NVIDIA/aistore/blob/main/python/examples/sdk/sdk-etl-tutorial.ipynb) and [Python SDK ETL Examples](https://github.com/NVIDIA/aistore/tree/main/python/examples/ais-etl).
   203  
   204  The `init_code` request currently supports two communication types:
   205  
   206  ### `hpush://` communication
   207  
   208  In `hpush` communication type, the user has to define a function that takes bytes as a parameter, processes it and returns bytes. e.g. [ETL to calculate MD5 of an object](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_md5_hpush.py)
   209  
   210  ```python
   211  def transform(input_bytes: bytes) -> bytes
   212  ```
   213  
   214  You can also stream objects in `transform()` by setting the `CHUNK_SIZE` parameter (`CHUNK_SIZE` > 0).
   215  
   216  e.g. [ETL to calculate MD5 of an object with streaming](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_md5_hpush_streaming.py), [ETL to transform images using torchvision](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_torchvision_hpush.py).
   217  
   218  > **Note:**
   219  >- If the function uses external dependencies, a user can provide an optional dependencies file or in the `elt().init()` function of Python SDK. These requirements will be installed on the machine executing the `transform` function and will be available for the function.
   220  
   221  ### `io://` communication
   222  
   223  In `io://` communication type, users have to define a `transform()` function that reads bytes from [`sys.stdin`](https://docs.python.org/3/library/sys.html#sys.stdin), carries out transformations over it, and then writes bytes to [`sys.stdout`](https://docs.python.org/3/library/sys.html#sys.stdout).
   224  
   225  ```python
   226  def transform() -> None:
   227      input_bytes = sys.stdin.buffer.read()
   228      # output_bytes = process(input_bytes)
   229      sys.stdout.buffer.write(output_bytes)
   230  ```
   231  
   232  e.g. [ETL to calculate MD5 of an object](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_md5_io.py), [ETL to transform images using torchvision](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_torchvision_io.py)
   233  
   234  ### Runtimes
   235  
   236  AIS-ETL provides several *runtimes* out of the box.
   237  Each *runtime* determines the programming language of your custom `transform` function and the set of pre-installed packages and tools that your `transform` can utilize.
   238  
   239  Currently, the following runtimes are supported:
   240  
   241  | Name | Description |
   242  | --- | --- |
   243  | `python3.8v2` | `python:3.8` is used to run the code. |
   244  | `python3.10v2` | `python:3.10` is used to run the code. |
   245  | `python3.11v2` | `python:3.11` is used to run the code. |
   246  
   247  More *runtimes* will be added in the future, with plans to support the most popular ETL toolchains.
   248  Still, since the number of supported  *runtimes* will always remain somewhat limited, there's always the second way: build your ETL container and deploy it via [*init spec* request](#init-spec-request).
   249  
   250  ### Argument Types
   251  
   252  The AIStore `etl init code` provides two `arg_type` parameter options for specifying the type of object specification between the AIStore and ETL container. These options are utilized as follows:
   253  
   254  | Parameter Value | Description |
   255  |-----------------|-------------|
   256  | "" (Empty String) | This serves as the default option, allowing the object to be passed as bytes. When initializing ETLs, the `arg_type` parameter can be entirely omitted, and it will automatically default to passing the object as bytes to the transformation function. |
   257  | "url" | When set to "url," this option allows the passing of the URL of the objects to be transformed to the user-defined transform function. It's important to note that this option is limited to '--comm-type=hpull'. In this scenario, the user is responsible for implementing the logic to fetch objects from the buckets based on the URL of the object received as a parameter. |
   258  
   259  
   260  ## *init spec* request
   261  
   262  *Init spec* request covers all, even the most sophisticated, cases of ETL initialization.
   263  It allows running any Docker image that implements certain requirements on communication with the cluster.
   264  The *init spec* request requires writing a Pod specification following specification requirements.
   265  
   266  For a detailed step-by-step tutorial on *init spec* request, please see the [MD5 ETL playbook](/docs/tutorials/etl/compute_md5.md).
   267  
   268  #### Requirements
   269  
   270  Custom ETL container is expected to satisfy the following requirements:
   271  
   272  1. Start a web server that supports at least one of the listed [communication mechanisms](#communication-mechanisms).
   273  2. The server can listen on any port, but the port must be specified in Pod spec with `containerPort` - the cluster
   274   must know how to contact the Pod.
   275  3. AIS target(s) may send requests in parallel to the web server inside the ETL container - any synchronization, therefore, must be done on the server-side.
   276  
   277  #### Specification YAML
   278  
   279  Specification of an ETL should be in the form of a YAML file.
   280  It is required to follow the Kubernetes [Pod template format](https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates)
   281  and contain all necessary fields to start the Pod.
   282  
   283  #### Required or additional fields
   284  
   285  | Path | Required | Description | Default |
   286  | --- | --- | --- | --- |
   287  | `metadata.annotations.communication_type` | `false` | [Communication type](#communication-mechanisms) of an ETL. | `hpush://` |
   288  | `metadata.annotations.wait_timeout` | `false` | Timeout on ETL Pods starting on target machines. See [annotations](#annotations) | infinity |
   289  | `spec.containers` | `true` | Containers running inside a Pod, exactly one required. | - |
   290  | `spec.containers[0].image` | `true` | Docker image of ETL container. | - |
   291  | `spec.containers[0].ports` | `true` (except `io://` communication type) | Ports exposed by a container, at least one expected. | - |
   292  | `spec.containers[0].ports[0].Name` | `true` | Name of the first Pod should be `default`. | - |
   293  | `spec.containers[0].ports[0].containerPort` | `true` | Port which a cluster will contact containers on. | - |
   294  | `spec.containers[0].readinessProbe` | `true` (except `io://` communication type) | ReadinessProbe of a container. | - |
   295  | `spec.containers[0].readinessProbe.timeoutSeconds` | `false` | Timeout for a readiness probe in seconds. | `5` |
   296  | `spec.containers[0].readinessProbe.periodSeconds` | `false` | Period between readiness probe requests in seconds. | `10` |
   297  | `spec.containers[0].readinessProbe.httpGet.Path` | `true` | Path for HTTP readiness probes. | - |
   298  | `spec.containers[0].readinessProbe.httpGet.Port` | `true` | Port for HTTP readiness probes. Required `default`. | - |
   299  
   300  #### Forbidden fields
   301  
   302  | Path | Reason |
   303  | --- | --- |
   304  | `spec.affinity.nodeAffinity` | Used by AIStore to colocate ETL containers with targets. |
   305  
   306  #### Communication Mechanisms
   307  
   308  AIS currently supports 3 (three) distinct target ⇔ container communication mechanisms to facilitate the fly or offline transformation.
   309  Users  can choose and specify (via YAML spec) any of the following:
   310  
   311  | Name | Value | Description |
   312  |---|---|---|
   313  | **post** | `hpush://` | A target issues a POST request to its ETL container with the body containing the requested object. After finishing the request, the target forwards the response from the ETL container to the user. |
   314  | **reverse proxy** | `hrev://` | A target uses a [reverse proxy](https://en.wikipedia.org/wiki/Reverse_proxy) to send a (GET) request to a cluster using an ETL container. ETL container should make a GET request to a target, transform bytes, and return the result to the target. |
   315  | **redirect** | `hpull://` | A target uses [HTTP redirect](https://developer.mozilla.org/en-US/docs/Web/HTTP/Redirections) to send a (GET) request to cluster using an ETL container. ETL container should make a GET request to the target, transform bytes, and return it to a user. |
   316  | **input/output** | `io://` | A target remotely runs the binary or the code and sends the data to standard input and excepts the transformed bytes to be sent on standard output. |
   317  
   318  > ETL container will have `AIS_TARGET_URL` environment variable set to the URL of its corresponding target.
   319  > To make a request for a given object it is required to add `<bucket-name>/<object-name>` to `AIS_TARGET_URL`, eg. `requests.get(env("AIS_TARGET_URL") + "/" + bucket_name + "/" + object_name)`.
   320  
   321  #### Argument Types
   322  
   323  The AIStore `etl init spec` provides three `arg_type` parameter options for specifying the type of object specification between the AIStore and ETL container. These options are utilized as follows:
   324  
   325  | Parameter Value | Description |
   326  |-----------------|-------------|
   327  | "" (Empty String) | This serves as the default option, allowing the object to be passed as bytes. When initializing ETLs, the `arg_type` parameter can be entirely omitted, and it will automatically default to passing the object as bytes to the transformation function. |
   328  | "url" | Pass the URL of the objects to be transformed to the user-defined transform function. It's important to note that this option is limited to '--comm-type=hpull'. In this scenario, the user is responsible for implementing the logic to fetch objects from the buckets based on the URL of the object received as a parameter. |
   329  | "fqn" | Pass a fully-qualified name (FQN) of the locally stored object. User is responsible for opening, reading, transforming, and closing the corresponding file. |
   330  
   331  ## Transforming objects
   332  
   333  AIStore supports both *inline* transformation of selected objects and *offline* transformation of an entire bucket.
   334  
   335  There are two ways to run ETL transformations:
   336  - HTTP RESTful APIs are described in [API Reference section](#api-reference) of this document.
   337  - [ETL CLI](/docs/cli/etl.md)
   338  - [Python SDK](https://github.com/NVIDIA/aistore/blob/main/python/aistore/sdk/README.md#etls)
   339  - [AIS Loader](/docs/aisloader.md)
   340  
   341  ## API Reference
   342  
   343  This section describes how to interact with ETLs via RESTful API.
   344  
   345  > `G` - denotes a (`hostname:port`) address of a **gateway** (any gateway in a given AIS cluster)
   346  
   347  | Operation | Description | HTTP action | Example |
   348  | --- | --- | --- | --- |
   349  | Init spec ETL | Initializes ETL based on POD `spec` template. Returns `ETL_NAME`. | PUT /v1/etl | `curl -X PUT 'http://G/v1/etl' '{"spec": "...", "id": "..."}'` |
   350  | Init code ETL | Initializes ETL based on the provided source code. Returns `ETL_NAME`. | PUT /v1/etl | `curl -X PUT 'http://G/v1/etl' '{"code": "...", "dependencies": "...", "runtime": "python3", "id": "..."}'` |
   351  | List ETLs | Lists all running ETLs. | GET /v1/etl | `curl -L -X GET 'http://G/v1/etl'` |
   352  | View ETLs Init spec/code | View code/spec of ETL by `ETL_NAME` | GET /v1/etl/ETL_NAME | `curl -L -X GET 'http://G/v1/etl/ETL_NAME'` |
   353  | Transform object | Transforms an object based on ETL with `ETL_NAME`. | GET /v1/objects/<bucket>/<objname>?etl_name=ETL_NAME | `curl -L -X GET 'http://G/v1/objects/shards/shard01.tar?etl_name=ETL_NAME' -o transformed_shard01.tar` |
   354  | Transform bucket | Transforms all objects in a bucket and puts them to destination bucket. | POST {"action": "etl-bck"} /v1/buckets/from-name | `curl -i -X POST -H 'Content-Type: application/json' -d '{"action": "etl-bck", "name": "to-name", "value":{"ext":"destext", "prefix":"prefix", "suffix": "suffix"}}' 'http://G/v1/buckets/from-name'` |
   355  | Dry run transform bucket | Accumulates in xaction stats how many objects and bytes would be created, without actually doing it. | POST {"action": "etl-bck"} /v1/buckets/from-name | `curl -i -X POST -H 'Content-Type: application/json' -d '{"action": "etl-bck", "name": "to-name", "value":{"ext":"destext", "dry_run": true}}' 'http://G/v1/buckets/from-name'` |
   356  | Stop ETL | Stops ETL with given `ETL_NAME`. | DELETE /v1/etl/ETL_NAME/stop | `curl -X POST 'http://G/v1/etl/ETL_NAME/stop'` |
   357  | Delete ETL | Delete ETL spec/code with given `ETL_NAME` | DELETE /v1/etl/<ETL_NAME> | `curl -X DELETE 'http://G/v1/etl/ETL_NAME' |
   358  
   359  
   360  ## ETL name specifications
   361  
   362  Every initialized ETL has a unique user-defined `ETL_NAME` associated with it, used for running transforms/computation on data or stopping the ETL.
   363  
   364  ```yaml
   365  apiVersion: v1
   366  kind: Pod
   367  metadata:
   368    name: compute-md5
   369  (...)
   370  ```
   371  
   372  When initializing ETL from spec/code, a valid and unique user-defined `ETL_NAME` should be assigned using the `--name` CLI parameter as shown below.
   373  
   374  ```console
   375  $ ais etl init code --name=etl-md5 --from-file=code.py --runtime=python3 --deps-file=deps.txt --comm-type hpull
   376  or
   377  $ ais etl init spec --name=etl-md5 --from-file=spec.yaml --comm-type hpull
   378  ```
   379  
   380  Below are specifications for a valid `ETL_NAME`:
   381  1. Starts with an alphabet 'A' to 'Z' or 'a' to 'z'.
   382  2. Can contain alphabets, numbers, underscore ('_'), or hyphen ('-').
   383  3. Should have a length greater than 5 and less than 21.
   384  4. Shouldn't contain special characters, except for underscore and hyphen.
   385  
   386  
   387  ## References
   388  
   389  * For technical blogs with in-depth background and working real-life examples, see:
   390    - [ETL: Introduction](https://aiatscale.org/blog/2021/10/21/ais-etl-1)
   391    - [AIStore SDK & ETL: Transform an image dataset with AIS SDK and load into PyTorch](https://aiatscale.org/blog/2023/04/03/transform-images-with-python-sdk)
   392    - [ETL: Using WebDataset to train on a sharded dataset ](https://aiatscale.org/blog/2021/10/29/ais-etl-3)
   393  * For step-by-step tutorials, see:
   394    - [Compute the MD5 of the object](/docs/tutorials/etl/compute_md5.md)
   395  * For a quick CLI introduction and reference, see [ETL CLI](/docs/cli/etl.md)
   396  * For initializing ETLs with AIStore Python SDK, see:
   397    - [Python SDK ETL Usage Docs](https://github.com/NVIDIA/aistore/blob/main/python/aistore/sdk/README.md#etls)
   398    - [Python SDK ETL Examples](https://github.com/NVIDIA/aistore/tree/main/python/examples/ais-etl)
   399    - [Python SDK ETL Tutorial](https://github.com/NVIDIA/aistore/blob/main/python/examples/sdk/sdk-etl-tutorial.ipynb)