github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/etl.md (about) 1 --- 2 layout: post 3 title: ETL 4 permalink: /docs/etl 5 redirect_from: 6 - /etl.md/ 7 - /docs/etl.md/ 8 --- 9 10 **ETL** stands for **E**xtract, **T**ransform, **L**oad. More specifically: 11 12 * **E**xtract - data from different original formats and/or multiple sources; 13 * **T**ransform - to the unified common format optimized for subsequent computation (e.g., training deep learning model); 14 * **L**oad - transformed data into a new destination - e.g., a storage system that supports high-performance computing over large scale datasets. 15 16 The latter can be AIStore (AIS). The system is designed from the ground up to support all 3 stages of the ETL pre (or post) processing. With AIS, you can effortlessly manage the AIS cluster by executing custom transformations in two ways: 17 18 1. **Inline**: This involves transforming datasets on the fly, where the data is read and streamed in a transformed format directly to computing clients. 19 2. **Offline**: Here, the transformed output is stored as a new dataset, which AIS makes accessible for any future computations. 20 21 22 > Implementation-wise, *offline* transformations of any kind, on the one hand, and copying datasets, on the other, are closely related - the latter being, effectively, a *no-op* offline transformation. 23 24 Most notably, AIS always runs transformations locally - *close to data*. Running *close to data* has always been one of the cornerstone design principles whereby in a deployed cluster each AIStore target proportionally contributes to the resulting cumulative bandwidth - the bandwidth that, in turn, will scale linearly with each added target. 25 26 This was the principle behind *distributed shuffle* (code-named [dSort](/docs/dsort.md)). 27 And this is exactly how we have more recently implemented **AIS-ETL** - the ETL service provided by AIStore. Find more information on the architecture and implementation of `ais-etl` [here](/ext/etl/README.md). 28 29 Technically, the service supports running user-provided ETL containers **and** custom Python scripts within the storage cluster. 30 31 **Note:** AIS-ETL (service) requires [Kubernetes](https://kubernetes.io). 32 33 ## Table of Contents 34 35 - [Getting Started](#getting-started) 36 - [Inline ETL example](#inline-etl-example) 37 - [Offline ETL example](#offline-etl-example) 38 - [Kubernetes Deployment](#kubernetes-deployment) 39 - [Extract, Transform and Load using user-defined functions](#extract-transform-and-load-using-user-defined-functions) 40 - [Extract, Transform and Load using custom containers](#extract-transform-and-load-using-custom-containers) 41 - [*init code* request](#init-code-request) 42 - [`hpush://` communication](#hpush-communication) 43 - [`io://` communication](#io-communication) 44 - [Runtimes](#runtimes) 45 - [Argument Types](#argument-types) 46 - [*init spec* request](#init-spec-request) 47 - [Requirements](#requirements) 48 - [Specification YAML](#specification-yaml) 49 - [Required or additional fields](#required-or-additional-fields) 50 - [Forbidden fields](#forbidden-fields) 51 - [Communication Mechanisms](#communication-mechanisms) 52 - [Argument Types](#argument-types-1) 53 - [Transforming objects](#transforming-objects) 54 - [API Reference](#api-reference) 55 - [ETL name specifications](#etl-name-specifications) 56 57 ## Getting Started with ETL in AIStore 58 59 To begin using ETLs in AIStore, you'll need to run AIStore within a Kubernetes cluster. There are several ways to achieve this, each suited for different purposes: 60 61 1. **AIStore Development with Native Kubernetes (minikube)**: 62 - Folder: [deploy/dev/k8s](/deploy/dev/k8s) 63 - Intended for: AIStore development using native Kubernetes provided by [minikube](https://minikube.sigs.k8s.io/docs) 64 - How to use: Run minikube and deploy the AIS cluster on it using the carefully documented steps available [here](/deploy/dev/k8s/README.md). 65 - Documentation: [README](/deploy/dev/k8s/README.md) 66 67 2. **Production Deployment with Kubernetes**: 68 - Folder: [deploy/prod/k8s](/deploy/prod/k8s) 69 - Intended for: Production use 70 - How to use: Utilize the Dockerfiles in this folder to build AIS images for production deployment. For this purpose, there is a separate dedicated [repository](https://github.com/NVIDIA/ais-k8s) that contains corresponding tools, scripts, and documentation. 71 - Documentation: [AIS/K8s Operator and Deployment Playbooks](https://github.com/NVIDIA/ais-k8s) 72 73 To verify that your deployment is correctly set up, execute the following [CLI](/docs/cli.md) command: 74 75 ```console 76 $ ais etl show 77 ``` 78 79 If you receive an empty response without any errors, your AIStore cluster is now ready to run ETL tasks. 80 81 ## Inline ETL example 82 83 To follow this and subsequent examples, make sure you have the [AIS CLI](/docs/cli.md) installed on your system. 84 85 ```console 86 # Prerequisites: 87 # AIStore must be running on Kubernetes. AIS CLI (Command-Line Interface) must be installed. Make sure you have a running AIStore cluster before proceeding. 88 89 # Step 1: Create a new bucket 90 $ ais bucket create ais://src 91 92 # Step 2: Show existing ETLs 93 $ ais etl show 94 95 # Step 3: Create a temporary file and add it to the bucket 96 $ echo "hello world" > text.txt 97 $ ais put text.txt ais://src 98 99 # Step 4: Create a spec file to initialize the ETL process 100 # In this example, we are using the MD5 transformer as a sample ETL. 101 $ curl -s https://raw.githubusercontent.com/NVIDIA/ais-etl/master/transformers/md5/pod.yaml -o md5_spec.yaml 102 103 # Step 5: Initialize the ETL process 104 $ export COMMUNICATION_TYPE="hpull://" 105 $ ais etl init spec --from-file md5_spec.yaml --name etl-md5 --comm-type $COMMUNICATION_TYPE 106 107 # Step 6: Check if the ETL is running 108 $ ais etl show 109 110 # Step 7: Run the transformation 111 # Replace 'ais://src/text.txt' with the appropriate source object location and '-' with the desired destination (e.g., local file path or another AIS bucket). 112 $ ais etl object etl-md5 ais://src/text.txt - 113 ``` 114 115 ## Offline ETL example 116 ```console 117 $ # Download Imagenet dataset 118 $ wget -O imagenet.tar https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar --no-check-certificate 119 120 $ # Extract Imagenet 121 $ mkdir imagenet && tar -C imagenet -xvf imagenet.tar >/dev/null 122 123 $ # Check if the dataset is extracted correctly 124 $ cd imagenet 125 $ ls | head -5 126 127 $ # Add the entire dataset to a bucket 128 $ ais bucket create ais://imagenet 129 $ ais put . ais://imagenet -r 130 131 $ # Check if the dataset is available in the bucket 132 $ ais ls ais://imagenet | head -5 133 134 $ # Create a custom transformation using torchvision 135 $ # The `code.py` file contains the Python code for the transformation function, and `deps.txt` lists the dependencies required to run `code.py` 136 $ cat code.py 137 import io 138 from PIL import Image 139 from torchvision import transforms as T 140 141 preprocess = T.Compose( 142 [ 143 T.Resize(256), 144 T.CenterCrop(224), 145 T.ToTensor(), 146 T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), 147 T.ToPILImage(), 148 ] 149 ) 150 151 # Define the transform function 152 def transform(data: bytes) -> bytes: 153 image = Image.open(io.BytesIO(data)) 154 processed = preprocess(image) 155 buf = io.BytesIO() 156 processed.save(buf, format='JPEG') 157 byte_im = buf.getvalue() 158 return byte_im 159 160 $ cat deps.txt 161 torch==2.0.1 162 torchvision==0.15.2 163 164 $ ais etl init code --name etl-torchvision --from-file code.py --deps-file deps.txt --runtime python3.11v2 --comm-type hpull 165 166 $ # Perform an offline transformation 167 $ ais etl bucket etl-torchvision ais://imagenet ais://imagenet-transformed --ext="{JPEG:JPEG}" 168 169 $ # Check if the transformed dataset is available in the bucket 170 $ ais ls ais://imagenet-transformed | head -5 171 172 $ # Verify one of the images by downloading its content 173 $ ais object get ais://imagenet-transformed/ILSVRC2012_val_00050000.JPEG test.JPEG 174 ``` 175 ## Extract, Transform, and Load using User-Defined Functions 176 177 1. To perform Extract, Transform, and Load (ETL) using user-defined functions, send the transform function in the [**init code** request](#init-code-request) to an AIStore endpoint. 178 179 2. Upon receiving the **init code** request, the AIStore proxy broadcasts the request to all AIStore targets in the cluster. 180 181 3. When an AIStore target receives the **init code**, it initiates the execution of the container **locally** on the target's machine (also known as [Kubernetes Node](https://kubernetes.io/docs/concepts/architecture/nodes/)). 182 183 ## Extract, Transform, and Load using Custom Containers 184 185 1. To perform Extract, Transform, and Load (ETL) using custom containers, execute the [**init spec** API](#init-spec-request) to an AIStore endpoint. 186 187 > The request contains a YAML spec and ultimately triggers the creation of [Kubernetes Pods](https://kubernetes.io/docs/concepts/workloads/pods/pod/) that run the user's ETL logic inside. 188 189 2. Upon receiving the **init spec** request, the AIStore proxy broadcasts the request to all AIStore targets in the cluster. 190 191 3. When a target receives the **init spec**, it starts the user container **locally** on the target's machine (also known as [Kubernetes Node](https://kubernetes.io/docs/concepts/architecture/nodes/)). 192 193 ## *init code* request 194 195 You can write your custom `transform` function that takes input object bytes as a parameter and returns output bytes (the transformed object's content). 196 You can then use the *init code* request to execute this `transform` on the entire distributed dataset. 197 198 In effect, a user can skip the entire step of writing their Dockerfile and building a custom ETL container - the *init code* capability allows the user to skip this step entirely. 199 200 > If you are familiar with [FasS](https://en.wikipedia.org/wiki/Function_as_a_service), then you probably will find this type of ETL initialization the most intuitive. 201 202 For a detailed step-by-step tutorial on *init code* requests, please see [Python SDK ETL Tutorial](https://github.com/NVIDIA/aistore/blob/main/python/examples/sdk/sdk-etl-tutorial.ipynb) and [Python SDK ETL Examples](https://github.com/NVIDIA/aistore/tree/main/python/examples/ais-etl). 203 204 The `init_code` request currently supports two communication types: 205 206 ### `hpush://` communication 207 208 In `hpush` communication type, the user has to define a function that takes bytes as a parameter, processes it and returns bytes. e.g. [ETL to calculate MD5 of an object](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_md5_hpush.py) 209 210 ```python 211 def transform(input_bytes: bytes) -> bytes 212 ``` 213 214 You can also stream objects in `transform()` by setting the `CHUNK_SIZE` parameter (`CHUNK_SIZE` > 0). 215 216 e.g. [ETL to calculate MD5 of an object with streaming](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_md5_hpush_streaming.py), [ETL to transform images using torchvision](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_torchvision_hpush.py). 217 218 > **Note:** 219 >- If the function uses external dependencies, a user can provide an optional dependencies file or in the `elt().init()` function of Python SDK. These requirements will be installed on the machine executing the `transform` function and will be available for the function. 220 221 ### `io://` communication 222 223 In `io://` communication type, users have to define a `transform()` function that reads bytes from [`sys.stdin`](https://docs.python.org/3/library/sys.html#sys.stdin), carries out transformations over it, and then writes bytes to [`sys.stdout`](https://docs.python.org/3/library/sys.html#sys.stdout). 224 225 ```python 226 def transform() -> None: 227 input_bytes = sys.stdin.buffer.read() 228 # output_bytes = process(input_bytes) 229 sys.stdout.buffer.write(output_bytes) 230 ``` 231 232 e.g. [ETL to calculate MD5 of an object](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_md5_io.py), [ETL to transform images using torchvision](https://github.com/NVIDIA/aistore/blob/main/python/examples/ais-etl/etl_torchvision_io.py) 233 234 ### Runtimes 235 236 AIS-ETL provides several *runtimes* out of the box. 237 Each *runtime* determines the programming language of your custom `transform` function and the set of pre-installed packages and tools that your `transform` can utilize. 238 239 Currently, the following runtimes are supported: 240 241 | Name | Description | 242 | --- | --- | 243 | `python3.8v2` | `python:3.8` is used to run the code. | 244 | `python3.10v2` | `python:3.10` is used to run the code. | 245 | `python3.11v2` | `python:3.11` is used to run the code. | 246 247 More *runtimes* will be added in the future, with plans to support the most popular ETL toolchains. 248 Still, since the number of supported *runtimes* will always remain somewhat limited, there's always the second way: build your ETL container and deploy it via [*init spec* request](#init-spec-request). 249 250 ### Argument Types 251 252 The AIStore `etl init code` provides two `arg_type` parameter options for specifying the type of object specification between the AIStore and ETL container. These options are utilized as follows: 253 254 | Parameter Value | Description | 255 |-----------------|-------------| 256 | "" (Empty String) | This serves as the default option, allowing the object to be passed as bytes. When initializing ETLs, the `arg_type` parameter can be entirely omitted, and it will automatically default to passing the object as bytes to the transformation function. | 257 | "url" | When set to "url," this option allows the passing of the URL of the objects to be transformed to the user-defined transform function. It's important to note that this option is limited to '--comm-type=hpull'. In this scenario, the user is responsible for implementing the logic to fetch objects from the buckets based on the URL of the object received as a parameter. | 258 259 260 ## *init spec* request 261 262 *Init spec* request covers all, even the most sophisticated, cases of ETL initialization. 263 It allows running any Docker image that implements certain requirements on communication with the cluster. 264 The *init spec* request requires writing a Pod specification following specification requirements. 265 266 For a detailed step-by-step tutorial on *init spec* request, please see the [MD5 ETL playbook](/docs/tutorials/etl/compute_md5.md). 267 268 #### Requirements 269 270 Custom ETL container is expected to satisfy the following requirements: 271 272 1. Start a web server that supports at least one of the listed [communication mechanisms](#communication-mechanisms). 273 2. The server can listen on any port, but the port must be specified in Pod spec with `containerPort` - the cluster 274 must know how to contact the Pod. 275 3. AIS target(s) may send requests in parallel to the web server inside the ETL container - any synchronization, therefore, must be done on the server-side. 276 277 #### Specification YAML 278 279 Specification of an ETL should be in the form of a YAML file. 280 It is required to follow the Kubernetes [Pod template format](https://kubernetes.io/docs/concepts/workloads/pods/#pod-templates) 281 and contain all necessary fields to start the Pod. 282 283 #### Required or additional fields 284 285 | Path | Required | Description | Default | 286 | --- | --- | --- | --- | 287 | `metadata.annotations.communication_type` | `false` | [Communication type](#communication-mechanisms) of an ETL. | `hpush://` | 288 | `metadata.annotations.wait_timeout` | `false` | Timeout on ETL Pods starting on target machines. See [annotations](#annotations) | infinity | 289 | `spec.containers` | `true` | Containers running inside a Pod, exactly one required. | - | 290 | `spec.containers[0].image` | `true` | Docker image of ETL container. | - | 291 | `spec.containers[0].ports` | `true` (except `io://` communication type) | Ports exposed by a container, at least one expected. | - | 292 | `spec.containers[0].ports[0].Name` | `true` | Name of the first Pod should be `default`. | - | 293 | `spec.containers[0].ports[0].containerPort` | `true` | Port which a cluster will contact containers on. | - | 294 | `spec.containers[0].readinessProbe` | `true` (except `io://` communication type) | ReadinessProbe of a container. | - | 295 | `spec.containers[0].readinessProbe.timeoutSeconds` | `false` | Timeout for a readiness probe in seconds. | `5` | 296 | `spec.containers[0].readinessProbe.periodSeconds` | `false` | Period between readiness probe requests in seconds. | `10` | 297 | `spec.containers[0].readinessProbe.httpGet.Path` | `true` | Path for HTTP readiness probes. | - | 298 | `spec.containers[0].readinessProbe.httpGet.Port` | `true` | Port for HTTP readiness probes. Required `default`. | - | 299 300 #### Forbidden fields 301 302 | Path | Reason | 303 | --- | --- | 304 | `spec.affinity.nodeAffinity` | Used by AIStore to colocate ETL containers with targets. | 305 306 #### Communication Mechanisms 307 308 AIS currently supports 3 (three) distinct target ⇔ container communication mechanisms to facilitate the fly or offline transformation. 309 Users can choose and specify (via YAML spec) any of the following: 310 311 | Name | Value | Description | 312 |---|---|---| 313 | **post** | `hpush://` | A target issues a POST request to its ETL container with the body containing the requested object. After finishing the request, the target forwards the response from the ETL container to the user. | 314 | **reverse proxy** | `hrev://` | A target uses a [reverse proxy](https://en.wikipedia.org/wiki/Reverse_proxy) to send a (GET) request to a cluster using an ETL container. ETL container should make a GET request to a target, transform bytes, and return the result to the target. | 315 | **redirect** | `hpull://` | A target uses [HTTP redirect](https://developer.mozilla.org/en-US/docs/Web/HTTP/Redirections) to send a (GET) request to cluster using an ETL container. ETL container should make a GET request to the target, transform bytes, and return it to a user. | 316 | **input/output** | `io://` | A target remotely runs the binary or the code and sends the data to standard input and excepts the transformed bytes to be sent on standard output. | 317 318 > ETL container will have `AIS_TARGET_URL` environment variable set to the URL of its corresponding target. 319 > To make a request for a given object it is required to add `<bucket-name>/<object-name>` to `AIS_TARGET_URL`, eg. `requests.get(env("AIS_TARGET_URL") + "/" + bucket_name + "/" + object_name)`. 320 321 #### Argument Types 322 323 The AIStore `etl init spec` provides three `arg_type` parameter options for specifying the type of object specification between the AIStore and ETL container. These options are utilized as follows: 324 325 | Parameter Value | Description | 326 |-----------------|-------------| 327 | "" (Empty String) | This serves as the default option, allowing the object to be passed as bytes. When initializing ETLs, the `arg_type` parameter can be entirely omitted, and it will automatically default to passing the object as bytes to the transformation function. | 328 | "url" | Pass the URL of the objects to be transformed to the user-defined transform function. It's important to note that this option is limited to '--comm-type=hpull'. In this scenario, the user is responsible for implementing the logic to fetch objects from the buckets based on the URL of the object received as a parameter. | 329 | "fqn" | Pass a fully-qualified name (FQN) of the locally stored object. User is responsible for opening, reading, transforming, and closing the corresponding file. | 330 331 ## Transforming objects 332 333 AIStore supports both *inline* transformation of selected objects and *offline* transformation of an entire bucket. 334 335 There are two ways to run ETL transformations: 336 - HTTP RESTful APIs are described in [API Reference section](#api-reference) of this document. 337 - [ETL CLI](/docs/cli/etl.md) 338 - [Python SDK](https://github.com/NVIDIA/aistore/blob/main/python/aistore/sdk/README.md#etls) 339 - [AIS Loader](/docs/aisloader.md) 340 341 ## API Reference 342 343 This section describes how to interact with ETLs via RESTful API. 344 345 > `G` - denotes a (`hostname:port`) address of a **gateway** (any gateway in a given AIS cluster) 346 347 | Operation | Description | HTTP action | Example | 348 | --- | --- | --- | --- | 349 | Init spec ETL | Initializes ETL based on POD `spec` template. Returns `ETL_NAME`. | PUT /v1/etl | `curl -X PUT 'http://G/v1/etl' '{"spec": "...", "id": "..."}'` | 350 | Init code ETL | Initializes ETL based on the provided source code. Returns `ETL_NAME`. | PUT /v1/etl | `curl -X PUT 'http://G/v1/etl' '{"code": "...", "dependencies": "...", "runtime": "python3", "id": "..."}'` | 351 | List ETLs | Lists all running ETLs. | GET /v1/etl | `curl -L -X GET 'http://G/v1/etl'` | 352 | View ETLs Init spec/code | View code/spec of ETL by `ETL_NAME` | GET /v1/etl/ETL_NAME | `curl -L -X GET 'http://G/v1/etl/ETL_NAME'` | 353 | Transform object | Transforms an object based on ETL with `ETL_NAME`. | GET /v1/objects/<bucket>/<objname>?etl_name=ETL_NAME | `curl -L -X GET 'http://G/v1/objects/shards/shard01.tar?etl_name=ETL_NAME' -o transformed_shard01.tar` | 354 | Transform bucket | Transforms all objects in a bucket and puts them to destination bucket. | POST {"action": "etl-bck"} /v1/buckets/from-name | `curl -i -X POST -H 'Content-Type: application/json' -d '{"action": "etl-bck", "name": "to-name", "value":{"ext":"destext", "prefix":"prefix", "suffix": "suffix"}}' 'http://G/v1/buckets/from-name'` | 355 | Dry run transform bucket | Accumulates in xaction stats how many objects and bytes would be created, without actually doing it. | POST {"action": "etl-bck"} /v1/buckets/from-name | `curl -i -X POST -H 'Content-Type: application/json' -d '{"action": "etl-bck", "name": "to-name", "value":{"ext":"destext", "dry_run": true}}' 'http://G/v1/buckets/from-name'` | 356 | Stop ETL | Stops ETL with given `ETL_NAME`. | DELETE /v1/etl/ETL_NAME/stop | `curl -X POST 'http://G/v1/etl/ETL_NAME/stop'` | 357 | Delete ETL | Delete ETL spec/code with given `ETL_NAME` | DELETE /v1/etl/<ETL_NAME> | `curl -X DELETE 'http://G/v1/etl/ETL_NAME' | 358 359 360 ## ETL name specifications 361 362 Every initialized ETL has a unique user-defined `ETL_NAME` associated with it, used for running transforms/computation on data or stopping the ETL. 363 364 ```yaml 365 apiVersion: v1 366 kind: Pod 367 metadata: 368 name: compute-md5 369 (...) 370 ``` 371 372 When initializing ETL from spec/code, a valid and unique user-defined `ETL_NAME` should be assigned using the `--name` CLI parameter as shown below. 373 374 ```console 375 $ ais etl init code --name=etl-md5 --from-file=code.py --runtime=python3 --deps-file=deps.txt --comm-type hpull 376 or 377 $ ais etl init spec --name=etl-md5 --from-file=spec.yaml --comm-type hpull 378 ``` 379 380 Below are specifications for a valid `ETL_NAME`: 381 1. Starts with an alphabet 'A' to 'Z' or 'a' to 'z'. 382 2. Can contain alphabets, numbers, underscore ('_'), or hyphen ('-'). 383 3. Should have a length greater than 5 and less than 21. 384 4. Shouldn't contain special characters, except for underscore and hyphen. 385 386 387 ## References 388 389 * For technical blogs with in-depth background and working real-life examples, see: 390 - [ETL: Introduction](https://aiatscale.org/blog/2021/10/21/ais-etl-1) 391 - [AIStore SDK & ETL: Transform an image dataset with AIS SDK and load into PyTorch](https://aiatscale.org/blog/2023/04/03/transform-images-with-python-sdk) 392 - [ETL: Using WebDataset to train on a sharded dataset ](https://aiatscale.org/blog/2021/10/29/ais-etl-3) 393 * For step-by-step tutorials, see: 394 - [Compute the MD5 of the object](/docs/tutorials/etl/compute_md5.md) 395 * For a quick CLI introduction and reference, see [ETL CLI](/docs/cli/etl.md) 396 * For initializing ETLs with AIStore Python SDK, see: 397 - [Python SDK ETL Usage Docs](https://github.com/NVIDIA/aistore/blob/main/python/aistore/sdk/README.md#etls) 398 - [Python SDK ETL Examples](https://github.com/NVIDIA/aistore/tree/main/python/examples/ais-etl) 399 - [Python SDK ETL Tutorial](https://github.com/NVIDIA/aistore/blob/main/python/examples/sdk/sdk-etl-tutorial.ipynb)