github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/tutorials/etl/etl_webdataset.md (about) 1 --- 2 layout: post 3 title: ETL WEBDATASET 4 permalink: /tutorials/etl/etl-webdataset 5 redirect_from: 6 - /tutorials/etl/etl_webdataset.md/ 7 - /docs/tutorials/etl/etl_webdataset.md/ 8 --- 9 10 # WebDataset ImageNet preprocessing with ETL 11 12 In this example, we will see how to use ETL to preprocess the images of ImageNet using [WebDataset](https://github.com/webdataset/webdataset), a PyTorch Dataset implementation providing efficient access to datasets stored in POSIX tar archives. 13 14 `Note: ETL is still in development so some steps may not work exactly as written below.` 15 16 ## Overview 17 18 This tutorial consists of couple steps: 19 1. Prepare AIStore cluster. 20 2. Prepare dataset. 21 3. Prepare WebDataset based transform code (ETL). 22 4. Online Transform dataset on AIStore cluster with ETL. 23 24 ## Prerequisites 25 26 * AIStore cluster deployed on Kubernetes. We recommend following guide below. 27 * [Deploy AIStore on local Kuberenetes cluster](https://github.com/NVIDIA/ais-k8s/blob/master/operator/README.md) 28 * [Deploy AIStore on the cloud](https://github.com/NVIDIA/ais-k8s/blob/master/terraform/README.md) 29 30 ## Prepare dataset 31 32 Before we start writing code, let's put an example tarball file with ImageNet images to the AIStore. 33 The tarball we will be using is [imagenet-train-000999.tar](https://storage.googleapis.com/nvdata-imagenet/imagenet-train-000999.tar) which is already in WebDataset friendly format. 34 35 ```console 36 $ tar -tvf imagenet-train-000999.tar | head -n 5 37 -r--r--r-- bigdata/bigdata 3 2020-06-25 11:11 0196495.cls 38 -r--r--r-- bigdata/bigdata 109671 2020-06-25 11:11 0196495.jpg 39 -r--r--r-- bigdata/bigdata 3 2020-06-25 11:11 0877232.cls 40 -r--r--r-- bigdata/bigdata 104484 2020-06-25 11:11 0877232.jpg 41 -r--r--r-- bigdata/bigdata 3 2020-06-25 11:11 0600062.cls 42 $ ais create ais://imagenet 43 "ais://imagenet" bucket created 44 $ ais put imagenet-train-000999.tar ais://imagenet 45 PUT "imagenet-train-000999.tar" into bucket "ais://imagenet" 46 ``` 47 48 ## Prepare ETL code 49 50 As we have ImageNet prepared now we need an ETL code that will do the transformation. 51 Here we will use `io://` communicator type with `python3.11v2` runtime to install `torchvision` and `webdataset` packages. 52 53 Our transform code will look like this (`code.py`): 54 ```python 55 # -*- Python -*- 56 57 # Perform imagenet-style augmentation and normalization on the shards 58 # on stdin, returning a new dataset on stdout. 59 60 import sys 61 from torchvision import transforms 62 import webdataset as wds 63 64 normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) 65 66 67 augment = transforms.Compose( 68 [ 69 transforms.RandomResizedCrop(224), 70 transforms.RandomHorizontalFlip(), 71 transforms.ToTensor(), 72 normalize, 73 ] 74 ) 75 76 dataset = wds.WebDataset("-").decode("pil") 77 78 sink = wds.TarWriter("-") 79 for sample in dataset: 80 print(sample.get("__key__"), file=sys.stderr) 81 sample["npy"] = augment(sample.pop("jpg")).numpy().astype("float16") 82 sink.write(sample) 83 84 ``` 85 86 The idea here is that we unpack the tarball, process each image and save it as a numpy array into transformed output tarball. 87 88 To make sure that the code runs we need to specify required dependencies (`deps.txt`): 89 ``` 90 git+https://github.com/tmbdev/webdataset.git 91 torchvision==0.15.1 92 PyYAML==5.4.1 93 ``` 94 95 ## Transform dataset 96 97 Now we can build the ETL: 98 ```console 99 $ ais etl init code --from-file=code.py --deps-file=deps.txt --runtime=python3.11v2 --comm-type="io://" 100 f90r81wR0 101 $ ais etl object f90r81wR0 imagenet/imagenet-train-000999.tar preprocessed-train.tar 102 $ tar -tvf preprocessed-train.tar | head -n 6 103 -r--r--r-- bigdata/bigdata 3 2021-07-20 23:52 0196495.cls 104 -r--r--r-- bigdata/bigdata 301184 2021-07-20 23:52 0196495.npy 105 -r--r--r-- bigdata/bigdata 3 2021-07-20 23:52 0877232.cls 106 -r--r--r-- bigdata/bigdata 301184 2021-07-20 23:52 0877232.npy 107 -r--r--r-- bigdata/bigdata 3 2021-07-20 23:52 0600062.cls 108 -r--r--r-- bigdata/bigdata 301184 2021-07-20 23:52 0600062.npy 109 ``` 110 111 As expected, the new tarball contains transformed images stored as pickled numpy arrays, each occupying `301184` bytes.