github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-10-29-ais-etl-3.md (about) 1 --- 2 layout: post 3 title: "AIStore & ETL: Using WebDataset to train on a sharded dataset (post #3)" 4 date: Oct 29, 2021 (Revised March 31, 2023) 5 author: Prashanth Dintyala, Janusz Marcinkiewicz, Alex Aizman, Aaron Wilson 6 categories: aistore etl pytorch webdataset python 7 --- 8 9 **Deprecated** -- WDTransform is no longer included as part of the AIS client, so this post only remains for educational purposes. ETL is in development and additional transformation tools will be included in future posts. 10 11 ## Background 12 13 In our [previous post](https://aiatscale.org/blog/2021/10/22/ais-etl-2), we have built a basic [PyTorch](https://pytorch.org/) data loader and used it to load transformed images from [AIStore](https://github.com/NVIDIA/aistore) (AIS). We have used the latter to run [torchvision](https://pytorch.org/vision/stable/index.html) transformations of the [Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) images. 14 15 Now, we'll be looking further into training that involves **sharded** datasets. We will utilize [WebDataset](https://github.com/webdataset/webdataset), an [iterable](https://pytorch.org/docs/stable/data.html#iterable-style-datasets) PyTorch dataset that provides a number of important [features](https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/#benefits). For demonstration purposes, we'll be using [ImageNet](https://www.image-net.org/) - a sharded version of the ImageNet, to be precise, where original images are assembled into `.tar` archives (aka shards). 16 17 > For background on WebDataset and AIStore, and the benefits of *sharding* very large datasets, please see [Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs](https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus). 18 19 On the Python side, we'll have [AIS Python client](https://github.com/NVIDIA/ais-etl/tree/post-3/package). The client is a thin layer on top of AIStore API/SDK providing easy operations on remote datasets. We'll be using it to offload image transformations to AIS cluster. 20 21 The remainder of this text is structured as follows: 22 23 * introduce sharded ImageNet; 24 * load a single shard and apply assorted `torchvision` transformations; 25 * run the same exact transformation in the cluster (in other words, *offload* this specific ETL to AIS); 26 * operate on multiple ([brace-expansion](https://www.linuxjournal.com/content/bash-brace-expansion) defined) shards 27 28 First step, though is to install the required dependencies (e.g., from your Jupyter notebook), as follows: 29 30 ```jupyter 31 ! pip install webdataset aistore torch torchvision matplotlib 32 ``` 33 34 ## The Dataset 35 36 Pre-shared ImageNet will be stored in a Google Cloud bucket that we'll also call `sharded-imagenet`. Shards can be inspected with [`ais`](https://aiatscale.org/docs/cli) command-line tool - on average, in our case, any given shard will contain about 1000 original (`.jpg`) ImageNet images and their corresponding (`.cls`) classes: 37 38 ```jupyter 39 ! ais get gcp://sharded-imagenet/imagenet-train-000000.tar - | tar -tvf - | head -n 10 40 41 -r--r--r-- bigdata/bigdata 3 2020-06-25 17:41 0911032.cls 42 -r--r--r-- bigdata/bigdata 92227 2020-06-25 17:41 0911032.jpg 43 -r--r--r-- bigdata/bigdata 3 2020-06-25 17:41 1203092.cls 44 -r--r--r-- bigdata/bigdata 15163 2020-06-25 17:41 1203092.jpg 45 -r--r--r-- bigdata/bigdata 3 2020-06-25 17:41 0403282.cls 46 -r--r--r-- bigdata/bigdata 139179 2020-06-25 17:41 0403282.jpg 47 -r--r--r-- bigdata/bigdata 3 2020-06-25 17:41 0267084.cls 48 -r--r--r-- bigdata/bigdata 200458 2020-06-25 17:41 0267084.jpg 49 -r--r--r-- bigdata/bigdata 3 2020-06-25 17:41 1026057.cls 50 -r--r--r-- bigdata/bigdata 159009 2020-06-25 17:41 1026057.jpg 51 ``` 52 53 Thus, in terms of its internal structure, this dataset is identical to what we've had in the [previous article](https://aiatscale.org/blog/2021/10/22/ais-etl-2), with one distinct difference: shards (formatted as .tar files). 54 55 Further, we assume (and require) that AIStore can "see" this GCP bucket. Covering the corresponding AIStore configuration would be outside the scope, but the main point is that AIS *self-populates* on demand. When getting user data from any [remote location](https://github.com/NVIDIA/aistore/blob/main/docs/providers.md), AIS always stores it (ie., the data), acting simultaneously as a fast-cache tier and a high-performance reliable-and-scalable storage. 56 57 ## Client-side transformation with WebDataset, and with AIStore acting as a traditional (dumb) storage 58 59 Next in the plan is to have WebDataset running transformations on the client side. Eventually, we'll move this entire ETL code onto AIS. But first, let's go over the conventional way of doing things: 60 61 ```python 62 from torchvision import transforms 63 import webdataset as wds 64 65 from aistore.sdk import Client 66 67 client = Client("http://aistore-sample-proxy:51080") # AIS IP address or hostname 68 69 normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) 70 71 # PyTorch transform to apply (compare with the [previous post](https://aiatscale.org/blog/2021/10/22/ais-etl-2)) 72 train_transform = transforms.Compose([ 73 transforms.RandomResizedCrop(224), 74 transforms.RandomHorizontalFlip(), 75 transforms.ToTensor(), 76 normalize, 77 ]) 78 79 # use AIS client to get the http URL for the shard 80 shard_url = client.object_url("gcp://sharded-imagenet", "imagenet-train-000000.tar") 81 82 dataset = ( 83 wds.WebDataset(shard_url, handler=wds.handlers.warn_and_continue) 84 .decode("pil") 85 .to_tuple("jpg;png;jpeg cls", handler=wds.handlers.warn_and_continue) 86 .map_tuple(train_transform, lambda x: x) 87 ) 88 89 loader = wds.WebLoader( 90 dataset, 91 batch_size=64, 92 shuffle=False, 93 num_workers=1, 94 ) 95 ``` 96 97 #### Comments to the code above 98 99 * [AIS Python client](https://github.com/NVIDIA/ais-etl/tree/post-3/package) helps with WebDataset data loader initialization. In this case, WebDataset loads a single `.tar` shard with `jpg` images, and transforms each image in the batch. 100 * `.decode("pil")` indicates torchvision data augmentation (for details, see [WebDataset docs](https://webdataset.github.io/webdataset/decoding) on decoding). 101 * `.map_tuple` does the actual heavy lifting applying torchvision transforms. 102 103 Let's visually compare original images with their loaded-and-transformed counterparts: 104 105 ```python 106 from utils import display_shard_images, display_loader_images 107 ``` 108 109 ```python 110 display_shard_images(client, "gcp://sharded-imagenet", "imagenet-train-000000.tar", objects=4) 111 ``` 112 113 ![example training images](/assets/wd_aistore/output_8_0.png) 114 115 ```python 116 display_loader_images(loader, objects=4) 117 ``` 118 119 ![transformed training images](/assets/wd_aistore/output_9_0.png) 120 121 > Source code for `display_shard_images` and `display_loader_images` can be found [here](/assets/wd_aistore/utils.py). 122 123 ## WDTransform with ETL in the cluster (deprecated) 124 125 We are now ready to create a data loader that'd rely on AIStore for image transformations. For this, we introduce `WDTransform` class, full source for which is available as part of the [AIS Python client](https://github.com/NVIDIA/ais-etl/tree/post-3/package): 126 127 ```python 128 from aistore.client.transform import WDTransform 129 130 train_etl = WDTransform(client, wd_transform, transform_name="imagenet-train", verbose=True) 131 ``` 132 133 In our example, `WDTransform` takes the following `transform_func` as an input: 134 135 ```python 136 def wd_transform(sample): 137 # `train_transform` was declared in previous section. 138 sample["npy"] = train_transform(sample.pop("jpg")).permute(1, 2, 0).numpy().astype("float32") 139 return sample 140 ``` 141 142 The function above returns transformed NumPy array after applying precisely the same `torchvision` transformations that we described in the [previous section](#client-side-transformation-with-webdataset-and-with-aistore-acting-as-a-traditional-dumb-storage). The input is a dictionary containing a single training tuple - an image (key="jpg") and a corresponding class (key="cls"). 143 144 Putting it all together, the code that loads a single transformed shard with AIS nodes carrying out actual transformations: 145 146 ```python 147 # NOTE: 148 # AIS Python client handles initialization of ETL on AIStore cluster - if a given named ETL 149 # does not exist in the cluster the client will simply initialize it on the fly: 150 etl_object_url = client.object_url("gcp://sharded-imagenet", "imagenet-train-000000.tar", transform_id=train_etl.uuid) 151 152 to_tensor = transforms.Compose([transforms.ToTensor()]) 153 etl_dataset = ( 154 wds.WebDataset(etl_object_url, handler=wds.handlers.warn_and_continue) 155 .decode("rgb") 156 .to_tuple("npy cls", handler=wds.handlers.warn_and_continue) 157 .map_tuple(to_tensor, lambda x: x) 158 ) 159 160 etl_loader = wds.WebLoader( 161 etl_dataset, 162 batch_size=64, 163 shuffle=False, 164 num_workers=1, 165 ) 166 ``` 167 168 #### Discussion 169 170 * We transform all images from a randomly selected shard (e.g., `imagenet-train-000000.tar` above). 171 * [AIS Python client](https://github.com/NVIDIA/ais-etl/tree/post-3/package) handles ETL initialization in the cluster. 172 * The data loader (`etl_dataset`) is very similar, almost identical, to the WebDataset loader from the previous section. One major difference, of course, is that it is AIS that runs transformations. 173 * The client-side part of the training pipeline handles already transformed images (represented as NumPy arrays). 174 * Which is exactly why we've set the decoder to "rgb" and added `to_tensor` (NumPy array to PyTorch tensor) conversion. 175 176 When we run this snippet in the notebook, we will first see: 177 178 ``` 179 Initializing ETL... 180 ETL imagenet-train successfully initialized 181 ``` 182 183 indicating that our (user-defined) ETL is now ready to execute. 184 185 To further confirm that it does work, we do again `display_loader_images`: 186 187 ```python 188 display_loader_images(etl_loader, objects=4) 189 ``` 190 191 ![ETL transformed training images](/assets/wd_aistore/output_17_0.png) 192 193 So yes indeed, the results of loading `imagenet-train-000000.tar` look virtually identical to the post-transform images we have seen in the previous section. 194 195 ## Iterating through multiple shards 196 197 The one and, practically, the only difference between single-shard and multi-shard operation is that for the latter we specify a *list* or a *range* of shards (to operate upon). AIStore supports multiple list/range formats; in this section, we'll show just one that utilizes the familiar Bash brace expansion notation. 198 199 > For instance, brace expansion "imagenet-val-{000000..000005}.tar" translates as a range of up to 6 (six) shards named accordingly. 200 201 For this purpose, we'll keep using `WDTransform`, but this time with a validation dataset and the transformation function that looks as follows: 202 203 ```python 204 val_transform = transforms.Compose([ 205 transforms.Resize(256), 206 transforms.CenterCrop(224), 207 transforms.ToTensor(), 208 normalize, 209 ]) 210 211 def wd_val_transform(sample): 212 sample["npy"] = val_transform(sample.pop("jpg")).permute(1, 2, 0).numpy().astype("float32") 213 return sample 214 215 val_etl = WDTransform(client, wd_val_transform, transform_name="imagenet-val", verbose=True) 216 ``` 217 218 The code to iterate over an arbitrary range (e.g., `{000000..000005}`) with `torchvision` performed by AIS nodes - in parallel and concurrently for all shards in a batch: 219 220 ```python 221 # Loading multiple shards using template. 222 val_objects = "imagenet-val-{000000..000005}.tar" 223 val_urls = client.expand_object_urls("gcp://sharded-imagenet", transform_id=val_etl.uuid, template=val_objects) 224 225 val_dataset = ( 226 wds.WebDataset(val_urls, handler=wds.handlers.warn_and_continue) 227 .decode("rgb") 228 .to_tuple("npy cls", handler=wds.handlers.warn_and_continue) 229 .map_tuple(to_tensor, lambda x: x) 230 ) 231 232 val_loader = wds.WebLoader( 233 val_dataset, 234 batch_size=64, 235 shuffle=False, 236 num_workers=1, 237 ) 238 239 ``` 240 241 > Compare this code with the single-shard example from the [previous section](#client-side-transformation-with-webdataset-and-with-aistore-acting-as-a-traditional-dumb-storage). 242 243 Now, as before, to make sure that our validation data loader does work, we display a random image, or images: 244 245 ```python 246 display_loader_images(val_loader, objects=4) 247 ``` 248 249 ![transformed validation images](/assets/wd_aistore/output_21_0.png) 250 251 ## Remarks 252 253 So far we have shown how to use (WebDataset + AIStore) to offload compute and I/O intensive transformations to a dedicated cluster. 254 255 Overall, the topic - large-scale inline and offline ETL - is vast. And we've only barely scratched the surface. The hope is, though, that this text provides at least a few useful insights. 256 257 The complete end-to-end PyTorch training example that we have used here is [available](/examples/etl-imagenet-wd/pytorch_wd.py). 258 259 Other references include: 260 261 1. AIStore & ETL: 262 - [Introduction](https://aiatscale.org/blog/2021/10/21/ais-etl-1) 263 - [Using AIS/PyTorch connector to transform ImageNet](https://aiatscale.org/blog/2021/10/22/ais-etl-2) 264 2. GitHub: 265 - [AIStore](https://github.com/NVIDIA/aistore) 266 - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s) 267 - [AIS-ETL containers and specs](https://github.com/NVIDIA/ais-etl) 268 3. Documentation, blogs, videos: 269 - [https://aiatscale.org](https://aiatscale.org/docs) 270 - [https://github.com/NVIDIA/aistore/tree/main/docs](https://github.com/NVIDIA/aistore/tree/main/docs)