github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-10-29-ais-etl-3.md (about)

     1  ---
     2  layout: post
     3  title:  "AIStore & ETL: Using WebDataset to train on a sharded dataset (post #3)"
     4  date:   Oct 29, 2021 (Revised March 31, 2023)
     5  author: Prashanth Dintyala, Janusz Marcinkiewicz, Alex Aizman, Aaron Wilson
     6  categories: aistore etl pytorch webdataset python
     7  ---
     8  
     9  **Deprecated** -- WDTransform is no longer included as part of the AIS client, so this post only remains for educational purposes. ETL is in development and additional transformation tools will be included in future posts. 
    10  
    11  ## Background
    12  
    13  In our [previous post](https://aiatscale.org/blog/2021/10/22/ais-etl-2), we have built a basic [PyTorch](https://pytorch.org/) data loader and used it to load transformed images from [AIStore](https://github.com/NVIDIA/aistore) (AIS). We have used the latter to run [torchvision](https://pytorch.org/vision/stable/index.html) transformations of the [Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) images.
    14  
    15  Now, we'll be looking further into training that involves **sharded** datasets. We will utilize [WebDataset](https://github.com/webdataset/webdataset), an [iterable](https://pytorch.org/docs/stable/data.html#iterable-style-datasets) PyTorch dataset that provides a number of important [features](https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/#benefits). For demonstration purposes, we'll be using [ImageNet](https://www.image-net.org/) - a sharded version of the ImageNet, to be precise, where original images are assembled into `.tar` archives (aka shards).
    16  
    17  > For background on WebDataset and AIStore, and the benefits of *sharding* very large datasets, please see [Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs](https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus).
    18  
    19  On the Python side, we'll have [AIS Python client](https://github.com/NVIDIA/ais-etl/tree/post-3/package). The client is a thin layer on top of AIStore API/SDK providing easy operations on remote datasets. We'll be using it to offload image transformations to AIS cluster.
    20  
    21  The remainder of this text is structured as follows:
    22  
    23  * introduce sharded ImageNet;
    24  * load a single shard and apply assorted `torchvision` transformations;
    25  * run the same exact transformation in the cluster (in other words, *offload* this specific ETL to AIS);
    26  * operate on multiple ([brace-expansion](https://www.linuxjournal.com/content/bash-brace-expansion) defined) shards
    27  
    28  First step, though is to install the required dependencies (e.g., from your Jupyter notebook), as follows:
    29  
    30  ```jupyter
    31  ! pip install webdataset aistore torch torchvision matplotlib
    32  ```
    33  
    34  ## The Dataset
    35  
    36  Pre-shared ImageNet will be stored in a Google Cloud bucket that we'll also call `sharded-imagenet`. Shards can be inspected with [`ais`](https://aiatscale.org/docs/cli) command-line tool - on average, in our case, any given shard will contain about 1000 original (`.jpg`) ImageNet images and their corresponding (`.cls`) classes:
    37  
    38  ```jupyter
    39  ! ais get gcp://sharded-imagenet/imagenet-train-000000.tar - | tar -tvf - | head -n 10
    40  
    41      -r--r--r-- bigdata/bigdata   3 2020-06-25 17:41 0911032.cls
    42      -r--r--r-- bigdata/bigdata 92227 2020-06-25 17:41 0911032.jpg
    43      -r--r--r-- bigdata/bigdata     3 2020-06-25 17:41 1203092.cls
    44      -r--r--r-- bigdata/bigdata 15163 2020-06-25 17:41 1203092.jpg
    45      -r--r--r-- bigdata/bigdata     3 2020-06-25 17:41 0403282.cls
    46      -r--r--r-- bigdata/bigdata 139179 2020-06-25 17:41 0403282.jpg
    47      -r--r--r-- bigdata/bigdata      3 2020-06-25 17:41 0267084.cls
    48      -r--r--r-- bigdata/bigdata 200458 2020-06-25 17:41 0267084.jpg
    49      -r--r--r-- bigdata/bigdata      3 2020-06-25 17:41 1026057.cls
    50      -r--r--r-- bigdata/bigdata 159009 2020-06-25 17:41 1026057.jpg
    51  ```
    52  
    53  Thus, in terms of its internal structure, this dataset is identical to what we've had in the [previous article](https://aiatscale.org/blog/2021/10/22/ais-etl-2), with one distinct difference: shards (formatted as .tar files).
    54  
    55  Further, we assume (and require) that AIStore can "see" this GCP bucket. Covering the corresponding AIStore configuration would be outside the scope, but the main point is that AIS *self-populates* on demand. When getting user data from any [remote location](https://github.com/NVIDIA/aistore/blob/main/docs/providers.md), AIS always stores it (ie., the data), acting simultaneously as a fast-cache tier and a high-performance reliable-and-scalable storage.
    56  
    57  ## Client-side transformation with WebDataset, and with AIStore acting as a traditional (dumb) storage
    58  
    59  Next in the plan is to have WebDataset running transformations on the client side. Eventually, we'll move this entire ETL code onto AIS. But first, let's go over the conventional way of doing things:
    60  
    61  ```python
    62  from torchvision import transforms
    63  import webdataset as wds
    64  
    65  from aistore.sdk import Client
    66  
    67  client = Client("http://aistore-sample-proxy:51080") # AIS IP address or hostname
    68  
    69  normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    70  
    71  # PyTorch transform to apply (compare with the [previous post](https://aiatscale.org/blog/2021/10/22/ais-etl-2))
    72  train_transform = transforms.Compose([
    73      transforms.RandomResizedCrop(224),
    74      transforms.RandomHorizontalFlip(),
    75      transforms.ToTensor(),
    76      normalize,
    77  ])
    78  
    79  # use AIS client to get the http URL for the shard
    80  shard_url = client.object_url("gcp://sharded-imagenet", "imagenet-train-000000.tar")
    81  
    82  dataset = (
    83      wds.WebDataset(shard_url, handler=wds.handlers.warn_and_continue)
    84          .decode("pil")
    85          .to_tuple("jpg;png;jpeg cls", handler=wds.handlers.warn_and_continue)
    86          .map_tuple(train_transform, lambda x: x)
    87  )
    88  
    89  loader = wds.WebLoader(
    90      dataset,
    91      batch_size=64,
    92      shuffle=False,
    93      num_workers=1,
    94  )
    95  ```
    96  
    97  #### Comments to the code above
    98  
    99  * [AIS Python client](https://github.com/NVIDIA/ais-etl/tree/post-3/package) helps with WebDataset data loader initialization. In this case, WebDataset loads a single `.tar` shard with `jpg` images, and transforms each image in the batch.
   100  * `.decode("pil")` indicates torchvision data augmentation (for details, see [WebDataset docs](https://webdataset.github.io/webdataset/decoding) on decoding).
   101  * `.map_tuple` does the actual heavy lifting applying torchvision transforms.
   102  
   103  Let's visually compare original images with their loaded-and-transformed counterparts:
   104  
   105  ```python
   106  from utils import display_shard_images, display_loader_images
   107  ```
   108  
   109  ```python
   110  display_shard_images(client, "gcp://sharded-imagenet", "imagenet-train-000000.tar", objects=4)
   111  ```
   112  
   113  ![example training images](/assets/wd_aistore/output_8_0.png)
   114  
   115  ```python
   116  display_loader_images(loader, objects=4)
   117  ```
   118  
   119  ![transformed training images](/assets/wd_aistore/output_9_0.png)
   120  
   121  > Source code for `display_shard_images` and `display_loader_images` can be found [here](/assets/wd_aistore/utils.py).
   122  
   123  ## WDTransform with ETL in the cluster (deprecated)
   124  
   125  We are now ready to create a data loader that'd rely on AIStore for image transformations. For this, we introduce `WDTransform` class, full source for which is available as part of the [AIS Python client](https://github.com/NVIDIA/ais-etl/tree/post-3/package):
   126  
   127  ```python
   128  from aistore.client.transform import WDTransform
   129  
   130  train_etl = WDTransform(client, wd_transform, transform_name="imagenet-train", verbose=True)
   131  ```
   132  
   133  In our example, `WDTransform` takes the following `transform_func` as an input:
   134  
   135  ```python
   136  def wd_transform(sample):
   137      # `train_transform` was declared in previous section.
   138      sample["npy"] = train_transform(sample.pop("jpg")).permute(1, 2, 0).numpy().astype("float32")
   139      return sample
   140  ```
   141  
   142  The function above returns transformed NumPy array after applying precisely the same `torchvision` transformations that we described in the [previous section](#client-side-transformation-with-webdataset-and-with-aistore-acting-as-a-traditional-dumb-storage). The input is a dictionary containing a single training tuple - an image (key="jpg") and a corresponding class (key="cls").
   143  
   144  Putting it all together, the code that loads a single transformed shard with AIS nodes carrying out actual transformations:
   145  
   146  ```python
   147  # NOTE:
   148  #  AIS Python client handles initialization of ETL on AIStore cluster - if a given named ETL
   149  #  does not exist in the cluster the client will simply initialize it on the fly:
   150  etl_object_url = client.object_url("gcp://sharded-imagenet", "imagenet-train-000000.tar", transform_id=train_etl.uuid)
   151  
   152  to_tensor = transforms.Compose([transforms.ToTensor()])
   153  etl_dataset = (
   154      wds.WebDataset(etl_object_url, handler=wds.handlers.warn_and_continue)
   155          .decode("rgb")
   156          .to_tuple("npy cls", handler=wds.handlers.warn_and_continue)
   157          .map_tuple(to_tensor, lambda x: x)
   158  )
   159  
   160  etl_loader = wds.WebLoader(
   161      etl_dataset,
   162      batch_size=64,
   163      shuffle=False,
   164      num_workers=1,
   165  )
   166  ```
   167  
   168  #### Discussion
   169  
   170  * We transform all images from a randomly selected shard (e.g., `imagenet-train-000000.tar` above).
   171  * [AIS Python client](https://github.com/NVIDIA/ais-etl/tree/post-3/package) handles ETL initialization in the cluster.
   172  * The data loader (`etl_dataset`) is very similar, almost identical, to the WebDataset loader from the previous section. One major difference, of course, is that it is AIS that runs transformations.
   173  * The client-side part of the training pipeline handles already transformed images (represented as NumPy arrays).
   174  * Which is exactly why we've set the decoder to "rgb" and added `to_tensor` (NumPy array to PyTorch tensor) conversion.
   175  
   176  When we run this snippet in the notebook, we will first see:
   177  
   178  ```
   179      Initializing ETL...
   180      ETL imagenet-train successfully initialized
   181  ```
   182  
   183  indicating that our (user-defined) ETL is now ready to execute.
   184  
   185  To further confirm that it does work, we do again `display_loader_images`:
   186  
   187  ```python
   188  display_loader_images(etl_loader, objects=4)
   189  ```
   190  
   191  ![ETL transformed training images](/assets/wd_aistore/output_17_0.png)
   192  
   193  So yes indeed, the results of loading `imagenet-train-000000.tar` look virtually identical to the post-transform images we have seen in the previous section.
   194  
   195  ## Iterating through multiple shards
   196  
   197  The one and, practically, the only difference between single-shard and multi-shard operation is that for the latter we specify a *list* or a *range* of shards (to operate upon). AIStore supports multiple list/range formats; in this section, we'll show just one that utilizes the familiar Bash brace expansion notation.
   198  
   199  > For instance, brace expansion "imagenet-val-{000000..000005}.tar" translates as a range of up to 6 (six) shards named accordingly.
   200  
   201  For this purpose, we'll keep using `WDTransform`, but this time with a validation dataset and the transformation function that looks as follows:
   202  
   203  ```python
   204  val_transform = transforms.Compose([
   205      transforms.Resize(256),
   206      transforms.CenterCrop(224),
   207      transforms.ToTensor(),
   208      normalize,
   209  ])
   210  
   211  def wd_val_transform(sample):
   212      sample["npy"] = val_transform(sample.pop("jpg")).permute(1, 2, 0).numpy().astype("float32")
   213      return sample
   214  
   215  val_etl = WDTransform(client, wd_val_transform, transform_name="imagenet-val", verbose=True)
   216  ```
   217  
   218  The code to iterate over an arbitrary range (e.g., `{000000..000005}`) with `torchvision` performed by AIS nodes - in parallel and concurrently for all shards in a batch:
   219  
   220  ```python
   221  # Loading multiple shards using template.
   222  val_objects = "imagenet-val-{000000..000005}.tar"
   223  val_urls = client.expand_object_urls("gcp://sharded-imagenet", transform_id=val_etl.uuid, template=val_objects)
   224  
   225  val_dataset = (
   226      wds.WebDataset(val_urls, handler=wds.handlers.warn_and_continue)
   227          .decode("rgb")
   228          .to_tuple("npy cls", handler=wds.handlers.warn_and_continue)
   229          .map_tuple(to_tensor, lambda x: x)
   230  )
   231  
   232  val_loader = wds.WebLoader(
   233      val_dataset,
   234      batch_size=64,
   235      shuffle=False,
   236      num_workers=1,
   237  )
   238  
   239  ```
   240  
   241  > Compare this code with the single-shard example from the [previous section](#client-side-transformation-with-webdataset-and-with-aistore-acting-as-a-traditional-dumb-storage).
   242  
   243  Now, as before, to make sure that our validation data loader does work, we display a random image, or images:
   244  
   245  ```python
   246  display_loader_images(val_loader, objects=4)
   247  ```
   248  
   249  ![transformed validation images](/assets/wd_aistore/output_21_0.png)
   250  
   251  ## Remarks
   252  
   253  So far we have shown how to use (WebDataset + AIStore) to offload compute and I/O intensive transformations to a dedicated cluster.
   254  
   255  Overall, the topic - large-scale inline and offline ETL - is vast. And we've only barely scratched the surface. The hope is, though, that this text provides at least a few useful insights.
   256  
   257  The complete end-to-end PyTorch training example that we have used here is [available](/examples/etl-imagenet-wd/pytorch_wd.py).
   258  
   259  Other references include:
   260  
   261  1. AIStore & ETL:
   262      - [Introduction](https://aiatscale.org/blog/2021/10/21/ais-etl-1)
   263      - [Using AIS/PyTorch connector to transform ImageNet](https://aiatscale.org/blog/2021/10/22/ais-etl-2)
   264  2. GitHub:
   265      - [AIStore](https://github.com/NVIDIA/aistore)
   266      - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s)
   267      - [AIS-ETL containers and specs](https://github.com/NVIDIA/ais-etl)
   268  3. Documentation, blogs, videos:
   269      - [https://aiatscale.org](https://aiatscale.org/docs)
   270      - [https://github.com/NVIDIA/aistore/tree/main/docs](https://github.com/NVIDIA/aistore/tree/main/docs)