github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-06-09-aisio-transforms-with-webdataset-pt-3.md (about)

     1  ---
     2  layout: post
     3  title: "AIStore with WebDataset Part 3 -- Building a Pipeline for Model Training"
     4  date: June 09, 2023
     5  author: Aaron Wilson
     6  categories: aistore etl pytorch python webdataset
     7  ---
     8  
     9  In the previous posts ([pt1](https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1), [pt2](https://aiatscale.org/blog/2023/05/11/aisio-transforms-with-webdataset-pt-2)), we discussed converting a dataset into shards of samples in the [WebDataset format](https://github.com/webdataset/webdataset#the-webdataset-format) and creating a function to transform these shards using [AIStore ETL](https://github.com/NVIDIA/aistore/blob/main/docs/etl.md).
    10  For the next step of model training, we must continuously fetch transformed samples from these shards.
    11  This post will demonstrate how to use the WebDataset library and PyTorch to generate a DataLoader for the last stage of the pipeline.
    12  The final pipeline will transform, decode, shuffle, and batch samples on demand for model training.
    13  
    14  ---
    15  
    16  ## Datasets, DataPipes, DataLoaders
    17  
    18  First, it's important to understand the difference between each of the PyTorch types:
    19  
    20  - [Dataset](https://pytorch.org/docs/stable/data.html#dataset-types) -- Datasets provide access to the data and can either be a map-style or iterable-style. 
    21  - [TorchData DataPipes](https://pytorch.org/data/main/torchdata.datapipes.iter.html) -- The beta TorchData library provides DataPipes, which PyTorch calls "a renaming and repurposing of the PyTorch Dataset for composed usage" (see [TorchData Github - What are DataPipes?](https://github.com/pytorch/data#what-are-datapipes)). A subclass of Dataset, DataPipes are a newer implementation designed for more flexibility in designing pipelines and are compatible with the newer [DataLoader2](https://pytorch.org/data/main/dataloader2.html).
    22  - [DataLoader](https://pytorch.org/docs/stable/data.html#module-torch.utils.data) -- The DataLoader manages interactions with a Dataset, scheduling workers to fetch new samples as needed and performing cross-dataset operations such as shuffling and batching. It is the final step in the pipeline and ultimately provides the arrays of input data to the model for training. 
    23  
    24  Both WebDataset and AIStore provide their own implementations of these tools: 
    25  
    26  - The AIStore `AISSourceLister` is a TorchData `IterDataPipe` that provides the URLs to access each of the provided AIS resources. `AISFileLister` and `AISFileLoader` are also available to load objects directly. In this example we'll use `AISSourceLister` to allow WebDataset to handle reading the objects. 
    27  - The webdataset library's [WebDataset class](https://github.com/webdataset/webdataset#webdataset) is an implementation of a PyTorch `IterableDataset`.  
    28  - Webdataset's [WebLoader class](https://github.com/webdataset/webdataset#dataloader) wraps the PyTorch `DataLoader` class, providing an easy way to extend functionality with built-in methods such as `shuffle`, `batch`, `decode`, etc. 
    29  
    30  Below we'll go through an example of tying each of these utilities together. The full example code can be found [here](/python/examples/aisio-pytorch/pytorch_webdataset.py).
    31  
    32  ---
    33  
    34  ## Creating an Iterable Dataset
    35  
    36  In [part 1](https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1) of this series, we uploaded shards of training samples in the WebDataset format.
    37  In [part 2](https://aiatscale.org/blog/2023/05/11/aisio-transforms-with-webdataset-pt-2) we created an AIStore ETL process in our cluster to transform these shards on the server side. 
    38  
    39  Now that we have the data and the transform function in place, we can use the `AISSourceLister` Iterable DataPipe to retrieve the URLs to the data we want in AIStore with the ETL transform applied.
    40  
    41  1. Define the sources in AIS. In this case, we'll just use every object inside the `images` bucket created in part 1.
    42  2. Apply an existing ETL if desired. When the sources provided are read, the ETL will be applied to the objects inline. 
    43  3. Since WebDataset expects a dictionary of sources, we can apply a simple function to transform each entry to a dictionary.
    44  4. Next, we shuffle the sources to avoid any bias from the order of the shards. 
    45  
    46  ```python
    47  sources = AISSourceLister(ais_sources=[bucket], etl_name=etl_name).map(lambda source_url: {"url": source_url})\
    48      .shuffle()
    49  ```
    50  
    51  We can now initialize `wds.WebDataset` by providing our datapipe of dictionaries to object URLs. 
    52  WebDataset will then handle fetching the objects and interpreting each individual record inside the object tars.
    53  
    54  ```python
    55  dataset = wds.WebDataset(sources)
    56  ```
    57  
    58  After this step, we now have an iterable dataset over the individual samples and can use any of the WebDataset built-in functions to modify them (see [the WebDataset README](https://github.com/webdataset/webdataset)).
    59  Here we'll shuffle the samples in each shard, decode the image files to tensors, and convert to batches of tuples. 
    60  Since we expect to use multiple dataset workers, each operating in their own subprocess, we'll batch the samples here to reduced the overhead of unpacking the samples in the main process. 
    61  
    62  Full `WebDataset` creation code:
    63  
    64  ```python
    65  
    66  def create_dataset() -> wds.WebDataset:
    67      bucket = client.bucket(bucket_name)
    68      # Get a list of urls for each object in AIS, with ETL applied, converted to the format WebDataset expects
    69      sources = AISSourceLister(ais_sources=[bucket], etl_name=etl_name).map(lambda source_url: {"url": source_url})\
    70          .shuffle()
    71      # Load shuffled list of transformed shards into WebDataset pipeline
    72      dataset = wds.WebDataset(sources)
    73      # Shuffle samples and apply built-in webdataset decoder for image files
    74      dataset = dataset.shuffle(size=1000).decode("torchrgb")
    75      # Return iterator over samples as tuples in batches
    76      return dataset.to_tuple("cls", "image.jpg", "trimap.png").batched(16)
    77  
    78  ```
    79  
    80  ![WebDataset](/assets/aisio_inline_wdataset/WebDataset.jpg)
    81  
    82  ---
    83  
    84  ## Creating a DataLoader
    85  
    86  With a dataset defined, we can now use the webdataset `WebLoader` class, introduced above, to manage retrieving data from the dataset.
    87  `WebLoader` allows us to parallelize loading from the dataset using multiple workers, then shuffle and batch the samples for training.
    88  Here we create a `WebLoader` with 4 dataset workers. 
    89  Each of those workers will execute the process defined in the dataset to fetch an object in AIS, parse into samples, and return a batch.
    90  
    91  Note `batch_size` in the `WebLoader` below is set to `None` because each dataset worker will do its own batching and yield a batch of 16 samples.
    92  We can then use the WebLoader `FluidInterface` functionality to first *unbatch* the minibatches of samples from each of our dataset workers, shuffle across a defined number of samples (`1000` in the example below), and then *rebatch* into batches of the size we actually want to provide to the model.
    93  
    94  ```python
    95  def create_dataloader(dataset):
    96      loader = wds.WebLoader(dataset, num_workers=4, batch_size=None)
    97      return loader.unbatched().shuffle(1000).batched(64)
    98  ```
    99  
   100  A simplified version of this pipeline with a 3 dataset workers, a dataset batch size of 3, and a final batch size of 4 would look like this: 
   101  
   102  ![WebLoader](/assets/aisio_inline_wdataset/WebLoader.jpg)
   103  
   104  ---
   105  ## Results
   106  
   107  Finally, we can inspect the results generated by the DataLoader, ready for model training.
   108  Note that none of the pipeline actually runs until the DataLoader requests the next batch of samples.
   109  
   110  ```python
   111  def view_data(dataloader):
   112      # Get the first batch
   113      batch = next(iter(dataloader))
   114      classes, images, trimaps = batch
   115      # Result is a set of tensors with the first dimension being the batch size
   116      print(classes.shape, images.shape, trimaps.shape)
   117      # View the first images in the first batch
   118      show_image_tensor(images[0])
   119      show_image_tensor(trimaps[0])
   120  ```
   121  
   122  
   123  ## References
   124  
   125  1. GitHub:
   126      - [AIStore](https://github.com/NVIDIA/aistore)
   127      - [WebDataset Library](https://github.com/webdataset/webdataset)
   128  2. Documentation, blogs, videos:
   129      - https://aiatscale.org
   130      - https://github.com/NVIDIA/aistore/tree/main/docs
   131      - Pytorch intro to Datasets and DataLoaders: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html
   132      - Discussion on Datasets, DataPipes, DataLoaders: https://sebastianraschka.com/blog/2022/datapipes.html
   133  3. Full code example
   134      - [Pytorch Pipelines With WebDataset Example](/python/examples/aisio-pytorch/pytorch_webdataset.py)
   135  4. Dataset
   136      - [The Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/)
   137