github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-06-09-aisio-transforms-with-webdataset-pt-3.md (about) 1 --- 2 layout: post 3 title: "AIStore with WebDataset Part 3 -- Building a Pipeline for Model Training" 4 date: June 09, 2023 5 author: Aaron Wilson 6 categories: aistore etl pytorch python webdataset 7 --- 8 9 In the previous posts ([pt1](https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1), [pt2](https://aiatscale.org/blog/2023/05/11/aisio-transforms-with-webdataset-pt-2)), we discussed converting a dataset into shards of samples in the [WebDataset format](https://github.com/webdataset/webdataset#the-webdataset-format) and creating a function to transform these shards using [AIStore ETL](https://github.com/NVIDIA/aistore/blob/main/docs/etl.md). 10 For the next step of model training, we must continuously fetch transformed samples from these shards. 11 This post will demonstrate how to use the WebDataset library and PyTorch to generate a DataLoader for the last stage of the pipeline. 12 The final pipeline will transform, decode, shuffle, and batch samples on demand for model training. 13 14 --- 15 16 ## Datasets, DataPipes, DataLoaders 17 18 First, it's important to understand the difference between each of the PyTorch types: 19 20 - [Dataset](https://pytorch.org/docs/stable/data.html#dataset-types) -- Datasets provide access to the data and can either be a map-style or iterable-style. 21 - [TorchData DataPipes](https://pytorch.org/data/main/torchdata.datapipes.iter.html) -- The beta TorchData library provides DataPipes, which PyTorch calls "a renaming and repurposing of the PyTorch Dataset for composed usage" (see [TorchData Github - What are DataPipes?](https://github.com/pytorch/data#what-are-datapipes)). A subclass of Dataset, DataPipes are a newer implementation designed for more flexibility in designing pipelines and are compatible with the newer [DataLoader2](https://pytorch.org/data/main/dataloader2.html). 22 - [DataLoader](https://pytorch.org/docs/stable/data.html#module-torch.utils.data) -- The DataLoader manages interactions with a Dataset, scheduling workers to fetch new samples as needed and performing cross-dataset operations such as shuffling and batching. It is the final step in the pipeline and ultimately provides the arrays of input data to the model for training. 23 24 Both WebDataset and AIStore provide their own implementations of these tools: 25 26 - The AIStore `AISSourceLister` is a TorchData `IterDataPipe` that provides the URLs to access each of the provided AIS resources. `AISFileLister` and `AISFileLoader` are also available to load objects directly. In this example we'll use `AISSourceLister` to allow WebDataset to handle reading the objects. 27 - The webdataset library's [WebDataset class](https://github.com/webdataset/webdataset#webdataset) is an implementation of a PyTorch `IterableDataset`. 28 - Webdataset's [WebLoader class](https://github.com/webdataset/webdataset#dataloader) wraps the PyTorch `DataLoader` class, providing an easy way to extend functionality with built-in methods such as `shuffle`, `batch`, `decode`, etc. 29 30 Below we'll go through an example of tying each of these utilities together. The full example code can be found [here](/python/examples/aisio-pytorch/pytorch_webdataset.py). 31 32 --- 33 34 ## Creating an Iterable Dataset 35 36 In [part 1](https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1) of this series, we uploaded shards of training samples in the WebDataset format. 37 In [part 2](https://aiatscale.org/blog/2023/05/11/aisio-transforms-with-webdataset-pt-2) we created an AIStore ETL process in our cluster to transform these shards on the server side. 38 39 Now that we have the data and the transform function in place, we can use the `AISSourceLister` Iterable DataPipe to retrieve the URLs to the data we want in AIStore with the ETL transform applied. 40 41 1. Define the sources in AIS. In this case, we'll just use every object inside the `images` bucket created in part 1. 42 2. Apply an existing ETL if desired. When the sources provided are read, the ETL will be applied to the objects inline. 43 3. Since WebDataset expects a dictionary of sources, we can apply a simple function to transform each entry to a dictionary. 44 4. Next, we shuffle the sources to avoid any bias from the order of the shards. 45 46 ```python 47 sources = AISSourceLister(ais_sources=[bucket], etl_name=etl_name).map(lambda source_url: {"url": source_url})\ 48 .shuffle() 49 ``` 50 51 We can now initialize `wds.WebDataset` by providing our datapipe of dictionaries to object URLs. 52 WebDataset will then handle fetching the objects and interpreting each individual record inside the object tars. 53 54 ```python 55 dataset = wds.WebDataset(sources) 56 ``` 57 58 After this step, we now have an iterable dataset over the individual samples and can use any of the WebDataset built-in functions to modify them (see [the WebDataset README](https://github.com/webdataset/webdataset)). 59 Here we'll shuffle the samples in each shard, decode the image files to tensors, and convert to batches of tuples. 60 Since we expect to use multiple dataset workers, each operating in their own subprocess, we'll batch the samples here to reduced the overhead of unpacking the samples in the main process. 61 62 Full `WebDataset` creation code: 63 64 ```python 65 66 def create_dataset() -> wds.WebDataset: 67 bucket = client.bucket(bucket_name) 68 # Get a list of urls for each object in AIS, with ETL applied, converted to the format WebDataset expects 69 sources = AISSourceLister(ais_sources=[bucket], etl_name=etl_name).map(lambda source_url: {"url": source_url})\ 70 .shuffle() 71 # Load shuffled list of transformed shards into WebDataset pipeline 72 dataset = wds.WebDataset(sources) 73 # Shuffle samples and apply built-in webdataset decoder for image files 74 dataset = dataset.shuffle(size=1000).decode("torchrgb") 75 # Return iterator over samples as tuples in batches 76 return dataset.to_tuple("cls", "image.jpg", "trimap.png").batched(16) 77 78 ``` 79 80 ![WebDataset](/assets/aisio_inline_wdataset/WebDataset.jpg) 81 82 --- 83 84 ## Creating a DataLoader 85 86 With a dataset defined, we can now use the webdataset `WebLoader` class, introduced above, to manage retrieving data from the dataset. 87 `WebLoader` allows us to parallelize loading from the dataset using multiple workers, then shuffle and batch the samples for training. 88 Here we create a `WebLoader` with 4 dataset workers. 89 Each of those workers will execute the process defined in the dataset to fetch an object in AIS, parse into samples, and return a batch. 90 91 Note `batch_size` in the `WebLoader` below is set to `None` because each dataset worker will do its own batching and yield a batch of 16 samples. 92 We can then use the WebLoader `FluidInterface` functionality to first *unbatch* the minibatches of samples from each of our dataset workers, shuffle across a defined number of samples (`1000` in the example below), and then *rebatch* into batches of the size we actually want to provide to the model. 93 94 ```python 95 def create_dataloader(dataset): 96 loader = wds.WebLoader(dataset, num_workers=4, batch_size=None) 97 return loader.unbatched().shuffle(1000).batched(64) 98 ``` 99 100 A simplified version of this pipeline with a 3 dataset workers, a dataset batch size of 3, and a final batch size of 4 would look like this: 101 102 ![WebLoader](/assets/aisio_inline_wdataset/WebLoader.jpg) 103 104 --- 105 ## Results 106 107 Finally, we can inspect the results generated by the DataLoader, ready for model training. 108 Note that none of the pipeline actually runs until the DataLoader requests the next batch of samples. 109 110 ```python 111 def view_data(dataloader): 112 # Get the first batch 113 batch = next(iter(dataloader)) 114 classes, images, trimaps = batch 115 # Result is a set of tensors with the first dimension being the batch size 116 print(classes.shape, images.shape, trimaps.shape) 117 # View the first images in the first batch 118 show_image_tensor(images[0]) 119 show_image_tensor(trimaps[0]) 120 ``` 121 122 123 ## References 124 125 1. GitHub: 126 - [AIStore](https://github.com/NVIDIA/aistore) 127 - [WebDataset Library](https://github.com/webdataset/webdataset) 128 2. Documentation, blogs, videos: 129 - https://aiatscale.org 130 - https://github.com/NVIDIA/aistore/tree/main/docs 131 - Pytorch intro to Datasets and DataLoaders: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html 132 - Discussion on Datasets, DataPipes, DataLoaders: https://sebastianraschka.com/blog/2022/datapipes.html 133 3. Full code example 134 - [Pytorch Pipelines With WebDataset Example](/python/examples/aisio-pytorch/pytorch_webdataset.py) 135 4. Dataset 136 - [The Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) 137