github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-05-08-aisio-transforms-with-webdataset-pt-1.md (about) 1 --- 2 layout: post 3 title: "AIStore with WebDataset Part 1 -- Storing WebDataset format in AIS" 4 date: May 05, 2023 5 authors: Aaron Wilson 6 categories: aistore etl pytorch python webdataset 7 --- 8 ## Motivation - High-Performance AI Storage 9 10 Training AI models is expensive, so it's important to keep GPUs fed with all the data they need as fast as they can consume it. WebDataset and AIStore each address different parts of this problem individually: 11 12 - [WebDataset](https://github.com/webdataset/webdataset) is a convenient format and set of related tools for storing data consisting of multiple related files (e.g. an image and a class). It allows for vastly faster i/o by enforcing sequential reads and writes of related data. 13 14 - [AIStore](https://github.com/NVIDIA/aistore) provides fast, scalable storage along with powerful features (including co-located ETL) to enhance AI workloads. 15 16 The obvious question for developers is how to combine WebDataset with AIStore to create a high performance data pipeline and maximize the features of both toolsets. 17 18 In this series of posts, we'll show how to effectively pair WebDataset with AIStore and how to use AIStore's inline object transformation and PyTorch DataPipelines to prepare a WebDataset for model training. The series will consist of 3 posts demonstrating the following tasks: 19 20 1. How to convert a generic dataset to the WebDataset format and store in AIS 21 2. How to create and apply an ETL for on-the-fly sample transformations 22 3. How to set up a PyTorch DataPipeline 23 24 For background it may be useful to view the [previous post](https://aiatscale.org/blog/2023/04/03/transform-images-with-python-sdk) demonstrating basic ETL operations using the AIStore Python SDK. 25 26 All code used below can be found here: [Inline WebDataset Transform Example](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/load_webdataset_example.py) 27 28 --- 29 ## Software 30 31 For this example we will be using: 32 33 - Python 3.10 34 - [WebDataset Python Library v0.2.48](https://pypi.org/project/webdataset/0.2.48/) 35 - [AIStore Python SDK v1.2.2](https://pypi.org/project/aistore/) 36 - [AIStore Cluster v3.17](https://github.com/NVIDIA/aistore) -- Running in Kubernetes (see [here](https://github.com/NVIDIA/aistore/blob/main/deploy/dev/k8s/README.md) for minikube deployment or [here](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md#kubernetes-deployments) for more advanced options) 37 38 --- 39 ## The Dataset 40 41 As in the [previous ETL example](https://aiatscale.org/blog/2023/04/03/transform-images-with-python-sdk), we will use the `Oxford-IIIT Pet Dataset`: [https://www.robots.ox.ac.uk/~vgg/data/pets/](https://www.robots.ox.ac.uk/~vgg/data/pets/). 42 43 This dataset is notably NOT in WebDataset format -- the class info, metadata, and images are all separated. 44 45 Below is a structured view of the dataset. `list.txt` contains a mapping from each filename to the class and breed info. The `.png` files in this dataset are trimap metadata files -- special image files where each pixel is marked as either foreground, background, or unlabeled. 46 47 ``` 48 . 49 ├── annotations 50 │ ├── list.txt 51 │ ├── README 52 │ ├── test.txt 53 │ ├── trainval.txt 54 │ ├── trimaps 55 │ │ └── PetName_SampleNumber.png 56 │ └── xmls 57 └── images 58 └── PetName_SampleNumber.jpg 59 ``` 60 --- 61 ## Converting to WebDataset format 62 63 There are several ways you could [convert a dataset to the WebDataset format](https://github.com/webdataset/webdataset#creating-a-webdataset), but in this case we will do it via Python using the WebDataset `ShardWriter`. Notice the callback function provided to upload the results of each created shard to the AIS bucket. For more info on creating WebDatasets [check out the video here](https://www.youtube.com/watch?v=v_PacO-3OGQ). 64 65 The full code is available [here](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/load_webdataset_example.py) but the key points are shown below: 66 67 ```python 68 69 def load_data(bucket, sample_generator): 70 71 def upload_shard(filename): 72 bucket.object(filename).put_file(filename) 73 os.unlink(filename) 74 75 # Writes data as tar to disk, uses callback function "post" to upload to AIS and delete 76 with wds.ShardWriter("samples-%02d.tar", maxcount=400, post=upload_shard) as writer: 77 for sample in sample_generator: 78 writer.write(sample) 79 ``` 80 81 `sample_generator` used above is simply a generator that iterates over the local dataset and yields individual samples in the format below: 82 83 ```python 84 { 85 "__key__": "sample_%04d" % index, 86 "image.jpg": image_data, 87 "cls": pet_class, 88 "trimap.png": trimap_data 89 } 90 ``` 91 92 Note that in the code provided, the sample generator will ignore any records that are incomplete or missing files. 93 94 The conversion and resulting structure is shown in the image below, with multiple shards each containing samples grouped by the same base name: 95 96 ![WebDataset Formatted Dataset in AIStore](/assets/aisio_inline_wdataset/dataset-conversion.jpg) 97 98 ## Dsort and Data Shuffling 99 100 The above code results in a series of shards in AIStore, but these shards are in the order of the original dataset and are not shuffled in any way, which is likely not what we would want for training. 101 102 The AIStore feature [dSort](https://aiatscale.org/docs/cli/dsort) can help. It supports shuffling and sorting shards of WebDataset samples, keeping the records (image.jpg, cls, and trimap.png) intact in the output shard. 103 104 Below is a visualization of the shuffling process: 105 106 ![dSort Shuffle Visualization](/assets/aisio_inline_wdataset/dsort-shuffle.jpg) 107 108 We need to write a dSort spec file to define the job to shuffle and shard our samples. The following spec loads from each of the existing shards, shuffles each record, and creates new shuffled shards of >100MB each. An existing output bucket can also be defined with the `output_bck` key if desired (other options listed [in the docs](https://aiatscale.org/docs/cli/dsort)). 109 110 ```json 111 { 112 "extension": ".tar", 113 "bck": {"name": "images"}, 114 "input_format": "samples-{00..18}", 115 "output_format": "shuffled-shard-%d", 116 "output_shard_size": "100MB", 117 "description": "Create new shuffled shards", 118 "algorithm": { 119 "kind": "shuffle" 120 } 121 } 122 ``` 123 124 Use the [AIS CLI](https://aiatscale.org/docs/cli) to start the dSort job (no support yet in the AIS Python SDK): 125 126 ```bash 127 ais start dsort -f dsort_shuffle_samples.json 128 ``` 129 130 Wait for the job to complete: 131 132 ```bash 133 ais wait `YourSortJobID` 134 ``` 135 136 Now we can see the ouptut shards as defined in the dSort job spec above, each containing a random set of the data samples. 137 138 ```bash 139 ais bucket ls ais://images -prefix shuffled 140 ``` 141 142 Or with the Python SDK: 143 144 ```python 145 import os 146 from aistore import Client 147 148 client = Client(os.getenv("AIS_ENDPOINT")) 149 objects = client.bucket("images").list_all_objects(prefix="shuffled") 150 print([entry.name for entry in objects]) 151 ``` 152 Output: 153 ``` 154 shuffled-shard-0.tar 102.36MiB 155 shuffled-shard-1.tar 102.44MiB 156 shuffled-shard-2.tar 102.40MiB 157 shuffled-shard-3.tar 102.45MiB 158 shuffled-shard-4.tar 102.36MiB 159 shuffled-shard-5.tar 102.40MiB 160 shuffled-shard-6.tar 102.49MiB 161 shuffled-shard-7.tar 74.84MiB 162 ``` 163 164 Finally we have our result: WebDataset-formatted, shuffled shards stored in AIS and ready for use! 165 166 In future posts, we'll show how to run transformations on this data and load it for model training. 167 168 --- 169 ## References 170 171 1. GitHub: 172 - [AIStore](https://github.com/NVIDIA/aistore) 173 - [Local Kubernetes Deployment](https://github.com/NVIDIA/aistore/blob/main/deploy/dev/k8s/README.md) 174 - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s) 175 - [WebDataset Library](https://github.com/webdataset/webdataset) 176 2. Documentation, blogs, videos: 177 - https://aiatscale.org 178 - https://github.com/NVIDIA/aistore/tree/main/docs 179 3. Full code example 180 - [Inline WebDataset Transform Example](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/load_webdataset_example.py) 181 4. Dataset 182 - [The Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) 183