github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-05-11-aisio-transforms-with-webdataset-pt-2.md (about) 1 --- 2 layout: post 3 title: "AIStore with WebDataset Part 2 -- Transforming WebDataset Shards in AIS" 4 date: May 11, 2023 5 author: Aaron Wilson 6 categories: aistore etl pytorch python webdataset 7 --- 8 9 In the [previous post](https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1) we converted a dataset to the WebDataset format and stored it in a bucket in AIStore. 10 11 This post will demonstrate AIStore's ability to efficiently apply custom transformations to the dataset on the storage cluster. We'll do this using [AIS ETL](https://github.com/NVIDIA/aistore/blob/main/docs/etl.md). 12 13 All code used below can be found here: [WebDataset ETL Example](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/etl_webdataset.py) 14 15 --- 16 ## Motivation for AIS ETL 17 18 AIS ETL takes advantage of processing power available on the storage cluster, as even an otherwise heavily utilized cluster will be idle, CPU-wise, over 90% of the time (of course, depending on specific workloads). 19 Performing transformations close to the data maximizes efficiency and can reduce the amount of network traffic depending on the transforms applied. 20 This makes it a much better option than pulling all the required data for training and doing both preprocessing and training on the same system. 21 22 In this demo, the ETL on AIStore is relatively lightweight, but offloading the pre-training computation to the AIS cluster could be much more important with a more intensive transform. 23 For more advanced workloads such as audio or video transcoding or other computer-vision tasks, GPU-accelerated transformations may be desired. While it is beyond the scope of this article, such a setup can be achieved with the right hardware (e.g. [Nvidia DGX](https://www.nvidia.com/en-us/data-center/dgx-platform/)), containers with GPU access ([NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)), and the AIS [init_spec option](https://github.com/NVIDIA/aistore/blob/main/docs/etl.md#init-spec-request). 24 25 --- 26 ## WebDataset-compatible Transforms 27 28 We start with a bucket in AIStore filled with objects where each object is a shard of multiple samples (the output of [part 1](https://aiatscale.org/blog/2023/05/08/aisio-transforms-with-webdataset-pt-1.md)). For our ETL to be able to operate on these objects, we need to write a function that can parse this WebDataset-formatted shard and perform the transform on each sample inside. 29 30 Below is a diagram and some simple example code for an ETL that parses these tar files and transforms all image files found inside (without creating any residual files). Since each object in AIS is a shard of multiple records, the first step is to load it from the URL as a WebDataset object. With this done, the WebDataset library makes it easy to iterate over each record, transform individual components, and then write out the result as a complete transformed shard. 31 32 ![WebDataset ETL](/assets/aisio_inline_wdataset/wd_etl.jpg) 33 34 All functions must be included in the `wd_etl` function when using `init_code` as shown, since that is the code that is packaged and sent to run on the cluster in the ETL container. For more flexibility when defining the ETL container, check out the `init_spec` option. 35 ```python 36 def wd_etl(object_url): 37 def img_to_bytes(img): 38 buf = io.BytesIO() 39 img = img.convert("RGB") 40 img.save(buf, format="JPEG") 41 return buf.getvalue() 42 43 def process_trimap(trimap_bytes): 44 image = Image.open(io.BytesIO(trimap_bytes)) 45 preprocessing = torchvision.transforms.Compose( 46 [ 47 torchvision.transforms.CenterCrop(350), 48 torchvision.transforms.Lambda(img_to_bytes) 49 ] 50 ) 51 return preprocessing(image) 52 53 def process_image(image_bytes): 54 image = Image.open(io.BytesIO(image_bytes)).convert("RGB") 55 preprocessing = torchvision.transforms.Compose( 56 [ 57 torchvision.transforms.CenterCrop(350), 58 torchvision.transforms.ToTensor(), 59 # Means and stds from ImageNet 60 torchvision.transforms.Normalize( 61 mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225] 62 ), 63 torchvision.transforms.ToPILImage(), 64 torchvision.transforms.Lambda(img_to_bytes), 65 ] 66 ) 67 return preprocessing(image) 68 69 # Initialize a WD object from the internal object URL in AIS 70 dataset = wds.WebDataset(object_url) 71 # Map the files for each individual sample to the appropriate processing function 72 processed_shard = dataset.map_dict(**{"image.jpg": process_image, "trimap.png": process_trimap}) 73 74 # Write the output to a memory buffer and return the value 75 buffer = io.BytesIO() 76 with wds.TarWriter(fileobj=buffer) as dst: 77 for sample in processed_shard: 78 dst.write(sample) 79 return buffer.getvalue() 80 ``` 81 82 Now we must initialize the above function as an ETL process on the AIS cluster. 83 84 Notice the ETL defined above takes an object URL as input, so for this case we will need to use the `hpull` communication type along with the `transform_url` option. This allows us to initialize WebDataset (using its own reader) with the internal URL for the object. If `transform_url` is not set as `True`, the etl will default to sending the object bytes as the argument to the user-provided transform function. 85 86 Since both the WebDataset and PyTorch libraries try to import each other, we need to use the `preimported_modules` option to first import one before running the transform function: 87 88 ```python 89 def create_wd_etl(client): 90 client.etl(etl_name).init_code( 91 transform=wd_etl, 92 preimported_modules=["torch"], 93 dependencies=["webdataset", "pillow", "torch", "torchvision"], 94 communication_type="hpull", 95 transform_url=True 96 ) 97 ``` 98 99 ## Transforming Objects 100 101 With the ETL created, we can use it to perform either an inline transformation to stream the results of a single object or an offline transformation to process objects within the cluster. The diagrams below are a simplified version of the full process (for more info on the inner workings, see [here](https://storagetarget.com/2021/04/02/integrated-storage-stack-for-training-inference-and-transformations/)). The primary difference between the two approaches is simple: inline performs the transformation as part of the initial GET request, offline stores the results for a later request. 102 103 ### Inline (single object): 104 ```python 105 single_object = client.bucket(bucket_name).object("samples-00.tar") 106 # Get object contents with ETL applied 107 processed_shard = single_object.get(etl_name=etl_name).read_all() 108 ``` 109 110 ![Inline Transform](/assets/aisio_inline_wdataset/inline_etl_sequence.jpg) 111 112 ### Offline (bucket-to-bucket): 113 ```python 114 dest_bucket = client.bucket("processed-samples").create(exist_ok=True) 115 # Transform the entire bucket, placing the output in the destination bucket 116 transform_job = client.bucket(bucket_name).transform(to_bck=dest_bucket, etl_name=etl_name) 117 client.job(transform_job).wait(verbose=True) 118 processed_shard = dest_bucket.object("samples-00.tar").get().read_all() 119 ``` 120 121 ![Offline Transform](/assets/aisio_inline_wdataset/offline_etl_sequence.jpg) 122 123 --- 124 ## Conclusion 125 126 This post demonstrates how to use WebDataset and AIS ETL to run custom transformations on the cluster, close to data. 127 However, both offline and inline transformations have major drawbacks if you have a very large dataset on which to train. 128 Offline transformations require far more storage than necessary as the entire output must be stored at once. 129 Individual results from inline transformations can be inefficient to compute and difficult to shuffle and batch. 130 [PyTorch DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) can help overcome this. 131 In the next post, we'll show how to put ETL to use when training a dataset by performing inline transformations on each object with a custom PyTorch Dataloader. 132 133 --- 134 ## References 135 136 1. GitHub: 137 - [AIStore](https://github.com/NVIDIA/aistore) 138 - [Local Kubernetes Deployment](https://github.com/NVIDIA/aistore/blob/main/deploy/dev/k8s/README.md) 139 - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s) 140 - [AIS-ETL containers and specs](https://github.com/NVIDIA/ais-etl) 141 - [WebDataset Library](https://github.com/webdataset/webdataset) 142 2. Documentation, blogs, videos: 143 - https://aiatscale.org 144 - https://github.com/NVIDIA/aistore/tree/main/docs 145 - [AIStore with WebDataset Part 1 -- Storing WebDataset format in AIS](https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1) 146 3. Full code example 147 - [WebDataset ETL Example](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/etl_webdataset.py) 148 4. Dataset 149 - [The Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) 150 5. PyTorch 151 - [DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)