github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-05-11-aisio-transforms-with-webdataset-pt-2.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-05-11-aisio-transforms-with-webdataset-pt-2.md (about)

     1  ---
     2  layout: post
     3  title:  "AIStore with WebDataset Part 2 -- Transforming WebDataset Shards in AIS"
     4  date:   May 11, 2023
     5  author: Aaron Wilson
     6  categories: aistore etl pytorch python webdataset
     7  ---
     8  
     9  In the [previous post](https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1) we converted a dataset to the WebDataset format and stored it in a bucket in AIStore. 
    10  
    11  This post will demonstrate AIStore's ability to efficiently apply custom transformations to the dataset on the storage cluster. We'll do this using [AIS ETL](https://github.com/NVIDIA/aistore/blob/main/docs/etl.md).
    12  
    13  All code used below can be found here: [WebDataset ETL Example](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/etl_webdataset.py)
    14  
    15  ---
    16  ## Motivation for AIS ETL
    17  
    18  AIS ETL takes advantage of processing power available on the storage cluster, as even an otherwise heavily utilized cluster will be idle, CPU-wise, over 90% of the time (of course, depending on specific workloads). 
    19  Performing transformations close to the data maximizes efficiency and can reduce the amount of network traffic depending on the transforms applied. 
    20  This makes it a much better option than pulling all the required data for training and doing both preprocessing and training on the same system. 
    21  
    22  In this demo, the ETL on AIStore is relatively lightweight, but offloading the pre-training computation to the AIS cluster could be much more important with a more intensive transform. 
    23  For more advanced workloads such as audio or video transcoding or other computer-vision tasks, GPU-accelerated transformations may be desired. While it is beyond the scope of this article, such a setup can be achieved with the right hardware (e.g. [Nvidia DGX](https://www.nvidia.com/en-us/data-center/dgx-platform/)), containers with GPU access ([NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)), and the AIS [init_spec option](https://github.com/NVIDIA/aistore/blob/main/docs/etl.md#init-spec-request).
    24  
    25  --- 
    26  ## WebDataset-compatible Transforms
    27   
    28  We start with a bucket in AIStore filled with objects where each object is a shard of multiple samples (the output of [part 1](https://aiatscale.org/blog/2023/05/08/aisio-transforms-with-webdataset-pt-1.md)). For our ETL to be able to operate on these objects, we need to write a function that can parse this WebDataset-formatted shard and perform the transform on each sample inside. 
    29  
    30  Below is a diagram and some simple example code for an ETL that parses these tar files and transforms all image files found inside (without creating any residual files). Since each object in AIS is a shard of multiple records, the first step is to load it from the URL as a WebDataset object. With this done, the WebDataset library makes it easy to iterate over each record, transform individual components, and then write out the result as a complete transformed shard. 
    31  
    32  ![WebDataset ETL](/assets/aisio_inline_wdataset/wd_etl.jpg)
    33  
    34  All functions must be included in the `wd_etl` function when using `init_code` as shown, since that is the code that is packaged and sent to run on the cluster in the ETL container. For more flexibility when defining the ETL container, check out the `init_spec` option. 
    35  ```python
    36  def wd_etl(object_url):
    37      def img_to_bytes(img):
    38          buf = io.BytesIO()
    39          img = img.convert("RGB")
    40          img.save(buf, format="JPEG")
    41          return buf.getvalue()
    42  
    43      def process_trimap(trimap_bytes):
    44          image = Image.open(io.BytesIO(trimap_bytes))
    45          preprocessing = torchvision.transforms.Compose(
    46              [
    47                  torchvision.transforms.CenterCrop(350),
    48                  torchvision.transforms.Lambda(img_to_bytes)
    49              ]
    50          )
    51          return preprocessing(image)
    52  
    53      def process_image(image_bytes):
    54          image = Image.open(io.BytesIO(image_bytes)).convert("RGB")
    55          preprocessing = torchvision.transforms.Compose(
    56              [
    57                  torchvision.transforms.CenterCrop(350),
    58                  torchvision.transforms.ToTensor(),
    59                  # Means and stds from ImageNet
    60                  torchvision.transforms.Normalize(
    61                      mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]
    62                  ),
    63                  torchvision.transforms.ToPILImage(),
    64                  torchvision.transforms.Lambda(img_to_bytes),
    65              ]
    66          )
    67          return preprocessing(image)
    68      
    69      # Initialize a WD object from the internal object URL in AIS
    70      dataset = wds.WebDataset(object_url)
    71      # Map the files for each individual sample to the appropriate processing function
    72      processed_shard = dataset.map_dict(**{"image.jpg": process_image, "trimap.png": process_trimap})
    73  
    74      # Write the output to a memory buffer and return the value
    75      buffer = io.BytesIO()
    76      with wds.TarWriter(fileobj=buffer) as dst:
    77          for sample in processed_shard:
    78              dst.write(sample)
    79      return buffer.getvalue()
    80  ```
    81  
    82  Now we must initialize the above function as an ETL process on the AIS cluster. 
    83  
    84  Notice the ETL defined above takes an object URL as input, so for this case we will need to use the `hpull` communication type along with the `transform_url` option. This allows us to initialize WebDataset (using its own reader) with the internal URL for the object. If `transform_url` is not set as `True`, the etl will default to sending the object bytes as the argument to the user-provided transform function.
    85  
    86  Since both the WebDataset and PyTorch libraries try to import each other, we need to use the `preimported_modules` option to first import one before running the transform function:
    87  
    88  ```python
    89  def create_wd_etl(client):
    90      client.etl(etl_name).init_code(
    91          transform=wd_etl,
    92          preimported_modules=["torch"],
    93          dependencies=["webdataset", "pillow", "torch", "torchvision"],
    94          communication_type="hpull",
    95          transform_url=True
    96      )
    97  ```
    98  
    99  ## Transforming Objects
   100  
   101  With the ETL created, we can use it to perform either an inline transformation to stream the results of a single object or an offline transformation to process objects within the cluster. The diagrams below are a simplified version of the full process (for more info on the inner workings, see [here](https://storagetarget.com/2021/04/02/integrated-storage-stack-for-training-inference-and-transformations/)). The primary difference between the two approaches is simple: inline performs the transformation as part of the initial GET request, offline stores the results for a later request. 
   102  
   103  ### Inline (single object): 
   104  ```python
   105      single_object = client.bucket(bucket_name).object("samples-00.tar")
   106      # Get object contents with ETL applied
   107      processed_shard = single_object.get(etl_name=etl_name).read_all()
   108  ```
   109  
   110  ![Inline Transform](/assets/aisio_inline_wdataset/inline_etl_sequence.jpg)
   111  
   112  ### Offline (bucket-to-bucket):
   113  ```python
   114      dest_bucket = client.bucket("processed-samples").create(exist_ok=True)
   115      # Transform the entire bucket, placing the output in the destination bucket
   116      transform_job = client.bucket(bucket_name).transform(to_bck=dest_bucket, etl_name=etl_name)
   117      client.job(transform_job).wait(verbose=True)
   118      processed_shard = dest_bucket.object("samples-00.tar").get().read_all()
   119  ```
   120  
   121  ![Offline Transform](/assets/aisio_inline_wdataset/offline_etl_sequence.jpg)
   122  
   123  ---
   124  ## Conclusion
   125  
   126  This post demonstrates how to use WebDataset and AIS ETL to run custom transformations on the cluster, close to data. 
   127  However, both offline and inline transformations have major drawbacks if you have a very large dataset on which to train.
   128  Offline transformations require far more storage than necessary as the entire output must be stored at once.
   129  Individual results from inline transformations can be inefficient to compute and difficult to shuffle and batch. 
   130  [PyTorch DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) can help overcome this.
   131  In the next post, we'll show how to put ETL to use when training a dataset by performing inline transformations on each object with a custom PyTorch Dataloader.
   132  
   133  --- 
   134  ## References
   135  
   136  1. GitHub:
   137      - [AIStore](https://github.com/NVIDIA/aistore)
   138      - [Local Kubernetes Deployment](https://github.com/NVIDIA/aistore/blob/main/deploy/dev/k8s/README.md)
   139      - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s)
   140      - [AIS-ETL containers and specs](https://github.com/NVIDIA/ais-etl)
   141      - [WebDataset Library](https://github.com/webdataset/webdataset)
   142  2. Documentation, blogs, videos:
   143      - https://aiatscale.org
   144      - https://github.com/NVIDIA/aistore/tree/main/docs
   145      - [AIStore with WebDataset Part 1 -- Storing WebDataset format in AIS](https://aiatscale.org/blog/2023/05/05/aisio-transforms-with-webdataset-pt-1)
   146  3. Full code example
   147      - [WebDataset ETL Example](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/etl_webdataset.py)
   148  4. Dataset
   149      - [The Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/)
   150  5. PyTorch 
   151      - [DataLoaders](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html)