github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-05-08-aisio-transforms-with-webdataset-pt-1.md (about)

     1  ---
     2  layout: post
     3  title:  "AIStore with WebDataset Part 1 -- Storing WebDataset format in AIS"
     4  date:   May 05, 2023
     5  authors: Aaron Wilson
     6  categories: aistore etl pytorch python webdataset
     7  ---
     8  ## Motivation - High-Performance AI Storage
     9  
    10  Training AI models is expensive, so it's important to keep GPUs fed with all the data they need as fast as they can consume it. WebDataset and AIStore each address different parts of this problem individually:
    11  
    12  - [WebDataset](https://github.com/webdataset/webdataset) is a convenient format and set of related tools for storing data consisting of multiple related files (e.g. an image and a class). It allows for vastly faster i/o by enforcing sequential reads and writes of related data.
    13  
    14  - [AIStore](https://github.com/NVIDIA/aistore) provides fast, scalable storage along with powerful features (including co-located ETL) to enhance AI workloads.
    15  
    16  The obvious question for developers is how to combine WebDataset with AIStore to create a high performance data pipeline and maximize the features of both toolsets. 
    17  
    18  In this series of posts, we'll show how to effectively pair WebDataset with AIStore and how to use AIStore's inline object transformation and PyTorch DataPipelines to prepare a WebDataset for model training. The series will consist of 3 posts demonstrating the following tasks:
    19  
    20  1. How to convert a generic dataset to the WebDataset format and store in AIS
    21  2. How to create and apply an ETL for on-the-fly sample transformations
    22  3. How to set up a PyTorch DataPipeline
    23  
    24  For background it may be useful to view the [previous post](https://aiatscale.org/blog/2023/04/03/transform-images-with-python-sdk) demonstrating basic ETL operations using the AIStore Python SDK. 
    25  
    26  All code used below can be found here: [Inline WebDataset Transform Example](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/load_webdataset_example.py)
    27  
    28  ---
    29  ## Software
    30  
    31  For this example we will be using:
    32  
    33  - Python 3.10
    34  - [WebDataset Python Library v0.2.48](https://pypi.org/project/webdataset/0.2.48/) 
    35  - [AIStore Python SDK v1.2.2](https://pypi.org/project/aistore/)
    36  - [AIStore Cluster v3.17](https://github.com/NVIDIA/aistore) -- Running in Kubernetes (see [here](https://github.com/NVIDIA/aistore/blob/main/deploy/dev/k8s/README.md) for minikube deployment or [here](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md#kubernetes-deployments) for more advanced options)
    37  
    38  --- 
    39  ## The Dataset
    40  
    41  As in the [previous ETL example](https://aiatscale.org/blog/2023/04/03/transform-images-with-python-sdk), we will use the `Oxford-IIIT Pet Dataset`: [https://www.robots.ox.ac.uk/~vgg/data/pets/](https://www.robots.ox.ac.uk/~vgg/data/pets/). 
    42  
    43  This dataset is notably NOT in WebDataset format -- the class info, metadata, and images are all separated.
    44  
    45  Below is a structured view of the dataset. `list.txt` contains a mapping from each filename to the class and breed info. The `.png` files in this dataset are trimap metadata files -- special image files where each pixel is marked as either foreground, background, or unlabeled. 
    46  
    47  ```
    48  .
    49  ├── annotations
    50  │   ├── list.txt
    51  │   ├── README
    52  │   ├── test.txt
    53  │   ├── trainval.txt
    54  │   ├── trimaps
    55  │   │   └── PetName_SampleNumber.png
    56  │   └── xmls
    57  └── images
    58      └── PetName_SampleNumber.jpg
    59  ```
    60  --- 
    61  ## Converting to WebDataset format
    62  
    63  There are several ways you could [convert a dataset to the WebDataset format](https://github.com/webdataset/webdataset#creating-a-webdataset), but in this case we will do it via Python using the WebDataset `ShardWriter`. Notice the callback function provided to upload the results of each created shard to the AIS bucket. For more info on creating WebDatasets [check out the video here](https://www.youtube.com/watch?v=v_PacO-3OGQ).
    64  
    65  The full code is available [here](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/load_webdataset_example.py) but the key points are shown below:
    66  
    67  ```python
    68  
    69  def load_data(bucket, sample_generator):
    70  
    71      def upload_shard(filename):
    72          bucket.object(filename).put_file(filename)
    73          os.unlink(filename)
    74  
    75      # Writes data as tar to disk, uses callback function "post" to upload to AIS and delete
    76      with wds.ShardWriter("samples-%02d.tar", maxcount=400, post=upload_shard) as writer:
    77          for sample in sample_generator:
    78              writer.write(sample)
    79  ```
    80  
    81  `sample_generator` used above is simply a generator that iterates over the local dataset and yields individual samples in the format below:
    82  
    83  ```python
    84  {
    85      "__key__": "sample_%04d" % index,
    86      "image.jpg": image_data,
    87      "cls": pet_class,
    88      "trimap.png": trimap_data
    89  }   
    90  ```
    91  
    92  Note that in the code provided, the sample generator will ignore any records that are incomplete or missing files. 
    93  
    94  The conversion and resulting structure is shown in the image below, with multiple shards each containing samples grouped by the same base name:
    95  
    96  ![WebDataset Formatted Dataset in AIStore](/assets/aisio_inline_wdataset/dataset-conversion.jpg)
    97  
    98  ## Dsort and Data Shuffling
    99  
   100  The above code results in a series of shards in AIStore, but these shards are in the order of the original dataset and are not shuffled in any way, which is likely not what we would want for training. 
   101  
   102  The AIStore feature [dSort](https://aiatscale.org/docs/cli/dsort) can help. It supports shuffling and sorting shards of WebDataset samples, keeping the records (image.jpg, cls, and trimap.png) intact in the output shard.
   103  
   104  Below is a visualization of the shuffling process:
   105  
   106  ![dSort Shuffle Visualization](/assets/aisio_inline_wdataset/dsort-shuffle.jpg)
   107  
   108  We need to write a dSort spec file to define the job to shuffle and shard our samples. The following spec loads from each of the existing shards, shuffles each record, and creates new shuffled shards of >100MB each. An existing output bucket can also be defined with the `output_bck` key if desired (other options listed [in the docs](https://aiatscale.org/docs/cli/dsort)).
   109  
   110  ```json
   111  {
   112      "extension": ".tar",
   113      "bck": {"name": "images"},
   114      "input_format": "samples-{00..18}",
   115      "output_format": "shuffled-shard-%d",
   116      "output_shard_size": "100MB",
   117      "description": "Create new shuffled shards",
   118      "algorithm": {
   119          "kind": "shuffle"
   120      }
   121  }
   122  ```
   123  
   124  Use the [AIS CLI](https://aiatscale.org/docs/cli) to start the dSort job (no support yet in the AIS Python SDK):
   125  
   126  ```bash
   127  ais start dsort -f dsort_shuffle_samples.json
   128  ```
   129  
   130  Wait for the job to complete:
   131  
   132  ```bash
   133  ais wait `YourSortJobID`
   134  ```
   135  
   136  Now we can see the ouptut shards as defined in the dSort job spec above, each containing a random set of the data samples. 
   137  
   138  ```bash
   139  ais bucket ls ais://images -prefix shuffled
   140  ```
   141  
   142  Or with the Python SDK:
   143  
   144  ```python
   145  import os
   146  from aistore import Client
   147  
   148  client = Client(os.getenv("AIS_ENDPOINT"))
   149  objects = client.bucket("images").list_all_objects(prefix="shuffled")
   150  print([entry.name for entry in objects])
   151  ```
   152  Output:
   153  ```
   154  shuffled-shard-0.tar     102.36MiB
   155  shuffled-shard-1.tar     102.44MiB
   156  shuffled-shard-2.tar     102.40MiB
   157  shuffled-shard-3.tar     102.45MiB
   158  shuffled-shard-4.tar     102.36MiB
   159  shuffled-shard-5.tar     102.40MiB
   160  shuffled-shard-6.tar     102.49MiB
   161  shuffled-shard-7.tar     74.84MiB
   162  ```
   163  
   164  Finally we have our result: WebDataset-formatted, shuffled shards stored in AIS and ready for use!
   165  
   166  In future posts, we'll show how to run transformations on this data and load it for model training.
   167  
   168  ---
   169  ## References
   170  
   171  1. GitHub:
   172      - [AIStore](https://github.com/NVIDIA/aistore)
   173      - [Local Kubernetes Deployment](https://github.com/NVIDIA/aistore/blob/main/deploy/dev/k8s/README.md)
   174      - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s)
   175      - [WebDataset Library](https://github.com/webdataset/webdataset)
   176  2. Documentation, blogs, videos:
   177      - https://aiatscale.org
   178      - https://github.com/NVIDIA/aistore/tree/main/docs
   179  3. Full code example
   180      - [Inline WebDataset Transform Example](https://github.com/NVIDIA/aistore/blob/main/docs/examples/aisio_webdataset/load_webdataset_example.py)
   181  4. Dataset
   182      - [The Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/)
   183