github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/tutorials/etl/etl_webdataset.md (about)

     1  ---
     2  layout: post
     3  title: ETL WEBDATASET
     4  permalink: /tutorials/etl/etl-webdataset
     5  redirect_from:
     6   - /tutorials/etl/etl_webdataset.md/
     7   - /docs/tutorials/etl/etl_webdataset.md/
     8  ---
     9  
    10  # WebDataset ImageNet preprocessing with ETL
    11  
    12  In this example, we will see how to use ETL to preprocess the images of ImageNet using [WebDataset](https://github.com/webdataset/webdataset), a PyTorch Dataset implementation providing efficient access to datasets stored in POSIX tar archives.
    13  
    14  `Note: ETL is still in development so some steps may not work exactly as written below.`
    15  
    16  ## Overview
    17  
    18  This tutorial consists of couple steps:
    19  1. Prepare AIStore cluster.
    20  2. Prepare dataset.
    21  3. Prepare WebDataset based transform code (ETL).
    22  4. Online Transform dataset on AIStore cluster with ETL.
    23  
    24  ## Prerequisites
    25  
    26  * AIStore cluster deployed on Kubernetes. We recommend following guide below.
    27    * [Deploy AIStore on local Kuberenetes cluster](https://github.com/NVIDIA/ais-k8s/blob/master/operator/README.md)
    28    * [Deploy AIStore on the cloud](https://github.com/NVIDIA/ais-k8s/blob/master/terraform/README.md)
    29  
    30  ## Prepare dataset
    31  
    32  Before we start writing code, let's put an example tarball file with ImageNet images to the AIStore.
    33  The tarball we will be using is [imagenet-train-000999.tar](https://storage.googleapis.com/nvdata-imagenet/imagenet-train-000999.tar) which is already in WebDataset friendly format.
    34  
    35  ```console
    36  $ tar -tvf imagenet-train-000999.tar | head -n 5
    37  -r--r--r-- bigdata/bigdata   3 2020-06-25 11:11 0196495.cls
    38  -r--r--r-- bigdata/bigdata 109671 2020-06-25 11:11 0196495.jpg
    39  -r--r--r-- bigdata/bigdata      3 2020-06-25 11:11 0877232.cls
    40  -r--r--r-- bigdata/bigdata 104484 2020-06-25 11:11 0877232.jpg
    41  -r--r--r-- bigdata/bigdata      3 2020-06-25 11:11 0600062.cls
    42  $ ais create ais://imagenet
    43  "ais://imagenet" bucket created
    44  $ ais put imagenet-train-000999.tar ais://imagenet
    45  PUT "imagenet-train-000999.tar" into bucket "ais://imagenet"
    46  ```
    47  
    48  ## Prepare ETL code
    49  
    50  As we have ImageNet prepared now we need an ETL code that will do the transformation.
    51  Here we will use `io://` communicator type with `python3.11v2` runtime to install `torchvision` and `webdataset` packages.
    52  
    53  Our transform code will look like this (`code.py`):
    54  ```python
    55  # -*- Python -*-
    56  
    57  # Perform imagenet-style augmentation and normalization on the shards
    58  # on stdin, returning a new dataset on stdout.
    59  
    60  import sys
    61  from torchvision import transforms
    62  import webdataset as wds
    63  
    64  normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    65  
    66  
    67  augment = transforms.Compose(
    68      [
    69          transforms.RandomResizedCrop(224),
    70          transforms.RandomHorizontalFlip(),
    71          transforms.ToTensor(),
    72          normalize,
    73      ]
    74  )
    75  
    76  dataset = wds.WebDataset("-").decode("pil")
    77  
    78  sink = wds.TarWriter("-")
    79  for sample in dataset:
    80      print(sample.get("__key__"), file=sys.stderr)
    81      sample["npy"] = augment(sample.pop("jpg")).numpy().astype("float16")
    82      sink.write(sample)
    83  
    84  ```
    85  
    86  The idea here is that we unpack the tarball, process each image and save it as a numpy array into transformed output tarball.
    87  
    88  To make sure that the code runs we need to specify required dependencies (`deps.txt`):
    89  ```
    90  git+https://github.com/tmbdev/webdataset.git
    91  torchvision==0.15.1
    92  PyYAML==5.4.1
    93  ```
    94  
    95  ## Transform dataset
    96  
    97  Now we can build the ETL:
    98  ```console
    99  $ ais etl init code --from-file=code.py --deps-file=deps.txt --runtime=python3.11v2 --comm-type="io://"
   100  f90r81wR0
   101  $ ais etl object f90r81wR0 imagenet/imagenet-train-000999.tar preprocessed-train.tar
   102  $ tar -tvf preprocessed-train.tar | head -n 6
   103  -r--r--r-- bigdata/bigdata   3 2021-07-20 23:52 0196495.cls
   104  -r--r--r-- bigdata/bigdata 301184 2021-07-20 23:52 0196495.npy
   105  -r--r--r-- bigdata/bigdata      3 2021-07-20 23:52 0877232.cls
   106  -r--r--r-- bigdata/bigdata 301184 2021-07-20 23:52 0877232.npy
   107  -r--r--r-- bigdata/bigdata      3 2021-07-20 23:52 0600062.cls
   108  -r--r--r-- bigdata/bigdata 301184 2021-07-20 23:52 0600062.npy
   109  ```
   110  
   111  As expected, the new tarball contains transformed images stored as pickled numpy arrays, each occupying `301184` bytes.