github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2023-04-03-transform-images-with-python-sdk.md (about)

     1  ---
     2  layout: post
     3  title:  "AIStore SDK & ETL: Transform an image dataset with AIS SDK and load into PyTorch"
     4  date:   Apr 03, 2023
     5  author: Aaron Wilson
     6  categories: aistore etl pytorch python
     7  ---
     8  
     9  With recent updates to the Python SDK, it's easier than ever to load data into AIS, transform it, and use it for training with PyTorch. In this post, we'll demonstrate how to do that with a small dataset of images.
    10  
    11  In a previous series of posts, we transformed the ImageNet dataset using a mixture of CLI and SDK commands. For background, you can view these posts below, but note that much of the syntax is out of date:
    12  
    13  * [AIStore & ETL: Introduction](https://aiatscale.org/blog/2021/10/21/ais-etl-1)
    14  * [AIStore & ETL: Using AIS/PyTorch connector to transform ImageNet (post #2)](https://aiatscale.org/blog/2021/10/22/ais-etl-2)
    15  
    16  ## Setup
    17  
    18  As we did in the posts above, we'll assume that an instance of AIStore has been already deployed on Kubernetes. All the code below will expect an `AIS_ENDPOINT` environment variable set to the cluster's endpoint.
    19  
    20  > To set up a local Kubernetes cluster with Minikube, checkout the [docs here](https://github.com/NVIDIA/aistore/tree/main/deploy/dev/k8s). For more advanced deployments, take a look at our dedicated [ais-k8s repository](https://github.com/NVIDIA/ais-k8s/).
    21  
    22  We'll be using PyTorch's `torchvision` to transform [The Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) - as illustrated:
    23  
    24  ![AIS-ETL Overview](/assets/ais_etl_series/ais-etl-overview.png)
    25  
    26  To interact with the cluster, we'll be using the [AIS Python SDK](https://github.com/NVIDIA/aistore/tree/main/python). Set up your Python environment and install the following requirements:
    27  
    28  ```text
    29  aistore
    30  torchvision
    31  torch
    32  ```
    33  
    34  ## The Dataset
    35  
    36  For this demo we will be using the [Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/) since it is less than 1GB. The [ImageNet Dataset](https://image-net.org/index.php) is another reasonable choice, but consists of much larger downloads.
    37  
    38  Once downloaded, the dataset includes an `images` and an `annotations` folder. For this example we will focus on the `images` directory, which consists of different sized `.jpg` images.
    39  
    40  ```python
    41  import os
    42  import io
    43  import sys
    44  from PIL import Image
    45  from torchvision import transforms
    46  import torch
    47  
    48  from aistore.pytorch import AISDataset
    49  from aistore.sdk import Client
    50  from aistore.sdk.multiobj import ObjectRange
    51  
    52  AISTORE_ENDPOINT = os.getenv("AIS_ENDPOINT", "http://192.168.49.2:8080")
    53  client = Client(AISTORE_ENDPOINT)
    54  bucket_name = "images"
    55  
    56  
    57  def show_image(image_data):
    58      with Image.open(io.BytesIO(image_data)) as image:
    59          image.show()
    60  
    61  
    62  def load_data():
    63      # First, let's create a bucket and put the data into AIS
    64      bucket = client.bucket(bucket_name).create()
    65      bucket.put_files("images/", pattern="*.jpg")
    66      # Show a random (non-transformed) image from the dataset
    67      image_data = bucket.object("Bengal_171.jpg").get().read_all()
    68      show_image(image_data)
    69  
    70  load_data()
    71  ```
    72  
    73  ![example cat image](/assets/transform_images_sdk/Bengal_171.jpg)
    74  
    75  The class for this image can also be found in the annotations data:
    76  
    77  ```console
    78  Bengal_171 6 1 2
    79  
    80  Translates to
    81  Class: 6 (ID)
    82  Species: 1 (cat)
    83  Breed: 2 (Bengal)
    84  ```
    85  
    86  ## Transforming the data
    87  
    88  Now that the data is in place, we need to define the transformation we want to apply before training on the data. Below we will deploy transformation code on an ETL K8s container. Once this code is deployed as an ETL in AIS, it can be applied to buckets or objects to transform them on the cluster.
    89  
    90  ```python
    91  def etl():
    92      def img_to_bytes(img):
    93          buf = io.BytesIO()
    94          img = img.convert('RGB')
    95          img.save(buf, format='JPEG')
    96          return buf.getvalue()
    97  
    98      input_bytes = sys.stdin.buffer.read()
    99      image = Image.open(io.BytesIO(input_bytes)).convert('RGB')
   100      preprocessing = transforms.Compose([
   101          transforms.RandomResizedCrop(224),
   102          transforms.RandomHorizontalFlip(),
   103          transforms.ToTensor(),
   104          transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
   105          transforms.ToPILImage(),
   106          transforms.Lambda(img_to_bytes),
   107      ])
   108      processed_bytes = preprocessing(image)
   109      sys.stdout.buffer.write(processed_bytes)
   110  ```
   111  
   112  ## Initializing
   113  
   114  We will use `python3` (`python:3.10`) *runtime* and install the `torchvision` package to run the `etl` function above. When using the Python SDK `init_code`, it will automatically select the current version of Python (if supported) as the runtime for compatibility with the code passed in. To use a different runtime, check out the `init_spec` option.
   115  
   116  > [runtime](https://github.com/NVIDIA/ais-etl/tree/master/runtime) contains a predefined work environment in which the provided code/script will be run. A full list of supported runtimes can be found [here](https://github.com/NVIDIA/aistore/blob/main/docs/etl.md#runtimes).
   117  
   118  
   119  ```python
   120  def create_etl():
   121      client.etl("transform-images").init_code(
   122                             transform=etl,
   123                             dependencies=["torchvision"],
   124                             communication_type="io")
   125  
   126  
   127  image_etl = create_etl()
   128  ```
   129  
   130  This initialization may take a few minutes to run, as it must download torchvision and all its dependencies.
   131  
   132  ```python
   133  def show_etl(etl):
   134      print(client.cluster().list_running_etls())
   135      print(etl.view())
   136  
   137  show_etl(image_etl)
   138  ```
   139  
   140  ## Inline and Offline ETL
   141  
   142  AIS supports both inline (applied when getting objects) and offline (bucket to bucket) ETL. For more info see the [ETL docs here](https://github.com/NVIDIA/aistore/blob/main/docs/etl.md).
   143  
   144  ## Transforming a single object inline
   145  
   146  With the ETL defined, we can use it when accessing our data.
   147  
   148  ```python
   149  def get_with_etl(etl):
   150      transformed_data = client.bucket(bucket_name).object("Bengal_171.jpg").get(etl_name=etl.name).read_all()
   151      show_image(transformed_data)
   152  
   153  get_with_etl(image_etl)
   154  ```
   155  
   156  Post-transform image:
   157  
   158  ![example image transformed](/assets/transform_images_sdk/Transformed_Bengal.jpg)
   159  
   160  ## Transforming an entire bucket offline
   161  
   162  Note that the job below may take a long time to run depending on your machine and the images you are transforming. You can view all jobs with `client.cluster().list_running_jobs()`. If you'd like to run a shorter example, you can limit which images are transformed with the `prefix_filter` option in the `bucket.transform` function:
   163  
   164  ```python
   165  def etl_bucket(etl):
   166      dest_bucket = client.bucket("transformed-images").create()
   167      transform_job = client.bucket(bucket_name).transform(etl_name=etl.name, to_bck=dest_bucket)
   168      client.job(transform_job).wait()
   169      print(entry.name for entry in dest_bucket.list_all_objects())
   170  
   171  etl_bucket(image_etl)
   172  ```
   173  
   174  ## Transforming multiple objects offline
   175  
   176  We can also utilize the SDK's object group feature to transform a selection of several objects with the defined ETL.
   177  
   178  ```python
   179  def etl_group(etl):
   180      dest_bucket = client.bucket("transformed-selected-images").create()
   181      # Select a range of objects from the source bucket
   182      object_range = ObjectRange(min_index=0, max_index=100, prefix="Bengal_", suffix=".jpg")
   183      object_group = client.bucket(bucket_name).objects(obj_range=object_range)
   184      transform_job = object_group.transform(etl_name=etl.name, to_bck=dest_bucket)
   185      client.job(transform_job).wait_for_idle(timeout=300)
   186      print([entry.name for entry in dest_bucket.list_all_objects()])
   187  
   188  etl_group(image_etl)
   189  ```
   190  
   191  ### AIS/PyTorch connector
   192  
   193  In the steps above, we demonstrated a few ways to transform objects, but to use the results we need to load them into a Pytorch Dataset and DataLoader. In PyTorch, a dataset can be defined by inheriting [`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset). Datasets can be fed into a `DataLoader` to handle batching, shuffling, etc. (see ['torch.utils.data.DataLoader'](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)).
   194  
   195  To implement inline ETL, transforming objects as we read them, you  will need to create a custom PyTorch Dataset as described [by PyTorch here](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html). In the future, AIS will likely provide some of this functionality directly. For now, we will use the output of the offline ETL (bucket-to-bucket) described above and use the provided `AISDataset` to read the transformed results. More info on reading AIS data into PyTorch can be found [on the AIS blog here](https://aiatscale.org/blog/2022/07/12/aisio-pytorch).
   196  
   197  ```python
   198  def create_dataloader():
   199      # Construct a dataset and dataloader to read data from the transformed bucket
   200      dataset = AISDataset(AISTORE_ENDPOINT, "ais://transformed-images")
   201      train_loader = torch.utils.data.DataLoader(dataset, shuffle=True)
   202      return train_loader
   203  
   204  data_loader = create_dataloader()
   205  ```
   206  
   207  This data loader can now be used with PyTorch to train a full model.
   208  
   209  Full code examples for each action above can be found [here](/examples/transform-images-sdk/transform_sdk.py)
   210  
   211  ## References
   212  
   213  1. [AIStore & ETL: Introduction](https://aiatscale.org/blog/2021/10/21/ais-etl-1)
   214  2. GitHub:
   215      - [AIStore](https://github.com/NVIDIA/aistore)
   216      - [Local Kubernetes Deployment](https://github.com/NVIDIA/aistore/blob/main/deploy/dev/k8s/README.md)
   217      - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s)
   218      - [AIS-ETL containers and specs](https://github.com/NVIDIA/ais-etl)
   219  3. Documentation, blogs, videos:
   220      - https://aiatscale.org
   221      - https://github.com/NVIDIA/aistore/tree/main/docs
   222  4. Deprecated training code samples:
   223      - [ImageNet PyTorch training with `aistore.pytorch.Dataset`](/examples/etl-imagenet-dataset/train_aistore.py)
   224  5. Full code example
   225      - [Transform Images With SDK](/examples/transform-images-sdk/transform_sdk.py)
   226  6. Dataset
   227      - [The Oxford-IIIT Pet Dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/)