github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2022-07-11-aisio-pytorch.md (about)

     1  ---
     2  layout: post
     3  title: "PyTorch: Loading Data from AIStore"
     4  date: 2022-07-11 18:31:24 -0700
     5  author: Abhishek Gaikwad
     6  categories: aistore pytorch sdk python
     7  ---
     8  
     9  # PyTorch: Loading Data from AIStore
    10  
    11  Listing and loading data from AIS buckets (buckets that are not 3rd
    12  party backend-based) and remote cloud buckets (3rd party backend-based
    13  cloud buckets) using
    14  [AISFileLister](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLister.html#aisfilelister)
    15  and
    16  [AISFileLoader](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLoader.html#torchdata.datapipes.iter.AISFileLoader).
    17  
    18  [AIStore](https://github.com/NVIDIA/aistore) (AIS for short) fully supports
    19  Amazon S3, Google Cloud, and Microsoft Azure backends, providing a
    20  unified namespace across multiple connected backends and/or other AIS
    21  clusters, and [more](https://github.com/NVIDIA/aistore#features).
    22  
    23  In the following example, we use the [Caltech-256 Object Category
    24  Dataset](https://authors.library.caltech.edu/7694/) containing 256
    25  object categories and a total of 30607 images stored on an AIS bucket
    26  and the [Microsoft COCO Dataset](https://cocodataset.org/#home) which
    27  has 330K images with over 200K labels of more than 1.5 million object
    28  instances across 80 object categories stored on Google Cloud.
    29  
    30  ``` {.python}
    31  # Imports
    32  import os
    33  from IPython.display import Image
    34  
    35  from torchdata.datapipes.iter import AISFileLister, AISFileLoader, Mapper
    36  ```
    37  
    38  
    39  ### Running the AIStore Cluster
    40  
    41  [Getting started with
    42  AIS](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md)
    43  will take only a few minutes (prerequisites boil down to having a Linux
    44  with a disk) and can be done either by running a prebuilt [all-in-one
    45  docker image](https://github.com/NVIDIA/aistore/tree/main/deploy) or
    46  directly from the open-source.
    47  
    48  To keep this example simple, we will be running a [minimal standalone
    49  docker
    50  deployment](https://github.com/NVIDIA/aistore/blob/main/deploy/prod/docker/single/README.md)
    51  of AIStore.
    52  
    53  ``` {.python}
    54  # Running the AIStore cluster in a container on port 51080
    55  # Note: The mounted path should have enough space to load the dataset
    56  
    57  ! docker run -d \
    58      -p 51080:51080 \
    59      -v <path_to_gcp_config>.json:/credentials/gcp.json \
    60      -e GOOGLE_APPLICATION_CREDENTIALS="/credentials/gcp.json" \
    61      -e AWS_ACCESS_KEY_ID="AWSKEYIDEXAMPLE" \
    62      -e AWS_SECRET_ACCESS_KEY="AWSSECRETEACCESSKEYEXAMPLE" \
    63      -e AWS_REGION="us-east-2" \
    64      -e AIS_BACKEND_PROVIDERS="gcp aws" \
    65      -v /disk0:/ais/disk0 \
    66      aistore/cluster-minimal:latest
    67  ```
    68  
    69  
    70  To create and put objects (dataset) in the bucket, I am going to be
    71  using [AIS
    72  CLI](https://github.com/NVIDIA/aistore/blob/main/docs/cli.md). But we
    73  can also use the [Python
    74  SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore) for the
    75  same.
    76  
    77  ``` {.python}
    78  ! ais config cli set cluster.url=http://localhost:51080
    79  
    80  # create bucket using AIS CLI
    81  ! ais create caltech256
    82  
    83  # put the downloaded dataset in the created AIS bucket
    84  ! ais put -r -y <path_to_dataset> ais://caltech256/
    85  ```
    86  
    87  > OUTPUT:<br>"ais://caltech256" created (see https://github.com/NVIDIA/aistore/blob/main/docs/bucket.md#default-bucket-properties)<br>
    88  > Files to upload:<br>
    89      EXTENSION	 COUNT	 SIZE<br>
    90                      1	 3.06KiB<br>
    91      .jpg		 30607	 1.08GiB<br>
    92      TOTAL		30608	1.08GiB<br>
    93      PUT 30608 objects to "ais://caltech256"<br>
    94  
    95  
    96  ### Preloaded dataset
    97  
    98  The following assumes that AIS cluster is running and one of its buckets
    99  contains Caltech-256 dataset.
   100  
   101  ``` {.python}
   102  # list of prefixes which contain data
   103  image_prefix = ['ais://caltech256/']
   104  
   105  # Listing all files starting with these prefixes on AIStore 
   106  dp_urls = AISFileLister(url="http://localhost:51080", source_datapipe=image_prefix)
   107  
   108  # list first 5 obj urls
   109  list(dp_urls)[:5]
   110  ```
   111  
   112  >OUTPUT:<br>
   113      ['ais://caltech256/002.american-flag/002_0001.jpg',<br>
   114       'ais://caltech256/002.american-flag/002_0002.jpg',<br>
   115       'ais://caltech256/002.american-flag/002_0003.jpg',<br>
   116       'ais://caltech256/002.american-flag/002_0004.jpg',<br>
   117       'ais://caltech256/002.american-flag/002_0005.jpg']
   118  
   119  
   120  
   121  ``` {.python}
   122  # loading data using AISFileLoader
   123  dp_files = AISFileLoader(url="http://localhost:51080", source_datapipe=dp_urls)
   124  
   125  # check the first obj
   126  url, img = next(iter(dp_files))
   127  
   128  print(f"image url: {url}")
   129  
   130  # view the image
   131  # Image(data=img.read())
   132  ```
   133  
   134  >OUTPUT:<br>
   135      image url: ais://caltech256/002.american-flag/002_0001.jpg
   136  
   137  ``` {.python}
   138  def collate_sample(data):
   139      path, image = data
   140      dir = os.path.split(os.path.dirname(path))[1]
   141      label_str, cls = dir.split(".")
   142      return {"path": path, "image": image, "label": int(label_str), "cls": cls}
   143  ```
   144  
   145  ``` {.python}
   146  # passing it further down the pipeline
   147  for _sample in Mapper(dp_files, collate_sample):
   148      pass
   149  ```
   150  
   151  ### Remote cloud buckets
   152  
   153  AIStore supports multiple [remote
   154  backends](https://aiatscale.org/docs/providers). With AIS, accessing
   155  cloud buckets doesn\'t require any additional setup assuming, of course,
   156  that you have the corresponding credentials (to access cloud buckets).
   157  
   158  For the following example, AIStore must be built with `--gcp` build tag.
   159  
   160  > `--gcp`, `--aws`, and a number of other [build tags](https://github.com/NVIDIA/aistore/blob/main/Makefile) is the mechanism we use to include optional libraries in the [build](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md#build-make-and-development-tools).
   161  
   162  ``` {.python}
   163  # list of prefixes which contain data
   164  gcp_prefix = ['gcp://webdataset-testing/']
   165  
   166  # Listing all files starting with these prefixes on AIStore 
   167  gcp_urls = AISFileLister(url="http://localhost:51080", source_datapipe=gcp_prefix)
   168  
   169  # list first 5 obj urls
   170  list(gcp_urls)[:5]
   171  ```
   172  
   173  > OUTPUT:<br>
   174      ['gcp://webdataset-testing/coco-train2014-seg-000000.tar',<br>
   175       'gcp://webdataset-testing/coco-train2014-seg-000001.tar',<br>
   176       'gcp://webdataset-testing/coco-train2014-seg-000002.tar',<br>
   177       'gcp://webdataset-testing/coco-train2014-seg-000003.tar',<br>
   178       'gcp://webdataset-testing/coco-train2014-seg-000004.tar']
   179  
   180  ``` {.python}
   181  dp_files = AISFileLoader(url="http://localhost:51080", source_datapipe=gcp_urls)
   182  ```
   183  ``` {.python}
   184  for url, file in dp_files.load_from_tar():
   185      pass
   186  ```
   187  
   188  ### References
   189  
   190  -   [AIStore](https://github.com/NVIDIA/aistore)
   191  -   [AIStore Blog](https://aiatscale.org/blog)
   192  -   [AIS CLI](https://github.com/NVIDIA/aistore/blob/main/docs/cli.md)
   193  -   [AIStore Cloud Backend
   194      Providers](https://aiatscale.org/docs/providers)
   195  -   [AIStore Documentation](https://aiatscale.org/docs)
   196  -   [AIStore Python
   197      SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore)
   198  -   [Caltech 256 Dataset](https://authors.library.caltech.edu/7694/)
   199  -   [Getting started with
   200      AIStore](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md)
   201  -   [Microsoft COCO Dataset](https://cocodataset.org/#home)