github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2022-07-11-aisio-pytorch.md (about) 1 --- 2 layout: post 3 title: "PyTorch: Loading Data from AIStore" 4 date: 2022-07-11 18:31:24 -0700 5 author: Abhishek Gaikwad 6 categories: aistore pytorch sdk python 7 --- 8 9 # PyTorch: Loading Data from AIStore 10 11 Listing and loading data from AIS buckets (buckets that are not 3rd 12 party backend-based) and remote cloud buckets (3rd party backend-based 13 cloud buckets) using 14 [AISFileLister](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLister.html#aisfilelister) 15 and 16 [AISFileLoader](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLoader.html#torchdata.datapipes.iter.AISFileLoader). 17 18 [AIStore](https://github.com/NVIDIA/aistore) (AIS for short) fully supports 19 Amazon S3, Google Cloud, and Microsoft Azure backends, providing a 20 unified namespace across multiple connected backends and/or other AIS 21 clusters, and [more](https://github.com/NVIDIA/aistore#features). 22 23 In the following example, we use the [Caltech-256 Object Category 24 Dataset](https://authors.library.caltech.edu/7694/) containing 256 25 object categories and a total of 30607 images stored on an AIS bucket 26 and the [Microsoft COCO Dataset](https://cocodataset.org/#home) which 27 has 330K images with over 200K labels of more than 1.5 million object 28 instances across 80 object categories stored on Google Cloud. 29 30 ``` {.python} 31 # Imports 32 import os 33 from IPython.display import Image 34 35 from torchdata.datapipes.iter import AISFileLister, AISFileLoader, Mapper 36 ``` 37 38 39 ### Running the AIStore Cluster 40 41 [Getting started with 42 AIS](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md) 43 will take only a few minutes (prerequisites boil down to having a Linux 44 with a disk) and can be done either by running a prebuilt [all-in-one 45 docker image](https://github.com/NVIDIA/aistore/tree/main/deploy) or 46 directly from the open-source. 47 48 To keep this example simple, we will be running a [minimal standalone 49 docker 50 deployment](https://github.com/NVIDIA/aistore/blob/main/deploy/prod/docker/single/README.md) 51 of AIStore. 52 53 ``` {.python} 54 # Running the AIStore cluster in a container on port 51080 55 # Note: The mounted path should have enough space to load the dataset 56 57 ! docker run -d \ 58 -p 51080:51080 \ 59 -v <path_to_gcp_config>.json:/credentials/gcp.json \ 60 -e GOOGLE_APPLICATION_CREDENTIALS="/credentials/gcp.json" \ 61 -e AWS_ACCESS_KEY_ID="AWSKEYIDEXAMPLE" \ 62 -e AWS_SECRET_ACCESS_KEY="AWSSECRETEACCESSKEYEXAMPLE" \ 63 -e AWS_REGION="us-east-2" \ 64 -e AIS_BACKEND_PROVIDERS="gcp aws" \ 65 -v /disk0:/ais/disk0 \ 66 aistore/cluster-minimal:latest 67 ``` 68 69 70 To create and put objects (dataset) in the bucket, I am going to be 71 using [AIS 72 CLI](https://github.com/NVIDIA/aistore/blob/main/docs/cli.md). But we 73 can also use the [Python 74 SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore) for the 75 same. 76 77 ``` {.python} 78 ! ais config cli set cluster.url=http://localhost:51080 79 80 # create bucket using AIS CLI 81 ! ais create caltech256 82 83 # put the downloaded dataset in the created AIS bucket 84 ! ais put -r -y <path_to_dataset> ais://caltech256/ 85 ``` 86 87 > OUTPUT:<br>"ais://caltech256" created (see https://github.com/NVIDIA/aistore/blob/main/docs/bucket.md#default-bucket-properties)<br> 88 > Files to upload:<br> 89 EXTENSION COUNT SIZE<br> 90 1 3.06KiB<br> 91 .jpg 30607 1.08GiB<br> 92 TOTAL 30608 1.08GiB<br> 93 PUT 30608 objects to "ais://caltech256"<br> 94 95 96 ### Preloaded dataset 97 98 The following assumes that AIS cluster is running and one of its buckets 99 contains Caltech-256 dataset. 100 101 ``` {.python} 102 # list of prefixes which contain data 103 image_prefix = ['ais://caltech256/'] 104 105 # Listing all files starting with these prefixes on AIStore 106 dp_urls = AISFileLister(url="http://localhost:51080", source_datapipe=image_prefix) 107 108 # list first 5 obj urls 109 list(dp_urls)[:5] 110 ``` 111 112 >OUTPUT:<br> 113 ['ais://caltech256/002.american-flag/002_0001.jpg',<br> 114 'ais://caltech256/002.american-flag/002_0002.jpg',<br> 115 'ais://caltech256/002.american-flag/002_0003.jpg',<br> 116 'ais://caltech256/002.american-flag/002_0004.jpg',<br> 117 'ais://caltech256/002.american-flag/002_0005.jpg'] 118 119 120 121 ``` {.python} 122 # loading data using AISFileLoader 123 dp_files = AISFileLoader(url="http://localhost:51080", source_datapipe=dp_urls) 124 125 # check the first obj 126 url, img = next(iter(dp_files)) 127 128 print(f"image url: {url}") 129 130 # view the image 131 # Image(data=img.read()) 132 ``` 133 134 >OUTPUT:<br> 135 image url: ais://caltech256/002.american-flag/002_0001.jpg 136 137 ``` {.python} 138 def collate_sample(data): 139 path, image = data 140 dir = os.path.split(os.path.dirname(path))[1] 141 label_str, cls = dir.split(".") 142 return {"path": path, "image": image, "label": int(label_str), "cls": cls} 143 ``` 144 145 ``` {.python} 146 # passing it further down the pipeline 147 for _sample in Mapper(dp_files, collate_sample): 148 pass 149 ``` 150 151 ### Remote cloud buckets 152 153 AIStore supports multiple [remote 154 backends](https://aiatscale.org/docs/providers). With AIS, accessing 155 cloud buckets doesn\'t require any additional setup assuming, of course, 156 that you have the corresponding credentials (to access cloud buckets). 157 158 For the following example, AIStore must be built with `--gcp` build tag. 159 160 > `--gcp`, `--aws`, and a number of other [build tags](https://github.com/NVIDIA/aistore/blob/main/Makefile) is the mechanism we use to include optional libraries in the [build](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md#build-make-and-development-tools). 161 162 ``` {.python} 163 # list of prefixes which contain data 164 gcp_prefix = ['gcp://webdataset-testing/'] 165 166 # Listing all files starting with these prefixes on AIStore 167 gcp_urls = AISFileLister(url="http://localhost:51080", source_datapipe=gcp_prefix) 168 169 # list first 5 obj urls 170 list(gcp_urls)[:5] 171 ``` 172 173 > OUTPUT:<br> 174 ['gcp://webdataset-testing/coco-train2014-seg-000000.tar',<br> 175 'gcp://webdataset-testing/coco-train2014-seg-000001.tar',<br> 176 'gcp://webdataset-testing/coco-train2014-seg-000002.tar',<br> 177 'gcp://webdataset-testing/coco-train2014-seg-000003.tar',<br> 178 'gcp://webdataset-testing/coco-train2014-seg-000004.tar'] 179 180 ``` {.python} 181 dp_files = AISFileLoader(url="http://localhost:51080", source_datapipe=gcp_urls) 182 ``` 183 ``` {.python} 184 for url, file in dp_files.load_from_tar(): 185 pass 186 ``` 187 188 ### References 189 190 - [AIStore](https://github.com/NVIDIA/aistore) 191 - [AIStore Blog](https://aiatscale.org/blog) 192 - [AIS CLI](https://github.com/NVIDIA/aistore/blob/main/docs/cli.md) 193 - [AIStore Cloud Backend 194 Providers](https://aiatscale.org/docs/providers) 195 - [AIStore Documentation](https://aiatscale.org/docs) 196 - [AIStore Python 197 SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore) 198 - [Caltech 256 Dataset](https://authors.library.caltech.edu/7694/) 199 - [Getting started with 200 AIStore](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md) 201 - [Microsoft COCO Dataset](https://cocodataset.org/#home)