github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/python/examples/aisio-pytorch/aisio_pytorch_example.ipynb (about)

     1  {
     2   "cells": [
     3    {
     4     "cell_type": "markdown",
     5     "id": "085cf314",
     6     "metadata": {},
     7     "source": [
     8      "# PyTorch: Loading Data from AIStore \n",
     9      "\n",
    10      "Listing and loading data from AIS buckets (buckets that are not 3rd party backend-based) and remote cloud buckets (3rd party backend-based cloud buckets) using [AISFileLister](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLister.html#aisfilelister) and [AISFileLoader](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLoader.html#torchdata.datapipes.iter.AISFileLoader).\n",
    11      "\n",
    12      "In the following example, we use the [Caltech-256 Object Category Dataset](https://authors.library.caltech.edu/7694/) containing 256 object categories and a total of 30607 images stored on an AIS bucket and the [Microsoft COCO Dataset](https://cocodataset.org/#home) which has 330K images with over 200K labels of more than 1.5 million object instances across 80 object categories stored on Google Cloud. "
    13     ]
    14    },
    15    {
    16     "cell_type": "code",
    17     "execution_count": null,
    18     "id": "0e9e03de",
    19     "metadata": {},
    20     "outputs": [],
    21     "source": [
    22      "# Imports\n",
    23      "import os\n",
    24      "from IPython.display import Image\n",
    25      "\n",
    26      "from torchdata.datapipes.iter import AISFileLister, AISFileLoader, Mapper"
    27     ]
    28    },
    29    {
    30     "cell_type": "markdown",
    31     "id": "c42580f7",
    32     "metadata": {},
    33     "source": [
    34      "### Running the AIStore Cluster\n",
    35      "\n",
    36      "[AIStore](https://github.com/NVIDIA/aistore) (AIS for short) is a highly available lightweight object storage system that specifically focuses on petascale deep learning. As a reliable redundant storage, AIS supports n-way mirroring and erasure coding. But it is not purely – or not only – a storage system: it’ll shuffle user datasets and run custom extract-transform-load workloads.\n",
    37      "\n",
    38      "AIS is an elastic cluster that can grow and shrink at runtime and can be ad-hoc deployed, with or without Kubernetes, anywhere from a single Linux machine to a bare-metal cluster of any size. AIS fully supports Amazon S3, Google Cloud, and Microsoft Azure backends, providing a unified namespace across multiple connected backends and/or other AIS clusters, and [more](https://github.com/NVIDIA/aistore#features).\n",
    39      "\n",
    40      "[Getting started with AIS](https://github.com/NVIDIA/aistore/blob/master/docs/getting_started.md) will take only a few minutes (prerequisites boil down to having a Linux with a disk) and can be done either by running a prebuilt [all-in-one docker image](https://github.com/NVIDIA/aistore/tree/master/deploy) or directly from the open-source.\n",
    41      "\n",
    42      "To keep this example simple, we will be running a [minimal standalone docker deployment](https://github.com/NVIDIA/aistore/blob/master/deploy/prod/docker/single/README.md) of AIStore."
    43     ]
    44    },
    45    {
    46     "cell_type": "code",
    47     "execution_count": null,
    48     "id": "51204353",
    49     "metadata": {},
    50     "outputs": [],
    51     "source": [
    52      "# Running the AIStore cluster in a container on port 51080\n",
    53      "# Note: The mounted path should have enough space to load the dataset\n",
    54      "\n",
    55      "! docker run -d \\\n",
    56      "    -p 51080:51080 \\\n",
    57      "    -v <path_to_gcp_config>.json:/credentials/gcp.json \\\n",
    58      "    -e GOOGLE_APPLICATION_CREDENTIALS=\"/credentials/gcp.json\" \\\n",
    59      "    -e AWS_ACCESS_KEY_ID=\"AWSKEYIDEXAMPLE\" \\\n",
    60      "    -e AWS_SECRET_ACCESS_KEY=\"AWSSECRETEACCESSKEYEXAMPLE\" \\\n",
    61      "    -e AWS_REGION=\"us-east-2\" \\\n",
    62      "    -e AIS_BACKEND_PROVIDERS=\"gcp aws\" \\\n",
    63      "    -v /disk0:/ais/disk0 \\\n",
    64      "    aistore/cluster-minimal:latest\n"
    65     ]
    66    },
    67    {
    68     "cell_type": "markdown",
    69     "id": "3b067695",
    70     "metadata": {},
    71     "source": [
    72      "To create and put objects (dataset) in the bucket, I am going to be using [AIS CLI](https://github.com/NVIDIA/aistore/blob/master/docs/cli.md). But we can also use the [Python SDK](https://github.com/NVIDIA/aistore/tree/master/python/aistore) for the same."
    73     ]
    74    },
    75    {
    76     "cell_type": "code",
    77     "execution_count": null,
    78     "id": "730e1053",
    79     "metadata": {},
    80     "outputs": [],
    81     "source": [
    82      "! ais config cli set cluster.url=http://localhost:51080\n",
    83      "\n",
    84      "# create bucket using AIS CLI\n",
    85      "! ais bucket create caltech256\n",
    86      "\n",
    87      "# put the downloaded dataset in the created AIS bucket\n",
    88      "! ais object put -r -y <path_to_dataset> ais://caltech256/"
    89     ]
    90    },
    91    {
    92     "cell_type": "markdown",
    93     "id": "b24bf6a8",
    94     "metadata": {},
    95     "source": [
    96      "### Preloaded dataset\n",
    97      "\n",
    98      "The following assumes that AIS cluster is running and one of its buckets contains Caltech-256 dataset."
    99     ]
   100    },
   101    {
   102     "cell_type": "code",
   103     "execution_count": null,
   104     "id": "f26495b1",
   105     "metadata": {},
   106     "outputs": [],
   107     "source": [
   108      "# list of prefixes which contain data\n",
   109      "image_prefix = [\"ais://caltech256/\"]\n",
   110      "\n",
   111      "# Listing all files starting with these prefixes on AIStore\n",
   112      "dp_urls = AISFileLister(url=\"http://localhost:51080\", source_datapipe=image_prefix)\n",
   113      "\n",
   114      "# list first 5 obj urls\n",
   115      "list(dp_urls)[:5]"
   116     ]
   117    },
   118    {
   119     "cell_type": "code",
   120     "execution_count": null,
   121     "id": "eb311250",
   122     "metadata": {},
   123     "outputs": [],
   124     "source": [
   125      "# loading data using AISFileLoader\n",
   126      "dp_files = AISFileLoader(url=\"http://localhost:51080\", source_datapipe=dp_urls)\n",
   127      "\n",
   128      "# check the first obj\n",
   129      "url, img = next(iter(dp_files))\n",
   130      "\n",
   131      "print(f\"image url: {url}\")\n",
   132      "Image(data=img.read())"
   133     ]
   134    },
   135    {
   136     "cell_type": "code",
   137     "execution_count": null,
   138     "id": "dd521f6a",
   139     "metadata": {},
   140     "outputs": [],
   141     "source": [
   142      "def collate_sample(data):\n",
   143      "    path, image = data\n",
   144      "    dir = os.path.split(os.path.dirname(path))[1]\n",
   145      "    label_str, cls = dir.split(\".\")\n",
   146      "    return {\"path\": path, \"image\": image, \"label\": int(label_str), \"cls\": cls}"
   147     ]
   148    },
   149    {
   150     "cell_type": "code",
   151     "execution_count": null,
   152     "id": "39737c3f",
   153     "metadata": {},
   154     "outputs": [],
   155     "source": [
   156      "# passing it further down the pipeline\n",
   157      "for _sample in Mapper(dp_files, collate_sample):\n",
   158      "    pass"
   159     ]
   160    },
   161    {
   162     "cell_type": "markdown",
   163     "id": "9044a1cd",
   164     "metadata": {},
   165     "source": [
   166      "### Remote cloud buckets\n",
   167      "\n",
   168      "AIStore supports multiple [remote backends](https://aiatscale.org/docs/providers). With AIS, accessing cloud buckets doesn't require any additional setup assuming, of course, that you have the corresponding credentials (to access cloud buckets).\n",
   169      "\n",
   170      "For the following example, AIStore must be built and linked with the remote cloud provider backend which contains the dataset."
   171     ]
   172    },
   173    {
   174     "cell_type": "code",
   175     "execution_count": null,
   176     "id": "2cd03757",
   177     "metadata": {},
   178     "outputs": [],
   179     "source": [
   180      "# list of prefixes which contain data\n",
   181      "gcp_prefix = [\"gcp://webdataset-testing/\"]\n",
   182      "\n",
   183      "# Listing all files starting with these prefixes on AIStore\n",
   184      "gcp_urls = AISFileLister(url=\"http://localhost:51080\", source_datapipe=gcp_prefix)\n",
   185      "\n",
   186      "# list first 5 obj urls\n",
   187      "list(gcp_urls)[:5]"
   188     ]
   189    },
   190    {
   191     "cell_type": "code",
   192     "execution_count": null,
   193     "id": "bccce8e6",
   194     "metadata": {},
   195     "outputs": [],
   196     "source": [
   197      "dp_files = AISFileLoader(url=\"http://localhost:51080\", source_datapipe=gcp_urls)"
   198     ]
   199    },
   200    {
   201     "cell_type": "code",
   202     "execution_count": null,
   203     "id": "ce89bc91",
   204     "metadata": {},
   205     "outputs": [],
   206     "source": [
   207      "for url, file in dp_files.load_from_tar():\n",
   208      "    pass"
   209     ]
   210    },
   211    {
   212     "cell_type": "markdown",
   213     "id": "be29de09",
   214     "metadata": {},
   215     "source": [
   216      "### References\n",
   217      "- [AIStore](https://github.com/NVIDIA/aistore)\n",
   218      "- [AIStore Blog](https://aiatscale.org/blog)\n",
   219      "- [AIS CLI](https://github.com/NVIDIA/aistore/blob/master/docs/cli.md)\n",
   220      "- [AIStore Cloud Backend Providers](https://aiatscale.org/docs/providers)\n",
   221      "- [AIStore Documentation](https://aiatscale.org/docs)\n",
   222      "- [AIStore Python SDK](https://github.com/NVIDIA/aistore/tree/master/python/aistore)\n",
   223      "- [Caltech 256 Dataset](https://authors.library.caltech.edu/7694/)\n",
   224      "- [Getting started with AIStore](https://github.com/NVIDIA/aistore/blob/master/docs/getting_started.md)\n",
   225      "- [Microsoft COCO Dataset](https://cocodataset.org/#home)\n"
   226     ]
   227    }
   228   ],
   229   "metadata": {
   230    "kernelspec": {
   231     "display_name": "Python 3 (ipykernel)",
   232     "language": "python",
   233     "name": "python3"
   234    },
   235    "language_info": {
   236     "codemirror_mode": {
   237      "name": "ipython",
   238      "version": 3
   239     },
   240     "file_extension": ".py",
   241     "mimetype": "text/x-python",
   242     "name": "python",
   243     "nbconvert_exporter": "python",
   244     "pygments_lexer": "ipython3",
   245     "version": "3.8.10"
   246    }
   247   },
   248   "nbformat": 4,
   249   "nbformat_minor": 5
   250  }