github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/python/examples/aisio-pytorch/dataset_example.ipynb (about) 1 { 2 "cells": [ 3 { 4 "cell_type": "markdown", 5 "metadata": {}, 6 "source": [ 7 "# PyTorch: Creating Datasets from AIS Backend\n", 8 "In the rapidly evolving field of machine learning, efficient data handling is crucial for training models effectively. This guide explores how to leverage AIStore (AIS), a scalable object storage solution, to create and manage datasets directly within PyTorch. We'll cover the integration of AIS with PyTorch through two custom dataset classes: AISDataset for map-style datasets and AISIterDataset for iterable datasets. These classes facilitate seamless access to data stored in AIS, supporting a variety of machine learning workflows. For details refer to [README](https://github.com/NVIDIA/aistore/tree/main/python/aistore/pytorch).\n" 9 ] 10 }, 11 { 12 "cell_type": "code", 13 "execution_count": 3, 14 "metadata": {}, 15 "outputs": [], 16 "source": [ 17 "# Imports\n", 18 "import os\n", 19 "import torch\n", 20 "from torch.utils.data import DataLoader\n", 21 "from aistore.pytorch.dataset import AISDataset, AISIterDataset\n", 22 "from aistore.sdk import Client" 23 ] 24 }, 25 { 26 "cell_type": "markdown", 27 "metadata": {}, 28 "source": [ 29 "## Setup client and necessary buckets " 30 ] 31 }, 32 { 33 "cell_type": "code", 34 "execution_count": 4, 35 "metadata": {}, 36 "outputs": [], 37 "source": [ 38 "ais_url = os.getenv(\"AIS_ENDPOINT\", \"http://localhost:8080\")\n", 39 "client = Client(ais_url)\n", 40 "bucket = client.bucket(\"my-bck\").create(exist_ok=True)" 41 ] 42 }, 43 { 44 "cell_type": "markdown", 45 "metadata": {}, 46 "source": [ 47 "### Create some objects in the bucket" 48 ] 49 }, 50 { 51 "cell_type": "code", 52 "execution_count": 7, 53 "metadata": {}, 54 "outputs": [], 55 "source": [ 56 "object_names = [f\"example_obj_{i}\" for i in range(10)]\n", 57 "for name in object_names:\n", 58 " bucket.object(name).put_content(f\"{name} - object content\".encode(\"utf-8\"))" 59 ] 60 }, 61 { 62 "cell_type": "markdown", 63 "metadata": {}, 64 "source": [ 65 "### Creating a Map-Style Dataset" 66 ] 67 }, 68 { 69 "cell_type": "code", 70 "execution_count": null, 71 "metadata": {}, 72 "outputs": [], 73 "source": [ 74 "map_dataset = AISDataset(client_url=ais_url, urls_list='ais://my-bck')\n", 75 "\n", 76 "for i in range(len(map_dataset)): # calculate length of all items present using len() function\n", 77 " print(map_dataset[i]) # get object url and byte array of the object\n", 78 "\n", 79 "# Create a DataLoader from the dataset\n", 80 "map_data_loader = DataLoader(map_dataset, batch_size=10, num_workers=2)\n" 81 ] 82 }, 83 { 84 "cell_type": "markdown", 85 "metadata": {}, 86 "source": [ 87 "### Creating a Iterable-Style Dataset" 88 ] 89 }, 90 { 91 "cell_type": "code", 92 "execution_count": null, 93 "metadata": {}, 94 "outputs": [], 95 "source": [ 96 "iter_dataset = AISIterDataset(client_url=ais_url, ais_source_list=bucket)\n", 97 "for sample in iter_dataset:\n", 98 " print(sample) # get object url and byte array of the object\n", 99 "\n", 100 "# Create a DataLoader from the dataset\n", 101 "iter_data_loader = DataLoader(iter_dataset, batch_size=10, num_workers=2)" 102 ] 103 }, 104 { 105 "cell_type": "markdown", 106 "metadata": {}, 107 "source": [ 108 "**Note:** We can also provide an etl_name (which is present in our cluster) to the Dataset to apply an etl to the objects. For example - `AISDataset(client_url=ais_url, urls_list='ais://my-bck', etl_name=your_etl_name)`" 109 ] 110 } 111 ], 112 "metadata": { 113 "kernelspec": { 114 "display_name": "my-python3-kernel", 115 "language": "python", 116 "name": "my-python3-kernel" 117 }, 118 "language_info": { 119 "codemirror_mode": { 120 "name": "ipython", 121 "version": 3 122 }, 123 "file_extension": ".py", 124 "mimetype": "text/x-python", 125 "name": "python", 126 "nbconvert_exporter": "python", 127 "pygments_lexer": "ipython3", 128 "version": "3.11.8" 129 } 130 }, 131 "nbformat": 4, 132 "nbformat_minor": 2 133 }