github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/python/aistore/pytorch/README.md (about)

     1  # AIS Plugin for PyTorch
     2  
     3  ## PyTorch Dataset and DataLoader for AIS
     4  
     5  
     6  AIS plugin is a PyTorch dataset library to access datasets stored on AIStore.
     7  
     8  PyTorch comes with powerful data loading capabilities, but loading data in PyTorch is fairly complex. One of the best ways to handle it is to start small and then add complexities as and when you need them. Below are some of the ways one can import data stored on AIS in PyTorch.
     9  
    10  ### PyTorch DataLoader
    11  
    12  The PyTorch DataLoader class gives you an iterable over a Dataset. It can be used to shuffle, batch and parallelize operations over your data.
    13  
    14  ### PyTorch Dataset
    15  
    16  But, to create a DataLoader you have to first create a Dataset, which is a class to read samples into memory. Most of the logic of the DataLoader resides on the Dataset.
    17  
    18  PyTorch offers two styles of Dataset class: Map-style and Iterable-style.
    19  
    20  **Note:** Both datasets can be initialized with a urls_list parameter and/or an ais_source_list parameter that defines which objects to reference in AIS.
    21  ```urls_list``` can be a single prefix url or a list of prefixes. Eg. ```"ais://bucket1/file-"``` or ```["aws://bucket2/train/", "ais://bucket3/train/"]```.
    22  Likewise ```ais_source_list``` can be a single [AISSource](https://github.com/NVIDIA/aistore/blob/main/python/aistore/sdk/ais_source.py) object or a list of [AISSource](https://github.com/NVIDIA/aistore/blob/main/python/aistore/sdk/ais_source.py) objects. Eg. ```"Client.bucket()``` or ```[Client.bucket.objects(), Client.bucket()]```.
    23  
    24  #### ***Map-style Dataset***
    25  
    26  A map-style dataset in PyTorch implements the `__getitem__()` and `__len__()` functions and provides the user a map from indices/keys to data samples.
    27  
    28  For example, we can access the i-th index label and its corresponding image by dataset[i] from a bucket in AIStore.
    29  
    30  ```
    31  from aistore.pytorch.dataset import AISDataset
    32  
    33  dataset = AISDataset(client_url="http://ais-gateway-url:8080", urls_list='ais://bucket1/')
    34  
    35  for i in range(len(dataset)): # calculate length of all items present using len() function
    36      print(dataset[i]) # get object url and byte array of the object
    37  
    38  ```
    39  
    40  
    41  #### ***Iterable-style datasets***
    42  
    43  Iterable-style datasets in PyTorch are tailored for scenarios where data needs to be processed as a stream, and direct indexing is either not feasible or inefficient. Such datasets inherit from IterableDataset and override the `__iter__()` method, providing an iterator over the data samples.
    44  
    45  These datasets are particularly useful when dealing with large data streams that cannot be loaded entirely into memory, or when data is continuously generated or fetched from external sources like databases, remote servers, or live data feeds.
    46  
    47  We have extended support for iterable-style datasets to AIStore (AIS) backends, enabling efficient streaming of data directly from AIS buckets. This approach is ideal for training models on large datasets stored in AIS, minimizing memory overhead and facilitating seamless data ingestion from AIS's distributed object storage.
    48  
    49  Here's how you can use an iterable-style dataset with AIStore:
    50  
    51  ```
    52  from aistore.pytorch.dataset import AISIterDataset
    53  from aistore.sdk import Client
    54  
    55  ais_url = os.getenv("AIS_ENDPOINT", "http://localhost:8080")
    56  client = Client(ais_url)
    57  bucket = client.bucket("my-bck").create(exist_ok=True)
    58  
    59  # Creating objects in our bucket 
    60  object_names = [f"example_obj_{i}" for i in range(10)]
    61  for name in object_names:
    62      bucket.object(name).put_content("object content".encode("utf-8"))
    63  
    64  # Creating an object group
    65  my_objects = bucket.objects(obj_names=object_names)
    66  
    67  # Initialize the dataset with the AIS client URL and the data source location
    68  dataset = AISIterDataset(client_url=ais_url, urls_list="ais://bucket1/", ais_source_list=my_objects)
    69  
    70  # Iterate over the dataset to fetch data samples as a stream
    71  for data_sample in dataset:
    72      print(data_sample)  # Each iteration fetches a data sample (object name and byte array)
    73  
    74  ```
    75  
    76  
    77  **Creating DataLoader from AISDataset**
    78  ```
    79  from aistore.pytorch.dataset import AISDataset
    80  
    81  train_loader = torch.utils.data.DataLoader(
    82      AISDataset(
    83          "http://ais-gateway-url:8080", urls_list=["ais://dataset1/train/", "ais://dataset2/train/"]),
    84      batch_size=args.batch_size, shuffle=True,
    85      num_workers=args.workers, pin_memory=True,
    86  )
    87  
    88  ```
    89  
    90  ## AIS IO Datapipe
    91  
    92  ### AIS File Lister
    93  
    94  Iterable Datapipe that lists files from the AIS backends with the given URL  prefixes. Acceptable prefixes include but not limited to - `ais://bucket-name`, `ais://bucket-name/`, `ais://bucket-name/folder`, `ais://bucket-name/folder/`, `ais://bucket-name/prefix`.
    95  
    96  **Note:**
    97  1) This function also supports files from multiple backends (`aws://..`, `gcp://..`, etc)
    98  2) Input *must* be a list and direct URLs are not supported.
    99  3) `length` is -1 by default, all calls to `len()` are invalid as not all items are iterated at the start.
   100  4) This internally uses [AIStore Python SDK](https://github.com/NVIDIA/aistore/tree/main/python).
   101  
   102  ### AIS File Loader
   103  
   104  Iterable Datapipe that loads files from the AIS backends with the given list of URLs (no prefixes allowed). Iterates all files in BytesIO format and returns a tuple (url, BytesIO).
   105  **Note:**
   106  1) This function also supports files from multiple backends (`aws://..`, `gcp://..`, etc)
   107  2) Input *must* be a list and direct URLs are not supported.
   108  3) This internally uses [AIStore Python SDK](https://github.com/NVIDIA/aistore/blob/main/python/aistore/sdk).
   109  
   110  ### Example
   111  ```
   112  from aistore.pytorch.aisio import AISFileListerIterDataPipe, AISFileLoaderIterDataPipe
   113  
   114  prefixes = ['ais://bucket1/train/', 'aws://bucket2/train/']
   115  
   116  list_of_files = AISFileListerIterDataPipe(url='http://ais-gateway-url:8080', source_datapipe=prefixes)
   117  
   118  files = AISFileLoaderIterDataPipe(url='http://ais-gateway-url:8080', source_datapipe=list_of_files)
   119  ```
   120  
   121  For a more in-depth example, see [here](https://github.com/NVIDIA/aistore/blob/main/python/examples/aisio_pytorch_example.ipynb)