github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/python/aistore/sdk/dataset/README.md (about)

     1  # Dataset Module for Python SDK
     2  
     3  The Dataset Module in the Python SDK offers robust tools to create and manage datasets for data science and machine learning projects. It simplifies the handling of data transformations, loading, and storage, providing a streamlined workflow for users. It utilizes the [WebDataset](https://github.com/webdataset/webdataset) format to store the datasets, with support for a number of different [backend providers](https://aiatscale.org/docs/providers).
     4  
     5  WebDataset is the chosen format due to its compatibility with AIStore's capabilities for efficiently managing TAR files. This choice enhances performance and flexibility for operations such as reading, writing, listing, and appending to TAR archives. For more details on AIStore's handling of TAR files, see:
     6  
     7  - [AIStore Archive Operations](https://aiatscale.org/docs/archive)
     8  - [AIStore and TAR Append](https://aiatscale.org/blog/2021/08/10/tar-append)
     9  
    10  ## Features
    11  
    12  - **[Writing Datasets](#writing-datasets)**:
    13    - Write datasets to a bucket in the WebDataset format. This format is particularly suited for large-scale machine learning datasets due to its efficiency and flexibility.
    14  
    15  In future we are planning to expand our dataset module with several exciting features:
    16  
    17  - **Dataset Reader**: Enabling efficient reading and manipulation of datasets stored in AIStore directly through our Python SDK.
    18  - **ETL Operations**: Integrating [ETL](https://aiatscale.org/docs/etl) capabilities to facilitate complex data transformations and processing, enhancing data preparation workflows.
    19  - **Dsort**: Implementing [Dsort](https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md) to provide high-performance, scalable sorting solutions, optimizing data organization and retrieval processes within AIStore.
    20  
    21  ## Writing Datasets
    22  
    23  The `write_dataset` function enables writing datasets as shards directly into a bucket using the WebDataset `ShardWriter`. Here's how you can use the `write_dataset` function:
    24  
    25  ```python
    26  from pathlib import Path
    27  from aistore.sdk import Client
    28  from aistore.sdk.dataset.dataset_config import DatasetConfig
    29  from aistore.sdk.dataset.data_attribute import DataAttribute
    30  from aistore.sdk.dataset.label_attribute import LabelAttribute
    31  
    32  ais_url = os.getenv("AIS_ENDPOINT", "http://localhost:8080")
    33  client = Client(ais_url)
    34  bucket = client.bucket("my-bck").create(exist_ok=True)
    35  
    36  dataset_config = DatasetConfig(
    37      primary_attribute=DataAttribute(path=Path("your/image/directory"), file_type="jpg", name="image"),
    38      secondary_attributes=[
    39          LabelAttribute(name="cls", label_identifier=your_class_lookup_fn)
    40      ],
    41  )
    42  
    43  bucket.write_dataset(config=dataset_config, pattern="img_dataset")
    44  ```
    45  
    46  **Note:** This is a beta feature and is still in development.
    47  
    48  ## References
    49  
    50  - [WebDataset](https://github.com/webdataset/webdataset)
    51  - [AIStore Archive Operations](https://aiatscale.org/docs/archive)
    52  - [AIStore and TAR Append](https://aiatscale.org/blog/2021/08/10/tar-append)