github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/python/aistore/sdk/dataset/README.md (about) 1 # Dataset Module for Python SDK 2 3 The Dataset Module in the Python SDK offers robust tools to create and manage datasets for data science and machine learning projects. It simplifies the handling of data transformations, loading, and storage, providing a streamlined workflow for users. It utilizes the [WebDataset](https://github.com/webdataset/webdataset) format to store the datasets, with support for a number of different [backend providers](https://aiatscale.org/docs/providers). 4 5 WebDataset is the chosen format due to its compatibility with AIStore's capabilities for efficiently managing TAR files. This choice enhances performance and flexibility for operations such as reading, writing, listing, and appending to TAR archives. For more details on AIStore's handling of TAR files, see: 6 7 - [AIStore Archive Operations](https://aiatscale.org/docs/archive) 8 - [AIStore and TAR Append](https://aiatscale.org/blog/2021/08/10/tar-append) 9 10 ## Features 11 12 - **[Writing Datasets](#writing-datasets)**: 13 - Write datasets to a bucket in the WebDataset format. This format is particularly suited for large-scale machine learning datasets due to its efficiency and flexibility. 14 15 In future we are planning to expand our dataset module with several exciting features: 16 17 - **Dataset Reader**: Enabling efficient reading and manipulation of datasets stored in AIStore directly through our Python SDK. 18 - **ETL Operations**: Integrating [ETL](https://aiatscale.org/docs/etl) capabilities to facilitate complex data transformations and processing, enhancing data preparation workflows. 19 - **Dsort**: Implementing [Dsort](https://github.com/NVIDIA/aistore/blob/main/docs/dsort.md) to provide high-performance, scalable sorting solutions, optimizing data organization and retrieval processes within AIStore. 20 21 ## Writing Datasets 22 23 The `write_dataset` function enables writing datasets as shards directly into a bucket using the WebDataset `ShardWriter`. Here's how you can use the `write_dataset` function: 24 25 ```python 26 from pathlib import Path 27 from aistore.sdk import Client 28 from aistore.sdk.dataset.dataset_config import DatasetConfig 29 from aistore.sdk.dataset.data_attribute import DataAttribute 30 from aistore.sdk.dataset.label_attribute import LabelAttribute 31 32 ais_url = os.getenv("AIS_ENDPOINT", "http://localhost:8080") 33 client = Client(ais_url) 34 bucket = client.bucket("my-bck").create(exist_ok=True) 35 36 dataset_config = DatasetConfig( 37 primary_attribute=DataAttribute(path=Path("your/image/directory"), file_type="jpg", name="image"), 38 secondary_attributes=[ 39 LabelAttribute(name="cls", label_identifier=your_class_lookup_fn) 40 ], 41 ) 42 43 bucket.write_dataset(config=dataset_config, pattern="img_dataset") 44 ``` 45 46 **Note:** This is a beta feature and is still in development. 47 48 ## References 49 50 - [WebDataset](https://github.com/webdataset/webdataset) 51 - [AIStore Archive Operations](https://aiatscale.org/docs/archive) 52 - [AIStore and TAR Append](https://aiatscale.org/blog/2021/08/10/tar-append)