github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2022-07-20-python-sdk.md (about)

     1  ---
     2  layout: post
     3  title:  "Python SDK: Getting Started"
     4  date:   Jul 20, 2022
     5  author: Ryan Koo
     6  categories: aistore python sdk
     7  ---
     8  
     9  # Python SDK: Getting Started
    10  
    11  Python has grounded itself as a popular language of choice among data scientists and machine learning developers. Python's recent popularity in the field can be attributed to Python's general *ease-of-use*, especially with the popular machine learning framework [PyTorch](https://pytorch.org/), which is itself written in Python.
    12  
    13  [AIStore Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore) is a project which includes a growing library of client-side APIs to easily access and utilize AIStore clusters, objects, and buckets, as well as a number of tools for AIStore usage/integration with PyTorch.
    14  
    15  The [AIStore Python API](https://aiatscale.org/docs/python-api) is essentially a Python port of AIStore's [Go APIs](https://github.com/NVIDIA/aistore/tree/main/api). In terms of functionality, the AIStore Python and Go APIs are quite similar, both of which essentially make simple [HTTP requests](https://aiatscale.org/docs/http-api#api-reference) to an AIStore endpoint. The HTTP requests allow the APIs to interact (reads and writes) with an AIStore instance's metadata. The API provides convenient and flexible ways (similar to those provided by the [CLI](https://aiatscale.org/docs/cli)) to move data (as objects) in and out of buckets on AIStore, manage AIStore clusters, and much more.
    16  
    17  This technical blog will demonstrate a few potential ways the Python API provided in the [Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore) could be used with a running AIStore instance to manage and utilize data.
    18  
    19  ## Getting Started
    20  
    21  ### Installing & Deploying AIStore
    22  
    23  The latest AIStore release can be easily installed either with Anaconda or `pip`:
    24  
    25  ```console
    26  $ conda install aistore
    27  ```
    28  
    29  ```console
    30  $ pip install aistore
    31  ```
    32  
    33  > Note that only Python 3.x (version 3.6 or later) is currently supported for AIStore.
    34  
    35  While there are a number of options available for deploying AIStore - as is demonstrated [here](https://github.com/NVIDIA/aistore/blob/main/docs/getting_started.md) - for the sake of simplicity, we will be using AIStore's [minimal standalone docker deployment](https://github.com/NVIDIA/aistore/blob/main/deploy/prod/docker/single/README.md):
    36  
    37  ```console
    38  # Deploying the AIStore cluster in a container on port 51080
    39  docker run -d \
    40      -p 51080:51080 \
    41      -v /disk0:/ais/disk0 \
    42      aistore/cluster-minimal:latest
    43  ```
    44  
    45  ### Moving Data To AIStore
    46  
    47  Let's say we want to move a copy of the [TinyImageNet](https://paperswithcode.com/dataset/tiny-imagenet) dataset from our local filesystem to a bucket on our running instance of AIStore.
    48  
    49  First, we import the Python API and initialize the client to the running instance of AIStore:
    50  
    51  ```python
    52  from aistore import Client
    53  
    54  client = Client("http://localhost:51080")
    55  ```
    56  
    57  Before moving any data into AIStore, we can first check to see AIStore is fully deployed and ready:
    58  
    59  ```python
    60  client.cluster().is_aistore_running()
    61  ```
    62  
    63  Once AIStore is verified as running, moving the dataset to a bucket on AIStore as a *compressed* format is as easy as:
    64  
    65  ```python
    66  BUCKET_NAME = "tinyimagenet_compressed"
    67  COMRPESSED_TINYIMAGENET = "~/Datasets/tinyimagenet-compressed.zip"
    68  OBJECT_NAME = "tinyimagenet-compressed.zip"
    69  
    70  # Create a new bucket [BUCKET_NAME] to store dataset
    71  client.bucket(BUCKET_NAME).create()
    72  
    73  # Verify bucket creation operation
    74  client.cluster().list_buckets()
    75  
    76  # Put dataset [COMPRESSED_TINYIMAGENET] in bucket [BUCKET_NAME] as object with name [OBJECT_NAME]
    77  client.bucket(BUCKET_NAME).object(OBJECT_NAME).put(COMPRESSED_TINYIMAGENET)
    78  
    79  # Verify object put operation
    80  client.bucket(BUCKET_NAME).list_objects().get_entries()
    81  ```
    82  
    83  Say we now want to instead move an *uncompressed* version of TinyImageNet to AIStore. The uncompressed format of TinyImageNet is comprised of several sub-directories which divide the dataset's many image samples into separate sets (train, validation, test) as well as separate classes (based on numbers mapped to image labels).
    84  
    85  As opposed to traditional file storage systems which operate on the concept of multi-level directories and sub-directories, object storage systems, such as AIStore, maintain a *strict* two-level hierarchy of *buckets* and *objects*. However, we can still maintain a "symbolic" directory by manipulating how we name the data.
    86  
    87  We can move the dataset to an AIStore bucket while preserving the directory-based structure of the dataset by using the bucket `put_files` command along with the `recursive` option:
    88  
    89  ```python
    90  BUCKET_NAME = "tinyimagenet_uncompressed"
    91  TINYIMAGENET_DIR = <local-path-to-dataset> + "/tinyimagenet/"
    92  
    93  # Create a new bucket [BUCKET_NAME] to store dataset
    94  bucket = client.bucket(BUCKET_NAME).create()
    95  
    96  bucket.put_files(TINYIMAGENET_DIR, recursive=True)
    97  
    98  # Verify object put operations
    99  bucket.list_objects().get_entries()
   100  ```
   101  
   102  ### Getting Data From AIStore
   103  
   104  Getting the *compressed* TinyImageNet dataset from AIStore bucket `ais://tinyimagenet_compressed` is as easy as:
   105  
   106  ```python
   107  BUCKET_NAME = "tinyimagenet_compressed"
   108  OBJECT_NAME = "tinyimagenet-compressed.zip"
   109  
   110  # Get object [OBJECT_NAME] from bucket [BUCKET_NAME]
   111  client.bucket(BUCKET_NAME).object(OBJECT_NAME).get()
   112  ```
   113  
   114  If we want to get the *uncompressed* TinyImageNet from AIStore bucket `ais://tinyimagenet_uncompressed`, we can easily do that with [Bucket.list_objects()](https://aiatscale.org/docs/python-api#bucket.Bucket.list_objects) and [Object.get()](https://aiatscale.org/docs/python-api#object.Object.get).
   115  
   116  ```python
   117  BUCKET_NAME = "tinyimagenet_uncompressed"
   118  
   119  # List all objects in bucket [BUCKET_NAME]
   120  TINYIMAGENET_UNCOMPRESSED = client.bucket(BUCKET_NAME).list_objects().get_entries()
   121  
   122  for FILENAME in TINYIMAGENET_UNCOMPRESSED:
   123      # Get object [filename.name] from bucket [BUCKET_NAME]
   124      client.bucket(BUCKET_NAME).object(FILENAME.name).get()
   125  ```
   126  
   127  We can also pick a *specific* section of the uncompressed dataset and only get those specific objects. By specifying a `prefix` to our [Bucket.list_objects()](https://aiatscale.org/docs/python-api#bucket.Bucket.list_objects) call, we can manipulate the *symbolic* file system and list only the contents in our desired directory.
   128  
   129  ```python
   130  BUCKET_NAME = "tinyimagenet_uncompressed"
   131  
   132  # Listing only objects with prefix "validation/" bucket [tinyimagenet_uncompressed]
   133  TINYIMAGENET_UNCOMPRESSED_VAL = client.bucket(BUCKET_NAME).list_objects(prefix="validation/").get_entries()
   134  
   135  for FILENAME in TINYIMAGENET_UNCOMPRESSED_VAL:
   136      # Get operation on objects with prefix "validation/" from bucket [tinyimagenet_uncompressed]
   137      client.bucket(BUCKET_NAME).object(FILENAME.name).get()
   138  ```
   139  
   140  ### External Cloud Storage Providers
   141  
   142  AIStore also supports third-party remote backends, including Amazon S3, Google Cloud, and Microsoft Azure.
   143  
   144  > For exact definitions and related capabilities, please see [terminology](https://aiatscale.org//docs/overview#terminology).
   145  
   146  We shutdown the previous instance of AIStore and re-deploy AIStore with AWS S3 and GCP backends attached:
   147  
   148  ```console
   149  # Similarly deploying AIStore cluster in a container on port 51080, but with GCP and AWS backends attached
   150  docker run -d \
   151      -p 51080:51080 \
   152      -v <path_to_gcp_config>.json:/credentials/gcp.json \
   153      -e GOOGLE_APPLICATION_CREDENTIALS="/credentials/gcp.json" \
   154      -e AWS_ACCESS_KEY_ID="AWSKEYIDEXAMPLE" \
   155      -e AWS_SECRET_ACCESS_KEY="AWSSECRETEACCESSKEYEXAMPLE" \
   156      -e AWS_REGION="us-east-2" \
   157      -e AIS_BACKEND_PROVIDERS="gcp aws" \
   158      -v /disk0:/ais/disk0 \
   159      aistore/cluster-minimal:latest
   160  ```
   161  
   162  > Deploying an AIStore cluster with third-party cloud backends simply *imports/copies the buckets and objects from the provided third-party backends to AIStore*. The client-side APIs themselves do **not** interact with the actual external backends at any point. The client-side APIs only interact with duplicate instances of those external cloud storage buckets residing in the AIStore cluster.
   163  
   164  The [Object.get()](https://aiatscale.org/docs/python-api#object.Object.get) works with external cloud storage buckets as well. We can use the method in a similar fashion as shown previously to easily get either a compressed or uncompressed version of the dataset from, for examples, `gcp://tinyimagenet_compressed` and `gcp://tinyimagenet_uncompressed`. 
   165  
   166  ```python
   167  # Getting compressed TinyImageNet dataset from [gcp://tinyimagenet_compressed]
   168  BUCKET_NAME = "tinyimagenet_compressed"
   169  OBJECT_NAME = "tinyimagenet-compressed.zip"
   170  client.bucket(BUCKET_NAME, provider="gcp").object(OBJECT_NAME).get()
   171  
   172  
   173  # Getting uncompressed TinyImageNet dataset from [gcp://tinyimagenet_uncompressed]
   174  BUCKET_NAME = "tinyimagenet_uncompressed"
   175  TINYIMAGENET_UNCOMPRESSED = client.bucket(BUCKET_NAME, provider="gcp").list_objects().get_entries()
   176  for FILENAME in TINYIMAGENET_UNCOMPRESSED:
   177      client.bucket(BUCKET_NAME, provider="gcp").object(FILENAME.name).get()
   178  
   179  
   180  # Getting only objects with prefix "validation/" from bucket [gcp://tinyimagenet_uncompressed]
   181  TINYIMAGENET_UNCOMPRESSED_VAL = client.bucket(BUCKET_NAME).list_objects(prefix="validation/").get_entries()
   182  for FILENAME in TINYIMAGENET_UNCOMPRESSED_VAL:
   183      client.bucket(BUCKET_NAME).object(FILENAME.name).get()
   184  ```
   185  
   186  > Note the added argument `provider` supplied in [`Client.bucket()`](https://aiatscale.org/docs/python-api#api.Client.bucket) for the examples shown above.
   187  
   188  We can instead choose to *copy* the contents of an external cloud storage bucket on AIStore to a native (AISProvider) AIStore bucket with [`Bucket.copy()`](https://aiatscale.org/docs/python-api#bucket.Bucket.copy) as well:
   189  
   190  ```python
   191  # Copy bucket [gcp://tinyimagenet_uncompressed] and its objects to new bucket [ais://tinyimagetnet_validationset]
   192  FROM_BUCKET = "tinyimagenet_uncompressed"
   193  TO_BUCKET = "tinyimagenet_validationset"
   194  client.bucket(FROM_BUCKET, provider="gcp").copy(TO_BUCKET)
   195  
   196  # Evict external cloud storage bucket [gcp://tinyimagenet_uncompressed] if not needed anymore for cleanup (free space on cluster)
   197  client.bucket(FROM_BUCKET, provider="gcp").evict()
   198  ```
   199  
   200  Eviction of a cloud storage bucket destroys any instance of the cloud storage bucket (and its objects) from the AIStore cluster metadata. Eviction does **not** delete or affect the actual cloud storage bucket (in AWS S3, GCP, or Azure).
   201  
   202  
   203  ## PyTorch
   204  
   205  PyTorch provides built-in [tools](https://github.com/pytorch/data/tree/main/torchdata/datapipes/iter/load#aistore-io-datapipe) for AIStore integration, allowing machine learning developers to easily use AIStore as a viable storage system option with PyTorch. In fact, the dataloading classes [`AISFileLister`](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLister.html#aisfilelister) and [`AISFileLoader`](https://pytorch.org/data/main/generated/torchdata.datapipes.iter.AISFileLoader.html#torchdata.datapipes.iter.AISFileLoader) found in [`aisio.py`](https://github.com/pytorch/data/blob/main/torchdata/datapipes/iter/load/aisio.py) provided by PyTorch make use of several of the client-side APIs referenced in this article.
   206  
   207  For more information on dataloading from AIStore with PyTorch, please refer to this [article](https://aiatscale.org/blog/2022/07/12/aisio-pytorch).
   208  
   209  
   210  ## More Examples & Resources
   211  
   212  For more examples, please refer to additional documentation [AIStore Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore) and try out the [SDK tutorial (Jupyter Notebook)](https://github.com/NVIDIA/aistore/blob/main/python/aistore/sdk-tutorial.ipynb).
   213  
   214  For information on specific API usage, please refer to the [API reference](https://aiatscale.org/docs/python-api).
   215  
   216  
   217  ## References
   218  
   219  * [AIStore GitHub](https://github.com/NVIDIA/aistore)
   220  * [AIStore Go API](https://github.com/NVIDIA/aistore/tree/main/api)
   221  * [AIStore Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore)
   222  * [Documentation](https://aiatscale.org/docs)
   223  * [Official AIStore PIP Package](https://pypi.org/project/aistore/)