github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-10-21-ais-etl-1.md (about)

     1  ---
     2  layout: post
     3  title:  "AIStore & ETL: Introduction (post #1)"
     4  date:   Oct 21, 2021
     5  author: Alex Aizman, Janusz Marcinkiewicz, Prashanth Dintyala
     6  categories: aistore etl pytorch
     7  ---
     8  
     9  [AIStore](https://github.com/NVIDIA/aistore) (AIS) is a reliable lightweight storage cluster that deploys anywhere, runs user containers and functions, and scales linearly with no limitation. The development has been inspired by the unique demands of deep-learning apps - particularly the need to pre-process, post-process, or otherwise reformat and augment raw data in order to uncover its hidden value and purpose.
    10  
    11  > Custom data-transforming operations usually organized into some sort of input and output pipelines that run before, during, and/or after ML training, have become so pervasive and so widely used that, *in summa* - and notwithstanding broad vagueness of the definition and lack of [inline citations](https://en.wikipedia.org/wiki/Extract,_transform,_load) - came to be widely known as ETL.
    12  
    13  The long and short of it is that we believe that ETL must run **only and exclusively inside** a storage system - the system that has been built to support just that. The only question is - *how*? How to *offload* I/O intensive data transforming pipelines *to the server*? And how to make it happen without changing the client-side code?
    14  
    15  This text opens a series of blog posts where we intend to start answering those questions with simple usage examples and code snippets that anyone can run. For starters, we'll use an **ImageNet**-derived dataset that has been already pre-**sharded**.
    16  
    17  > [ImageNet](https://www.image-net.org/download.php) is, without argument, the most popular deep-learning dataset of the last decade. It is, simultaneously, a textbook example of the latter.
    18  
    19  The motivation to convert (small-file) samples to larger shards that would optimally contain batches of original samples - such motivation exists and b) grows exponentially with the size of a dataset in question. ImageNet, once [inflated](https://arxiv.org/abs/2001.01858) beyond the capacity of a typical server, would make a good representation. More on that in [here](https://eng.uber.com/scaling-hdfs/) and [here](http://www.acadpubl.eu/hub/2018-119-15/2/301.pdf).
    20  
    21  AIS, on the other hand, has been built to conveniently support pre-sharding, post-sharding, de-sharding, and generally handling serialized archives transparently from the user perspective. There are currently 3 (three) equally supported [archival formats](https://github.com/NVIDIA/aistore/releases), whereby training apps can read and write data without really paying any attention. Implementation-wise, archiving is realized as an asynchronous multi-object batch operation that gathers arbitrary ranges of samples from across the cluster and combines them into archival shards in a streaming fashion. `APPEND` (to an existing archives aka shard) is also supported, and much more.
    22  
    23  ## TL;DR
    24  
    25  This post is the first in the upcoming mini-series. We'll gradually introduce AIStore, an open-source immediately-deployable-anywhere scalable-specialized storage. And the tooling around it to assist AI researchers and data scientists.
    26  
    27  Much of those tools are early-stage - developing as we speak. For instance, `aistore.pytorch.Dataset` - subclass of the familiar [`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) that allows running arbitrary `torchvision` -based transforms *on the server* - i.e., inside (and by) your AIStore cluster.
    28  
    29  Schematically, at a comfortably high level, the resulting picture will look as follows:
    30  
    31  ![AIS-ETL Block Diagram](/assets/ais_etl_series/block-diag-1.png)
    32  
    33  Here we have a Kubernetes cluster that runs AIS cluster (where [AIS/K8s Operator](https://github.com/NVIDIA/ais-k8s/tree/master/operator) is not shown as well as many other implementation details).
    34  
    35  Each K8s node contains an AIS target (that has *disks*) and, optionally, AIS proxy (aka gateway) responsible for control-plane - specifically, for routing I/O requests.
    36  
    37  In addition, there's a locally running ETL - locally as far as *transforming* data flow between itself and its peer AIS target within a given K8s node. There are multiple communication mechanisms (currently 4, to be precise) that we support to accommodate a variety of ETL containers and their respective *runtimes* - more about all of that in our next post.
    38  
    39  ## References
    40  
    41  1. **High Performance I/O For Large Scale Deep Learning**, https://arxiv.org/abs/2001.01858
    42  2. **Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs**, https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus
    43  3. **AIS ETL: Getting Started, Tutorial, Inline and Offline examples, Kubernetes deployment**, https://github.com/NVIDIA/aistore/blob/main/docs/etl.md
    44  4. **GitHub open source**:
    45     - [AIStore](https://github.com/NVIDIA/aistore)
    46     - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s)
    47     - [AIS-ETL containers and specs](https://github.com/NVIDIA/ais-etl)
    48  5. **AI-at-Scale** documentation and blogs, https://aiatscale.org