github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-10-21-ais-etl-1.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-10-21-ais-etl-1.md (about)

1 ---
2 layout: post
3 title: "AIStore & ETL: Introduction (post #1)"
4 date: Oct 21, 2021
5 author: Alex Aizman, Janusz Marcinkiewicz, Prashanth Dintyala
6 categories: aistore etl pytorch
7 ---
8
9 [AIStore](https://github.com/NVIDIA/aistore) (AIS) is a reliable lightweight storage cluster that deploys anywhere, runs user containers and functions, and scales linearly with no limitation. The development has been inspired by the unique demands of deep-learning apps - particularly the need to pre-process, post-process, or otherwise reformat and augment raw data in order to uncover its hidden value and purpose.
10
11 > Custom data-transforming operations usually organized into some sort of input and output pipelines that run before, during, and/or after ML training, have become so pervasive and so widely used that, *in summa* - and notwithstanding broad vagueness of the definition and lack of [inline citations](https://en.wikipedia.org/wiki/Extract,_transform,_load) - came to be widely known as ETL.
12
13 The long and short of it is that we believe that ETL must run **only and exclusively inside** a storage system - the system that has been built to support just that. The only question is - *how*? How to *offload* I/O intensive data transforming pipelines *to the server*? And how to make it happen without changing the client-side code?
14
15 This text opens a series of blog posts where we intend to start answering those questions with simple usage examples and code snippets that anyone can run. For starters, we'll use an **ImageNet**-derived dataset that has been already pre-**sharded**.
16
17 > [ImageNet](https://www.image-net.org/download.php) is, without argument, the most popular deep-learning dataset of the last decade. It is, simultaneously, a textbook example of the latter.
18
19 The motivation to convert (small-file) samples to larger shards that would optimally contain batches of original samples - such motivation exists and b) grows exponentially with the size of a dataset in question. ImageNet, once [inflated](https://arxiv.org/abs/2001.01858) beyond the capacity of a typical server, would make a good representation. More on that in [here](https://eng.uber.com/scaling-hdfs/) and [here](http://www.acadpubl.eu/hub/2018-119-15/2/301.pdf).
20
21 AIS, on the other hand, has been built to conveniently support pre-sharding, post-sharding, de-sharding, and generally handling serialized archives transparently from the user perspective. There are currently 3 (three) equally supported [archival formats](https://github.com/NVIDIA/aistore/releases), whereby training apps can read and write data without really paying any attention. Implementation-wise, archiving is realized as an asynchronous multi-object batch operation that gathers arbitrary ranges of samples from across the cluster and combines them into archival shards in a streaming fashion. `APPEND` (to an existing archives aka shard) is also supported, and much more.
22
23 ## TL;DR
24
25 This post is the first in the upcoming mini-series. We'll gradually introduce AIStore, an open-source immediately-deployable-anywhere scalable-specialized storage. And the tooling around it to assist AI researchers and data scientists.
26
27 Much of those tools are early-stage - developing as we speak. For instance, `aistore.pytorch.Dataset` - subclass of the familiar [`torch.utils.data.Dataset`](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) that allows running arbitrary `torchvision` -based transforms *on the server* - i.e., inside (and by) your AIStore cluster.
28
29 Schematically, at a comfortably high level, the resulting picture will look as follows:
30
31 ![AIS-ETL Block Diagram](/assets/ais_etl_series/block-diag-1.png)
32
33 Here we have a Kubernetes cluster that runs AIS cluster (where [AIS/K8s Operator](https://github.com/NVIDIA/ais-k8s/tree/master/operator) is not shown as well as many other implementation details).
34
35 Each K8s node contains an AIS target (that has *disks*) and, optionally, AIS proxy (aka gateway) responsible for control-plane - specifically, for routing I/O requests.
36
37 In addition, there's a locally running ETL - locally as far as *transforming* data flow between itself and its peer AIS target within a given K8s node. There are multiple communication mechanisms (currently 4, to be precise) that we support to accommodate a variety of ETL containers and their respective *runtimes* - more about all of that in our next post.
38
39 ## References
40
41 1. **High Performance I/O For Large Scale Deep Learning**, https://arxiv.org/abs/2001.01858
42 2. **Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs**, https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus
43 3. **AIS ETL: Getting Started, Tutorial, Inline and Offline examples, Kubernetes deployment**, https://github.com/NVIDIA/aistore/blob/main/docs/etl.md
44 4. **GitHub open source**:
45 - [AIStore](https://github.com/NVIDIA/aistore)
46 - [AIS/Kubernetes Operator, AIS on bare-metal, Deployment Playbooks, Helm](https://github.com/NVIDIA/ais-k8s)
47 - [AIS-ETL containers and specs](https://github.com/NVIDIA/ais-etl)
48 5. **AI-at-Scale** documentation and blogs, https://aiatscale.org