github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-12-15-whats-new-in-v3.8.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-12-15-whats-new-in-v3.8.md (about)

     1  ---
     2  layout: post
     3  title:  "What's new in AIS v3.8"
     4  date:   Dec 15, 2021
     5  author: Alex Aizman
     6  categories: aistore
     7  ---
     8  
     9  AIStore v3.8 is a significant upgrade delivering [long-awaited features, stabilization fixes, and performance improvements](https://github.com/NVIDIA/aistore/releases/tag/3.8). There's also the cumulative effect of continuous functional and stress testing combined with (continuous) refactoring to optimize and reinforce the codebase.
    10  
    11  In other words, a certain achieved *milestone* that includes:
    12  
    13  ## ETL
    14  
    15  AIS-ETL is designed around the idea to run custom *transforming* containers directly on AIS target nodes. Typical flow includes the following steps:
    16  
    17  1. User initiates ETL workload by executing one of the documented API calls
    18     and providing either the corresponding docker image or a *transforming function* (e.g. Python script);
    19  2. AIS gateway coordinates the deployment of ETL containers (aka K8s pods) on AIS targets: one container per target;
    20  3. Each target creates a local `communicator` instance for the specified `communication type`.
    21  
    22  Prior to 3.8, [supported communication types](/docs/etl.md) were all HTTP-based. For instance, existing ["hpull://"](/docs/etl.md#communication-mechanisms) facilitates HTTP-redirect type communication with AIS target redirecting original read requests to the local ETL container. Version 3.8 adds a non-HTTP communicator (denoted as "io://") and removes the requirement to wrap your custom transforming logic into some sort of HTTP processing.
    23  
    24  The new "io://" communicator acts as a simple executor of external commands *by* the ETL container. On its end, AIS target resorts to capturing resulting standard output (containing transformed bytes) and standard error. This is maybe not the most performant solution but certainly the easiest one to implement.
    25  
    26  Additionally, v3.8 integrates ETL (jobs) with [xactions](/docs/batch.md) thus providing consistency in terms of starting/stopping and managing/monitoring. All existing APIs and [CLIs](/docs/cli/job.md) that are common for all [xactions](/docs/batch.md) are supported out of the box.
    27  
    28  Finally, v3.8 introduces persistent ETL metadata as a new replicated-versioned-and-protected metadata type. The implementation leverages existing mechanism to keep clustered nodes in-sync with added, removed, and updated ETL specifications. The ultimate objective is to be able to run an arbitrary mix of inline and offline ETLs while simultaneously viewing and *editing* their (persistent) specs.
    29  
    30  Further reading:
    31  - [Using AIS/PyTorch connector to transform ImageNet](https://aiatscale.org/blog/2021/10/22/ais-etl-2)
    32  - [Using WebDataset to train on a sharded dataset](https://aiatscale.org/blog/2021/10/29/ais-etl-3)
    33  
    34  ## Storage cleanup
    35  
    36  Cleanup, as the name implies, is tasked with safely removing already deleted objects (that we keep for a while to support future [undeletion](https://en.wikipedia.org/wiki/Undeletion)). Subject to being cleaned up also are:
    37  
    38  * workfiles resulting from interrupted workloads
    39  * unfinished erasure-coded slices
    40  * misplaced replicas left behind during global rebalancing
    41  
    42  and similar. In short, all sorts of "artifacts" of distributed migration, replication, and erasure coding.
    43  
    44  Like LRU-based cluster-wide eviction, cleanup runs automatically or [administratively](/docs/cli/storage.md). Cleanup triggers automatically when the system exceeds 65% (or configured) of total used capacity. But note:
    45  
    46  > Automatic cleanup always runs _prior_ to automatic LRU eviction, so that the latter would take into account updated used and available percentages.
    47  
    48  > LRU eviction is separately configured on a per-bucket basis with cluster-wide inheritable defaults set as follows: enabled for Cloud buckets, disabled for AIS buckets that have no remote backend.
    49  
    50  ## Custom object metadata
    51  
    52  AIS now differentiates between:
    53  
    54  * its own system metadata (size, access time, checksum, number of copies, etc.)
    55  * Cloud object metadata (source, version, MD5, ETag), and
    56  * custom metadata comprising user-defined key/values
    57  
    58  All metadata from all sources is now preserved and checksum-protected, stored persistently and maintained across all intra-cluster migrations and replications. There's also an improved check for local <=> remote equality in the context of cold GETs and [downloads](/docs/downloader.md) - the check that takes into account size, version (if available), ETag (if available), and checksum(s) - all of the above.
    59  
    60  ## Volume
    61  
    62  Multi-disk volume in AIS is a collection of [mountpaths](/docs/overview.md#terminology). The corresponding metadata (called VMD) is versioned, persistent, and protected (i.e., checksummed and replicated). Version 3.8 reinforces ais volume (function) in presence of unlikely but nevertheless critical *scenarios* that include the usual:
    63  
    64  * faulted drives, degraded drives, missing (unmounted or detached) drives
    65  * old, missing, or corrupted VMD instances
    66  
    67  At startup, AIS target performs mini-bootstrapping sequence to load and cross-check VMD against other its stored replicas and persistent configuration, both. At runtime, there's a revised, amended, and fully-supported capability to gracefully detach and attach mountpaths.
    68  
    69  In fact, any mountpath can be temporarily disabled and (re)enabled, permanently detached and later re-attached. As long as there's enough space on the remaining mountpaths to carry out volume resilvering all the 4 (four) verbs can be used at any time.
    70  
    71  > Needless to say, it'd make sense _not_ to power cycle the target during resilvering.
    72  
    73  ## Easy URL
    74  
    75  The feature codenamed "easy URL" is a simple alternative mapping of the AIS API to handle URLs paths that look as follows:
    76  
    77  | URL Path | Cloud |
    78  | --- | --- |
    79  | /gs/mybucket/myobject | Google Cloud buckets |
    80  | /az/mybucket/myobject | Azure Blob Storage |
    81  | /ais/mybucket/myobject | AIS |
    82  
    83  In other words, easy URL is a convenience that allows reading, writing, deleting, and listing as follows:
    84  
    85  ```console
    86  # Example: GET
    87  $ curl -L -X GET 'http://aistore/gs/my-google-bucket/abc-train-0001.tar'
    88  
    89  # Example: PUT
    90  $ curl -L -X PUT 'http://aistore/gs/my-google-bucket/abc-train-9999.tar -T /tmp/9999.tar'
    91  
    92  # Example: LIST
    93  $ curl -L -X GET 'http://aistore/gs/my-google-bucket'
    94  ```
    95  
    96  Note, however:
    97  
    98  > There's a reason that Amazon S3 is missing in the list (above) that includes GCP and Azure. That's because AIS provides full [S3 compatibility](/docs/s3compat.md) layer via its "/s3" endpoint. [S3 compatibility](/docs/s3compat.md) shall not be confused with a simple alternative ("easy URL") mapping of HTTP requests.
    99  
   100  
   101  ## TL;DR
   102  
   103  Other v3.8 additions include:
   104  
   105  - target *standby* mode !4688, !4689, !4691
   106  - amended and improved performance monitoring !4792, !4793, !4794, !4798, !4800, !4810, !4812
   107  - ais targets with no disks !4825
   108  - Kubernetes Operator [v0.9](https://github.com/NVIDIA/ais-k8s/releases/tag/v0.9)
   109  - and more.
   110  
   111  Some of those might be described later in a separate posting.