github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/archive.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/archive.md (about)

     1  ---
     2  layout: post
     3  title: ARCHIVE
     4  permalink: /docs/archive
     5  redirect_from:
     6   - /archive.md/
     7   - /docs/archive.md/
     8  ---
     9  
    10  Training neural networks on very large datasets is not easy (an understatement).
    11  
    12  One of the many associated challenges is a so-called [small-file problem](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22small+file+problem%22) - the problem that gets progressively worse given continuous random access to the entirety of an underlying dataset (that often also has a tendency to annually double in size).
    13  
    14  One way to address the small-file problem involves providing some sort of *serialization* or *sharding* that allows to run **unmodified** clients and apps.
    15  
    16  Sharding - is exactly the approach that we took in AIStore. Archiving or sharding, in the context, means utilizing TAR, for instance, to combine small files into .tar formatted shards.
    17  
    18  > While I/O performance was always the primary motivation, the fact that a sharded dataset is, effectively, a backup of the original one must be considered an important added bonus.
    19  
    20  Today AIS equally supports formats: TAR, TGZ (TAR.GZ), TAR.LZ4, ZIP, where:
    21  
    22  * TAR is a well-known format first introduced in Unix V7 circa 1979 with specific formatting flavors including USTAR, PAX, and GNU TAR (all three are equally supported);
    23  * TGZ (aka TAR.GZ) and TAR.LZ4 provide, respectively, gzip and lz4 compression to tar files (aka tarballs);
    24  * and ZIP is [PKWARE ZIP](https://www.pkware.com/appnote) first introduced in 1989.
    25  
    26  AIS can natively read, write, append(**), and list archives.
    27  
    28  All sharding formats are equally supported across the entire set of AIS APIs. For instance, `list-objects` API supports "opening" objects formatted as one of the supported archival types and including contents of archived directories into generated result sets. Clients can run concurrent multi-object (source bucket => destination bucket) transactions to en masse generate new archives from [selected](/docs/batch.md) subsets of files, and more.
    29  
    30  APPEND to existing archives is also provided but limited to [TAR only](https://aiatscale.org/blog/2021/08/10/tar-append).
    31  
    32  > Maybe with exception of TAR, none of the listed sharding/archiving formats was ever designed to be append-able - that is, not if we are actually talking about *appending* and not some sort of extract-all-create-new type emulation (that will certainly break the performance in several well-documented ways).
    33  
    34  See also:
    35  
    36  * [CLI examples](/docs/cli/archive.md)
    37  * [More CLI examples](/docs/cli/object.md)
    38  * [API](/docs/http_api.md)