github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/archive.md (about) 1 --- 2 layout: post 3 title: ARCHIVE 4 permalink: /docs/archive 5 redirect_from: 6 - /archive.md/ 7 - /docs/archive.md/ 8 --- 9 10 Training neural networks on very large datasets is not easy (an understatement). 11 12 One of the many associated challenges is a so-called [small-file problem](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=%22small+file+problem%22) - the problem that gets progressively worse given continuous random access to the entirety of an underlying dataset (that often also has a tendency to annually double in size). 13 14 One way to address the small-file problem involves providing some sort of *serialization* or *sharding* that allows to run **unmodified** clients and apps. 15 16 Sharding - is exactly the approach that we took in AIStore. Archiving or sharding, in the context, means utilizing TAR, for instance, to combine small files into .tar formatted shards. 17 18 > While I/O performance was always the primary motivation, the fact that a sharded dataset is, effectively, a backup of the original one must be considered an important added bonus. 19 20 Today AIS equally supports formats: TAR, TGZ (TAR.GZ), TAR.LZ4, ZIP, where: 21 22 * TAR is a well-known format first introduced in Unix V7 circa 1979 with specific formatting flavors including USTAR, PAX, and GNU TAR (all three are equally supported); 23 * TGZ (aka TAR.GZ) and TAR.LZ4 provide, respectively, gzip and lz4 compression to tar files (aka tarballs); 24 * and ZIP is [PKWARE ZIP](https://www.pkware.com/appnote) first introduced in 1989. 25 26 AIS can natively read, write, append(**), and list archives. 27 28 All sharding formats are equally supported across the entire set of AIS APIs. For instance, `list-objects` API supports "opening" objects formatted as one of the supported archival types and including contents of archived directories into generated result sets. Clients can run concurrent multi-object (source bucket => destination bucket) transactions to en masse generate new archives from [selected](/docs/batch.md) subsets of files, and more. 29 30 APPEND to existing archives is also provided but limited to [TAR only](https://aiatscale.org/blog/2021/08/10/tar-append). 31 32 > Maybe with exception of TAR, none of the listed sharding/archiving formats was ever designed to be append-able - that is, not if we are actually talking about *appending* and not some sort of extract-all-create-new type emulation (that will certainly break the performance in several well-documented ways). 33 34 See also: 35 36 * [CLI examples](/docs/cli/archive.md) 37 * [More CLI examples](/docs/cli/object.md) 38 * [API](/docs/http_api.md)