github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/cli/archive.md (about) 1 --- 2 layout: post 3 title: BUCKET 4 permalink: /docs/cli/archive 5 redirect_from: 6 - /cli/archive.md/ 7 - /docs/cli/archive.md/ 8 --- 9 10 # When objects are called _shards_ 11 12 In this document: 13 * commands to read, write, extract, and list *archives* - objects formatted as `TAR`, `TGZ` (or `TAR.GZ`) , `ZIP`, or `TAR.LZ4`. 14 15 For the most recently updated list of supported archival formats, please refer to [this source](https://github.com/NVIDIA/aistore/blob/main/cmn/archive/mime.go). 16 17 The corresponding subset of CLI commands starts with `ais archive`, from where you can `<TAB-TAB>` to the actual (reading, writing, etc.) operation. 18 19 ```console 20 $ ais archive --help 21 22 NAME: 23 ais archive get - get a shard and extract its content; get an archived file; 24 write the content locally with destination options including: filename, directory, STDOUT ('-'), or '/dev/null' (discard); 25 assorted options further include: 26 - '--prefix' to get multiple shards in one shot (empty prefix for the entire bucket); 27 - '--progress' and '--refresh' to watch progress bar; 28 - '-v' to produce verbose output when getting multiple objects. 29 'ais archive get' examples: 30 - ais://abc/trunk-0123.tar.lz4 /tmp/out - get and extract entire shard to /tmp/out/trunk/* 31 - ais://abc/trunk-0123.tar.lz4 --archpath file45.jpeg /tmp/out - extract one named file 32 - ais://abc/trunk-0123.tar.lz4/file45.jpeg /tmp/out - same as above (and note that '--archpath' is implied) 33 - ais://abc/trunk-0123.tar.lz4/file45 /tmp/out/file456.new - same as above, with destination explicitly (re)named 34 'ais archive get' multi-selection examples: 35 - ais://abc/trunk-0123.tar 111.tar --archregx=jpeg --archmode=suffix - return 111.tar with all *.jpeg files from a given shard 36 - ais://abc/trunk-0123.tar 222.tar --archregx=file45 --archmode=wdskey - return 222.tar with all file45.* files --/-- 37 - ais://abc/trunk-0123.tar 333.tar --archregx=subdir/ --archmode=prefix - 333.tar with all subdir/* files --/-- 38 39 USAGE: 40 ais archive get [command options] BUCKET[/SHARD_NAME] [OUT_FILE|OUT_DIR|-] 41 42 OPTIONS: 43 --checksum validate checksum 44 --yes, -y assume 'yes' to all questions 45 --latest check in-cluster metadata and, possibly, GET, download, prefetch, or copy the latest object version 46 from the associated remote bucket: 47 - provides operation-level control over object versioning (and version synchronization) 48 without requiring to change bucket configuration 49 - the latter can be done using 'ais bucket props set BUCKET versioning' 50 - see also: 'ais ls --check-versions', 'ais cp', 'ais prefetch', 'ais get' 51 --refresh value time interval for continuous monitoring; can be also used to update progress bar (at a given interval); 52 valid time units: ns, us (or µs), ms, s (default), m, h 53 --progress show progress bar(s) and progress of execution in real time 54 --blob-download utilize built-in blob-downloader (and the corresponding alternative datapath) to read very large remote objects 55 --chunk-size value chunk size in IEC or SI units, or "raw" bytes (e.g.: 4mb, 1MiB, 1048576, 128k; see '--units') 56 --num-workers value number of concurrent blob-downloading workers (readers); system default when omitted or zero (default: 0) 57 --archpath value extract the specified file from an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4; 58 see also: '--archregx' 59 --archmime value expected format (mime type) of an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4; 60 especially usable for shards with non-standard extensions 61 --archregx value string that specifies prefix, suffix, substring, WebDataset key, _or_ a general-purpose regular expression 62 to select possibly multiple matching archived files from a given shard; 63 is used in combination with '--archmode' ("matching mode") option 64 --archmode value enumerated "matching mode" that tells aistore how to handle '--archregx', one of: 65 * regexp - general purpose regular expression; 66 * prefix - matching filename starts with; 67 * suffix - matching filename ends with; 68 * substr - matching filename contains; 69 * wdskey - WebDataset key 70 example: 71 given a shard containing (subdir/aaa.jpg, subdir/aaa.json, subdir/bbb.jpg, subdir/bbb.json, ...) 72 and wdskey=subdir/aaa, aistore will match and return (subdir/aaa.jpg, subdir/aaa.json) 73 --extract, -x extract all files from archive(s) 74 --inventory list objects using _bucket inventory_ (docs/s3inventory.md); requires s3:// backend; will provide significant performance 75 boost when used with very large s3 buckets; e.g. usage: 76 1) 'ais ls s3://abc --inventory' 77 2) 'ais ls s3://abc --inventory --paged --prefix=subdir/' 78 (see also: docs/s3inventory.md) 79 --inv-name value bucket inventory name (optional; system default name is '.inventory') 80 --inv-id value bucket inventory ID (optional; by default, we use bucket name as the bucket's inventory ID) 81 --prefix value get objects that start with the specified prefix, e.g.: 82 '--prefix a/b/c' - get objects from the virtual directory a/b/c and objects from the virtual directory 83 a/b that have their names (relative to this directory) starting with 'c'; 84 '--prefix ""' - get entire bucket (all objects) 85 --cached get only in-cluster objects - only those objects from a remote bucket that are present ("cached") 86 --archive list archived content (see docs/archive.md for details) 87 --limit value maximum number of object names to display (0 - unlimited; see also '--max-pages') 88 e.g.: 'ais ls gs://abc --limit 1234 --cached --props size,custom (default: 0) 89 --units value show statistics and/or parse command-line specified sizes using one of the following _units of measurement_: 90 iec - IEC format, e.g.: KiB, MiB, GiB (default) 91 si - SI (metric) format, e.g.: KB, MB, GB 92 raw - do not convert to (or from) human-readable format 93 --verbose, -v verbose output 94 --silent server-side flag, an indication for aistore _not_ to log assorted errors (e.g., HEAD(object) failures) 95 --help, -h show help 96 ``` 97 98 ## Table of Contents 99 - [Archive files and directories](#archive-files-and-directories) 100 - [Append files and directories to an existing archive](#append-files-and-directories-to-an-existing-archive) 101 - [Archive multiple objects](#archive-multiple-objects) 102 - [List archived content](#list-archived-content) 103 - [Get archived content](#get-archived-content) 104 - [Get archived content: multiple-selection](#get-archived-content-multiple-selection) 105 - [Generate shards](#generate-shards) 106 107 ## Archive files and directories 108 109 Archive multiple files. 110 111 ```console 112 $ ais archive put --help 113 NAME: 114 ais archive put - archive a file, a directory, or multiple files and/or directories as 115 (.tar, .tgz or .tar.gz, .zip, .tar.lz4)-formatted object - aka "shard". 116 Both APPEND (to an existing shard) and PUT (a new version of the shard) are supported. 117 Examples: 118 - 'local-filename bucket/shard-00123.tar.lz4 --append --archpath name-in-archive' - append file to a given shard, 119 optionally, rename it (inside archive) as specified; 120 - 'local-filename bucket/shard-00123.tar.lz4 --append-or-put --archpath name-in-archive' - append file to a given shard if exists, 121 otherwise, create a new shard (and name it shard-00123.tar.lz4, as specified); 122 - 'src-dir bucket/shard-99999.zip -put' - one directory; iff the destination .zip doesn't exist create a new one; 123 - '"sys, docs" ais://dst/CCC.tar --dry-run -y -r --archpath ggg/' - dry-run to recursively archive two directories. 124 Tips: 125 - use '--dry-run' option if in doubt; 126 - to archive objects from a ais:// or remote bucket, run 'ais archive bucket', see --help for details. 127 128 USAGE: 129 ais archive put [command options] [-|FILE|DIRECTORY[/PATTERN]] BUCKET/SHARD_NAME 130 ``` 131 132 The operation accepts either an explicitly defined *list* or template-defined *range* of file names (to archive). 133 134 **NOTE:** 135 136 * `ais archive put` works with locally accessible (source) files and shall _not_ be confused with `ais archive bucket` command (below). 137 138 Also, note that `ais put` command with its `--archpath` option provides an alternative way to archive multiple objects: 139 140 For the most recently updated list of supported archival formats, please see: 141 142 * [this source](https://github.com/NVIDIA/aistore/blob/main/cmn/archive/mime.go). 143 144 ## Append files and directories to an existing archive 145 146 APPEND operation provides for appending files to existing archives (shards). As such, APPEND is a variation of PUT (above) with additional **two boolean flags**: 147 148 | Name | Description | 149 | --- | --- | 150 | `--append` | add newly archived content to the destination object (\"archive\", \"shard\") that **must** exist | 151 | `--append-or-put` | **if** destination object (\"archive\", \"shard\") exists append to it, otherwise archive a new one | 152 153 ### Example 1: add file to archive 154 155 #### step 1. create archive (by archiving a given source dir) 156 157 ```console 158 $ ais archive put sys ais://nnn/sys.tar.lz4 159 Warning: multi-file 'archive put' operation requires either '--append' or '--append-or-put' option 160 Proceed to execute 'archive put --append-or-put'? [Y/N]: y 161 Files to upload: 162 EXTENSION COUNT SIZE 163 .go 11 17.46KiB 164 TOTAL 11 17.46KiB 165 APPEND 11 files (one directory, non-recursive) => ais://nnn/sys.tar.lz4? [Y/N]: y 166 Done 167 ``` 168 169 #### step 2. add a single file to existing archive 170 171 ```console 172 $ ais archive put README.md ais://nnn/sys.tar.lz4 --archpath=docs/README --append 173 APPEND README.md to ais://nnn/sys.tar.lz4 as "docs/README" 174 ``` 175 176 #### step 3. list entire bucket with an `--archive` option to show all archived entries 177 178 ```console 179 $ ais ls ais://nnn --archive 180 NAME SIZE 181 sys.tar.lz4 16.84KiB 182 sys.tar.lz4/api_linux.go 1.07KiB 183 sys.tar.lz4/cpu.go 1.07KiB 184 sys.tar.lz4/cpu_darwin.go 802B 185 sys.tar.lz4/cpu_linux.go 2.14KiB 186 sys.tar.lz4/docs/README 13.85KiB 187 sys.tar.lz4/mem.go 1.16KiB 188 sys.tar.lz4/mem_darwin.go 2.04KiB 189 sys.tar.lz4/mem_linux.go 2.81KiB 190 sys.tar.lz4/proc.go 784B 191 sys.tar.lz4/proc_darwin.go 369B 192 sys.tar.lz4/proc_linux.go 1.40KiB 193 sys.tar.lz4/sys_test.go 3.88KiB 194 Listed: 13 names 195 ``` 196 197 Alternatively, use regex to select: 198 199 ```console 200 $ ais ls ais://nnn --archive --regex docs 201 NAME SIZE 202 sys.tar.lz4/docs/README 13.85KiB 203 ``` 204 205 ### Example 2: use `--template` flag to add source files 206 207 Generally, the `--template` option combines (an optional) prefix and/or one or more ranges (e.g., bash brace expansions). 208 209 In this case, the template we use is a simple prefix with no ranges. 210 211 ```console 212 $ ls -l /tmp/w 213 total 32 214 -rw-r--r-- 1 root root 14180 Dec 11 18:18 111 215 -rw-r--r-- 1 root root 14180 Dec 11 18:18 222 216 217 $ ais archive put ais://nnn/shard-001.tar --template /tmp/w/ --append 218 Files to upload: 219 EXTENSION COUNT SIZE 220 2 27.70KiB 221 TOTAL 2 27.70KiB 222 APPEND 2 files (one directory, non-recursive) => ais://nnn/shard-001.tar? [Y/N]: y 223 Done 224 $ ais ls ais://nnn/shard-001.tar --archive 225 NAME SIZE 226 shard-001.tar 37.50KiB 227 shard-001.tar/111 13.85KiB 228 shard-001.tar/222 13.85KiB 229 shard-001.tar/23ed44d8bf3952a35484-1.test 1.00KiB 230 shard-001.tar/452938788ebb87807043-4.test 1.00KiB 231 shard-001.tar/7925bc9b5eb1daa12ed0-2.test 1.00KiB 232 shard-001.tar/8264574b49bd188a4b27-0.test 1.00KiB 233 shard-001.tar/f1f25e52c5edd768e0ec-3.test 1.00KiB 234 ``` 235 236 ### Example 3: add file to archive 237 238 In this example, we assume that `arch.tar` already exists. 239 240 ```console 241 # contents _before_: 242 $ ais archive ls ais://abc/arch.tar 243 NAME SIZE 244 arch.tar 4.5KiB 245 arch.tar/obj1 1.0KiB 246 arch.tar/obj2 1.0KiB 247 248 # add file to existing archive: 249 $ ais archive put /tmp/obj1.bin ais://abc/arch.tar --archpath bin/obj1 250 APPEND "/tmp/obj1.bin" to object "ais://abc/arch.tar[/bin/obj1]" 251 252 # contents _after_: 253 $ ais archive ls ais://abc/arch.tar 254 NAME SIZE 255 arch.tar 6KiB 256 arch.tar/bin/obj1 2.KiB 257 arch.tar/obj1 1.0KiB 258 arch.tar/obj2 1.0KiB 259 ``` 260 261 ### Example 4: add file to archive 262 263 ```console 264 # contents _before_: 265 266 $ ais archive ls ais://nnn/shard-2.tar 267 NAME SIZE 268 shard-2.tar 5.50KiB 269 shard-2.tar/0379f37cbb0415e7eaea-3.test 1.00KiB 270 shard-2.tar/504c563d14852368575b-5.test 1.00KiB 271 shard-2.tar/c7bcb7014568b5e7d13b-4.test 1.00KiB 272 273 # append and note that `--archpath` can specify a fully qualified destination name 274 275 $ ais archive put LICENSE ais://nnn/shard-2.tar --archpath shard-2.tar/license.test 276 APPEND "/go/src/github.com/NVIDIA/aistore/LICENSE" to "ais://nnn/shard-2.tar[/shard-2.tar/license.test]" 277 278 # contents _after_: 279 $ ais archive ls ais://nnn/shard-2.tar 280 NAME SIZE 281 shard-2.tar 7.50KiB 282 shard-2.tar/0379f37cbb0415e7eaea-3.test 1.00KiB 283 shard-2.tar/504c563d14852368575b-5.test 1.00KiB 284 shard-2.tar/c7bcb7014568b5e7d13b-4.test 1.00KiB 285 shard-2.tar/license.test 1.05KiB 286 ``` 287 288 ## Archive multiple objects 289 290 This is a yet another archive-**creating** operation that: 291 292 1. takes in multiple objects from a given **source bucket**, and 293 2. archives them all as a shard in the specified destination bucket, 294 295 where: 296 297 * source and destination buckets may not necessarily be different; 298 * both `--list` and `--template` options are supported 299 * supported archival formats include `.tar`, `.tar.gz` (or, same, `.tgz`), and `.zip`; more extensions may be added in the future. 300 * archiving is carried out asynchronously, in parallel by all AIS targets. 301 302 As such, `ais archive bucket` is one of the supported [multi-object operations](/docs/cli/object.md#operations-on-lists-and-ranges). 303 304 **NOTE:** 305 306 * `ais archive bucket` multi-object bucket-to-bucket archiving shall _not_ be confused with `ais archive put` command - the latter is used to archive multiple source **files** from a local (or locally accessible) source **directory**. 307 308 ```console 309 $ ais archive bucket --help 310 NAME: 311 ais archive bucket - archive multiple objects from SRC_BUCKET as (.tar, .tgz or .tar.gz, .zip, .tar.lz4)-formatted shard 312 313 USAGE: 314 ais archive bucket [command options] SRC_BUCKET DST_BUCKET/SHARD_NAME 315 316 OPTIONS: 317 --template value template to match object or file names; may contain prefix (that could be empty) with zero or more ranges 318 (with optional steps and gaps), e.g.: 319 --template "" # (an empty or '*' template matches eveything) 320 --template 'dir/subdir/' 321 --template 'shard-{1000..9999}.tar' 322 --template "prefix-{0010..0013..2}-gap-{1..2}-suffix" 323 and similarly, when specifying files and directories: 324 --template '/home/dir/subdir/' 325 --template "/abc/prefix-{0010..9999..2}-suffix" 326 --list value comma-separated list of object or file names, e.g.: 327 --list 'o1,o2,o3' 328 --list "abc/1.tar, abc/1.cls, abc/1.jpeg" 329 or, when listing files and/or directories: 330 --list "/home/docs, /home/abc/1.tar, /home/abc/1.jpeg" 331 --dry-run preview the results without really running the action 332 --include-src-bck prefix the names of archived files with the source bucket name 333 --append-or-put if destination object ("archive", "shard") exists append to it, otherwise archive a new one 334 --cont-on-err keep running archiving xaction in presence of errors in a any given multi-object transaction 335 --wait wait for an asynchronous operation to finish (optionally, use '--timeout' to limit the waiting time) 336 --help, -h show help 337 ``` 338 339 ### Examples 340 341 1. Archive a list of objects from a given bucket: 342 343 ```console 344 $ ais archive bucket ais://bck/arch.tar --list obj1,obj2 345 Archiving "ais://bck/arch.tar" ... 346 ``` 347 348 Resulting `ais://bck/arch.tar` contains objects `ais://bck/obj1` and `ais://bck/obj2`. 349 350 2. Archive objects from a different bucket, use template (range): 351 352 ```console 353 $ ais archive bucket ais://src ais://dst/arch.tar --template "obj-{0..9}" 354 355 Archiving "ais://dst/arch.tar" ... 356 ``` 357 358 `ais://dst/arch.tar` now contains 10 objects from bucket `ais://src`: `ais://src/obj-0`, `ais://src/obj-1` ... `ais://src/obj-9`. 359 360 3. Archive 3 objects and then append 2 more: 361 362 ```console 363 $ ais archive bucket ais://bck/arch1.tar --template "obj{1..3}" 364 Archived "ais://bck/arch1.tar" ... 365 $ ais archive ls ais://bck/arch1.tar 366 NAME SIZE 367 arch1.tar 31.00KiB 368 arch1.tar/obj1 9.26KiB 369 arch1.tar/obj2 9.26KiB 370 arch1.tar/obj3 9.26KiB 371 372 $ ais archive bucket ais://bck/arch1.tar --template "obj{4..5}" --append 373 Archived "ais://bck/arch1.tar" 374 375 $ ais archive ls ais://bck/arch1.tar 376 NAME SIZE 377 arch1.tar 51.00KiB 378 arch1.tar/obj1 9.26KiB 379 arch1.tar/obj2 9.26KiB 380 arch1.tar/obj3 9.26KiB 381 arch1.tar/obj4 9.26KiB 382 arch1.tar/obj5 9.26KiB 383 ``` 384 385 ## List archived content 386 387 ```console 388 NAME: 389 ais archive ls - list archived content (supported formats: .tar, .tgz or .tar.gz, .zip, .tar.lz4) 390 391 USAGE: 392 ais archive ls [command options] BUCKET[/SHARD_NAME] 393 ``` 394 395 List archived content as a tree with archive ("shard") name as a root and archived files as leaves. 396 Filenames are always sorted alphabetically. 397 398 ### Options 399 400 | Name | Type | Description | Default | 401 | --- | --- | --- | --- | 402 | `--props` | `string` | Comma-separated properties to return with object names | `"size"` 403 | `--all` | `bool` | Show all objects, including misplaced, duplicated, etc. | `false` | 404 405 ### Examples 406 407 ```console 408 $ ais archive ls ais://bck/arch.tar 409 NAME SIZE 410 arch.tar 4.5KiB 411 arch.tar/obj1 1.0KiB 412 arch.tar/obj2 1.0KiB 413 ``` 414 415 ### Example: use '--prefix' that crosses shard boundary 416 417 For starters, we recursively archive all aistore docs: 418 419 ```console 420 $ ais put docs ais://A.tar --archive -r 421 ``` 422 423 To list a virtual subdirectory _inside_ this newly created shard (e.g.): 424 425 ```console 426 $ ais archive ls ais://nnn --prefix "A.tar/tutorials" 427 NAME SIZE 428 A.tar/tutorials/README.md 561B 429 A.tar/tutorials/etl/compute_md5.md 8.28KiB 430 A.tar/tutorials/etl/etl_imagenet_pytorch.md 4.16KiB 431 A.tar/tutorials/etl/etl_webdataset.md 3.97KiB 432 Listed: 4 names 433 ```` 434 435 or, same: 436 437 ```console 438 $ ais ls ais://nnn --prefix "A.tar/tutorials" --archive 439 NAME SIZE 440 A.tar/tutorials/README.md 561B 441 A.tar/tutorials/etl/compute_md5.md 8.28KiB 442 A.tar/tutorials/etl/etl_imagenet_pytorch.md 4.16KiB 443 A.tar/tutorials/etl/etl_webdataset.md 3.97KiB 444 Listed: 4 names 445 ``` 446 447 ## Get archived content 448 449 ```console 450 $ ais get --help 451 452 ais get - (alias for "object get") get an object, a shard, an archived file, or a range of bytes from all of the above; 453 write the content locally with destination options including: filename, directory, STDOUT ('-'), or '/dev/null' (discard); 454 assorted options further include: 455 - '--prefix' to get multiple objects in one shot (empty prefix for the entire bucket); 456 - '--extract' or '--archpath' to extract archived content; 457 - '--progress' and '--refresh' to watch progress bar; 458 - '-v' to produce verbose output when getting multiple objects. 459 460 USAGE: 461 ais get [command options] BUCKET[/OBJECT_NAME] [OUT_FILE|OUT_DIR|-] 462 463 OPTIONS: 464 --offset value object read offset; must be used together with '--length'; default formatting: IEC (use '--units' to override) 465 --checksum validate checksum 466 --yes, -y assume 'yes' to all questions 467 --refresh value interval for continuous monitoring; 468 valid time units: ns, us (or µs), ms, s (default), m, h 469 --progress show progress bar(s) and progress of execution in real time 470 --archpath value extract the specified file from an archive (shard) 471 --extract, -x extract all files from archive(s) 472 --prefix value get objects that start with the specified prefix, e.g.: 473 '--prefix a/b/c' - get objects from the virtual directory a/b/c and objects from the virtual directory 474 a/b that have their names (relative to this directory) starting with c; 475 '--prefix ""' - get entire bucket 476 --cached get only those objects from a remote bucket that are present ("cached") in AIS 477 --archive list archived content (see docs/archive.md for details) 478 --limit value limit object name count (0 - unlimited) (default: 0) 479 --units value show statistics and/or parse command-line specified sizes using one of the following _units of measurement_: 480 iec - IEC format, e.g.: KiB, MiB, GiB (default) 481 si - SI (metric) format, e.g.: KB, MB, GB 482 raw - do not convert to (or from) human-readable format 483 --verbose, -v verbose outout when getting multiple objects 484 --help, -h show help 485 ``` 486 487 ### Example: extract one file 488 489 ```console 490 $ ais archive get ais://dst/A.tar.gz /tmp/w --archpath 111.ext1 491 GET 111.ext1 from ais://dst/A.tar.gz as "/tmp/w/111.ext1" (12.56KiB) 492 493 $ ls /tmp/w 494 111.ext1 495 ``` 496 497 Alternatively, use fully qualified name: 498 499 ```console 500 $ ais archive get ais://dst/A.tar.gz/111.ext1 /tmp/w 501 ``` 502 503 ### Example: extract one file using its fully-qualified name:: 504 505 ```console 506 $ ais archive get ais://nnn/A.tar/tutorials/README.md /tmp/out 507 ``` 508 509 ### Example: extract all files from a single shard 510 511 Let's say, we have a certain shard in a certain bucket: 512 513 ```console 514 $ ais ls ais://dst --archive 515 NAME SIZE 516 A.tar.gz 5.18KiB 517 A.tar.gz/111.ext1 12.56KiB 518 A.tar.gz/222.ext1 12.56KiB 519 A.tar.gz/333.ext2 12.56KiB 520 ``` 521 522 We can then go ahead to GET and extract it to local directory, e.g.: 523 524 ```console 525 $ ais archive get ais://dst/A.tar.gz /tmp/www --extract 526 GET A.tar.gz from ais://dst as "/tmp/www/A.tar.gz" (5.18KiB) and extract to /tmp/www/A/ 527 528 $ ls /tmp/www/A 529 111.ext1 222.ext1 333.ext2 530 ``` 531 532 But here's an alternative syntax to achieve the same: 533 534 ```console 535 $ ais get ais://dst --archive --prefix A.tar.gz /tmp/www 536 ``` 537 538 or even: 539 540 ```console 541 $ ais get ais://dst --archive --prefix A.tar.gz /tmp/www --progress --refresh 1 -y 542 543 GET 51 objects from ais://dst/tmp/ggg (total size 1.08MiB) 544 Objects: 51/51 [==============================================================] 100 % 545 Total size: 1.08 MiB / 1.08 MiB [==============================================================] 100 % 546 ``` 547 548 The difference is that: 549 550 * in the first case we ask for a specific shard, 551 * while in the second (and third) we filter bucket's content using a certain prefix 552 * and the fact (the convention) that archived filenames are prefixed with their parent (shard) name. 553 554 ### Example: extract all files from all shards (with a given prefix) 555 556 Let's say, there's a bucket `ais://dst` with a virtual directory `abc/` that in turn contains: 557 558 ```console 559 $ ais ls ais://dst 560 NAME SIZE 561 A.tar.gz 5.18KiB 562 B.tar.lz4 247.88KiB 563 C.tar.zip 4.15KiB 564 D.tar 2.00KiB 565 ``` 566 567 Next, we GET and extract them all in the respective sub-directories (note `--verbose` option): 568 569 ```console 570 $ ais archive get ais://dst /tmp/w --prefix "" --extract -v 571 572 GET 4 objects from ais://dst to /tmp/w (total size 259.21KiB) [Y/N]: y 573 GET D.tar from ais://dst as "/tmp/w/D.tar" (2.00KiB) and extract as /tmp/w/D 574 GET A.tar.gz from ais://dst as "/tmp/w/A.tar.gz" (5.18KiB) and extract as /tmp/w/A 575 GET C.tar.zip from ais://dst as "/tmp/w/C.tar.zip" (4.15KiB) and extract as /tmp/w/C 576 GET B.tar.lz4 from ais://dst as "/tmp/w/B.tar.lz4" (247.88KiB) and extract as /tmp/w/B 577 ``` 578 579 ### Example: use '--prefix' that crosses shard boundary 580 581 For starters, we recursively archive all aistore docs: 582 583 ```console 584 $ ais put docs ais://A.tar --archive -r 585 ``` 586 587 To list a virtual subdirectory _inside_ this newly created shard (e.g.): 588 589 ```console 590 $ ais archive ls ais://nnn --prefix A.tar/tutorials 591 NAME SIZE 592 A.tar/tutorials/README.md 561B 593 A.tar/tutorials/etl/compute_md5.md 8.28KiB 594 A.tar/tutorials/etl/etl_imagenet_pytorch.md 4.16KiB 595 A.tar/tutorials/etl/etl_webdataset.md 3.97KiB 596 Listed: 4 names 597 ``` 598 599 Now, extract matching files _from_ the bucket to /tmp/out: 600 601 ```console 602 $ ais archive get ais://nnn --prefix A.tar/tutorials /tmp/out 603 GET 6 objects from ais://nnn/tmp/out (total size 17.81MiB) [Y/N]: y 604 605 $ ls -al /tmp/out/tutorials/ 606 total 20 607 drwxr-x--- 4 root root 4096 May 13 20:05 ./ 608 drwxr-xr-x 3 root root 4096 May 13 20:05 ../ 609 drwxr-x--- 2 root root 4096 May 13 20:05 etl/ 610 -rw-r--r-- 1 root root 561 May 13 20:05 README.md 611 drwxr-x--- 2 root root 4096 May 13 20:05 various/ 612 ``` 613 614 ## Get archived content: multiple selection 615 616 Generally, both single and multi-selection from a given source shard is realized using one of the following 4 (four) options: 617 618 ```console 619 --archpath value extract the specified file from an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4; 620 see also: '--archregx' 621 --archmime value expected format (mime type) of an object ("shard") formatted as: .tar, .tgz or .tar.gz, .zip, .tar.lz4; 622 especially usable for shards with non-standard extensions 623 --archregx value string that specifies prefix, suffix, substring, WebDataset key, _or_ a general-purpose regular expression 624 to select possibly multiple matching archived files from a given shard; 625 is used in combination with '--archmode' ("matching mode") option 626 --archmode value enumerated "matching mode" that tells aistore how to handle '--archregx', one of: 627 * regexp - general purpose regular expression; 628 * prefix - matching filename starts with; 629 * suffix - matching filename ends with; 630 * substr - matching filename contains; 631 * wdskey - WebDataset key 632 example: 633 given a shard containing (subdir/aaa.jpg, subdir/aaa.json, subdir/bbb.jpg, subdir/bbb.json, ...) 634 and wdskey=subdir/aaa, aistore will match and return (subdir/aaa.jpg, subdir/aaa.json) 635 ``` 636 637 In particular, '--archregx' and '--archmode' pair defines multiple selection that can be further demonstrated on the following examples. 638 639 > But first, note that in all multi-selection cases, the result is (currently) invariably formatted as .TAR (that contains the aforementioned selection). 640 641 ### Example: suffix match 642 643 Select all `*.jpeg` files from a given shard and return them all as 111.tar: 644 645 ```console 646 $ ais archive get ais://abc/trunk-0123.tar 111.tar --archregx=jpeg --archmode=suffix 647 ``` 648 649 ### Example: [WebDataset](https://github.com/webdataset/webdataset) key 650 651 Select all files that have a given [WebDataset](https://github.com/webdataset/webdataset) key; return the result as 222.tar: 652 653 ```console 654 $ ais archive get ais://abc/trunk-0123.tar 222.tar --archregx=file45 --archmode=wdskey 655 ``` 656 657 ### Example: prefix match 658 659 Similar to the above except that in this case '--archregx' value specifies virtual subdirectory inside a given named shard: 660 661 ```console 662 $ ais archive get ais://abc/trunk-0123.tar 333.tar --archregx=subdir/ --archmode=prefix 663 ``` 664 665 ## Generate shards 666 667 `ais archive gen-shards "BUCKET/TEMPLATE.EXT"` 668 669 Put randomly generated shards that can be used for dSort testing. 670 The `TEMPLATE` must be bash-like brace expansion (see examples) and `.EXT` must be one of: `.tar`, `.tar.gz`. 671 672 **Warning**: Remember to always quote the argument (`"..."`) otherwise the brace expansion will happen in terminal. 673 674 ### Options 675 676 | Flag | Type | Description | Default | 677 | --- | --- | --- | --- | 678 | `--fsize` | `string` | Single file size inside the shard, can end with size suffix (k, MB, GiB, ...) | `1024` (`1KB`)| 679 | `--fcount` | `int` | Number of files inside single shard | `5` | 680 | `--fext` | `string` | Comma-separated list of file extensions (default ".test"), e.g.: --fext '.mp3,.json,.cls' | `.test` | 681 | `--cleanup` | `bool` | When set, the old bucket will be deleted and created again | `false` | 682 | `--conc` | `int` | Limits number of concurrent `PUT` requests and number of concurrent shards created | `10` | 683 684 ### Examples 685 686 #### Generate shards with varying numbers of files and file sizes 687 688 Generate 10 shards each containing 100 files of size 256KB and put them inside `ais://dsort-testing` bucket (creates it if it does not exist). 689 Shards will be named: `shard-0.tar`, `shard-1.tar`, ..., `shard-9.tar`. 690 691 ```console 692 $ ais archive gen-shards "ais://dsort-testing/shard-{0..9}.tar" --fsize 262144 --fcount 100 693 Shards created: 10/10 [==============================================================] 100 % 694 $ ais ls ais://dsort-testing 695 NAME SIZE VERSION 696 shard-0.tar 25.05MiB 1 697 shard-1.tar 25.05MiB 1 698 shard-2.tar 25.05MiB 1 699 shard-3.tar 25.05MiB 1 700 shard-4.tar 25.05MiB 1 701 shard-5.tar 25.05MiB 1 702 shard-6.tar 25.05MiB 1 703 shard-7.tar 25.05MiB 1 704 shard-8.tar 25.05MiB 1 705 shard-9.tar 25.05MiB 1 706 ``` 707 708 #### Generate shards using custom naming template 709 710 Generates 100 shards each containing 5 files of size 256KB and put them inside `dsort-testing` bucket. 711 Shards will be compressed and named: `super_shard_000_last.tgz`, `super_shard_001_last.tgz`, ..., `super_shard_099_last.tgz` 712 713 ```console 714 $ ais archive gen-shards "ais://dsort-testing/super_shard_{000..099}_last.tar" --fsize 262144 --cleanup 715 Shards created: 100/100 [==============================================================] 100 % 716 $ ais ls ais://dsort-testing 717 NAME SIZE VERSION 718 super_shard_000_last.tgz 1.25MiB 1 719 super_shard_001_last.tgz 1.25MiB 1 720 super_shard_002_last.tgz 1.25MiB 1 721 super_shard_003_last.tgz 1.25MiB 1 722 super_shard_004_last.tgz 1.25MiB 1 723 super_shard_005_last.tgz 1.25MiB 1 724 super_shard_006_last.tgz 1.25MiB 1 725 super_shard_007_last.tgz 1.25MiB 1 726 ... 727 ``` 728 729 #### Multi-extension example 730 731 732 ```console 733 $ ais archive gen-shards 'ais://nnn/shard-{01..99}.tar' -fext ".mp3, .json, .cls" 734 735 $ ais archive ls ais://nnn | head -n 20 736 NAME SIZE 737 shard-01.tar 23.50KiB 738 shard-01.tar/541701ae863f76d0f7e0-0.cls 1.00KiB 739 shard-01.tar/541701ae863f76d0f7e0-0.json 1.00KiB 740 shard-01.tar/541701ae863f76d0f7e0-0.mp3 1.00KiB 741 shard-01.tar/8f8c5fa2934c90138833-1.cls 1.00KiB 742 shard-01.tar/8f8c5fa2934c90138833-1.json 1.00KiB 743 shard-01.tar/8f8c5fa2934c90138833-1.mp3 1.00KiB 744 shard-01.tar/9a42bd12d810d890ea86-3.cls 1.00KiB 745 shard-01.tar/9a42bd12d810d890ea86-3.json 1.00KiB 746 shard-01.tar/9a42bd12d810d890ea86-3.mp3 1.00KiB 747 shard-01.tar/c5bd7c7a34e12ebf3ad3-2.cls 1.00KiB 748 shard-01.tar/c5bd7c7a34e12ebf3ad3-2.json 1.00KiB 749 shard-01.tar/c5bd7c7a34e12ebf3ad3-2.mp3 1.00KiB 750 shard-01.tar/f13522533ecafbad4fe5-4.cls 1.00KiB 751 shard-01.tar/f13522533ecafbad4fe5-4.json 1.00KiB 752 shard-01.tar/f13522533ecafbad4fe5-4.mp3 1.00KiB 753 shard-02.tar 23.50KiB 754 shard-02.tar/095e6ae644ff4fd1778b-7.cls 1.00KiB 755 shard-02.tar/095e6ae644ff4fd1778b-7.json 1.00KiB 756 ... 757 ```