github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/blob_downloader.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/blob_downloader.md (about)

     1  ---
     2  layout: post
     3  title: Blob Downloader
     4  permalink: /docs/blob_downloader
     5  redirect_from:
     6   - /blob_downloader.md/
     7   - /docs/blob_downloader.md/
     8  ---
     9  
    10  ## Background
    11  
    12  AIStore supports multiple ways to populate itself with existing datasets, including (but not limited to):
    13  
    14  * **on demand**, often during the first epoch;
    15  * **copy** entire bucket or its selected virtual subdirectories;
    16  * **copy** multiple matching objects;
    17  * **archive** multiple objects
    18  * **prefetch** remote bucket or parts of thereof;
    19  * **download** raw http(s) addressible directories, including (but not limited to) Cloud storages;
    20  * **promote** NFS or SMB shares accessible by one or multiple (or all) AIS target nodes;
    21  
    22  > The on-demand "way" is maybe the most popular, whereby users just start running their workloads against a [remote bucket](docs/providers.md) with AIS cluster positioned as an intermediate fast tier.
    23  
    24  But there's more. In particular, v3.22 introduces a special facility to download very large remote objects a.k.a. BLOBs.
    25  
    26  We call this (new facility):
    27  
    28  ## Blob Downloader
    29  
    30  AIS blob downloader features multiple concurrent workers - chunk readers - that run in parallel and, well, read certain fixed-size chunks from the remote object.
    31  
    32  User can control (or tune-up) the number of workers and the chunk size(s), among other configurable tunables. The tunables themselves are backed up by system defaults - in particular:
    33  
    34  | Name | Comment |
    35  | --- | --- |
    36  | default chunk size  | 2 MiB |
    37  | minimum chunk size  | 32 KiB |
    38  | maximum chunk size  | 16 MiB |
    39  | default number of workers | 4 |
    40  
    41  In addition to massively parallel reading (**), blob downloader also:
    42  
    43  * stores and _finalizes_ (checksums, replicates, erasure codes - as per bucket configuration) downloaded object;
    44  * optionally(**), concurrently transmits the loaded content to requesting user.
    45  
    46  > (**) assuming sufficient and _not_ rate-limited network bandwidth
    47  
    48  > (**) see [GET](#2-get-via-blob-downloader) section below
    49  
    50  ## Flavors
    51  
    52  For users, blob downloader is currently(**) available in 3 distinct flavors:
    53  
    54  | Name | Go API | CLI |
    55  | --- | --- | --- |
    56  | 1. `blob-download` job | [api.BlobDownload](https://github.com/NVIDIA/aistore/blob/main/api/blob.go) | `ais blob-download`  |
    57  | 2. `GET` request | [api.GetObject](https://github.com/NVIDIA/aistore/blob/main/api/object.go) and friends  | `ais get`  |
    58  | 3. `prefetch` job | [api.Prefetch](https://github.com/NVIDIA/aistore/blob/main/api/multiobj.go) | `ais prefetch`  |
    59  
    60  
    61  > (**) There's a plan to integrate blob downloader with [Internet Downloader](downloader.md) and, generally, all supported mechanisms that one way or another read remote objects and files.
    62  
    63  > (**) At the time of this writing, none of the above is supported (yet) in our [Python SDK](https://github.com/NVIDIA/aistore/tree/main/python/aistore/sdk).
    64  
    65  Rest of this text talks separately about each of the 3 "flavors" providing additional details, insights, and context.
    66  
    67  ## 1. Usage
    68  
    69  To put some of the blob downloader's functionality into immediate perspective, let's see some CLI:
    70  
    71  ```console
    72  $ ais blob-download --help
    73  NAME:
    74     ais blob-download - run a job to download large object(s) from remote storage to aistore cluster, e.g.:
    75       - 'blob-download s3://ab/largefile --chunk-size=2mb --progress'       - download one blob at a given chunk size
    76       - 'blob-download s3://ab --list "f1, f2" --num-workers=4 --progress'  - use 4 concurrent readers to download each of the 2 blobs
    77     When _not_ using '--progress' option, run 'ais show job' to monitor.
    78  
    79  USAGE:
    80     ais blob-download [command options] BUCKET/OBJECT_NAME
    81  
    82  OPTIONS:
    83     --refresh value      interval for continuous monitoring;
    84                          valid time units: ns, us (or µs), ms, s (default), m, h
    85     --progress           show progress bar(s) and progress of execution in real time
    86     --list value         comma-separated list of object or file names, e.g.:
    87                          --list 'o1,o2,o3'
    88                          --list "abc/1.tar, abc/1.cls, abc/1.jpeg"
    89                          or, when listing files and/or directories:
    90                          --list "/home/docs, /home/abc/1.tar, /home/abc/1.jpeg"
    91     --chunk-size value   chunk size in IEC or SI units, or "raw" bytes (e.g.: 4mb, 1MiB, 1048576, 128k; see '--units')
    92     --num-workers value  number of concurrent blob-downloading workers (readers); system default when omitted or zero (default: 0)
    93     --wait               wait for an asynchronous operation to finish (optionally, use '--timeout' to limit the waiting time)
    94     --timeout value      maximum time to wait for a job to finish; if omitted: wait forever or until Ctrl-C;
    95                          valid time units: ns, us (or µs), ms, s (default), m, h
    96     --latest             check in-cluster metadata and, possibly, GET, download, prefetch, or copy the latest object version
    97                          from the associated remote bucket:
    98                          - provides operation-level control over object versioning (and version synchronization)
    99                            without requiring to change bucket configuration
   100                          - the latter can be done using 'ais bucket props set BUCKET versioning'
   101                          - see also: 'ais ls --check-versions', 'ais cp', 'ais prefetch', 'ais get'
   102  ```
   103  
   104  ## 2. GET via blob downloader
   105  
   106  Some of the common use cases boil down to the following:
   107  
   108  * user "knows" the size of an object to be read (or downloaded) from remote (cold) storage;
   109  * there's also an idea of a certain size _threshold_ beyond which the latency of the operation becomes prohibitive.
   110  
   111  Thus, when the size in question is greater than the _threshold_ there's a motivation to speed up.
   112  
   113  To meet this motivation, AIS now supports `GET` request with additional (and optional) http headers:
   114  
   115  | Header | Values (examples) | Comments |
   116  | --- | --- | --- |
   117  | `ais-blob-download` | "true", ""  | NOTE: to engage blob downloader, this http header must be present and must be "true" (or "y", "yes", "on" case-insensitive) |
   118  | `ais-blob-chunk` | "1mb", "1234567", "128KiB"  | [system defaults](#blob-downloader) above |
   119  | `ais-blob-workers` | "3", "7", "16"  | ditto |
   120  
   121  * HTTP headers that AIStore recognizes and supports are always prefixed with "ais-". For the most recently updated list (of headers), please see [the source](https://github.com/NVIDIA/aistore/blob/main/api/apc/headers.go).
   122  
   123  ## 3. Prefetch remote buckets w/ blob size threshold
   124  
   125  `Prefetch` is another batch operation, one of the supported job types that can be invoked both via Go or Python call, or command line.
   126  
   127  The idea of size threshold applies here as well, with the only difference being the _scope_: single object in [GET](#2-get-via-blob-downloader), all matching objects in `prefetch`.
   128  
   129  > The `prefetch` operation supports multi-object selection via the usual `--list`, `--template`, and `--prefix` options.
   130  
   131  But first thing first, let's see an example.
   132  
   133  ```console
   134  $ ais ls s3://abc
   135  NAME             SIZE            CACHED
   136  aisloader        39.30MiB        no
   137  largefile        5.76GiB         no
   138  smallfile        100.00MiB       no
   139  ```
   140  
   141  Given the bucket (above), we now run `prefetch` with 1MB size threshold:
   142  
   143  ```console
   144  $ ais prefetch s3://abc --blob-threshold 1mb
   145  prefetch-objects[E-w0gjdm1z]: prefetch entire bucket s3://abc. To monitor the progress, run 'ais show job E-w0gjdm1z'
   146  ```
   147  
   148  But notice, `prefetch` stats do not move:
   149  
   150  ```console
   151  $ ais show job E-w0gjdm1z
   152  NODE             ID              KIND                 BUCKET     OBJECTS      BYTES       START        END     STATE
   153  CAHt8081         E-w0gjdm1z      prefetch-listrange   s3://abc   -            -           10:08:24     -       Running
   154  ```
   155  
   156  And that is because it is the blob downloader that actually does all the work behind the scenes:
   157  
   158  ```console
   159  $ ais show job blob-download
   160  blob-download[lP3Lpe5jJ]
   161  NODE             ID              KIND            BUCKET         OBJECTS      BYTES        START        END     STATE
   162  CAHt8081         lP3Lpe5jJ       blob-download   s3://abc       -            20.00MiB     10:08:25     -       Running
   163  ```
   164  
   165  The work that shortly thereafter results in:
   166  
   167  ```console
   168  $ ais ls s3://abc
   169  NAME             SIZE            CACHED
   170  aisloader        39.30MiB        yes
   171  largefile        5.76GiB         yes
   172  smallfile        100.00MiB       yes
   173  ```