github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/downloader.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/downloader.md (about)

     1  ---
     2  layout: post
     3  title: DOWNLOADER
     4  permalink: /docs/downloader
     5  redirect_from:
     6   - /downloader.md/
     7   - /docs/downloader.md/
     8  ---
     9  
    10  ## Why Downloader?
    11  
    12  It probably won't be much of an exaggeration to say that the majority of popular AI datasets are available on the Internet and public remote buckets.
    13  Those datasets are often growing in size, thus continuously providing a wealth of information to research and analyze.
    14  
    15  It is, therefore, appropriate to ask a follow-up question: how to efficiently work with those datasets?
    16  And what happens if the dataset in question is *larger* than the capacity of a single host?
    17  What happens if it is large enough to require a cluster of storage servers?
    18  
    19  > The often cited paper called [Revisiting Unreasonable Effectiveness of Data in Deep Learning Era](https://arxiv.org/abs/1707.02968) lists a good number of those large and very popular datasets, as well as the reasons to utilize them for training.
    20  
    21  Meet **Internet Downloader** - an integrated part of the AIStore.
    22  AIS cluster can be easily deployed on any commodity hardware, and AIS **downloader** can then be used to quickly populate AIS buckets with any contents from a given location.
    23  
    24  ## Features
    25  
    26  > By way of background, AIStore supports a number of [3rd party Backend providers](/docs/providers.md) and utilizes the providers' SDKs to access the corresponding backends.
    27  > For Amazon S3, that would be `aws-sdk-go` SDK, for Azure - `azure-storage-blob-go`, and so on.
    28  > Each SDK can be **conditionally linked** into AIS executable - the decision (to link or not to link) is made prior to deployment.
    29  
    30  This has a certain implication for the Downloader.
    31  Namely:
    32  
    33  Downloadable source can be both an Internet link (or links) or a remote bucket accessible via the corresponding backend implementation.
    34  You can, for instance, download a Google Cloud bucket via its Internet location that would look something like: `https://www.googleapis.com/storage/.../bucket-name/...`.
    35  
    36  However.
    37  When downloading a remote bucket (**any** remote bucket), it is always **preferable** to have the corresponding SDK linked-in.
    38  Downloader will then detect the SDK "presence" at runtime and use a wider range of options available via this SDK.
    39  
    40  Other supported features include:
    41  
    42  * Can download a single file (object), a range, an entire bucket, **and** a virtual directory in a given remote bucket.
    43  * Easy to use with [command line interface](/docs/cli/download.md).
    44  * Versioning and checksum support allows for an optimal download of the same source location multiple times to *incrementally* update AIS destination with source changes (if any).
    45  
    46  The rest of this document describes these and other capabilities in greater detail and illustrates them with examples.
    47  
    48  ## Example
    49  
    50  Downloading jobs run asynchronously; you can monitor the progress of each specific job.
    51  The following example runs two jobs, each downloading 10 objects (gzipped tarballs in this case) from a given Google Cloud bucket:
    52  
    53  ```console
    54  $ ais start download "gs://lpr-imagenet/train-{0001..0010}.tgz" ais://imagenet
    55  5JjIuGemR
    56  Run `ais show job download 5JjIuGemR` to monitor the progress of downloading.
    57  $ ais start download "gs://lpr-imagenet/train-{0011..0020}.tgz" ais://imagenet
    58  H9OjbW5FH
    59  Run `ais show job download H9OjbW5FH` to monitor the progress of downloading.
    60  $ ais show job download
    61  JOB ID           STATUS          ERRORS  DESCRIPTION
    62  5JjIuGemR        Finished        0       https://storage.googleapis.com/lpr-imagenet/imagenet_train-{0001..0010}.tgz -> ais://imagenet
    63  H9OjbW5FH        Finished        0       https://storage.googleapis.com/lpr-imagenet/imagenet_train-{0011..0020}.tgz -> ais://imagenet
    64  ```
    65  
    66  For more examples see: [Downloader CLI](/docs/cli/download.md)
    67  
    68  ## Request to download
    69  
    70  AIS Downloader supports 4 (four) request types:
    71  
    72  * **Single** - download a single object.
    73  * **Multi** - download multiple objects provided by JSON map (string -> string) or list of strings.
    74  * **Range** - download multiple objects based on a given naming pattern.
    75  * **Backend** - given optional prefix and optional suffix, download matching objects from the specified remote bucket.
    76  
    77  > Prior to downloading, make sure destination bucket already exists.
    78  > To create a bucket using AIS CLI, run `ais create`, for instance:
    79  >
    80  > ```console
    81  > $ ais create imagenet
    82  > ```
    83  >
    84  > Also, see [AIS API](/docs/http_api.md) for details on how to create, destroy, and list storage buckets.
    85  > For Python-based clients, a better starting point could be [here](/docs/overview.md#python-client).
    86  
    87  The rest of this document is structured around supported *types of downloading jobs* and can serve as an API reference for the Downloader.
    88  
    89  ## Table of Contents
    90  
    91  - [Single (object) download](#single-download)
    92  - [Multi (object) download](#multi-download)
    93  - [Range (object) download](#range-download)
    94  - [Backend download](#backend-download)
    95  - [Aborting](#aborting)
    96  - [Status (of the download)](#status)
    97  - [List of downloads](#list-of-downloads)
    98  - [Remove from list](#remove-from-list)
    99  
   100  ## Single Download
   101  
   102  The request (described below) downloads a *single* object and is considered the most basic.
   103  This request returns *id* on successful request which can then be used to check the status or abort the download job.
   104  
   105  ### Request JSON Parameters
   106  
   107  Name | Type | Description | Optional?
   108  ------------ | ------------- | ------------- | -------------
   109  `bucket.name` | `string` | Bucket where the downloaded object is saved to. | No |
   110  `bucket.provider` | `string` | Determines the provider of the bucket. By default, locality is determined automatically. | Yes |
   111  `bucket.namespace` | `string` | Determines the namespace of the bucket. | Yes |
   112  `description` | `string` | Description for the download request. | Yes |
   113  `timeout` | `string` | Timeout for request to external resource. | Yes |
   114  `limits.connections` | `int` | Number of concurrent connections each target can make. | Yes |
   115  `limits.bytes_per_hour` | `int` | Number of bytes the cluster can download in one hour. | Yes |
   116  `link` | `string` | URL of where the object is downloaded from. | No |
   117  `object_name` | `string` | Name of the object the download is saved as. If no objname is provided, the name will be the last element in the URL's path. | Yes |
   118  
   119  ### Sample Request
   120  
   121  #### Single object download
   122  
   123  ```bash
   124  $ curl -Li -H 'Content-Type: application/json' -d '{
   125    "type": "single",
   126    "bucket": {"name": "ubuntu"},
   127    "object_name": "ubuntu.iso",
   128    "link": "http://releases.ubuntu.com/18.04.1/ubuntu-18.04.1-desktop-amd64.iso"
   129  }' -X POST 'http://localhost:8080/v1/download'
   130  ```
   131  
   132  ## Multi Download
   133  
   134  A *multi* object download requires either a map or a list in JSON body:
   135  * **Map** - in map, each entry should contain `custom_object_name` (key) -> `external_link` (value). This format allows object names to not depend on automatic naming as it is done in *list* format.
   136  * **List** - in list, each entry should contain `external_link` to resource. Objects names are created from the base of the link.
   137  
   138  This request returns *id* on successful request which can then be used to check the status or abort the download job.
   139  
   140  ### Request JSON Parameters
   141  
   142  Name | Type | Description | Optional?
   143  ------------ | ------------- | ------------- | -------------
   144  `bucket.name` | `string` | Bucket where the downloaded object is saved to. | No |
   145  `bucket.provider` | `string` | Determines the provider of the bucket. By default, locality is determined automatically. | Yes |
   146  `bucket.namespace` | `string` | Determines the namespace of the bucket. | Yes |
   147  `description` | `string` | Description for the download request. | Yes |
   148  `timeout` | `string` | Timeout for request to external resource. | Yes |
   149  `limits.connections` | `int` | Number of concurrent connections each target can make. | Yes |
   150  `limits.bytes_per_hour` | `int` | Number of bytes the cluster can download in one hour. | Yes |
   151  `objects` | `array` or `map` | The payload with the objects to download. | No |
   152  
   153  ### Sample Request
   154  
   155  #### Multi Download using object map
   156  
   157  ```bash
   158  $ curl -Li -H 'Content-Type: application/json' -d '{
   159    "type": "multi",
   160    "bucket": {"name": "ubuntu"},
   161    "objects": {
   162      "train-labels.gz": "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
   163      "t10k-labels-idx1.gz": "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
   164      "train-images.gz": "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
   165    }
   166  }' -X POST 'http://localhost:8080/v1/download'
   167  ```
   168  
   169  #### Multi Download using object list
   170  
   171  ```bash
   172  $ curl -Li -H 'Content-Type: application/json' -d '{
   173    "type": "multi",
   174    "bucket": {"name": "ubuntu"},
   175    "objects": [
   176      "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz",
   177      "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz",
   178      "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz"
   179    ]
   180  }' -X POST 'http://localhost:8080/v1/download'
   181  ```
   182  
   183  ## Range Download
   184  
   185  A *range* download retrieves (in one shot) multiple objects while expecting (and relying upon) a certain naming convention which happens to be often used.
   186  This request returns *id* on successful request which can then be used to check the status or abort the download job.
   187  
   188  Namely, the *range* download expects the object name to consist of prefix + index + suffix, as described below:
   189  
   190  ### Range Format
   191  
   192  Consider a website named `randomwebsite.com/some_dir/` that contains the following files:
   193  - `object1log.txt`
   194  - `object2log.txt`
   195  - `object3log.txt`
   196  - ...
   197  - `object1000log.txt`
   198  
   199  To populate AIStore with objects in the range from `object200log.txt` to `object300log.txt` (101 objects total), use the *range* download.
   200  
   201  ### Request JSON Parameters
   202  
   203  Name | Type | Description | Optional?
   204  ------------ | ------------- | ------------- | -------------
   205  `bucket.name` | `string` | Bucket where the downloaded object is saved to. | No |
   206  `bucket.provider` | `string` | Determines the provider of the bucket. By default, locality is determined automatically. | Yes |
   207  `bucket.namespace` | `string` | Determines the namespace of the bucket. | Yes |
   208  `description` | `string` | Description for the download request. | Yes |
   209  `timeout` | `string` | Timeout for request to external resource. | Yes |
   210  `limits.connections` | `int` | Number of concurrent connections each target can make. | Yes |
   211  `limits.bytes_per_hour` | `int` | Number of bytes the cluster can download in one hour. | Yes |
   212  `subdir` | `string` | Subdirectory in the `bucket` where the downloaded objects are saved to. | Yes |
   213  `template` | `string` | Bash template describing names of the objects in the URL. | No |
   214  
   215  ### Sample Request
   216  
   217  #### Download a (range) list of objects
   218  
   219  ```bash
   220  $ curl -Lig -H 'Content-Type: application/json' -d '{
   221    "type": "range",
   222    "bucket": {"name": "test"},
   223    "template": "randomwebsite.com/some_dir/object{200..300}log.txt"
   224  }' -X POST 'http://localhost:8080/v1/download'
   225  ```
   226  
   227  #### Download a (range) list of objects into a subdirectory inside a bucket
   228  
   229  ```bash
   230  $ curl -Lig -H 'Content-Type: application/json' -d '{
   231    "type": "range",
   232    "bucket": {"name": "test"},
   233    "template": "randomwebsite.com/some_dir/object{200..300}log.txt",
   234    "subdir": "some/subdir/"
   235  }' -X POST 'http://localhost:8080/v1/download'
   236  ```
   237  
   238  #### Download a (range) list of objects, selecting every tenth object
   239  
   240  ```bash
   241  $ curl -Lig -H 'Content-Type: application/json' -d '{
   242    "type": "range",
   243    "bucket": {"name": "test"},
   244    "template": "randomwebsite.com/some_dir/object{1..1000..10}log.txt"
   245  }' -X POST 'http://localhost:8080/v1/download'
   246  ```
   247  
   248  **Tip:** use `-g` option in curl to turn off URL globbing parser - it will allow to use `{` and `}` without escaping them.
   249  
   250  ## Backend download
   251  
   252  A *backend* download prefetches multiple objects which names match provided prefix and suffix and are contained in a given remote bucket.
   253  
   254  ### Request JSON Parameters
   255  
   256  Name | Type | Description | Optional?
   257  ------------ | ------------- | ------------- | -------------
   258  `bucket.name` | `string` | Bucket where the downloaded object is saved to. | No |
   259  `bucket.provider` | `string` | Determines the provider of the bucket. | Yes |
   260  `bucket.namespace` | `string` | Determines the namespace of the bucket. | Yes |
   261  `description` | `string` | Description for the download request. | Yes |
   262  `sync` | `bool` | Synchronizes the remote bucket: downloads new or updated objects (regular download) + checks and deletes cached objects if they are no longer present in the remote bucket. | Yes |
   263  `prefix` | `string` | Prefix of the objects names to download. | Yes |
   264  `suffix` | `string` | Suffix of the objects names to download. | Yes |
   265  
   266  ### Sample Request
   267  
   268  #### Download objects from a remote bucket
   269  
   270  ```bash
   271  $ curl -Liv -H 'Content-Type: application/json' -d '{
   272    "type": "backend",
   273    "bucket": {"name": "lpr-vision", "provider": "gcp"},
   274    "prefix": "imagenet/imagenet_train-",
   275    "suffix": ".tgz"
   276  }' -X POST 'http://localhost:8080/v1/download'
   277  ```
   278  
   279  ## Aborting
   280  
   281  Any download request can be aborted at any time by making a `DELETE` request to `/v1/download/abort` with provided `id` (which is returned upon job creation).
   282  
   283  ### Request JSON Parameters
   284  
   285  Name | Type | Description | Optional?
   286  ------------ | ------------- | ------------- | -------------
   287  `id` | `string` | Unique identifier of download job returned upon job creation. | No |
   288  
   289  ### Sample Request
   290  
   291  #### Abort download
   292  
   293  ```console
   294  $ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X DELETE 'http://localhost:8080/v1/download/abort'
   295  ```
   296  
   297  ## Status
   298  
   299  The status of any download request can be queried at any time using `GET` request with provided `id` (which is returned upon job creation).
   300  
   301  ### Request JSON Parameters
   302  
   303  Name | Type | Description | Optional?
   304  ------------ | ------------- | ------------- | -------------
   305  `id` | `string` | Unique identifier of download job returned upon job creation. | No |
   306  
   307  ### Sample Request
   308  
   309  #### Get download status
   310  
   311  ```console
   312  $ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X GET 'http://localhost:8080/v1/download'
   313  ```
   314  
   315  ## List of Downloads
   316  
   317  The list of all download requests can be queried at any time. Note that this has the same syntax as [Status](#status) except the `id` parameter is empty.
   318  
   319  ### Request Parameters
   320  
   321  Name | Type | Description | Optional?
   322  ------------ | ------------- | ------------- | -------------
   323  `regex` | `string` | Regex for the description of download requests. | Yes |
   324  
   325  ### Sample Requests
   326  
   327  #### Get list of all downloads
   328  
   329  ```console
   330  $ curl -Li -X GET 'http://localhost:8080/v1/download'
   331  ```
   332  
   333  #### Get list of downloads with description starting with a digit
   334  
   335  ```console
   336  $ curl -Li -H 'Content-Type: application/json' -d '{"regex": "^[0-9]"}' -X GET 'http://localhost:8080/v1/download'
   337  ```
   338  
   339  ## Remove from List
   340  
   341  Any aborted or finished download request can be removed from the [list of downloads](#list-of-downloads) by making a `DELETE` request to `/v1/download/remove` with provided `id` (which is returned upon job creation).
   342  
   343  ### Request JSON Parameters
   344  
   345  Name | Type | Description | Optional?
   346  ------------ | ------------- | ------------- | -------------
   347  `id` | `string` | Unique identifier of download job returned upon job creation. | No |
   348  
   349  ### Sample Request
   350  
   351  #### Remove download job from the list
   352  
   353  ```console
   354  $ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X DELETE 'http://localhost:8080/v1/download/remove'
   355  ```