github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/downloader.md (about) 1 --- 2 layout: post 3 title: DOWNLOADER 4 permalink: /docs/downloader 5 redirect_from: 6 - /downloader.md/ 7 - /docs/downloader.md/ 8 --- 9 10 ## Why Downloader? 11 12 It probably won't be much of an exaggeration to say that the majority of popular AI datasets are available on the Internet and public remote buckets. 13 Those datasets are often growing in size, thus continuously providing a wealth of information to research and analyze. 14 15 It is, therefore, appropriate to ask a follow-up question: how to efficiently work with those datasets? 16 And what happens if the dataset in question is *larger* than the capacity of a single host? 17 What happens if it is large enough to require a cluster of storage servers? 18 19 > The often cited paper called [Revisiting Unreasonable Effectiveness of Data in Deep Learning Era](https://arxiv.org/abs/1707.02968) lists a good number of those large and very popular datasets, as well as the reasons to utilize them for training. 20 21 Meet **Internet Downloader** - an integrated part of the AIStore. 22 AIS cluster can be easily deployed on any commodity hardware, and AIS **downloader** can then be used to quickly populate AIS buckets with any contents from a given location. 23 24 ## Features 25 26 > By way of background, AIStore supports a number of [3rd party Backend providers](/docs/providers.md) and utilizes the providers' SDKs to access the corresponding backends. 27 > For Amazon S3, that would be `aws-sdk-go` SDK, for Azure - `azure-storage-blob-go`, and so on. 28 > Each SDK can be **conditionally linked** into AIS executable - the decision (to link or not to link) is made prior to deployment. 29 30 This has a certain implication for the Downloader. 31 Namely: 32 33 Downloadable source can be both an Internet link (or links) or a remote bucket accessible via the corresponding backend implementation. 34 You can, for instance, download a Google Cloud bucket via its Internet location that would look something like: `https://www.googleapis.com/storage/.../bucket-name/...`. 35 36 However. 37 When downloading a remote bucket (**any** remote bucket), it is always **preferable** to have the corresponding SDK linked-in. 38 Downloader will then detect the SDK "presence" at runtime and use a wider range of options available via this SDK. 39 40 Other supported features include: 41 42 * Can download a single file (object), a range, an entire bucket, **and** a virtual directory in a given remote bucket. 43 * Easy to use with [command line interface](/docs/cli/download.md). 44 * Versioning and checksum support allows for an optimal download of the same source location multiple times to *incrementally* update AIS destination with source changes (if any). 45 46 The rest of this document describes these and other capabilities in greater detail and illustrates them with examples. 47 48 ## Example 49 50 Downloading jobs run asynchronously; you can monitor the progress of each specific job. 51 The following example runs two jobs, each downloading 10 objects (gzipped tarballs in this case) from a given Google Cloud bucket: 52 53 ```console 54 $ ais start download "gs://lpr-imagenet/train-{0001..0010}.tgz" ais://imagenet 55 5JjIuGemR 56 Run `ais show job download 5JjIuGemR` to monitor the progress of downloading. 57 $ ais start download "gs://lpr-imagenet/train-{0011..0020}.tgz" ais://imagenet 58 H9OjbW5FH 59 Run `ais show job download H9OjbW5FH` to monitor the progress of downloading. 60 $ ais show job download 61 JOB ID STATUS ERRORS DESCRIPTION 62 5JjIuGemR Finished 0 https://storage.googleapis.com/lpr-imagenet/imagenet_train-{0001..0010}.tgz -> ais://imagenet 63 H9OjbW5FH Finished 0 https://storage.googleapis.com/lpr-imagenet/imagenet_train-{0011..0020}.tgz -> ais://imagenet 64 ``` 65 66 For more examples see: [Downloader CLI](/docs/cli/download.md) 67 68 ## Request to download 69 70 AIS Downloader supports 4 (four) request types: 71 72 * **Single** - download a single object. 73 * **Multi** - download multiple objects provided by JSON map (string -> string) or list of strings. 74 * **Range** - download multiple objects based on a given naming pattern. 75 * **Backend** - given optional prefix and optional suffix, download matching objects from the specified remote bucket. 76 77 > Prior to downloading, make sure destination bucket already exists. 78 > To create a bucket using AIS CLI, run `ais create`, for instance: 79 > 80 > ```console 81 > $ ais create imagenet 82 > ``` 83 > 84 > Also, see [AIS API](/docs/http_api.md) for details on how to create, destroy, and list storage buckets. 85 > For Python-based clients, a better starting point could be [here](/docs/overview.md#python-client). 86 87 The rest of this document is structured around supported *types of downloading jobs* and can serve as an API reference for the Downloader. 88 89 ## Table of Contents 90 91 - [Single (object) download](#single-download) 92 - [Multi (object) download](#multi-download) 93 - [Range (object) download](#range-download) 94 - [Backend download](#backend-download) 95 - [Aborting](#aborting) 96 - [Status (of the download)](#status) 97 - [List of downloads](#list-of-downloads) 98 - [Remove from list](#remove-from-list) 99 100 ## Single Download 101 102 The request (described below) downloads a *single* object and is considered the most basic. 103 This request returns *id* on successful request which can then be used to check the status or abort the download job. 104 105 ### Request JSON Parameters 106 107 Name | Type | Description | Optional? 108 ------------ | ------------- | ------------- | ------------- 109 `bucket.name` | `string` | Bucket where the downloaded object is saved to. | No | 110 `bucket.provider` | `string` | Determines the provider of the bucket. By default, locality is determined automatically. | Yes | 111 `bucket.namespace` | `string` | Determines the namespace of the bucket. | Yes | 112 `description` | `string` | Description for the download request. | Yes | 113 `timeout` | `string` | Timeout for request to external resource. | Yes | 114 `limits.connections` | `int` | Number of concurrent connections each target can make. | Yes | 115 `limits.bytes_per_hour` | `int` | Number of bytes the cluster can download in one hour. | Yes | 116 `link` | `string` | URL of where the object is downloaded from. | No | 117 `object_name` | `string` | Name of the object the download is saved as. If no objname is provided, the name will be the last element in the URL's path. | Yes | 118 119 ### Sample Request 120 121 #### Single object download 122 123 ```bash 124 $ curl -Li -H 'Content-Type: application/json' -d '{ 125 "type": "single", 126 "bucket": {"name": "ubuntu"}, 127 "object_name": "ubuntu.iso", 128 "link": "http://releases.ubuntu.com/18.04.1/ubuntu-18.04.1-desktop-amd64.iso" 129 }' -X POST 'http://localhost:8080/v1/download' 130 ``` 131 132 ## Multi Download 133 134 A *multi* object download requires either a map or a list in JSON body: 135 * **Map** - in map, each entry should contain `custom_object_name` (key) -> `external_link` (value). This format allows object names to not depend on automatic naming as it is done in *list* format. 136 * **List** - in list, each entry should contain `external_link` to resource. Objects names are created from the base of the link. 137 138 This request returns *id* on successful request which can then be used to check the status or abort the download job. 139 140 ### Request JSON Parameters 141 142 Name | Type | Description | Optional? 143 ------------ | ------------- | ------------- | ------------- 144 `bucket.name` | `string` | Bucket where the downloaded object is saved to. | No | 145 `bucket.provider` | `string` | Determines the provider of the bucket. By default, locality is determined automatically. | Yes | 146 `bucket.namespace` | `string` | Determines the namespace of the bucket. | Yes | 147 `description` | `string` | Description for the download request. | Yes | 148 `timeout` | `string` | Timeout for request to external resource. | Yes | 149 `limits.connections` | `int` | Number of concurrent connections each target can make. | Yes | 150 `limits.bytes_per_hour` | `int` | Number of bytes the cluster can download in one hour. | Yes | 151 `objects` | `array` or `map` | The payload with the objects to download. | No | 152 153 ### Sample Request 154 155 #### Multi Download using object map 156 157 ```bash 158 $ curl -Li -H 'Content-Type: application/json' -d '{ 159 "type": "multi", 160 "bucket": {"name": "ubuntu"}, 161 "objects": { 162 "train-labels.gz": "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", 163 "t10k-labels-idx1.gz": "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz", 164 "train-images.gz": "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz" 165 } 166 }' -X POST 'http://localhost:8080/v1/download' 167 ``` 168 169 #### Multi Download using object list 170 171 ```bash 172 $ curl -Li -H 'Content-Type: application/json' -d '{ 173 "type": "multi", 174 "bucket": {"name": "ubuntu"}, 175 "objects": [ 176 "http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz", 177 "http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz", 178 "http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz" 179 ] 180 }' -X POST 'http://localhost:8080/v1/download' 181 ``` 182 183 ## Range Download 184 185 A *range* download retrieves (in one shot) multiple objects while expecting (and relying upon) a certain naming convention which happens to be often used. 186 This request returns *id* on successful request which can then be used to check the status or abort the download job. 187 188 Namely, the *range* download expects the object name to consist of prefix + index + suffix, as described below: 189 190 ### Range Format 191 192 Consider a website named `randomwebsite.com/some_dir/` that contains the following files: 193 - `object1log.txt` 194 - `object2log.txt` 195 - `object3log.txt` 196 - ... 197 - `object1000log.txt` 198 199 To populate AIStore with objects in the range from `object200log.txt` to `object300log.txt` (101 objects total), use the *range* download. 200 201 ### Request JSON Parameters 202 203 Name | Type | Description | Optional? 204 ------------ | ------------- | ------------- | ------------- 205 `bucket.name` | `string` | Bucket where the downloaded object is saved to. | No | 206 `bucket.provider` | `string` | Determines the provider of the bucket. By default, locality is determined automatically. | Yes | 207 `bucket.namespace` | `string` | Determines the namespace of the bucket. | Yes | 208 `description` | `string` | Description for the download request. | Yes | 209 `timeout` | `string` | Timeout for request to external resource. | Yes | 210 `limits.connections` | `int` | Number of concurrent connections each target can make. | Yes | 211 `limits.bytes_per_hour` | `int` | Number of bytes the cluster can download in one hour. | Yes | 212 `subdir` | `string` | Subdirectory in the `bucket` where the downloaded objects are saved to. | Yes | 213 `template` | `string` | Bash template describing names of the objects in the URL. | No | 214 215 ### Sample Request 216 217 #### Download a (range) list of objects 218 219 ```bash 220 $ curl -Lig -H 'Content-Type: application/json' -d '{ 221 "type": "range", 222 "bucket": {"name": "test"}, 223 "template": "randomwebsite.com/some_dir/object{200..300}log.txt" 224 }' -X POST 'http://localhost:8080/v1/download' 225 ``` 226 227 #### Download a (range) list of objects into a subdirectory inside a bucket 228 229 ```bash 230 $ curl -Lig -H 'Content-Type: application/json' -d '{ 231 "type": "range", 232 "bucket": {"name": "test"}, 233 "template": "randomwebsite.com/some_dir/object{200..300}log.txt", 234 "subdir": "some/subdir/" 235 }' -X POST 'http://localhost:8080/v1/download' 236 ``` 237 238 #### Download a (range) list of objects, selecting every tenth object 239 240 ```bash 241 $ curl -Lig -H 'Content-Type: application/json' -d '{ 242 "type": "range", 243 "bucket": {"name": "test"}, 244 "template": "randomwebsite.com/some_dir/object{1..1000..10}log.txt" 245 }' -X POST 'http://localhost:8080/v1/download' 246 ``` 247 248 **Tip:** use `-g` option in curl to turn off URL globbing parser - it will allow to use `{` and `}` without escaping them. 249 250 ## Backend download 251 252 A *backend* download prefetches multiple objects which names match provided prefix and suffix and are contained in a given remote bucket. 253 254 ### Request JSON Parameters 255 256 Name | Type | Description | Optional? 257 ------------ | ------------- | ------------- | ------------- 258 `bucket.name` | `string` | Bucket where the downloaded object is saved to. | No | 259 `bucket.provider` | `string` | Determines the provider of the bucket. | Yes | 260 `bucket.namespace` | `string` | Determines the namespace of the bucket. | Yes | 261 `description` | `string` | Description for the download request. | Yes | 262 `sync` | `bool` | Synchronizes the remote bucket: downloads new or updated objects (regular download) + checks and deletes cached objects if they are no longer present in the remote bucket. | Yes | 263 `prefix` | `string` | Prefix of the objects names to download. | Yes | 264 `suffix` | `string` | Suffix of the objects names to download. | Yes | 265 266 ### Sample Request 267 268 #### Download objects from a remote bucket 269 270 ```bash 271 $ curl -Liv -H 'Content-Type: application/json' -d '{ 272 "type": "backend", 273 "bucket": {"name": "lpr-vision", "provider": "gcp"}, 274 "prefix": "imagenet/imagenet_train-", 275 "suffix": ".tgz" 276 }' -X POST 'http://localhost:8080/v1/download' 277 ``` 278 279 ## Aborting 280 281 Any download request can be aborted at any time by making a `DELETE` request to `/v1/download/abort` with provided `id` (which is returned upon job creation). 282 283 ### Request JSON Parameters 284 285 Name | Type | Description | Optional? 286 ------------ | ------------- | ------------- | ------------- 287 `id` | `string` | Unique identifier of download job returned upon job creation. | No | 288 289 ### Sample Request 290 291 #### Abort download 292 293 ```console 294 $ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X DELETE 'http://localhost:8080/v1/download/abort' 295 ``` 296 297 ## Status 298 299 The status of any download request can be queried at any time using `GET` request with provided `id` (which is returned upon job creation). 300 301 ### Request JSON Parameters 302 303 Name | Type | Description | Optional? 304 ------------ | ------------- | ------------- | ------------- 305 `id` | `string` | Unique identifier of download job returned upon job creation. | No | 306 307 ### Sample Request 308 309 #### Get download status 310 311 ```console 312 $ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X GET 'http://localhost:8080/v1/download' 313 ``` 314 315 ## List of Downloads 316 317 The list of all download requests can be queried at any time. Note that this has the same syntax as [Status](#status) except the `id` parameter is empty. 318 319 ### Request Parameters 320 321 Name | Type | Description | Optional? 322 ------------ | ------------- | ------------- | ------------- 323 `regex` | `string` | Regex for the description of download requests. | Yes | 324 325 ### Sample Requests 326 327 #### Get list of all downloads 328 329 ```console 330 $ curl -Li -X GET 'http://localhost:8080/v1/download' 331 ``` 332 333 #### Get list of downloads with description starting with a digit 334 335 ```console 336 $ curl -Li -H 'Content-Type: application/json' -d '{"regex": "^[0-9]"}' -X GET 'http://localhost:8080/v1/download' 337 ``` 338 339 ## Remove from List 340 341 Any aborted or finished download request can be removed from the [list of downloads](#list-of-downloads) by making a `DELETE` request to `/v1/download/remove` with provided `id` (which is returned upon job creation). 342 343 ### Request JSON Parameters 344 345 Name | Type | Description | Optional? 346 ------------ | ------------- | ------------- | ------------- 347 `id` | `string` | Unique identifier of download job returned upon job creation. | No | 348 349 ### Sample Request 350 351 #### Remove download job from the list 352 353 ```console 354 $ curl -Li -H 'Content-Type: application/json' -d '{"id": "5JjIuGemR"}' -X DELETE 'http://localhost:8080/v1/download/remove' 355 ```