github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/bucket.md (about) 1 --- 2 layout: post 3 title: BUCKET 4 permalink: /docs/bucket 5 redirect_from: 6 - /bucket.md/ 7 - /docs/bucket.md/ 8 --- 9 10 # Table of Contents 11 12 - [Bucket](#bucket) 13 - [Default Bucket Properties](#default-bucket-properties) 14 - [Inherited Bucket Properties and LRU](#inherited-bucket-properties-and-lru) 15 - [Backend Provider](#backend-provider) 16 - [List Buckets](#list-buckets) 17 - [AIS Bucket](#ais-bucket) 18 - [CLI: create, rename and, destroy ais bucket](#cli-create-rename-and-destroy-ais-bucket) 19 - [CLI: specifying and listing remote buckets](#cli-specifying-and-listing-remote-buckets) 20 - [CLI: working with remote AIS cluster](#cli-working-with-remote-ais-cluster) 21 - [Remote Bucket](#remote-bucket) 22 - [Public Cloud Buckets](#public-cloud-buckets) 23 - [Remote AIS cluster](#remote-ais-cluster) 24 - [Public HTTP(S) Datasets](#public-https-dataset) 25 - [Prefetch/Evict Objects](#prefetchevict-objects) 26 - [Evict Remote Bucket](#evict-remote-bucket) 27 - [Out of band updates](/docs/out_of_band.md) 28 - [Backend Bucket](#backend-bucket) 29 - [AIS bucket as a reference](#ais-bucket-as-a-reference) 30 - [Bucket Properties](#bucket-properties) 31 - [CLI examples: listing and setting bucket properties](#cli-examples-listing-and-setting-bucket-properties) 32 - [Bucket Access Attributes](#bucket-access-attributes) 33 - [AWS-specific configuration](#aws-specific-configuration) 34 - [List Objects](#list-objects) 35 - [Options](#options) 36 - [Results](#results) 37 38 # Bucket 39 40 AIStore uses the popular and well-known bucket abstraction, originally (likely) introduced by Amazon S3. 41 42 Similar to S3, AIS bucket is a _container for objects_. 43 44 > An object, in turn, is a file **and** a metadata that describes that object and normally includes: checksum, version, references to copies (replicas), size, last access time, source bucket (if object's origin is a Cloud bucket), custom user-defined attributes, and more. 45 46 AIS is a flat `<bucket-name>/<object-name>` storage hierarchy where named buckets store user datasets. 47 48 In addition, each AIS bucket is a point of applying (per-bucket) management policies: checksumming, versioning, erasure coding, mirroring, LRU eviction, checksum and/or version validation, and more. 49 50 AIS buckets *contain* user data performing the same function as, for instance: 51 52 * [Amazon S3 buckets](https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html) 53 * [Google Cloud (GCP) buckets](https://cloud.google.com/storage/docs/key-terms#buckets) 54 * [Microsoft Azure Blob containers](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction) 55 56 In addition, AIS supports multiple storage **backends** including itself: 57 58 ![Supported Backends](images/supported-backends.png) 59 60 But there's more. 61 62 AIStore supports vendor-specific configuration on a per bucket basis. For instance, any bucket _backed up_ by an AWS S3 bucket (**) can be configured to use alternative: 63 64 * named AWS profiles (with alternative credentials and/or region) 65 * s3 endpoints 66 67 (**) Terminology-wise, when we say "s3 bucket" or "google cloud bucket" we in fact reference a bucket in an AIS cluster that is either: 68 69 * (A) denoted with the respective `s3:` or `gs:` protocol schema, or 70 * (B) is a differently named AIS (that is, `ais://`) bucket that has its `backend_bck` property referencing the s3 (or google cloud) bucket in question. 71 72 > For examples and **usage**, grep docs for `backend_bck` or see [AWS profiles and alternative s3 endpoints](/docs/cli/aws_profile_endpoint.md). 73 74 All the [supported storage services](storage_svcs.md) equally apply to all storage backends with only a few exceptions. The following table summarizes them. 75 76 | Kind | Description | Supported Storage Services | 77 | --- | --- | --- | 78 | AIS buckets | buckets that are **not** 3rd party backend-based. AIS buckets store user objects and support user-specified bucket properties (e.g., 3 copies). Unlike remote buckets, ais buckets can be created through the [RESTful API](http_api.md). Similar to remote buckets, ais buckets are distributed and balanced, content-wise, across the entire AIS cluster. | [Checksumming](storage_svcs.md#checksumming), [LRU (advanced usage)](storage_svcs.md#lru-for-local-buckets), [Erasure Coding](storage_svcs.md#erasure-coding), [Local Mirroring and Load Balancing](storage_svcs.md#local-mirroring-and-load-balancing) | 79 | remote buckets | When AIS is deployed as [fast tier](providers.md), buckets in the cloud storage can be viewed and accessed through the [RESTful API](http_api.md) in AIS, in the exact same way as ais buckets. When this happens, AIS creates local instances of said buckets which then serves as a cache. These are referred to as **3rd party backend-based buckets**. | [Checksumming](storage_svcs.md#checksumming), [LRU](storage_svcs.md#lru), [Erasure Coding](storage_svcs.md#erasure-coding), [Local mirroring and load balancing](storage_svcs.md#local-mirroring-and-load-balancing) | 80 81 3rd party backend-based and AIS buckets support the same API with a few documented exceptions. Remote buckets can be *evicted* from AIS. AIS buckets are the only buckets that can be created, renamed, and deleted via the [RESTful API](http_api.md). 82 83 ## Default Bucket Properties 84 85 By default, created buckets inherit their properties from the cluster-wide global [configuration](configuration.md). 86 Similar to other types of cluster-wide metadata, global configuration (also referred to as "cluster configuration") 87 is protected (versioned, checksummed) and replicated across the entire cluster. 88 89 **Important**: 90 91 * Bucket properties can be changed at any time via `api.SetBucketProps`. 92 * In addition, `api.CreateBucket` allows to specify (non-default) properties at bucket creation time. 93 * Inherited defaults include (but are not limited to) checksum, LRU, versioning, n-way mirroring, and erasure-coding configurations. 94 * By default, LRU is disabled for AIS (`ais://`) buckets. 95 96 Bucket creation operation allows to override the **inherited defaults**, which include: 97 98 | Configuration section | References | 99 | --- | --- | 100 | Backend | [Backend Provider](#backend-provider) | 101 | Checksum | [Supported Checksums and Brief Theory of Operations](checksum.md) | 102 | LRU | [Storage Services: LRU](storage_svcs.md#lru) | 103 | N-way mirror | [Storage Services: n-way mirror](storage_svcs.md#n-way-mirror) | 104 | Versioning | --- | 105 | Access | [Bucket Access Attributes](#bucket-access-attributes) | 106 | Erasure Coding | [Storage Services: erasure coding](storage_svcs.md#erasure-coding) | 107 | Metadata Persistence | --- | 108 109 Example specifying (non-default) bucket properties at creation time: 110 111 ```console 112 $ ais create ais://abc --props="mirror.enabled=true mirror.copies=4" 113 114 # or, same using JSON: 115 $ ais create ais://abc --props='{"mirror": {"enabled": true, "copies": 4}}' 116 ``` 117 118 ## Inherited Bucket Properties and LRU 119 120 1. [LRU](storage_svcs.md#lru) eviction triggers automatically when the percentage of used capacity exceeds configured ("high") watermark `space.highwm`. The latter is part of bucket configuration and one of the many bucket properties that can be individually configured. 121 2. By default, `space.highwm` = `90%` of total storage space. 122 3. Another important knob is `lru.enabled` that defines whether a given bucket can be a subject of LRU eviction in the first place. 123 4. By default, these two and all the other knobs are [inherited](#default-bucket-properties) by a newly created bucket from [default (global, cluster-wide) configuration](configuration.md#cluster-and-node-configuration). 124 5. However, those inherited defaults can be changed - [overridden](#default-bucket-properties) - both at bucket creation time, and at any later time. 125 126 Going back to [LRU](storage_svcs.md#lru), it can be disabled (or enabled) on a per bucket basis. 127 128 Prior to the version 3.8, [LRU](storage_svcs.md#lru) eviction **was by default globally enabled**. Starting v3.8, [LRU](storage_svcs.md#lru) is enabled by default **only for remote buckets**. 129 130 > AIS buckets that have remote backends are, by definition, remote buckets. See [next section](#backend-provider) for details. 131 132 In summary, starting v3.8, a newly created AIS bucket inherits default configuration that makes the bucket *non-evictable*. 133 134 Useful CLI commands include: 135 136 ```console 137 # CLI to conveniently _toggle_ LRU eviction on and off on a per-bucket basis: 138 $ ais bucket lru ... 139 140 # Reset bucket properties to cluster-wide defaults: 141 $ ais bucket props reset ... 142 143 # Evict any given bucket based on a user-defined _template_. 144 # The command is one of the many supported _multi-object_ operations that run asynchronously 145 # and handle arbitrary (list, range, prefix)-defined templates. 146 $ ais bucket evict ... 147 ``` 148 149 See also: 150 151 * [CLI: Operations on Lists and Ranges](/docs/cli/object.md#operations-on-lists-and-ranges) 152 * [api.CreateBucket() and api.SetBucketProps()](/api/bucket.go) 153 * [RESTful API](http_api.md) 154 * [CLI: listing and setting bucket properties](#cli-examples-listing-and-setting-bucket-properties) 155 * [CLI documentation and many more examples](cli/bucket.md) 156 157 ## Backend Provider 158 159 [Backend Provider](providers.md) is an abstraction, and, simultaneously, an API-supported option that allows to delineate between "remote" and "local" buckets with respect to a given (any given) AIS cluster. 160 For complete definition and details, please refer to the [backend provider document](providers.md). 161 162 Backend provider is realized as an optional parameter in the GET, PUT, APPEND, DELETE and [Range/List](batch.md) operations with supported enumerated values that include: 163 * `ais` - for AIS buckets 164 * `aws` or `s3` - for Amazon S3 buckets 165 * `azure` or `az` - for Microsoft Azure Blob Storage buckets 166 * `gcp` or `gs` - for Google Cloud Storage buckets 167 * `ht` - for HTTP(S) based datasets 168 169 For API reference, please refer [to the RESTful API and examples](http_api.md). 170 The rest of this document serves to further explain features and concepts specific to storage buckets. 171 172 # List Buckets 173 174 To list all buckets, both _present_ in the cluster and remote, simply run: 175 176 * `ais ls --all` 177 178 Other useful variations of the command include: 179 180 * `ais ls s3` - list only those s3 buckets that are _present_ in the cluster 181 * `ais ls gs` - GCP buckets 182 * `ais ls ais` - list _all_ AIS buckets 183 * `ais ls ais://@ --all` - list _all_ remote AIS buckets (i.e., buckets in all remote AIS clusters currently attached) 184 185 And more: 186 187 * `ais ls s3: --all --regex abc` - list _all_ s3 buckets that match a given regex ("abc", in the example) 188 * `ais ls gs: --summary` - report usage statistics: numbers of objects and total sizes 189 190 ## See also 191 192 * `ais ls --help` 193 * [CLI: `ais ls`](/docs/cli/bucket.md) 194 195 # AIS Bucket 196 197 AIS buckets are the AIStore-own distributed buckets that are not associated with any 3rd party Cloud. 198 199 The [RESTful API](http_api.md) can be used to create, copy, rename and, destroy ais buckets. 200 201 New ais buckets must be given a unique name that does not duplicate any existing ais bucket. 202 203 If you are going to use an AIS bucket as an S3-compatible one, consider changing the bucket's checksum to `MD5`. 204 For details, see [S3 compatibility](s3compat.md#s3-compatibility). 205 206 ## CLI: create, rename and, destroy ais bucket 207 208 To create an ais bucket with the name `yt8m`, rename it to `yt8m_extended` and delete it, run: 209 210 ```console 211 $ ais create ais://yt8m 212 $ ais bucket mv ais://yt8m ais://yt8m_extended 213 $ ais bucket rm ais://yt8m_extended 214 ``` 215 216 Please note that rename bucket is not an instant operation, especially if the bucket contains data. Follow the `rename` command tips to monitor when the operation completes. 217 218 ## CLI: specifying and listing remote buckets 219 220 To list absolutely _all_ buckets that your AIS cluster has access to, run `ais ls`. 221 222 To lists all remote (and only remote) buckets, use: `ais ls @`. For example: 223 224 ```console 225 $ ais ls @ 226 227 AIS Buckets (1) 228 ais://@U-0MEX8oYt/abc 229 GCP Buckets (7) 230 gcp://lpr-foo 231 gcp://lpr-bar 232 ... (another 5 buckets omitted) 233 ``` 234 235 This example assumes that there's a remote AIS cluster identified by its UUID `U-0MEX8oYt` and previously [attached](#cli-working-with-remote-ais-cluster) to the "local" one. 236 237 Notice the naming notiation reference remote AIS buckets: prefix `@` in the full bucket name indicates remote cluster's UUIDs. 238 239 > Complete bucket naming specification includes bucket name, backend provider and namespace (which in turn includes UUID and optional sub-name, etc.). The spec can be found in this [source](/cmn/bucket.go). 240 241 And here are CLI examples of listing buckets by a given provider: 242 243 ### List Google buckets: 244 ```console 245 $ ais ls gs:// 246 # or, same: 247 $ ais ls gs: 248 249 GCP Buckets (7) 250 gcp://lpr-foo 251 gcp://lpr-bar 252 ... 253 ``` 254 255 ### List AIS buckets: 256 ```console 257 $ ais ls ais:// 258 # or, same: 259 $ ais ls ais: 260 ``` 261 262 ### List remote AIS buckets: 263 ```console 264 $ ais ls ais://@ 265 ``` 266 267 ## CLI: working with remote AIS cluster 268 269 AIS clusters can be attached to each other, thus forming a global (and globally accessible) namespace of all individually hosted datasets. For background and details on AIS multi-clustering, please refer to this [document](providers.md#remote-ais-cluster). 270 271 The following example creates an attachment between two clusters, lists all remote buckets, and then list objects in one of those remote buckets (see comments inline): 272 273 ```console 274 $ # Attach remote AIS cluster and assign it an alias `teamZ` (for convenience and for future reference): 275 $ ais cluster attach teamZ=http://cluster.ais.org:51080 276 Remote cluster (teamZ=http://cluster.ais.org:51080) successfully attached 277 $ 278 $ # The cluster at http://cluster.ais.org:51080 is now persistently attached: 279 $ ais show remote-cluster 280 UUID URL Alias Primary Smap Targets Online 281 MCBgkFqp http://cluster.ais.org:51080 teamZ p[primary] v317 10 yes 282 $ 283 $ # List all buckets in all remote clusters 284 $ # Notice the syntax: by convention, we use `@` to prefix remote cluster UUIDs, and so 285 $ # `ais://@` translates as "AIS backend provider, any remote cluster" 286 $ 287 $ ais ls ais://@ 288 AIS Buckets (4) 289 ais://@MCBgkFqp/imagenet 290 ais://@MCBgkFqp/coco 291 ais://@MCBgkFqp/imagenet-augmented 292 ais://@MCBgkFqp/imagenet-inflated 293 $ 294 $ # List all buckets in the remote cluster with UUID = MCBgkFqp 295 $ # Notice the syntax: `ais://@some-string` translates as "remote AIS cluster with alias or UUID equal some-string" 296 $ 297 $ ais ls ais://@MCBgkFqp 298 AIS Buckets (4) 299 ais://@MCBgkFqp/imagenet 300 ais://@MCBgkFqp/coco 301 ais://@MCBgkFqp/imagenet-augmented 302 ais://@MCBgkFqp/imagenet-inflated 303 $ 304 $ # List all buckets with name matching the regex pattern "tes*" 305 $ ais ls --regex "tes*" 306 AWS Buckets (3) 307 aws://test1 308 aws://test2 309 aws://test2 310 $ 311 $ # We can conveniently keep using our previously selected alias for the remote cluster - 312 $ # The following lists selected remote bucket using the cluster's alias: 313 $ ais ls ais://@teamZ/imagenet-augmented 314 NAME SIZE 315 train-001.tgz 153.52KiB 316 train-002.tgz 136.44KiB 317 ... 318 $ 319 $ # The same, but this time using the cluster's UUID: 320 $ ais ls ais://@MCBgkFqp/imagenet-augmented 321 NAME SIZE 322 train-001.tgz 153.52KiB 323 train-002.tgz 136.44KiB 324 ... 325 ``` 326 327 # Remote Bucket 328 329 Remote buckets are buckets that use 3rd party storage (AWS/GCP/Azure or HDFS) when AIS is deployed as [fast tier](overview.md#fast-tier). 330 Any reference to "Cloud buckets" refer to remote buckets that use a public cloud bucket as their backend (i.e. AWS/GCP/Azure, but not HDFS). 331 332 > By default, AIS does not keep track of the remote buckets in its configuration map. However, if users modify the properties of the remote bucket, AIS will then keep track. 333 334 ## Public Cloud Buckets 335 336 Public Google Storage supports limited access to its data. 337 If AIS cluster is deployed with Google Cloud enabled (Google Storage is selected as 3rd party Backend provider when [deploying an AIS cluster](/docs/getting_started.md#local-playground)), it allows a few operations without providing credentials: 338 HEAD a bucket, list bucket's content, GET an object, and HEAD an object. 339 The example shows accessing a private GCP bucket and a public GCP one without user authorization. 340 341 ```console 342 $ # Listing objects of a private bucket 343 $ ais ls gs://ais-ic 344 Bucket "gcp://ais-ic" does not exist 345 $ 346 $ # Listing a public bucket 347 $ ais ls gs://pub-images --limit 3 348 NAME SIZE 349 images-shard.ipynb 101.94KiB 350 images-train-000000.tar 964.77MiB 351 images-train-000001.tar 964.74MiB 352 ``` 353 354 Even if an AIS cluster is deployed without Cloud support, it is still possible to access public GCP and AWS buckets. 355 Run downloader to copy data from a public Cloud bucket to an AIS bucket and then use the AIS bucket. 356 Example shows how to download data from public Google storage: 357 358 ```console 359 $ ais create ais://images 360 "ais://images" bucket created 361 $ ais start download "gs://pub-images/images-train-{000000..000001}.tar" ais://images/ 362 Z8WkHxwIrr 363 Run `ais show job download Z8WkHxwIrr` to monitor the progress of downloading. 364 $ ais wait download Z8WkHxwIrr # or, same: ais wait Z8WkHxwIrr 365 $ ais ls ais://images 366 NAME SIZE 367 images-train-000000.tar 964.77MiB 368 images-train-000001.tar 964.74MiB 369 ``` 370 371 > Job starting, stopping (i.e., aborting), and monitoring commands all have equivalent *shorter* versions. For instance `ais start download` can be expressed as `ais start download`, while `ais wait copy-bucket Z8WkHxwIrr` is the same as `ais wait Z8WkHxwIrr`. 372 373 ## Remote AIS cluster 374 375 AIS cluster can be *attached* to another one which provides immediate capability for one cluster to "see" and transparently access the other's buckets and objects. 376 377 The functionality is termed [global namespace](providers.md#remote-ais-cluster) and is further described in the [backend providers](providers.md) readme. 378 379 To support global namespace, bucket names include `@`-prefixed cluster UUID. For remote AIS clusters, remote UUID and remote aliases can be used interchangeably. 380 381 For example, `ais://@remais/abc` would translate as AIS backend provider, where remote cluster would have `remais` alias. 382 383 Example working with remote AIS cluster (as well as easy-to-use scripts) can be found at: 384 385 * [readme for developers](development.md) 386 * [working with remote AIS cluster](#cli-working-with-remote-ais-cluster) 387 388 ## Public HTTP(S) Dataset 389 390 It is standard in machine learning community to publish datasets in public domains, so they can be accessed by everyone. 391 AIStore has integrated tools like [downloader](/docs/downloader.md) which can help in downloading those large datasets straight into provided AIS bucket. 392 However, sometimes using such tools is not a feasible solution. 393 394 For other cases AIStore has ability to act as a reverese-proxy when accessing **any** URL. 395 This enables downloading any HTTP(S) based content into AIStore cluster. 396 Assuming that proxy is listening on `localhost:8080`, one can use it as reverse-proxy to download `http://storage.googleapis.com/pub-images/images-train-000000.tar` shard into AIS cluster: 397 398 ```console 399 $ curl -sL --max-redirs 3 -x localhost:8080 --noproxy "$(curl -s localhost:8080/v1/cluster?what=target_ips)" \ 400 -X GET "http://storage.googleapis.com/minikube/minikube-0.6.iso.sha256" \ 401 > /dev/null 402 ``` 403 404 Alternatively, an object can also be downloaded using the `get` and `cat` CLI commands. 405 ```console 406 $ ais get http://storage.googleapis.com/minikube/minikube-0.7.iso.sha256 minikube-0.7.iso.sha256 407 ``` 408 409 This will cache shard object inside the AIStore cluster. 410 We can confirm this by listing available buckets and checking the content: 411 412 ```console 413 $ ais ls 414 AIS Buckets (1) 415 ais://local-bck 416 AWS Buckets (1) 417 aws://ais-test 418 HTTP(S) Buckets (1) 419 ht://ZDdhNTYxZTkyMzhkNjk3NA (http://storage.googleapis.com/minikube/) 420 $ ais ls ht://ZDdhNTYxZTkyMzhkNjk3NA 421 NAME SIZE 422 minikube-0.6.iso.sha256 65B 423 ``` 424 425 Now, when the object is accessed again, it will be served from AIStore cluster and will **not** be re-downloaded from HTTP(S) source. 426 427 Under the hood, AIStore remembers the object's source URL and associates the bucket with this URL. 428 In our example, bucket `ht://ZDdhNTYxZTkyMzhkNjk3NA` will be associated with `http://storage.googleapis.com/minikube/` URL. 429 Therefore, we can interchangeably use the associated URL for listing the bucket as show below. 430 431 ```console 432 $ ais ls http://storage.googleapis.com/minikube 433 NAME SIZE 434 minikube-0.6.iso.sha256 65B 435 ``` 436 437 > Note that only the last part (`minikube-0.6.iso.sha256`) of the URL is treated as the object name. 438 439 Such connection between bucket and URL allows downloading content without providing URL again: 440 441 ```console 442 $ ais object cat ht://ZDdhNTYxZTkyMzhkNjk3NA/minikube-0.7.iso.sha256 > /dev/null # cache another object 443 $ ais ls ht://ZDdhNTYxZTkyMzhkNjk3NA 444 NAME SIZE 445 minikube-0.6.iso.sha256 65B 446 minikube-0.7.iso.sha256 65B 447 ``` 448 449 ## Prefetch/Evict Objects 450 451 Objects within remote buckets are automatically fetched into storage targets when accessed through AIS and are evicted based on the monitored capacity and configurable high/low watermarks when [LRU](storage_svcs.md#lru) is enabled. 452 453 The [RESTful API](http_api.md) can be used to manually fetch a group of objects from the remote bucket (called prefetch) into storage targets or to remove them from AIS (called evict). 454 455 Objects are prefetched or evicted using [List/Range Operations](batch.md#listrange-operations). 456 457 For example, to use a [list operation](batch.md#list) to prefetch 'o1', 'o2', and, 'o3' from Amazon S3 remote bucket `abc`, run: 458 459 ```console 460 $ ais start prefetch aws://abc --list o1,o2,o3 461 ``` 462 463 To use a [range operation](batch.md#range) to evict the 1000th to 2000th objects in the remote bucket `abc` from AIS, which names begin with the prefix `__tst/test-`, run: 464 465 ```console 466 $ ais bucket evict aws://abc --template "__tst/test-{1000..2000}" 467 ``` 468 469 ### See also 470 471 * [Operations on Lists and Ranges](/docs/cli/object.md#operations-on-lists-and-ranges) 472 473 ## Evict Remote Bucket 474 475 This is `ais bucket evict` command but most of the time we'll be using its `ais evict` alias: 476 477 ```console 478 $ ais evict --help 479 NAME: 480 ais evict - (alias for "bucket evict") evict one remote bucket, multiple remote buckets, or 481 selected objects in a given remote bucket or buckets, e.g.: 482 - 'evict gs://abc' - evict entire bucket (all gs://abc objects in aistore); 483 - 'evict gs:' - evict all GCP buckets from the cluster; 484 - 'evict gs://abc --template images/' - evict all objects from the virtual subdirectory "images"; 485 - 'evict gs://abc/images/' - same as above; 486 - 'evict gs://abc --template "shard-{0000..9999}.tar.lz4"' - evict the matching range (prefix + brace expansion); 487 - 'evict "gs://abc/shard-{0000..9999}.tar.lz4"' - same as above (notice double quotes) 488 489 USAGE: 490 ais evict [command options] BUCKET[/OBJECT_NAME_or_TEMPLATE] [BUCKET[/OBJECT_NAME_or_TEMPLATE] ...] 491 492 OPTIONS: 493 --list value comma-separated list of object or file names, e.g.: 494 --list 'o1,o2,o3' 495 --list "abc/1.tar, abc/1.cls, abc/1.jpeg" 496 or, when listing files and/or directories: 497 --list "/home/docs, /home/abc/1.tar, /home/abc/1.jpeg" 498 --template value template to match object or file names; may contain prefix (that could be empty) with zero or more ranges 499 (with optional steps and gaps), e.g.: 500 --template "" # (an empty or '*' template matches eveything) 501 --template 'dir/subdir/' 502 --template 'shard-{1000..9999}.tar' 503 --template "prefix-{0010..0013..2}-gap-{1..2}-suffix" 504 and similarly, when specifying files and directories: 505 --template '/home/dir/subdir/' 506 --template "/abc/prefix-{0010..9999..2}-suffix" 507 --wait wait for an asynchronous operation to finish (optionally, use '--timeout' to limit the waiting time) 508 --timeout value maximum time to wait for a job to finish; if omitted: wait forever or until Ctrl-C; 509 valid time units: ns, us (or µs), ms, s (default), m, h 510 --progress show progress bar(s) and progress of execution in real time 511 --refresh value interval for continuous monitoring; 512 valid time units: ns, us (or µs), ms, s (default), m, h 513 --keep-md keep bucket metadata 514 --prefix value select objects that have names starting with the specified prefix, e.g.: 515 '--prefix a/b/c' - matches names 'a/b/c/d', 'a/b/cdef', and similar; 516 '--prefix a/b/c/' - only matches objects from the virtual directory a/b/c/ 517 --dry-run preview the results without really running the action 518 --verbose, -v verbose output 519 --non-verbose, --nv non-verbose (quiet) output, minimized reporting 520 --help, -h show help 521 ``` 522 523 Note usage examples above. You can always run `--help` option to see the most recently updated inline help. 524 525 Once there is a request to access the bucket, or a request to change the bucket's properties (see `set bucket props` in [REST API](http_api.md)), then the AIS cluster starts keeping track of the bucket. 526 527 In an evict bucket operation, AIS will remove all traces of the remote bucket within the AIS cluster. This effectively resets the AIS cluster to the point before any requests to the bucket have been made. This does not affect the objects stored within the remote bucket. 528 529 For example, to evict `abc` remote bucket from the AIS cluster, run: 530 531 ```console 532 $ ais bucket evict aws://abc 533 ``` 534 535 Note: When an HDFS bucket is evicted, AIS will only delete objects stored in the cluster. AIS will retain the bucket's metadata to allow the bucket to re-register later. 536 This behavior can be applied to other remote buckets by using the `--keep-md` flag with `ais bucket evict`. 537 538 ### See also 539 540 * [Operations on Lists and Ranges](/docs/cli/object.md#operations-on-lists-and-ranges) 541 542 # Backend Bucket 543 544 So far, we have covered AIS and remote buckets. These abstractions are sufficient for almost all use cases. But there are times when we would like to download objects from an existing remote bucket and then make use of the features available only for AIS buckets. 545 546 One way of accomplishing that could be: 547 1. Prefetch cloud objects. 548 2. Create AIS bucket. 549 3. Use the bucket-copying [API](http_api.md) or [CLI](/docs/cli/bucket.md) to copy over the objects from the remote bucket to the newly created AIS bucket. 550 551 However, the extra-copying involved may prove to be time and/or space consuming. Hence, AIS-supported capability to establish an **ad-hoc** 1-to-1 relationship between a given AIS bucket and an existing cloud (*backend*). 552 553 > As aside, the term "backend" - something that is on the back, usually far (or farther) away - is often used for data redundancy, data caching, and/or data sharing. AIS *backend bucket* allows to achieve all of the above. 554 555 For example: 556 557 ```console 558 $ ais create abc 559 "abc" bucket created 560 $ ais bucket props set ais://abc backend_bck=gcp://xyz 561 Bucket props successfully updated 562 ``` 563 564 After that, you can access all objects from `gcp://xyz` via `ais://abc`. **On-demand persistent caching** (from the `gcp://xyz`) becomes then automatically available, as well as **all other AIS-supported storage services** configurable on a per-bucket basis. 565 566 For example: 567 568 ```console 569 $ ais ls gcp://xyz 570 NAME SIZE VERSION 571 shard-0.tar 2.50KiB 1 572 shard-1.tar 2.50KiB 1 573 $ ais ls ais://abc 574 NAME SIZE VERSION 575 shard-0.tar 2.50KiB 1 576 shard-1.tar 2.50KiB 1 577 $ ais get ais://abc/shard-0.tar /dev/null # cache/prefetch cloud object 578 "shard-0.tar" has the size 2.50KiB (2560 B) 579 $ ais ls ais://abc --cached 580 NAME SIZE VERSION 581 shard-0.tar 2.50KiB 1 582 $ ais bucket props set ais://abc backend_bck=none # disconnect backend bucket 583 Bucket props successfully updated 584 $ ais ls ais://abc 585 NAME SIZE VERSION 586 shard-0.tar 2.50KiB 1 587 ``` 588 589 For more examples please refer to [CLI docs](/docs/cli/bucket.md#connectdisconnect-ais-bucket-tofrom-cloud-bucket). 590 591 ## AIS bucket as a reference 592 593 Stated differently, aistore bucket itself can serve as a reference to another bucket. E.g., you could have, say, `ais://llm-latest` to always point to whatever is the latest result of a data prep service. 594 595 ```console 596 ### create an arbitrary bucket (say, `ais://llm-latest`) and always use it to reference the latest augmented results 597 598 $ ais create ais://llm-latest 599 $ ais bucket props set ais://llm-latest backend_bck=gs://llm-augmented-2023-12-04 600 601 ### next day, when the data prep service produces a new derivative: 602 603 $ ais bucket props set ais://llm-latest backend_bck=gs://llm-augmented-2023-12-05 604 605 ### and keep using the same static name, etc. 606 ``` 607 608 Caching wise, when you walk `ais://llm-latest` (or any other aistore bucket with a remote backend), aistore will make sure to perform remote (cold) GETs to update itself when and if required, etc. 609 610 > In re "cold GET" vs "warm GET" performance, see [AIStore as a Fast Tier Storage](https://aiatscale.org/blog/2023/11/27/aistore-fast-tier) blog. 611 612 # Bucket Properties 613 614 The full list of bucket properties are: 615 616 | Bucket Property | JSON | Description | Fields | 617 | --- | --- | --- | --- | 618 | Provider | `provider` | "ais", "aws", "azure", "gcp", or "ht" | `"provider": "ais"/"aws"/"azure"/"gcp"/"ht"` | 619 | Cksum | `checksum` | Please refer to [Supported Checksums and Brief Theory of Operations](checksum.md) | | 620 | LRU | `lru` | Configuration for [LRU](storage_svcs.md#lru). `space.lowwm` and `space.highwm` is the used capacity low-watermark and high-watermark (% of total local storage capacity) respectively. `space.out_of_space` if exceeded, the target starts failing new PUTs and keeps failing them until its local used-cap gets back below `space.highwm`. `dont_evict_time` denotes the period of time during which eviction of an object is forbidden [atime, atime + `dont_evict_time`]. `capacity_upd_time` denotes the frequency at which AIStore updates local capacity utilization. `enabled` LRU will only run when set to true. | `"lru": {"dont_evict_time": "120m", "capacity_upd_time": "10m", "enabled": bool }`. Note: `space.*` are cluster level properties. | 621 | Mirror | `mirror` | Configuration for [Mirroring](storage_svcs.md#n-way-mirror). `copies` represents the number of local copies. `burst_buffer` represents channel buffer size. `enabled` will only generate local copies when set to true. | `"mirror": { "copies": int64, "burst_buffer": int64, "enabled": bool }` | 622 | EC | `ec` | Configuration for [erasure coding](storage_svcs.md#erasure-coding). `objsize_limit` is the limit in which objects below this size are replicated instead of EC'ed. `data_slices` represents the number of data slices. `parity_slices` represents the number of parity slices/replicas. `enabled` represents if EC is enabled. | `"ec": { "objsize_limit": int64, "data_slices": int, "parity_slices": int, "enabled": bool }` | 623 | Versioning | `versioning` | Configuration for object versioning support where `enabled` represents if object versioning is enabled for a bucket. For remote bucket versioning must be enabled in the corresponding backend (e.g. Amazon S3). `validate_warm_get`: determines if the object's version is checked | `"versioning": { "enabled": true, "validate_warm_get": false }`| 624 | AccessAttrs | `access` | Bucket access [attributes](#bucket-access-attributes). Default value is 0 - full access | `"access": "0" ` | 625 | BID | `bid` | Readonly property: unique bucket ID | `"bid": "10e45"` | 626 | Created | `created` | Readonly property: bucket creation date, in nanoseconds(Unix time) | `"created": "1546300800000000000"` | 627 628 ## CLI examples: listing and setting bucket properties 629 630 ### List bucket properties 631 632 ```console 633 $ ais show bucket mybucket 634 ... 635 $ 636 $ # Or, the same to get output in a (raw) JSON form: 637 $ ais show bucket mybucket --json 638 ... 639 ``` 640 641 ### Enable erasure coding on a bucket 642 643 ```console 644 $ ais bucket props mybucket ec.enabled=true 645 ``` 646 647 ### Enable object versioning and then list updated bucket properties 648 649 ```console 650 $ ais bucket props mybucket versioning.enabled=true 651 $ ais show bucket mybucket 652 ... 653 ``` 654 655 # Bucket Access Attributes 656 657 Bucket access is controlled by a single 64-bit `access` value in the [Bucket Properties structure](/cmn/api.go), whereby its bits have the following mapping as far as allowed (or denied) operations: 658 659 | Operation | Bit Mask | 660 | --- | --- | 661 | GET | 0x1 | 662 | HEAD | 0x2 | 663 | PUT, APPEND | 0x4 | 664 | Cold GET | 0x8 | 665 | DELETE | 0x16 | 666 667 For instance, to make bucket `abc` read-only, execute the following [AIS CLI](/docs/cli.md) command: 668 669 ```console 670 $ ais bucket props abc 'access=ro' 671 ``` 672 673 The same expressed via `curl` will look as follows: 674 675 ```console 676 $ curl -i -X PATCH -H 'Content-Type: application/json' -d '{"action": "set-bprops", "value": {"access": 18446744073709551587}}' http://localhost:8080/v1/buckets/abc 677 ``` 678 679 > `18446744073709551587 = 0xffffffffffffffe3 = 0xffffffffffffffff ^ (4|8|16)` 680 681 # AWS-specific configuration 682 683 AIStore supports AWS-specific configuration on a per s3 bucket basis. Any bucket that is backed up by an AWS S3 bucket (**) can be configured to use alternative: 684 685 * named AWS profiles (with alternative credentials and/or region) 686 * alternative s3 endpoints 687 688 For background and usage examples, please see [CLI: AWS-specific bucket configuration](/docs/cli/aws_profile_endpoint.md). 689 690 # List Objects 691 692 > Note: some of the following content **may be outdated**. For the most recent updates, please check [`ais ls`](https://github.com/NVIDIA/aistore/blob/main/docs/cli/bucket.md#list-objects) CLI. 693 694 ListObjects API returns a page of object names and, optionally, their properties (including sizes, access time, checksums, and more), in addition to a token that serves as a cursor, or a marker for the *next* page retrieval. 695 696 > Go [ListObjects](https://github.com/NVIDIA/aistore/blob/main/api/bucket.go) API 697 698 When a cluster is rebalancing, the returned list of objects can be incomplete due to objects are being migrated. 699 The returned [result](#list-result) has non-zero value(the least significant bit is set to `1`) to indicate that the list was generated when the cluster was unstable. 700 To get the correct list, either re-request the list after the rebalance ends or read the list with [the option](#list-options) `SelectMisplaced` enabled. 701 In the latter case, the list may contain duplicated entries. 702 703 ## Options 704 705 The properties-and-options specifier must be a JSON-encoded structure, for instance `{"props": "size"}` (see examples). 706 An empty structure `{}` results in getting just the names of the objects (from the specified bucket) with no other metadata. 707 708 | Property/Option | Description | Value | 709 | --- | --- | --- | 710 | `uuid` | ID of the list objects operation | After initial request to list objects the `uuid` is returned and should be used for subsequent requests. The ID ensures integrity between next requests. | 711 | `pagesize` | The maximum number of object names returned in response | For AIS buckets default value is `10000`. For remote buckets this value varies as each provider has it's own maximum page size. | 712 | `props` | The properties of the object to return | A comma-separated string containing any combination of: `name,size,version,checksum,atime,location,copies,ec,status` (if not specified, props are set to `name,size,version,checksum,atime`). <sup id="a1">[1](#ft1)</sup> | 713 | `prefix` | The prefix which all returned objects must have | For example, `prefix = "my/directory/structure/"` will include object `object_name = "my/directory/structure/object1.txt"` but will not `object_name = "my/directory/object2.txt"` | 714 | `start_after` | Name of the object after which the listing should start | For example, `start_after = "baa"` will include object `object_name = "caa"` but will not `object_name = "ba"` nor `object_name = "aab"`. | 715 | `continuation_token` | The token identifying the next page to retrieve | Returned in the `ContinuationToken` field from a call to ListObjects that does not retrieve all keys. When the last key is retrieved, `ContinuationToken` will be the empty string. | 716 | `time_format` | The standard by which times should be formatted | Any of the following [golang time constants](http://golang.org/pkg/time/#pkg-constants): RFC822, Stamp, StampMilli, RFC822Z, RFC1123, RFC1123Z, RFC3339. The default is RFC822. | 717 | `flags` | Advanced filter options | A bit field of [ListObjsMsg extended flags](/cmn/api.go). | 718 719 ListObjsMsg extended flags: 720 721 | Name | Value | Description | 722 | --- | --- | --- | 723 | `SelectCached` | `1` | For remote buckets only: return only objects that are cached on AIS drives, i.e. objects that can be read without accessing to the Cloud | 724 | `SelectMisplaced` | `2` | Include objects that are on incorrect target or mountpath | 725 | `SelectDeleted` | `4` | Include objects marked as deleted | 726 | `SelectArchDir` | `8` | If an object is an archive, include its content into object list | 727 | `SelectOnlyNames` | `16` | Do not retrieve object attributes for faster bucket listing. In this mode, all fields of the response, except object names and statuses, are empty | 728 729 We say that "an object is cached" to indicate two separate things: 730 731 * The object was originally downloaded from a remote bucket, bucket in a remote AIS cluster, or an HTTP(s) based dataset; 732 * The object is stored in the AIS cluster. 733 734 In other words, the term "cached" is simply a **shortcut** to indicate the object's immediate availability without the need to go and check the object's original location. Being "cached" does not have any implications on object's persistence: "cached" objects, similar to those objects that originated in a given AIS cluster, are stored with arbitrary (per bucket configurable) levels of redundancy, etc. In short, the same storage policies apply to "cached" and "non-cached". 735 736 Note that the list generated with `SelectMisplaced` option may have duplicated entries. 737 E.g, after rebalance the list can contain two entries for the same object: 738 a misplaced one (from original location) and real one (from the new location). 739 740 <a name="ft1">1</a>) The objects that exist in the Cloud but are not present in the AIStore cache will have their atime property empty (`""`). The atime (access time) property is supported for the objects that are present in the AIStore cache. [↩](#a1) 741 742 ### Results 743 744 The result may contain all bucket objects(if a bucket is small) or only the current page. The struct includes fields: 745 746 | Field | JSON Value | Description | 747 | --- | --- | --- | 748 | UUID | `uuid` | Unique ID of the listing operation. Pass it to all consecutive list requests to read the next page of objects. If UUID is empty, the server starts listing objects from the first page | 749 | Entries | `entries` | A page of objects and their properties | 750 | ContinuationToken | `continuation_token` | The token to request the next page of objects. Empty value means that it is the last page | 751 | Flags | `flags` | Extra information - a bit-mask field. `0x0001` bit indicates that a rebalance was running at the time the list was generated |