github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-12-07-cp-files-to-ais.md

github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-12-07-cp-files-to-ais.md (about)

     1  ---
     2  layout: post
     3  title:  "Copying existing file datasets in two easy steps"
     4  date:   Dec 7, 2021
     5  author: Alex Aizman
     6  categories: aistore migration replication
     7  ---
     8  
     9  AIStore supports [numerous ways](https://github.com/NVIDIA/aistore/blob/main/docs/overview.md#existing-datasets) to copy, download, or otherwise transfer existing datasets. Much depends on *where is* the dataset in question, and whether we can access this location using some of sort of HTTP-based interface. I'll put more references below. But in this post, let's talk about datasets that already reside *on premises*.
    10  
    11  > The term *on premises* here includes a super-wide spectrum of use cases ranging from commercial high-end (and, possibly, distributed) filers to your own Linux or Mac (where you may, or may not, want to run AIStore itself, etc.).
    12  
    13  Ultimately, the only precondition is that there is a *directory* you can access that contains files to migrate or copy. It turns out that **everything else** can be done in two easy steps:
    14  
    15  1. Run local HTTP server.
    16  2. [Prefetch](https://github.com/NVIDIA/aistore/blob/main/docs/cli/object.md#operations-on-lists-and-ranges) or [download](https://github.com/NVIDIA/aistore/blob/main/docs/downloader.md) the files.
    17  
    18  Implementation-wise, both Step 1 and Step 2 have multiple variations and we'll consider at least some of them below. But first, let's take a look at an example:
    19  
    20  ```bash
    21  # Let's assume, the files we want to copy are located under /tmp/abc:
    22  $ cd /tmp
    23  $ ls abc
    24  hello-world
    25  shard-001.tar
    26  ...
    27  shard-999.tar
    28  
    29  # Step 1. run local http server
    30  # =============================
    31  $ python3 -m http.server --bind 0.0.0.0 51061
    32  
    33  # use AIS CLI to make sure the files are readable
    34  $ ais get http://localhost:51061/abc/hello-world
    35  
    36  # Step 2. get all files in the range 'shard-{001..999}.tar'
    37  # =========================================================
    38  
    39  # keep using AIS CLI to list HTTP buckets
    40  # (and note that AIS will create one on the fly after the very first successful `GET`)
    41  $ ais ls ht://
    42  ht://ZDE1YzE0NzhiNWFkMQ
    43  
    44  # run batch `prefetch` job to load bash-expansion templated names from this bucket
    45  $ ais start prefetch ht://ZDE1YzE0NzhiNWFkMQ --template 'shard-{001..999}.tar'
    46  ```
    47  
    48  Here we run Python's own `http.server` to listen on port `51061` and serve the files from the directory that we have previously `cd`-ed into (`/tmp`, in the example).
    49  
    50  Of course, the port, the directory, and the filenames above are all randomly chosen for purely **illustrative purposes**. The main point the example is trying to make is that HTTP connectivity of any kind immediately opens up a number of easy ways to migrate or replicate any data that exists in files.
    51  
    52  As far as aforementioned *implementation variations*, they include running, for instance, Go-based HTTP server instead of the Python's (`htserver.go`):
    53  
    54  ```go
    55  package main
    56  
    57  import "net/http"
    58  
    59  func main() {
    60  	http.ListenAndServe(":52062", http.FileServer(http.Dir("/tmp")))
    61  }
    62  ```
    63  
    64  and then using AIS [downloader](https://github.com/NVIDIA/aistore/blob/main/docs/downloader.md) extension instead of the multi-object `prefetch` that we have used above:
    65  
    66  ```bash
    67  # Step 1. run local http server
    68  # =============================
    69  $ go run htserver.go
    70  
    71  # Step 2. download 10 files named shard-{001..010}.tar
    72  # ====================================================
    73  
    74  # `hostname` below indicates the hostname or IP address of the machine where
    75  # we are running `go run htserver.go`;
    76  # also note that the destination bucket `ais://abc` will be created iff it doesn't exist
    77  $ ais start download "http://hostname:52062/abc/shard-{001..010}.tar" ais://abc
    78  GUsQcjEPY
    79  Run `ais show job download GUsQcjEPY --progress` to monitor the progress.
    80  
    81  # list objects in the bucket we have just created:
    82  $ ais ls ais://abc
    83  NAME             SIZE
    84  shard-001.tar    151.13KiB
    85  shard-002.tar    147.98KiB
    86  shard-003.tar    101.45KiB
    87  shard-004.tar    150.37KiB
    88  shard-005.tar    146.00KiB
    89  shard-006.tar    130.70KiB
    90  shard-007.tar    129.04KiB
    91  shard-008.tar    157.53KiB
    92  shard-009.tar    161.32KiB
    93  shard-010.tar    124.11KiB
    94  ```
    95  
    96  Other than a different, albeit still arbitrary, listening port and a user-selected destination bucket, minor differences include explicitly naming the directory from which we want to serve files. And also an attempt to indicate that if *it* works for `localhost` it'll work for any valid `hostname` or IP address. For as long as the latter is visible over HTTP.
    97  
    98  ## References
    99  
   100  * [Using AIS Downloader](https://github.com/NVIDIA/aistore/blob/main/docs/cli/download.md)
   101  * [Multi-object operations](https://github.com/NVIDIA/aistore/blob/main/docs/cli/object.md#operations-on-lists-and-ranges)
   102  * [Promote files and directories](https://github.com/NVIDIA/aistore/blob/main/docs/cli/object.md#promote-files-and-directories)