github.com/NVIDIA/aistore@v1.3.23-0.20240517131212-7df6609be51d/docs/_posts/2021-12-07-cp-files-to-ais.md (about) 1 --- 2 layout: post 3 title: "Copying existing file datasets in two easy steps" 4 date: Dec 7, 2021 5 author: Alex Aizman 6 categories: aistore migration replication 7 --- 8 9 AIStore supports [numerous ways](https://github.com/NVIDIA/aistore/blob/main/docs/overview.md#existing-datasets) to copy, download, or otherwise transfer existing datasets. Much depends on *where is* the dataset in question, and whether we can access this location using some of sort of HTTP-based interface. I'll put more references below. But in this post, let's talk about datasets that already reside *on premises*. 10 11 > The term *on premises* here includes a super-wide spectrum of use cases ranging from commercial high-end (and, possibly, distributed) filers to your own Linux or Mac (where you may, or may not, want to run AIStore itself, etc.). 12 13 Ultimately, the only precondition is that there is a *directory* you can access that contains files to migrate or copy. It turns out that **everything else** can be done in two easy steps: 14 15 1. Run local HTTP server. 16 2. [Prefetch](https://github.com/NVIDIA/aistore/blob/main/docs/cli/object.md#operations-on-lists-and-ranges) or [download](https://github.com/NVIDIA/aistore/blob/main/docs/downloader.md) the files. 17 18 Implementation-wise, both Step 1 and Step 2 have multiple variations and we'll consider at least some of them below. But first, let's take a look at an example: 19 20 ```bash 21 # Let's assume, the files we want to copy are located under /tmp/abc: 22 $ cd /tmp 23 $ ls abc 24 hello-world 25 shard-001.tar 26 ... 27 shard-999.tar 28 29 # Step 1. run local http server 30 # ============================= 31 $ python3 -m http.server --bind 0.0.0.0 51061 32 33 # use AIS CLI to make sure the files are readable 34 $ ais get http://localhost:51061/abc/hello-world 35 36 # Step 2. get all files in the range 'shard-{001..999}.tar' 37 # ========================================================= 38 39 # keep using AIS CLI to list HTTP buckets 40 # (and note that AIS will create one on the fly after the very first successful `GET`) 41 $ ais ls ht:// 42 ht://ZDE1YzE0NzhiNWFkMQ 43 44 # run batch `prefetch` job to load bash-expansion templated names from this bucket 45 $ ais start prefetch ht://ZDE1YzE0NzhiNWFkMQ --template 'shard-{001..999}.tar' 46 ``` 47 48 Here we run Python's own `http.server` to listen on port `51061` and serve the files from the directory that we have previously `cd`-ed into (`/tmp`, in the example). 49 50 Of course, the port, the directory, and the filenames above are all randomly chosen for purely **illustrative purposes**. The main point the example is trying to make is that HTTP connectivity of any kind immediately opens up a number of easy ways to migrate or replicate any data that exists in files. 51 52 As far as aforementioned *implementation variations*, they include running, for instance, Go-based HTTP server instead of the Python's (`htserver.go`): 53 54 ```go 55 package main 56 57 import "net/http" 58 59 func main() { 60 http.ListenAndServe(":52062", http.FileServer(http.Dir("/tmp"))) 61 } 62 ``` 63 64 and then using AIS [downloader](https://github.com/NVIDIA/aistore/blob/main/docs/downloader.md) extension instead of the multi-object `prefetch` that we have used above: 65 66 ```bash 67 # Step 1. run local http server 68 # ============================= 69 $ go run htserver.go 70 71 # Step 2. download 10 files named shard-{001..010}.tar 72 # ==================================================== 73 74 # `hostname` below indicates the hostname or IP address of the machine where 75 # we are running `go run htserver.go`; 76 # also note that the destination bucket `ais://abc` will be created iff it doesn't exist 77 $ ais start download "http://hostname:52062/abc/shard-{001..010}.tar" ais://abc 78 GUsQcjEPY 79 Run `ais show job download GUsQcjEPY --progress` to monitor the progress. 80 81 # list objects in the bucket we have just created: 82 $ ais ls ais://abc 83 NAME SIZE 84 shard-001.tar 151.13KiB 85 shard-002.tar 147.98KiB 86 shard-003.tar 101.45KiB 87 shard-004.tar 150.37KiB 88 shard-005.tar 146.00KiB 89 shard-006.tar 130.70KiB 90 shard-007.tar 129.04KiB 91 shard-008.tar 157.53KiB 92 shard-009.tar 161.32KiB 93 shard-010.tar 124.11KiB 94 ``` 95 96 Other than a different, albeit still arbitrary, listening port and a user-selected destination bucket, minor differences include explicitly naming the directory from which we want to serve files. And also an attempt to indicate that if *it* works for `localhost` it'll work for any valid `hostname` or IP address. For as long as the latter is visible over HTTP. 97 98 ## References 99 100 * [Using AIS Downloader](https://github.com/NVIDIA/aistore/blob/main/docs/cli/download.md) 101 * [Multi-object operations](https://github.com/NVIDIA/aistore/blob/main/docs/cli/object.md#operations-on-lists-and-ranges) 102 * [Promote files and directories](https://github.com/NVIDIA/aistore/blob/main/docs/cli/object.md#promote-files-and-directories)