github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/copying.md (about) 1 --- 2 title: Copying data to/from lakeFS 3 description: 4 parent: How-To 5 redirect_from: 6 - /integrations/distcp.html 7 - /integrations/rclone.html 8 --- 9 10 # Copying data to/from lakeFS 11 12 {% include toc.html %} 13 14 ## Using DistCp 15 16 Apache Hadoop [DistCp](https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html){:target="_blank"} (distributed copy) is a tool used for large inter/intra-cluster copying. You can easily use it with your lakeFS repositories. 17 18 **Note** 19 20 In the following examples, we set AWS credentials on the command line for clarity. In production, you should set these properties using one of Hadoop's standard ways of [Authenticating with S3](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3){:target="_blank"}. 21 {: .note} 22 23 ### Between lakeFS repositories 24 25 You can use DistCP to copy between two different lakeFS repositories. Replace the access key pair with your lakeFS access key pair: 26 27 ```bash 28 hadoop distcp \ 29 -Dfs.s3a.path.style.access=true \ 30 -Dfs.s3a.access.key="AKIAIOSFODNN7EXAMPLE" \ 31 -Dfs.s3a.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \ 32 -Dfs.s3a.endpoint="https://lakefs.example.com" \ 33 "s3a://example-repo-1/main/example-file.parquet" \ 34 "s3a://example-repo-2/main/example-file.parquet" 35 ``` 36 37 ### Between S3 buckets and lakeFS 38 39 To copy data from an S3 bucket to a lakeFS repository, use Hadoop's [per-bucket configuration](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Configuring_different_S3_buckets_with_Per-Bucket_Configuration){:target="_blank"}. 40 In the following examples, replace the first access key pair with your lakeFS key pair, and the second one with your AWS IAM key pair: 41 42 #### From S3 to lakeFs 43 44 ```bash 45 hadoop distcp \ 46 -Dfs.s3a.path.style.access=true \ 47 -Dfs.s3a.bucket.example-repo.access.key="AKIAIOSFODNN7EXAMPLE" \ 48 -Dfs.s3a.bucket.example-repo.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \ 49 -Dfs.s3a.bucket.example-repo.endpoint="https://lakefs.example.com" \ 50 -Dfs.s3a.bucket.example-bucket.access.key="AKIAIOSFODNN3EXAMPLE" \ 51 -Dfs.s3a.bucket.example-bucket.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \ 52 "s3a://example-bucket/example-file.parquet" \ 53 "s3a://example-repo/main/example-file.parquet" 54 ``` 55 56 #### From lakeFS to S3 57 58 ```bash 59 hadoop distcp \ 60 -Dfs.s3a.path.style.access=true \ 61 -Dfs.s3a.bucket.example-repo.access.key="AKIAIOSFODNN7EXAMPLE" \ 62 -Dfs.s3a.bucket.example-repo.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \ 63 -Dfs.s3a.bucket.example-repo.endpoint="https://lakefs.example.com" \ 64 -Dfs.s3a.bucket.example-bucket.access.key="AKIAIOSFODNN3EXAMPLE" \ 65 -Dfs.s3a.bucket.example-bucket.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \ 66 "s3a://example-repo/main/myfile" \ 67 "s3a://example-bucket/myfile" 68 ``` 69 70 ## Using Rclone 71 72 [Rclone](https://rclone.org/){:target="_blank"} is a command line program to sync files and directories between cloud providers. 73 To use it with lakeFS, create an Rclone remote as describe below and then use it as you would any other Rclone remote. 74 75 ### Creating a remote for lakeFS in Rclone 76 77 To add the remote to Rclone, choose one of the following options: 78 79 #### Option 1: Add an entry in your Rclone configuration file 80 * Find the path to your Rclone configuration file and copy it for the next step. 81 82 ```shell 83 rclone config file 84 # output: 85 # Configuration file is stored at: 86 # /home/myuser/.config/rclone/rclone.conf 87 ``` 88 89 * If your lakeFS access key is already set in an AWS profile or environment variables, run the following command, replacing the endpoint property with your lakeFS endpoint: 90 91 ```shell 92 cat <<EOT >> /home/myuser/.config/rclone/rclone.conf 93 [lakefs] 94 type = s3 95 provider = Other 96 endpoint = https://lakefs.example.com 97 no_check_bucket = true 98 EOT 99 ``` 100 101 * Otherwise, also include your lakeFS access key pair in the Rclone configuration file: 102 103 ```shell 104 cat <<EOT >> /home/myuser/.config/rclone/rclone.conf 105 [lakefs] 106 type = s3 107 provider = Other 108 env_auth = false 109 access_key_id = AKIAIOSFODNN7EXAMPLE 110 secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY 111 endpoint = https://lakefs.example.com 112 no_check_bucket = true 113 EOT 114 ``` 115 116 #### Option 2: Use the Rclone interactive config command 117 118 Run this command and follow the instructions: 119 ```shell 120 rclone config 121 ``` 122 Choose AWS S3 as your type of storage, and enter your lakeFS endpoint as your S3 endpoint. 123 You will have to choose whether you use your environment for authentication (recommended), 124 or enter the lakeFS access key pair into the Rclone configuration. Select "Edit advanced 125 config" and accept defaults for all values except `no_check_bucket`: 126 ``` 127 If set, don't attempt to check the bucket exists or create it. 128 129 This can be useful when trying to minimize the number of transactions 130 Rclone carries out, if you know the bucket exists already. 131 132 This might also be needed if the user you're using doesn't have bucket 133 creation permissions. Before v1.52.0, this would have passed silently 134 due to a bug. 135 136 Enter a boolean value (true or false). Press Enter for the default ("false"). 137 no_check_bucket> yes 138 ``` 139 140 ### Syncing S3 and lakeFS 141 142 ```shell 143 rclone sync mys3remote://mybucket/path/ lakefs:example-repo/main/path 144 ``` 145 146 ### Syncing a local directory and lakeFS 147 148 ```shell 149 rclone sync /home/myuser/path/ lakefs:example-repo/main/path 150 ```