github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/copying.md (about)

     1  ---
     2  title: Copying data to/from lakeFS
     3  description: 
     4  parent: How-To
     5  redirect_from: 
     6    - /integrations/distcp.html
     7    - /integrations/rclone.html
     8  ---
     9  
    10  # Copying data to/from lakeFS
    11  
    12  {% include toc.html %}
    13  
    14  ## Using DistCp
    15  
    16  Apache Hadoop [DistCp](https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html){:target="_blank"} (distributed copy) is a tool used for large inter/intra-cluster copying. You can easily use it with your lakeFS repositories.
    17  
    18  **Note** 
    19  
    20  In the following examples, we set AWS credentials on the command line for clarity. In production, you should set these properties using one of Hadoop's standard ways of [Authenticating with S3](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3){:target="_blank"}. 
    21  {: .note}
    22  
    23  ### Between lakeFS repositories
    24  
    25  You can use DistCP to copy between two different lakeFS repositories. Replace the access key pair with your lakeFS access key pair:
    26  
    27  ```bash
    28  hadoop distcp \
    29    -Dfs.s3a.path.style.access=true \
    30    -Dfs.s3a.access.key="AKIAIOSFODNN7EXAMPLE" \
    31    -Dfs.s3a.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
    32    -Dfs.s3a.endpoint="https://lakefs.example.com" \
    33    "s3a://example-repo-1/main/example-file.parquet" \
    34    "s3a://example-repo-2/main/example-file.parquet"
    35  ```
    36  
    37  ### Between S3 buckets and lakeFS
    38  
    39  To copy data from an S3 bucket to a lakeFS repository, use Hadoop's [per-bucket configuration](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Configuring_different_S3_buckets_with_Per-Bucket_Configuration){:target="_blank"}.
    40  In the following examples, replace the first access key pair with your lakeFS key pair, and the second one with your AWS IAM key pair:
    41  
    42  #### From S3 to lakeFs
    43  
    44  ```bash
    45  hadoop distcp \
    46    -Dfs.s3a.path.style.access=true \
    47    -Dfs.s3a.bucket.example-repo.access.key="AKIAIOSFODNN7EXAMPLE" \
    48    -Dfs.s3a.bucket.example-repo.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
    49    -Dfs.s3a.bucket.example-repo.endpoint="https://lakefs.example.com" \
    50    -Dfs.s3a.bucket.example-bucket.access.key="AKIAIOSFODNN3EXAMPLE" \
    51    -Dfs.s3a.bucket.example-bucket.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
    52    "s3a://example-bucket/example-file.parquet" \
    53    "s3a://example-repo/main/example-file.parquet"
    54  ```
    55  
    56  #### From lakeFS to S3
    57  
    58  ```bash
    59  hadoop distcp \
    60    -Dfs.s3a.path.style.access=true \
    61    -Dfs.s3a.bucket.example-repo.access.key="AKIAIOSFODNN7EXAMPLE" \
    62    -Dfs.s3a.bucket.example-repo.secret.key="wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY" \
    63    -Dfs.s3a.bucket.example-repo.endpoint="https://lakefs.example.com" \
    64    -Dfs.s3a.bucket.example-bucket.access.key="AKIAIOSFODNN3EXAMPLE" \
    65    -Dfs.s3a.bucket.example-bucket.secret.key="wJalrXUtnFEMI/K3MDENG/bPxRfiCYEXAMPLEKEY" \
    66    "s3a://example-repo/main/myfile" \
    67    "s3a://example-bucket/myfile"
    68  ```
    69  
    70  ## Using Rclone
    71  
    72  [Rclone](https://rclone.org/){:target="_blank"} is a command line program to sync files and directories between cloud providers.
    73  To use it with lakeFS, create an Rclone remote as describe below and then use it as you would any other Rclone remote.
    74  
    75  ### Creating a remote for lakeFS in Rclone
    76  
    77  To add the remote to Rclone, choose one of the following options:
    78  
    79  #### Option 1: Add an entry in your Rclone configuration file
    80  *   Find the path to your Rclone configuration file and copy it for the next step.
    81  
    82      ```shell
    83      rclone config file
    84      # output:
    85      # Configuration file is stored at:
    86      # /home/myuser/.config/rclone/rclone.conf
    87      ```
    88  
    89  *   If your lakeFS access key is already set in an AWS profile or environment variables, run the following command, replacing the endpoint property with your lakeFS endpoint:
    90  
    91      ```shell
    92      cat <<EOT >> /home/myuser/.config/rclone/rclone.conf
    93      [lakefs]
    94      type = s3
    95      provider = Other
    96      endpoint = https://lakefs.example.com
    97  	no_check_bucket = true
    98      EOT
    99      ```
   100  
   101  *   Otherwise, also include your lakeFS access key pair in the Rclone configuration file:
   102  
   103      ```shell
   104      cat <<EOT >> /home/myuser/.config/rclone/rclone.conf
   105      [lakefs]
   106      type = s3
   107      provider = Other
   108      env_auth = false
   109      access_key_id = AKIAIOSFODNN7EXAMPLE
   110      secret_access_key = wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
   111      endpoint = https://lakefs.example.com
   112  	no_check_bucket = true
   113      EOT
   114      ```
   115  	
   116  #### Option 2: Use the Rclone interactive config command
   117  
   118  Run this command and follow the instructions:
   119  ```shell
   120  rclone config
   121  ```
   122  Choose AWS S3 as your type of storage, and enter your lakeFS endpoint as your S3 endpoint.
   123  You will have to choose whether you use your environment for authentication (recommended),
   124  or enter the lakeFS access key pair into the Rclone configuration. Select "Edit advanced
   125  config" and accept defaults for all values except `no_check_bucket`:
   126  ```
   127  If set, don't attempt to check the bucket exists or create it.
   128  
   129  This can be useful when trying to minimize the number of transactions
   130  Rclone carries out, if you know the bucket exists already.
   131  
   132  This might also be needed if the user you're using doesn't have bucket
   133  creation permissions. Before v1.52.0, this would have passed silently
   134  due to a bug.
   135  
   136  Enter a boolean value (true or false). Press Enter for the default ("false").
   137  no_check_bucket> yes
   138  ```
   139  
   140  ### Syncing S3 and lakeFS
   141  
   142  ```shell
   143  rclone sync mys3remote://mybucket/path/ lakefs:example-repo/main/path
   144  ```
   145  
   146  ### Syncing a local directory and lakeFS
   147  
   148  ```shell
   149  rclone sync /home/myuser/path/ lakefs:example-repo/main/path
   150  ```