github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/export.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/export.md (about)

     1  ---
     2  title: Export Data
     3  description: Use the lakeFS Spark client or RClone inside Docker to export a lakeFS commit to the object store.
     4  parent: How-To
     5  redirect_from: 
     6    - /reference/export.html
     7  ---
     8  
     9  # Exporting Data
    10  The export operation copies all data from a given lakeFS commit to
    11  a designated object store location.
    12  
    13  For instance, the contents `lakefs://example/main` might be exported on
    14  `s3://company-bucket/example/latest`. Clients entirely unaware of lakeFS could use that
    15  base URL to access latest files on `main`. Clients aware of lakeFS can continue to use
    16  the lakeFS S3 endpoint to access repository files on `s3://example/main`, as well as
    17  other versions and uncommitted versions.
    18  
    19  Possible use-cases:
    20  1. External consumers of data don't have access to your lakeFS installation.
    21  1. Some data pipelines in the organization are not fully migrated to lakeFS.
    22  1. You want to experiment with lakeFS as a side-by-side installation first.
    23  1. Create copies of your data lake in other regions (taking into account read pricing).
    24  
    25  {% include toc.html %}
    26  
    27  ## Exporting Data With Spark 
    28  
    29  ### Using spark-submit
    30  You can use the export main in three different modes:
    31  
    32  1. Export all the objects from branch `example-branch` on `example-repo` repository to S3 location `s3://example-bucket/prefix/`:
    33  
    34     ```shell
    35     .... example-repo s3://example-bucket/prefix/ --branch=example-branch
    36     ```
    37  
    38  
    39  1. Export all the objects from a commit `c805e49bafb841a0875f49cd555b397340bbd9b8` on `example-repo` repository to S3 location `s3://example-bucket/prefix/`:
    40  
    41     ```shell
    42     .... example-repo s3://example-bucket/prefix/ --commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8
    43     ```
    44  
    45  1. Export only the diff between branch `example-branch` and commit `c805e49bafb841a0875f49cd555b397340bbd9b8`
    46     on `example-repo` repository to S3 location `s3://example-bucket/prefix/`:
    47  
    48     ```shell
    49     .... example-repo s3://example-bucket/prefix/ --branch=example-branch --prev_commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8
    50     ```
    51  
    52  The complete `spark-submit` command would look as follows:
    53  
    54  ```shell
    55  spark-submit --conf spark.hadoop.lakefs.api.url=https://<LAKEFS_ENDPOINT>/api/v1 \
    56    --conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY_ID> \
    57    --conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_ACCESS_KEY> \
    58    --packages io.lakefs:lakefs-spark-client_2.12:0.11.0 \
    59    --class io.treeverse.clients.Main export-app example-repo s3://example-bucket/prefix \
    60    --branch=example-branch
    61  ```
    62  
    63  The command assumes that the Spark cluster has permissions to write to `s3://example-bucket/prefix`.
    64  Otherwise, add `spark.hadoop.fs.s3a.access.key` and `spark.hadoop.fs.s3a.secret.key` with the proper credentials.
    65  {: .note}
    66  
    67  #### Networking
    68  
    69  Spark export communicates with the lakeFS server.  Very large repositories
    70  may require increasing a read timeout.  If you run into timeout errors
    71  during communication from the Spark job to lakeFS consider increasing these
    72  timeouts:
    73  
    74  * Add `-c spark.hadoop.lakefs.api.read.timeout_seconds=TIMEOUT_IN_SECONDS`
    75    (default 10) to allow lakeFS more time to respond to requests.
    76  * Add `-c
    77    spark.hadoop.lakefs.api.connection.timeout_seconds=TIMEOUT_IN_SECONDS`
    78    (default 10) to wait longer for lakeFS to accept connections.
    79  
    80  ### Using custom code (Notebook/Spark)
    81  
    82  Set up lakeFS Spark metadata client with the endpoint and credentials as instructed in the previous [page]({% link reference/spark-client.md %}).
    83  
    84  The client exposes the `Exporter` object with three export options:
    85  
    86  <ol><li>
    87  Export *all* the objects at the HEAD of a given branch. Does not include
    88  files that were added to that branch but were not committed.
    89  
    90  <div class="tabs">
    91    <ul>
    92    <li><a href="#export-head-scala">Scala</a></li>
    93    </ul>
    94    <div markdown="1" id="export-head-scala">
    95  ```scala
    96  exportAllFromBranch(branch: String)
    97  ```
    98    </div>
    99  </div>
   100  </li>
   101  <li>Export ALL objects from a commit:
   102  
   103  <div class="tabs">
   104    <ul>
   105    <li><a href="#export-commit-scala">Scala</a></li>
   106    </ul>
   107    <div markdown="1" id="export-commit-scala">
   108  ```scala
   109  exportAllFromCommit(commitID: String)
   110  ```
   111    </div>
   112  </div>
   113  </li>
   114  <li>Export just the diff between a commit and the HEAD of a branch.
   115  
   116     This is an ideal option for continuous exports of a branch since it will change only the files
   117     that have been changed since the previous commit.
   118  
   119  <div class="tabs">
   120    <ul>
   121    <li><a href="#export-diffs-scala">Scala</a></li>
   122    </ul>
   123    <div markdown="1" id="export-diffs-scala">
   124  ```scala
   125  exportFrom(branch: String, prevCommitID: String)
   126  ```
   127    </div>
   128  </div>
   129  </li>
   130  </ol>
   131  
   132  ## Success/Failure Indications
   133  
   134  When the Spark export operation ends, an additional status file will be added to the root
   135  object storage destination.
   136  If all files were exported successfully, the file path will be of the form: `EXPORT_<commitID>_<ISO-8601-time-UTC>_SUCCESS`.
   137  For failures: the form will be`EXPORT_<commitID>_<ISO-8601-time-UTC>_FAILURE`, and the file will include a log of the failed files operations.
   138  
   139  ## Export Rounds (Spark success files)
   140  Some files should be exported before others, e.g., a Spark `_SUCCESS` file exported before other files under
   141  the same prefix might send the wrong indication.
   142  
   143  The export operation may contain several *rounds* within the same export.
   144  A failing round will stop the export of all the files of the next rounds.
   145  
   146  By default, lakeFS will use the `SparkFilter` and have 2 rounds for each export.
   147  The first round will export any non-Spark `_SUCCESS` files. Second round will export all Spark's `_SUCCESS` files.
   148  You may override the default behavior by passing a custom `filter` to the Exporter.  
   149  
   150  ## Example
   151  
   152  <ol><li>First configure the `Exporter` instance:
   153  
   154  <div class="tabs">
   155    <ul>
   156      <li><a href="#export-custom-setup-scala">Scala</a></li>
   157    </ul>
   158    <div markdown="1" id="export-custom-setup-scala">
   159  ```scala
   160  import io.treeverse.clients.{ApiClient, Exporter}
   161  import org.apache.spark.sql.SparkSession
   162  
   163  val endpoint = "http://<LAKEFS_ENDPOINT>/api/v1"
   164  val accessKey = "<LAKEFS_ACCESS_KEY_ID>"
   165  val secretKey = "<LAKEFS_SECRET_ACCESS_KEY>"
   166  
   167  val repo = "example-repo"
   168  
   169  val spark = SparkSession.builder().appName("I can export").master("local").getOrCreate()
   170  val sc = spark.sparkContext
   171  sc.hadoopConfiguration.set("lakefs.api.url", endpoint)
   172  sc.hadoopConfiguration.set("lakefs.api.access_key", accessKey)
   173  sc.hadoopConfiguration.set("lakefs.api.secret_key", secretKey)
   174  
   175  // Add any required spark context configuration for s3
   176  val rootLocation = "s3://company-bucket/example/latest"
   177  
   178  val apiClient = new ApiClient(endpoint, accessKey, secretKey)
   179  val exporter = new Exporter(spark, apiClient, repo, rootLocation)
   180  ```
   181    </div>
   182  </div></li>
   183  <li>Now you can export all objects from `main` branch to `s3://company-bucket/example/latest`:
   184  
   185  <div class="tabs">
   186    <ul>
   187    <li><a href="#export-custom-branch-scala">Scala</a></li>
   188    </ul>
   189    <div markdown="1" id="export-custom-branch-scala">
   190  ```scala
   191  val branch = "main"
   192  exporter.exportAllFromBranch(branch)
   193  ```
   194    </div>
   195  </div></li>
   196  <li>Assuming a previous successful export on commit `f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7`,
   197  you can alternatively export just the difference between `main` branch and the commit:
   198  
   199  <div class="tabs">
   200    <ul>
   201      <li><a href="#export-custom-diffs-scala">Scala</a></li>
   202    </ul>
   203    <div markdown="1" id="export-custom-diffs-scala">
   204  ```scala
   205  val branch = "main"
   206  val commit = "f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7"
   207  exporter.exportFrom(branch, commit)
   208  ```
   209    </div>
   210  </div></li></ol>
   211  
   212  ## Exporting Data with Docker
   213  
   214  This option is recommended if you don't have Spark at your tool-set.
   215  It doesn't support distribution across machines, therefore may have a lower performance. 
   216  Using this method, you can export data from lakeFS to S3 using the export options (in a similar way to the Spark export):
   217  
   218  1. Export all objects from a branch `example-branch` on `example-repo` repository to S3 location `s3://destination-bucket/prefix/`:
   219  
   220     ```shell
   221     .... example-repo s3://destination-bucket/prefix/ --branch="example-branch"
   222     ```
   223  
   224  
   225  1. Export all objects from a commit `c805e49bafb841a0875f49cd555b397340bbd9b8` on `example-repo` repository to S3 location `s3://destination-bucket/prefix/`:
   226  
   227     ```shell
   228     .... example-repo s3://destination-bucket/prefix/ --commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8
   229     ```
   230  
   231  
   232  1. Export only the diff between branch `example-branch` and commit `c805e49bafb841a0875f49cd555b397340bbd9b8`
   233     on `example-repo` repository to S3 location `s3://destination-bucket/prefix/`:
   234  
   235     ```shell
   236     .... example-repo s3://destination-bucket/prefix/ --branch="example-branch" --prev_commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8
   237     ```
   238  
   239  You will need to add the relevant environment variables.
   240  The complete `docker run` command would look like:
   241  
   242  ```shell
   243  docker run \
   244      -e LAKEFS_ACCESS_KEY_ID=XXX -e LAKEFS_SECRET_ACCESS_KEY=YYY \
   245     -e LAKEFS_ENDPOINT=https://<LAKEFS_ENDPOINT>/ \
   246     -e AWS_ACCESS_KEY_ID=XXX -e AWS_SECRET_ACCESS_KEY=YYY \
   247     treeverse/lakefs-rclone-export:latest \
   248        example-repo \
   249        s3://destination-bucket/prefix/ \
   250        --branch="example-branch"
   251  ```
   252  
   253  **Note:** This feature uses [rclone](https://rclone.org/){: target="_blank"},
   254  and specifically [rclone sync](https://rclone.org/commands/rclone_sync/){: target="_blank"}. This can change the destination path, therefore the s3 destination location must be designated to lakeFS export.
   255  {: .note}