github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/export.md (about) 1 --- 2 title: Export Data 3 description: Use the lakeFS Spark client or RClone inside Docker to export a lakeFS commit to the object store. 4 parent: How-To 5 redirect_from: 6 - /reference/export.html 7 --- 8 9 # Exporting Data 10 The export operation copies all data from a given lakeFS commit to 11 a designated object store location. 12 13 For instance, the contents `lakefs://example/main` might be exported on 14 `s3://company-bucket/example/latest`. Clients entirely unaware of lakeFS could use that 15 base URL to access latest files on `main`. Clients aware of lakeFS can continue to use 16 the lakeFS S3 endpoint to access repository files on `s3://example/main`, as well as 17 other versions and uncommitted versions. 18 19 Possible use-cases: 20 1. External consumers of data don't have access to your lakeFS installation. 21 1. Some data pipelines in the organization are not fully migrated to lakeFS. 22 1. You want to experiment with lakeFS as a side-by-side installation first. 23 1. Create copies of your data lake in other regions (taking into account read pricing). 24 25 {% include toc.html %} 26 27 ## Exporting Data With Spark 28 29 ### Using spark-submit 30 You can use the export main in three different modes: 31 32 1. Export all the objects from branch `example-branch` on `example-repo` repository to S3 location `s3://example-bucket/prefix/`: 33 34 ```shell 35 .... example-repo s3://example-bucket/prefix/ --branch=example-branch 36 ``` 37 38 39 1. Export all the objects from a commit `c805e49bafb841a0875f49cd555b397340bbd9b8` on `example-repo` repository to S3 location `s3://example-bucket/prefix/`: 40 41 ```shell 42 .... example-repo s3://example-bucket/prefix/ --commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8 43 ``` 44 45 1. Export only the diff between branch `example-branch` and commit `c805e49bafb841a0875f49cd555b397340bbd9b8` 46 on `example-repo` repository to S3 location `s3://example-bucket/prefix/`: 47 48 ```shell 49 .... example-repo s3://example-bucket/prefix/ --branch=example-branch --prev_commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8 50 ``` 51 52 The complete `spark-submit` command would look as follows: 53 54 ```shell 55 spark-submit --conf spark.hadoop.lakefs.api.url=https://<LAKEFS_ENDPOINT>/api/v1 \ 56 --conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY_ID> \ 57 --conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_ACCESS_KEY> \ 58 --packages io.lakefs:lakefs-spark-client_2.12:0.11.0 \ 59 --class io.treeverse.clients.Main export-app example-repo s3://example-bucket/prefix \ 60 --branch=example-branch 61 ``` 62 63 The command assumes that the Spark cluster has permissions to write to `s3://example-bucket/prefix`. 64 Otherwise, add `spark.hadoop.fs.s3a.access.key` and `spark.hadoop.fs.s3a.secret.key` with the proper credentials. 65 {: .note} 66 67 #### Networking 68 69 Spark export communicates with the lakeFS server. Very large repositories 70 may require increasing a read timeout. If you run into timeout errors 71 during communication from the Spark job to lakeFS consider increasing these 72 timeouts: 73 74 * Add `-c spark.hadoop.lakefs.api.read.timeout_seconds=TIMEOUT_IN_SECONDS` 75 (default 10) to allow lakeFS more time to respond to requests. 76 * Add `-c 77 spark.hadoop.lakefs.api.connection.timeout_seconds=TIMEOUT_IN_SECONDS` 78 (default 10) to wait longer for lakeFS to accept connections. 79 80 ### Using custom code (Notebook/Spark) 81 82 Set up lakeFS Spark metadata client with the endpoint and credentials as instructed in the previous [page]({% link reference/spark-client.md %}). 83 84 The client exposes the `Exporter` object with three export options: 85 86 <ol><li> 87 Export *all* the objects at the HEAD of a given branch. Does not include 88 files that were added to that branch but were not committed. 89 90 <div class="tabs"> 91 <ul> 92 <li><a href="#export-head-scala">Scala</a></li> 93 </ul> 94 <div markdown="1" id="export-head-scala"> 95 ```scala 96 exportAllFromBranch(branch: String) 97 ``` 98 </div> 99 </div> 100 </li> 101 <li>Export ALL objects from a commit: 102 103 <div class="tabs"> 104 <ul> 105 <li><a href="#export-commit-scala">Scala</a></li> 106 </ul> 107 <div markdown="1" id="export-commit-scala"> 108 ```scala 109 exportAllFromCommit(commitID: String) 110 ``` 111 </div> 112 </div> 113 </li> 114 <li>Export just the diff between a commit and the HEAD of a branch. 115 116 This is an ideal option for continuous exports of a branch since it will change only the files 117 that have been changed since the previous commit. 118 119 <div class="tabs"> 120 <ul> 121 <li><a href="#export-diffs-scala">Scala</a></li> 122 </ul> 123 <div markdown="1" id="export-diffs-scala"> 124 ```scala 125 exportFrom(branch: String, prevCommitID: String) 126 ``` 127 </div> 128 </div> 129 </li> 130 </ol> 131 132 ## Success/Failure Indications 133 134 When the Spark export operation ends, an additional status file will be added to the root 135 object storage destination. 136 If all files were exported successfully, the file path will be of the form: `EXPORT_<commitID>_<ISO-8601-time-UTC>_SUCCESS`. 137 For failures: the form will be`EXPORT_<commitID>_<ISO-8601-time-UTC>_FAILURE`, and the file will include a log of the failed files operations. 138 139 ## Export Rounds (Spark success files) 140 Some files should be exported before others, e.g., a Spark `_SUCCESS` file exported before other files under 141 the same prefix might send the wrong indication. 142 143 The export operation may contain several *rounds* within the same export. 144 A failing round will stop the export of all the files of the next rounds. 145 146 By default, lakeFS will use the `SparkFilter` and have 2 rounds for each export. 147 The first round will export any non-Spark `_SUCCESS` files. Second round will export all Spark's `_SUCCESS` files. 148 You may override the default behavior by passing a custom `filter` to the Exporter. 149 150 ## Example 151 152 <ol><li>First configure the `Exporter` instance: 153 154 <div class="tabs"> 155 <ul> 156 <li><a href="#export-custom-setup-scala">Scala</a></li> 157 </ul> 158 <div markdown="1" id="export-custom-setup-scala"> 159 ```scala 160 import io.treeverse.clients.{ApiClient, Exporter} 161 import org.apache.spark.sql.SparkSession 162 163 val endpoint = "http://<LAKEFS_ENDPOINT>/api/v1" 164 val accessKey = "<LAKEFS_ACCESS_KEY_ID>" 165 val secretKey = "<LAKEFS_SECRET_ACCESS_KEY>" 166 167 val repo = "example-repo" 168 169 val spark = SparkSession.builder().appName("I can export").master("local").getOrCreate() 170 val sc = spark.sparkContext 171 sc.hadoopConfiguration.set("lakefs.api.url", endpoint) 172 sc.hadoopConfiguration.set("lakefs.api.access_key", accessKey) 173 sc.hadoopConfiguration.set("lakefs.api.secret_key", secretKey) 174 175 // Add any required spark context configuration for s3 176 val rootLocation = "s3://company-bucket/example/latest" 177 178 val apiClient = new ApiClient(endpoint, accessKey, secretKey) 179 val exporter = new Exporter(spark, apiClient, repo, rootLocation) 180 ``` 181 </div> 182 </div></li> 183 <li>Now you can export all objects from `main` branch to `s3://company-bucket/example/latest`: 184 185 <div class="tabs"> 186 <ul> 187 <li><a href="#export-custom-branch-scala">Scala</a></li> 188 </ul> 189 <div markdown="1" id="export-custom-branch-scala"> 190 ```scala 191 val branch = "main" 192 exporter.exportAllFromBranch(branch) 193 ``` 194 </div> 195 </div></li> 196 <li>Assuming a previous successful export on commit `f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7`, 197 you can alternatively export just the difference between `main` branch and the commit: 198 199 <div class="tabs"> 200 <ul> 201 <li><a href="#export-custom-diffs-scala">Scala</a></li> 202 </ul> 203 <div markdown="1" id="export-custom-diffs-scala"> 204 ```scala 205 val branch = "main" 206 val commit = "f3c450d8cd0e84ac67e7bc1c5dcde9bef82d8ba7" 207 exporter.exportFrom(branch, commit) 208 ``` 209 </div> 210 </div></li></ol> 211 212 ## Exporting Data with Docker 213 214 This option is recommended if you don't have Spark at your tool-set. 215 It doesn't support distribution across machines, therefore may have a lower performance. 216 Using this method, you can export data from lakeFS to S3 using the export options (in a similar way to the Spark export): 217 218 1. Export all objects from a branch `example-branch` on `example-repo` repository to S3 location `s3://destination-bucket/prefix/`: 219 220 ```shell 221 .... example-repo s3://destination-bucket/prefix/ --branch="example-branch" 222 ``` 223 224 225 1. Export all objects from a commit `c805e49bafb841a0875f49cd555b397340bbd9b8` on `example-repo` repository to S3 location `s3://destination-bucket/prefix/`: 226 227 ```shell 228 .... example-repo s3://destination-bucket/prefix/ --commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8 229 ``` 230 231 232 1. Export only the diff between branch `example-branch` and commit `c805e49bafb841a0875f49cd555b397340bbd9b8` 233 on `example-repo` repository to S3 location `s3://destination-bucket/prefix/`: 234 235 ```shell 236 .... example-repo s3://destination-bucket/prefix/ --branch="example-branch" --prev_commit_id=c805e49bafb841a0875f49cd555b397340bbd9b8 237 ``` 238 239 You will need to add the relevant environment variables. 240 The complete `docker run` command would look like: 241 242 ```shell 243 docker run \ 244 -e LAKEFS_ACCESS_KEY_ID=XXX -e LAKEFS_SECRET_ACCESS_KEY=YYY \ 245 -e LAKEFS_ENDPOINT=https://<LAKEFS_ENDPOINT>/ \ 246 -e AWS_ACCESS_KEY_ID=XXX -e AWS_SECRET_ACCESS_KEY=YYY \ 247 treeverse/lakefs-rclone-export:latest \ 248 example-repo \ 249 s3://destination-bucket/prefix/ \ 250 --branch="example-branch" 251 ``` 252 253 **Note:** This feature uses [rclone](https://rclone.org/){: target="_blank"}, 254 and specifically [rclone sync](https://rclone.org/commands/rclone_sync/){: target="_blank"}. This can change the destination path, therefore the s3 destination location must be designated to lakeFS export. 255 {: .note}