github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/export-functionality.md (about) 1 # Export 2 3 ## Requirements 4 The functionality currently provided by the [lakeFS Export utility](https://docs.lakefs.io/reference/export.html) but without a dependency on Apache Spark. 5 This includes: 6 1. Export all data from a given lakeFS reference (commit or branch) to a designated object store location. 7 1. Export only the diff between the HEAD of a branch and a commit reference (helpful for continuous exports). 8 1. Success/Failure indications on the exported objects. 9 10 11 ## Non-Requirements 12 1. Support distribution across machines. 13 1. Support from lakeFS. 14 15 16 ## Possible Solutions external to lakeFS 17 18 ### Spark client stand-alone 19 20 Create a stand-alone Docker container that runs the Spark client. 21 22 In order to use Docker containers, users will need to download it if necessary, learn it, and have it in their toolset. 23 24 Usage example: 25 26 ```shell 27 docker run lakefs-export --conf spark.hadoop.lakefs.api.url=https://<LAKEFS_ENDPOINT>/api/v1 \ 28 --conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY_ID> \ 29 --conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_ACCESS_KEY> \ 30 --packages io.lakefs:lakefs-spark-client_2.12:0.1.0 \ 31 --class io.treeverse.clients.Main export-app example-repo s3://example-bucket/prefix \ 32 --branch=example-branch 33 ``` 34 35 Pros: 36 - Utilizes the existing Spark client, and gives the same functionality. 37 38 Cons: 39 - Require creating and maintaining the standalone container. 40 - Still requires the user to understand Spark indirectly - for example know how to read logs or errors that arise from running Spark stand-alone. 41 42 ### Using the java client 43 Reimplement the spark client export functionality by creating a new Exporter class that doesn't use spark. 44 This class can be part of the java cliet, or a new "Export" client that uses the java client. 45 46 The usage will be with java code (in a similar way of using custom code with the spark client). 47 48 In order to use Java code, users will need to download it if necessary, learn it, and have it in their toolset. 49 50 Usage example: 51 52 ```java 53 class Export{ 54 public static void main(String[]args){ 55 Exporter exporter = new Exporter(apiClient, repoName, rootLocation); 56 exporter.exportAllFromBranch(branchName); 57 exporter.exportFrom(branchName, commitRef); 58 } 59 } 60 ``` 61 62 Pros: 63 - An intermediate solution that is compatible with the issue. 64 65 Cons: 66 - Require checking how spark is being used in the current implementation (in order to rewrite its functions). Not necessarily possible in a naive way. 67 - Require developing and maintaining a second export functionality. 68 69 ### Using Rclone 70 [Rclone](https://rclone.org/) is a command line program to sync files and directories between cloud providers. 71 Users can gain the export functionality by using Rclone's [copy](https://rclone.org/commands/rclone_copy/), [sync](https://rclone.org/commands/rclone_sync/), and [check](https://rclone.org/commands/rclone_check/) commands. 72 73 The copy command can be used to copy files from lakeFS to a designated object store location. 74 The sync command can be used to export only the diff between a specific branch and a commit reference, since sync makes the source and the dest identical (modifying destination only). 75 The check command can be used as a success/failure indication on the exported objects, since it checks if the files in the source and destination match, and logs a report of files which don't match. 76 77 Usage example: 78 79 ```shell 80 rclone copy source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix 81 82 rclone sync source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix 83 84 rclone check source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix 85 ``` 86 87 Pros: 88 - Rclone gives the functionality of exporting data from lakeFS to a designated object store location. 89 - Simple to use. 90 - Doesn't require maintaining a new feature. 91 92 Cons: 93 - Require a designated object store location that doesn't contain any other data but the data associated to the lakeFS branch. That is because Rclone's sync command will delete all files that don't exist in the branch. 94 95 96 ## Chosen solution 97 98 Wrap Rclone with a python script to match the behavior of the new export functionality to the one of the Spark client. 99 The script will run over a Docker container, which will have all the necessary dependencies. 100 The relevant Docker image will be published to Docker Hub. 101 This solution utilizes Rclone and its features, it is easy to implement and achives the required behavior. 102 103 For example: 104 ```shell 105 docker run -e LAKEFS_ACCESS_KEY=XXX -e LAKEFS_SECRET_KEY=YYY -e LAKEFS_ENDPOINT=https://<LAKEFS_ENDPOINT>/ -e S3_ACCESS_KEY=XXX -e S3_SECRET_KEY=YYY -it lakefs-export repo-name s3://some-bucket/some_path/ --branch="branch-name" 106 ``` 107 108 The script will call: 109 ```shell 110 rclone sync source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix 111 rclone check source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix 112 ``` 113 And then will add a success/failure file according to the check command result.