github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/export-functionality.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/export-functionality.md (about)

     1  # Export
     2  
     3  ## Requirements
     4  The functionality currently provided by the [lakeFS Export utility](https://docs.lakefs.io/reference/export.html) but without a dependency on Apache Spark.
     5  This includes:
     6  1. Export all data from a given lakeFS reference (commit or branch) to a designated object store location.
     7  1. Export only the diff between the HEAD of a branch and a commit reference (helpful for continuous exports).
     8  1. Success/Failure indications on the exported objects. 
     9  
    10  
    11  ## Non-Requirements
    12  1. Support distribution across machines.
    13  1. Support from lakeFS.
    14  
    15  
    16  ## Possible Solutions external to lakeFS
    17  
    18  ### Spark client stand-alone
    19  
    20  Create a stand-alone Docker container that runs the Spark client.
    21  
    22  In order to use Docker containers, users will need to download it if necessary, learn it, and have it in their toolset.
    23  
    24  Usage example:
    25  
    26  ```shell 
    27  docker run lakefs-export --conf spark.hadoop.lakefs.api.url=https://<LAKEFS_ENDPOINT>/api/v1 \
    28  --conf spark.hadoop.lakefs.api.access_key=<LAKEFS_ACCESS_KEY_ID> \
    29  --conf spark.hadoop.lakefs.api.secret_key=<LAKEFS_SECRET_ACCESS_KEY> \
    30  --packages io.lakefs:lakefs-spark-client_2.12:0.1.0 \
    31  --class io.treeverse.clients.Main export-app example-repo s3://example-bucket/prefix \
    32  --branch=example-branch 
    33  ``` 
    34  
    35  Pros:
    36  - Utilizes the existing Spark client, and gives the same functionality. 
    37   
    38  Cons:
    39  - Require creating and maintaining the standalone container.
    40  - Still requires the user to understand Spark indirectly - for example know how to read logs or errors that arise from running Spark stand-alone.
    41  
    42  ### Using the java client
    43  Reimplement the spark client export functionality by creating a new Exporter class that doesn't use spark.
    44  This class can be part of the java cliet, or a new "Export" client that uses the java client.
    45  
    46  The usage will be with java code (in a similar way of using custom code with the spark client).
    47  
    48  In order to use Java code, users will need to download it if necessary, learn it, and have it in their toolset.
    49  
    50  Usage example:
    51  
    52  ```java
    53  class Export{
    54      public static void main(String[]args){
    55          Exporter exporter = new Exporter(apiClient, repoName, rootLocation);
    56          exporter.exportAllFromBranch(branchName);
    57          exporter.exportFrom(branchName, commitRef);
    58      }
    59  }
    60  ```
    61  
    62  Pros:
    63  - An intermediate solution that is compatible with the issue.
    64  
    65  Cons:
    66  - Require checking how spark is being used in the current implementation (in order to rewrite its functions). Not necessarily possible in a naive way.
    67  - Require developing and maintaining a second export functionality.
    68  
    69  ### Using Rclone
    70  [Rclone](https://rclone.org/) is a command line program to sync files and directories between cloud providers.
    71  Users can gain the export functionality by using Rclone's [copy](https://rclone.org/commands/rclone_copy/), [sync](https://rclone.org/commands/rclone_sync/), and [check](https://rclone.org/commands/rclone_check/) commands.
    72  
    73  The copy command can be used to copy files from lakeFS to a designated object store location.
    74  The sync command can be used to export only the diff between a specific branch and a commit reference, since sync makes the source and the dest identical (modifying destination only).
    75  The check command can be used as a success/failure indication on the exported objects, since it checks if the files in the source and destination match, and logs a report of files which don't match.
    76  
    77  Usage example:
    78  
    79  ```shell
    80  rclone copy source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix
    81  
    82  rclone sync source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix
    83  
    84  rclone check source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix
    85  ```
    86  
    87  Pros:
    88  - Rclone gives the functionality of exporting data from lakeFS to a designated object store location. 
    89  - Simple to use. 
    90  - Doesn't require maintaining a new feature.
    91  
    92  Cons:
    93  - Require a designated object store location that doesn't contain any other data but the data associated to the lakeFS branch. That is because Rclone's sync command will delete all files that don't exist in the branch.   
    94  
    95  
    96  ## Chosen solution
    97  
    98  Wrap Rclone with a python script to match the behavior of the new export functionality to the one of the Spark client.
    99  The script will run over a Docker container, which will have all the necessary dependencies.
   100  The relevant Docker image will be published to Docker Hub. 
   101  This solution utilizes Rclone and its features, it is easy to implement and achives the required behavior.
   102  
   103  For example:
   104  ```shell
   105  docker run -e LAKEFS_ACCESS_KEY=XXX -e LAKEFS_SECRET_KEY=YYY -e LAKEFS_ENDPOINT=https://<LAKEFS_ENDPOINT>/ -e S3_ACCESS_KEY=XXX -e S3_SECRET_KEY=YYY -it lakefs-export repo-name s3://some-bucket/some_path/ --branch="branch-name"
   106  ```
   107  
   108  The script will call:
   109  ```shell
   110  rclone sync source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix
   111  rclone check source:lakefs:example-repo/main/ dest:s3://example-bucket/prefix
   112  ```
   113  And then will add a success/failure file according to the check command result.