github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/delta-diff.md (about)

     1  # Delta Table Differ
     2  
     3  ## User Story
     4  
     5  Jenny the data engineer runs a new retention-and-column-remover ETL over multiple Delta Lake tables. To test the ETL before running on production, she uses lakeFS (obviously) and branches out to a dedicated branch `exp1`, and runs the ETL pointed at `exp1`.
     6  The output is not what Jenny planned... The Delta Table is missing multiple columns and the number of rows just got bigger!
     7  She would like to debug the ETL run. In other words, she would want to see the Delta table history changes inserted applied in branch `exp1` since the commit that she branched from.
     8  
     9  ---
    10  
    11  ## Goals and scope
    12  
    13  1. For the MVP, support only Delta table diff.
    14  2. The system should be open for different data comparison implementations.
    15  3. The diff itself will consist of metadata changes only (computed using the operation history of the two tables) and, optionally, the change in the number of rows.
    16  4. The diff will be a "two-dots" diff (like `git log branch1..branch2`). Basically showing the log changes that happened in the topic branch and not in the base branch.
    17  5. UI: GUI only.
    18  6. Reduce user friction as much as possible.
    19  
    20  ---
    21  
    22  ## Non-Goals and off-scope
    23  
    24  1.  The Delta diff will be limited to the available Delta Log entries (the JSON files), which are, by default, retained
    25  for [30 days](https://docs.databricks.com/delta/history.html#configure-data-retention-for-time-travel).
    26  2. Log compaction isn't supported.
    27  3. Delta Table diff will only be supported if the table stayed where it was, e.g. if we have a Delta table on path
    28  `repo/branch1/delta/table` and it moved in a different branch to location `repo/branch2/delta/different/location/table`
    29  then it won't be "diffable".
    30  
    31  ---
    32  
    33  ## High-Level design
    34  
    35  ### Considered architectures
    36  
    37  ##### Diff as part of the lakeFS server
    38  
    39  Implement the diff functionality within lakeFS.  
    40  Delta doesn't provide a Go SDK implementation, so we'll need to implement the Delta data model and interface ourselves, which is not something we aim to do.
    41  In addition, this couples future diff implementations to Go (which might result in similar problems).
    42  
    43  ##### Diff as a standalone RESTful server
    44  
    45  The diff implementation will be implemented in one of the available Delta Lake SDK languages and will be called RESTfully from lakeFS to get the diff results.
    46  Users will run and manage the server and add communication details to the lakeFS configuration file.  
    47  The problem with this approach is the friction added for the users. We intend to integrate a diff experience as seamlessly as possible, yet this solution is adding the overhead of managing another server alongside lakeFS.
    48  
    49  ##### Diff as bundled Rust/WebAssembly in lakeFS
    50  
    51  Using the `delta-rs` package to build the differ, and bundling it into the lakeFS binary using `cgo` (to communicate with Rust's FFI) or a Web Assembly package that will use Go as the Web Assembly runtime/engine.  
    52  Multiple languages can be compiled to WebAssembly which is a great benefit, yet the engines that are needed to run WebAssembly runtime in Go are implemented in other languages (not Go) and are unstable and raise compilation complexity and maintenance.
    53  
    54  ##### Diff as an external binary executable
    55  
    56  We can trigger a diff binary from lakeFS using the `os.exec` package and get the output generated by that binary.
    57  This is almost where we want to be, except that we'll need to somehow validate that the lakeFS server and the executable are using the same "API version" to communicate, so that the output of the binary would match the expected one from lakeFS. In addition, I would like to decrease the amount of deserialization needed to be implemented to interpret the returned DTO (maintainability and extensibility-wise).
    58  
    59  ### Chosen architecture
    60  
    61  #### The Microkernel/Plugin Architecture
    62  
    63  The Microkernel/Plugin architecture is composed of two entities: a single "core system" and multiple "plugins".  
    64  In our case, the lakeFS server act as the core system, and the different diff implementations, including the Delta diff implementation, will act as plugins.  
    65  We'll use `gRPC` as the transport protocol, which makes the language of choice almost immaterial (due to the use of protobufs as the transferred data format)
    66  as long as it's self-contained (theoretically it can also be not system-runtime-dependent but then the cost will be an added requirement for running lakeFS- runtime support).
    67  
    68  #### Plugin system consideration
    69  
    70  * The plugin system should be flexible enough such that it won't impose a language restriction, so that in the future we could have lakeFS users take part in plugins creation (not restricted to diffs).
    71  * The plugin system should support multiple OSs.
    72  * The plugin system should be easy enough to write plugins.
    73  
    74  #### Hashicorp Plugin system
    75  
    76  Hashicorp's battle-tested [`go-plugin`](https://github.com/hashicorp/go-plugin) system is a plugin system over RPC (used in Terraform, Vault, Consul, and Packer).
    77  Currently, it's only designed to work over a local network.  
    78  Plugins can be written and consumed in almost every major language. This is achieved by supporting [gRPC](http://www.grpc.io/) as the communication protocol.
    79  * The plugin system works by launching subprocesses and communicating over RPC (both `net/rpc` and `gRPC` are supported).
    80    A single connection is made between any plugin and the core process. For gRPC-based plugins, the HTTP2 protocol handles [connection multiplexing](https://freecontent.manning.com/animation-http-1-1-vs-http-2-vs-http-2-with-push/).
    81  * The plugin and core system are separate processes, which means that a crash of a plugin won't cause the core system to crash (lakeFS in our case).
    82  
    83  ![Microkernel architecture overview](../open/diagrams/microkernel-overview.png)
    84  [(excalidraw file)](../open/diagrams/microkernel-overview.excalidraw)
    85  
    86  ---
    87  
    88  ## Implementation
    89  
    90  ![Delta Diff flow](../open/diagrams/delta-diff-flow.png)
    91  [(excalidraw file)](../open/diagrams/delta-diff-flow.excalidraw)
    92  
    93  #### DiffService
    94  
    95  The DiffService will be an internal component in lakeFS which will serve as the Core system.  
    96  In order to realize which diff plugins are available, we shall use new manifest section and environment variables:
    97  
    98  ```yaml
    99  diff:
   100    <diff_type>:
   101      plugin: <plugin_name_1>
   102  plugins:
   103    <plugin_name_1>: <location to the 'plugin_name_1' plugin - full path to the binary needed to execute>
   104    <plugin_name_2>: <location to the 'plugin_name_1' plugin - full path to the binary needed to execute>    
   105  ```
   106  
   107  1. `LAKEFS_DIFF_{DIFF_TYPE}_PLUGIN`
   108     The DiffService will use the name of plugin provided by this env var to load the binary and perform the diff. For instance,
   109     `LAKEFS_DIFF_DELTA_PLUGIN=delta_diff_plugin` will search for the binary path under the manifest path
   110     `plugins.delta_diff_plugin` or environment variable `LAKEFS_PLUGINS_DELTA_DIFF_PLUGIN`.
   111     If a given plugin path will not be found in the manifest or env var (e.g. `delta_diff_plugin` in the above example),
   112     the binary of a plugin with the same name will be searched under `~/.lakefs/plugins` (as the
   113     [plugins' default location](https://github.com/hashicorp/terraform/blob/main/plugins.go)).
   114  2. `LAKEFS_PLUGINS_{PLUGIN_NAME}`
   115     This environment variable (and corresponding manifest path) will provide the path to the binary of a plugin named
   116     {PLUGIN_NAME}, for example, `LAKEFS_PLUGINS_DELTA_DIFF_PLUGIN=/path/to/delta/diff/binary`.
   117  
   118  The **type** of diff will be sent as part of the request to lakeFS as specified [here](#API).
   119  The communication between the DiffService and the plugins, as explained above, will be through the `go-plugin` package (`gRPC`).
   120  
   121  #### Delta Diff Plugin
   122  
   123  Implemented using [delta-rs](https://github.com/delta-io/delta-rs) (Rust), this plugin will perform the diff operation using table paths provided by the DiffService through a `gRPC` call.  
   124  To query the Delta Table from lakeFS, the plugin will generate an S3 client (this is a constraint imposed by the `delta-rs` package) and send a request to lakeFS's S3 gateway.  
   125  
   126  * The objects requested from lakeFS will only be the `_delta_log` JSON files, as they are the files that construct a table's history.
   127  * We shall limit the number of returned objects to 1000 (each entry is ~500 bytes).<sup>[2](#future-plans)</sup>
   128  
   129  _The diff algorithm:_
   130  1. Get the table version of the base common ancestor.
   131  2. Run a variation of the Delta [HISTORY command](https://docs.delta.io/latest/delta-utility.html#history-schema)<sup>[3](#future-plans)</sup>
   132  (this variation starts from a given version number) on both table paths, starting from the version retrieved above.
   133      - If the left table doesn't include the base version (from point 1), start from the oldest version greater than the base.
   134  3. Operate on the [returned "commitInfo" entry vector ](https://github.com/delta-io/delta-rs/blob/d444cdf7588503c1ebfceec90d4d2fadbd50a703/rust/src/delta.rs#L910)
   135  starting from the **oldest** version of each history vector that is greater than the base table version: 
   136      1. If one table version is greater than the other table version, increase the smaller table's version until it gets to the same version of the other table.
   137      2. While there are more versions available && answer's vector size < 1001: 
   138          1. Create a hash for the entry based on fields: `timestamp`, `operation`, `operationParameters` , and `operationMetrics` values.
   139          2. Compare the hashes of the versions.
   140          3. If they **aren't equal**, add the "left" table's entry to the returned history list.
   141          4. Traverse one version forward in both vectors.
   142  4. Return the history list.
   143  
   144  ### Authentication<sup>[4](#future-plans)</sup>
   145  
   146  The `delta-rs` package generates an S3 client which will be used to communicate back to lakeFS (through the S3 gateway).  
   147  In order for the S3 client to communicate with lakeFS (or S3 in general) it needs to pass an AWS Access Key Id and Secret access key.  
   148  Since applicative credentials are not obligatory, the users that sent the request for a diff might not have such, and even if they have,
   149  they cannot send them through the GUI (which is the UI we chose to implement this feature for the MVP).
   150  To overcome this scenario, we'll use special diff credentials as follows:
   151  1. User makes a Delta Diff request.
   152  2. The DiffService checks if the user has "diff credentials" in the DB:
   153      1. If there are such credentials, it will use them.
   154      2. If there aren't such, it will generate the credentials and save them: `{AKIA: DIFF-<>, SAK: <>}`. The `DIFF` prefix will be used to identify "diff credentials".
   155  3. The DiffService will pass the credentials to the Delta Diff plugin during the call as part of the `gRPC` call.
   156  
   157  ### API
   158  
   159  - GET `/repositories/repo/{repo}/otf/refs/{left_ref}/diff/{right_ref}?table_path={path}&type={diff_type}`
   160      - Tagged as experimental
   161        - **Response**:  
   162            The response includes an array of operations from different versions of the specified table format, and the type of diff:
   163              `changed`, `created`, or `dropped`.  
   164            It has a general structure that enables formatting the different table format operation structs.
   165          - `DiffType`: 
   166            - description: the type of change
   167            - type: string
   168          - `Results`:
   169            - description: an array of differences
   170            - type: array[
   171              - id:
   172                - type: string
   173                - description: the version/snapshot/transaction id of the operation.
   174              - timestamp (epoch):
   175                - type: long
   176                - description: operation's timestamp.
   177              - operation:
   178                - type: string
   179                - description: operation's name.
   180              - operationContent:
   181                - type: map
   182                - description: an operation content specific to the table format implemented.  
   183              - operationType:
   184                - type: string
   185                - description: the type of performed operation: created, updated, or deleted
   186              ]  
   187  
   188              **Delta lake response example**:
   189            
   190              ```json
   191              {
   192                  "table_diff_type": "changed",
   193                  "results": [
   194                    {
   195                        "id": "1",
   196                        "timestamp":1515491537026,
   197                        "operation_type": "update",
   198                        "operation":"INSERT",
   199                        "operation_content":{
   200                             "operation_parameters": {
   201                                "mode":"Append",
   202                                "partitionBy":"[]"
   203                              }
   204                    },
   205                    ...
   206                ]
   207                  
   208              ```
   209      
   210  ---
   211  
   212  ### Build & Package
   213  
   214  #### Build
   215  
   216  The Rust diff implementation will be located within the lakeFS repo. That way the protobuf will be shared between the 
   217  "core" and "plugin" components easily. We will use `cargo` to build the artifacts for the different architectures.  
   218  
   219  ###### lakeFS development
   220  
   221  The different plugins will have their own target in the `makefile` which will not be included as a (sub) dependency
   222  of the `all` target. That way we'll not force lakeFS developers to include Rust (or any other plugin runtime) 
   223  in their stack.
   224  If developers want to develop new plugins, they'll have to include the necessary runtime in their environment.
   225  
   226  * _Contribution guide_: It also means updating the contribution guide on how to update the makefile if a new plugin is added.
   227  
   228  ###### Docker
   229  
   230  The docker file will build the plugins and will include Rust in the build part.
   231  
   232  #### Package
   233  
   234  1. We will package the binary within the released lakeFS + lakectl archive. It will be located under a `plugins` directory:
   235  `plugins/diff/delta_v1`.
   236  2. We will also update the Docker image to include the binary within it. The docker image is what clients should
   237  use in order to get the "out-of-the-box" experience.
   238  
   239  ---
   240  
   241  ## Metrics
   242  
   243  ### Applicative Metrics
   244  
   245  - Diff runtime
   246  - The total number of versions read (per diff) - this is the number of Delta Log (JSON) files that were read from lakeFS by the plugin.
   247  This is basically the size of a Delta Log. We can use it later on to optimize reading and understand an average size of 
   248  Delta log.
   249  
   250  ### Business Statistics
   251  
   252  - The number of unique requests by installation id- get the lower bound of the number of Delta Lake users.
   253  
   254  ---
   255  
   256  ## Future plans
   257  
   258  1. Add a separate call to the plugin that merely checks that the plugin si willing to try to run on a location.
   259  This allows a better UX(gray-out the button until you type in a good path), and will probably be helpful for performing 
   260  auto-detection.
   261  2. Support pagination.
   262  3. The `delta-rs` HISTORY command is inefficient as it sends multiple requests to get all the Delta log entries. 
   263  We might want to hold this information in our own model to make it easier for the diff (and possibly merge) to work.
   264  4. This design's authentication method is temporary. It enlists two auth requirements that we might need to implement:
   265     1. Support tags/policies for credentials.
   266     2. Allow temporary credentials.