github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/delta-catalog-exporter.md (about)

     1  # Delta Lake catalog exporter
     2  
     3  ## Introduction
     4  
     5  The Delta Lake table format manages its catalog of used files through a log-based system. This log, as the name implies,
     6  contains a sequence of deltas representing changes applied to the table. The list of files that collectively represent
     7  the table at a specific log entry is constructed by reapplying the changes stored in the log files one by one, starting
     8  from the last checkpoint (a file that summarizes all changes up to that point) and progressing to the latest log entry.
     9  Each log entry contains either an `add` or `remove` action, which adds or removes a data (parquet) file, ultimately
    10  shaping the structure of the table.
    11  
    12  In order to make Delta Lake tables accessible to external users, we aim to export the Delta Lake log to an external
    13  location, enabling these users to read tables backed by lakeFS and Delta Lake.
    14  
    15  ---
    16  
    17  ## Proposed Solution
    18  
    19  Following the [catalog exports issue](https://github.com/treeverse/lakeFS/issues/6461), the Delta Lake log will be
    20  exported to the `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}/_delta_log/` path, which resides 
    21  within the user's designated storage bucket.  
    22  Within the `_delta_log` directory, you will find the following components:
    23  1. The last checkpoint (or the initial log entry if no checkpoint has been established yet).
    24  2. All the log entries that have been recorded since that last checkpoint.
    25  
    26  Notably, the log entries will mirror those present in lakeFS, with one key distinction: Instead of utilizing relative
    27  logical paths, they will include absolute physical paths:
    28  
    29  #### lakeFS-backed Delta Log entry: 
    30  ```json
    31  { "commitInfo": {
    32    "timestamp": 1699199369960,
    33    "operation": "WRITE",
    34    "operationParameters": {
    35      "mode": "Overwrite",
    36      "partitionBy": "[]"
    37    },
    38    "readVersion": 2,
    39    ...
    40    }
    41  { "add": { 
    42    "path":"part-00000-72b765fd-a97b-4386-b92c-cc582a7ca176-c000.snappy.parquet",
    43    ...
    44    }
    45  }
    46  { "remove": { 
    47    "path":"part-00000-56e72a31-0078-459d-a577-ef2c5d3dc0f9-c000.snappy.parquet",
    48    ...
    49    }
    50  }
    51  ```
    52  
    53  #### Exported Delta Log entry:
    54  ```json
    55  { "commitInfo": {
    56    "timestamp": 1699199369960,
    57    "operation": "WRITE",
    58    "operationParameters": {
    59      "mode": "Overwrite",
    60      "partitionBy": "[]"
    61    },
    62    "readVersion": 2,
    63    ...
    64    }
    65  { "add": { 
    66    "path":"s3://my-bucket/my-path/data/gk3l4p7nl532qibsgkv0/cl3rj1fnl532qibsglr0",
    67    ...
    68    }
    69  }
    70  { "remove": { 
    71    "path":"s3://my-bucket/my-path/data/gk899p7jl532qibsgkv8/zxcrhuvnl532qibshouy",
    72    ...
    73    }
    74  }
    75  ```
    76  
    77  ---
    78  
    79  We shall use the [delta-go](https://github.com/csimplestring/delta-go) package to read the Delta Lake log since the last
    80  checkpoint (or first entry if none found) and generate the new `_delta_log` directory with log entries as described 
    81  above. The directory and log files will be written to `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}/_delta_log/`.
    82  The `tableName` will be fetched from the hook's configurations.  
    83  The Delta Lake table can now be read from `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}`.