github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/delta-catalog-exporter.md (about) 1 # Delta Lake catalog exporter 2 3 ## Introduction 4 5 The Delta Lake table format manages its catalog of used files through a log-based system. This log, as the name implies, 6 contains a sequence of deltas representing changes applied to the table. The list of files that collectively represent 7 the table at a specific log entry is constructed by reapplying the changes stored in the log files one by one, starting 8 from the last checkpoint (a file that summarizes all changes up to that point) and progressing to the latest log entry. 9 Each log entry contains either an `add` or `remove` action, which adds or removes a data (parquet) file, ultimately 10 shaping the structure of the table. 11 12 In order to make Delta Lake tables accessible to external users, we aim to export the Delta Lake log to an external 13 location, enabling these users to read tables backed by lakeFS and Delta Lake. 14 15 --- 16 17 ## Proposed Solution 18 19 Following the [catalog exports issue](https://github.com/treeverse/lakeFS/issues/6461), the Delta Lake log will be 20 exported to the `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}/_delta_log/` path, which resides 21 within the user's designated storage bucket. 22 Within the `_delta_log` directory, you will find the following components: 23 1. The last checkpoint (or the initial log entry if no checkpoint has been established yet). 24 2. All the log entries that have been recorded since that last checkpoint. 25 26 Notably, the log entries will mirror those present in lakeFS, with one key distinction: Instead of utilizing relative 27 logical paths, they will include absolute physical paths: 28 29 #### lakeFS-backed Delta Log entry: 30 ```json 31 { "commitInfo": { 32 "timestamp": 1699199369960, 33 "operation": "WRITE", 34 "operationParameters": { 35 "mode": "Overwrite", 36 "partitionBy": "[]" 37 }, 38 "readVersion": 2, 39 ... 40 } 41 { "add": { 42 "path":"part-00000-72b765fd-a97b-4386-b92c-cc582a7ca176-c000.snappy.parquet", 43 ... 44 } 45 } 46 { "remove": { 47 "path":"part-00000-56e72a31-0078-459d-a577-ef2c5d3dc0f9-c000.snappy.parquet", 48 ... 49 } 50 } 51 ``` 52 53 #### Exported Delta Log entry: 54 ```json 55 { "commitInfo": { 56 "timestamp": 1699199369960, 57 "operation": "WRITE", 58 "operationParameters": { 59 "mode": "Overwrite", 60 "partitionBy": "[]" 61 }, 62 "readVersion": 2, 63 ... 64 } 65 { "add": { 66 "path":"s3://my-bucket/my-path/data/gk3l4p7nl532qibsgkv0/cl3rj1fnl532qibsglr0", 67 ... 68 } 69 } 70 { "remove": { 71 "path":"s3://my-bucket/my-path/data/gk899p7jl532qibsgkv8/zxcrhuvnl532qibshouy", 72 ... 73 } 74 } 75 ``` 76 77 --- 78 79 We shall use the [delta-go](https://github.com/csimplestring/delta-go) package to read the Delta Lake log since the last 80 checkpoint (or first entry if none found) and generate the new `_delta_log` directory with log entries as described 81 above. The directory and log files will be written to `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}/_delta_log/`. 82 The `tableName` will be fetched from the hook's configurations. 83 The Delta Lake table can now be read from `${storageNamespace}/_lakefs/exported/${ref}/${commitId}/${tableName}`.