github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/design-datasets.md (about) 1 ## Existing 2 3 We have these forms of table definitions. I order them here roughly by 4 chronological order of initial implementation, mixed with mutual 5 dependencies. 6 7 ### `lakectl metastore` 8 9 This command supports *Hive* and *Glue* metastores. 10 11 A Hive metastore is stored off lakeFS on a Hive Metastore Server (HMS). 12 `lakectl` allows updating the table stored on the HMS to match branch 13 states, and can drive HMS merges and copies. All metadata is stored on the 14 HMS; the `lakectl metastore` command calls out to a HMS URI that is supplied 15 by configuration or by flags. 16 17 A Glue or similar metastore is stored off lakeFS. lakeFS exports "symlinks" 18 objects that give it direct access to S3. All important table metadata is 19 still stored on an external Glue catalog. 20 21 ### Delta Lake and Delta-diff 22 23 Given a path, Delta describes its own schema, so no further catalog is 24 necessary. The current lakeFS Delta Lake implementation is unaware of the 25 list of tables stored in the repository. 26 27 Delta diff has no explicit concept of tables because it needs none. The web 28 UI activates Delta diff (only) when a folder contains a `_delta_log/` 29 subfolder. 30 31 It will be possible to list all subfolders holding Delta Lake folders by 32 listing with separator `_delta_log/`. (A simple API change could add a 33 "prefixes-only" mode to the lakeFS list-objects API, if needed.) 34 35 ### Iceberg Catalog 36 37 lakeFS Iceberg support is a catalog that locates Iceberg files. This allows 38 it to translate schema to refs. lakeFS Iceberg catalog is an Iceberg HadoopCatalog wrapper: 39 it supports access through paths. In a sense, it does not support a true 40 catalog concept of "all tables". 41 42 Currently only HadoopCatalog-like behavior is supported, but it is not clear what _other_ 43 catalogs would work. 44 45 ### Delta Sharing server 46 47 The Delta Sharing server is available on lakeFS Cloud. It is used to 48 connect lakeFS to DataBricks Unity Catalog, along with similar support in 49 other products. The Delta Sharing protocol requires each Delta Sharing 50 server to be a catalog for the tables that it manages. So lakeFS Delta 51 Sharing defines tables. 52 53 The lakeFS Delta Sharing server defines tables by using YaML files under the 54 `_lakefs_tables/` prefix. Each table is defined by its own YaML file. The 55 server translates the schema of a table identifier to a ref. By specifying 56 paths through object keys these definitions exactly support lakeFS ref 57 semantics: branching out, tagging, ref expressions, and even successfully 58 merging defines the same table across schemas. 59 60 lakeFS Delta Sharing supports two table types: 61 62 * **Delta Lake**: These describe their own schema. A `delta` table needs no 63 information beyond a table name and its path. 64 * **"Hive"**: These are partitioned directories of Parquet or possibly other 65 files. Despite the name, no Hive metastore is involved; rather the lakeFS 66 Delta Sharing server defines similar semantics. A `hive` table needs a 67 name and a path, but also a list of partition columns and a schema for all 68 fields. Table schemas are defined using 69 [this][delta-sharing-schema-schema] "a subset of Spark SQL's JSON Schema 70 representation". Currently Hive-style tables do not support column ranges 71 metadata. 72 73 ## Analysis 74 75 We currently support these 3 ways of defining tables: 76 77 * Implicitly, by discovering that a given path holds a table. This is our 78 current Delta Lake and Delta-diff support. 79 * Externally, by deferring to an external catalog. This is our metastore 80 support (HMS is the catalog). Our Iceberg support _might_ become partly 81 external, if we added support to defer table definitions to some other 82 catalog. 83 * Internally, by effectively serving the catalog. This is our lakeFS Delta 84 Sharing (the Delta Sharing protocol does not call this a "catalog"). Our 85 Iceberg support is also currently internal. 86 87 ### Implicit definitions 88 89 Implicit definitions give users the _easiest_ path. However they have 90 multiple limitations: 91 92 * Require discoverability of the table by examining only object metadata. 93 So supported only by "modern" table formats: Iceberg and Delta. Formats 94 such as prefixes containing Parquet files _cannot_ support all operations: 95 - It is not possible to report table metadata efficiently. 96 - It may not be possible to list all tables. 97 - It will _not_ be possible to support named tables, all tables must be 98 defined by paths. 99 100 Using implicit definitions necessarily leaks through the table abstraction. 101 There is no useful general case here. Users will find it easy to request 102 functionality that uses implicit definitions. 103 104 #### Possible future features 105 106 Define an implicit "Parquet folder" as a folder that holds many `.parquet` 107 files, or a folder that holds many `<key>=<value>` subfolders with some 108 `.parquet` files under them. Then we could add to the GUI a display of a 109 sample DuckDB query when looking at a Parquet folder: 110 111 ### External definitions 112 113 External definitions in catalogs may be the easiest way to give _some_ users 114 a fully-functional experience, for _some_ catalogs. Their principal 115 limitations: 116 117 * Risk of fragmentation: If there are many popular external catalogs we will 118 need to support many of them. Different catalogs support different 119 use-cases, so we should expect to have to support multiple catalogs. 120 * Limitations due to the external definition: Because existing catalog 121 implementations were designed without lakeFS in mind, we will not be able 122 to add lakeFS features to all of them. For instance the HMS protocol is 123 Thrift-based and does not support refs-as-schemas. 124 125 Iceberg support might become an external definition once it supports 126 deferring to an external catalog. It does not yet support wrapping an 127 external catalog because of these risks. 128 129 #### Possible future features 130 131 Support more catalog types. Each catalog adds implementation effort. 132 133 Support catalog wrappers that work with multiple external catalogs. For 134 instance, wrapping additional catalogs under the Iceberg catalog, making it 135 an external definition. 136 137 ### Internal definition 138 139 An internal definition should allow us to support _all_ lakeFS features that 140 systems can handle. Because it reduces external dependencies, it is 141 probably the _fastest_ way for us to implement "table" features on top of 142 lakeFS -- we have seen this in the lakeFS Delta Sharing server. The Iceberg 143 catalog wrapper is currently internal, although it does not even support 144 defining tables except via Spark definitions. 145 146 Their principal limitations: 147 148 * Possibly unacceptable cost to users: If users are deeply invested in 149 existing catalogs they may be reluctant or unable to transition to lakeFS 150 catalogs. Again, wrappers might mitigate this. 151 * Risk of fragmentation: There may be different base catalog interfaces for 152 different use-cases. For instance, the Iceberg catalog implementation is 153 a particular _kind_ of Spark catalog, and will not be suitable for 154 straight Delta Lake tables. This fragmentation is _considerably less_ 155 than the fragmentation expected for external definitions, of course. 156 * Becoming opinionated. lakeFS is so far unopinionated as regards catalogs: 157 it defines none and has difficulty integrating with most. As soon as we 158 define catalogs we run the risk of becoming opionated in ways that our 159 users might prefer to avoid. 160 161 One way to decrease this is to implement lakeFS table definitions as a 162 separate lakeFS layer, and catalogs as yet other layers. Users must still 163 be able to use lakeFS without these extras. 164 165 #### Possible future features 166 167 Support Iceberg catalogs for multiple implementations: general Spark and 168 Hadoop, non-Java ecosystem, etc. 169 170 Support translation from existing (external) catalogs. 171 172 Support repository / ref scanning to define useful draft table definitions. 173 174 ## Questions for our design partners 175 176 The key questions to answer are: 177 178 * _How many tables have users already defined_? (Affects whether we need to 179 support external tables). 180 181 Follow-on: 182 - _Which existing catalogs do users have?_ 183 - _Can we automatically derive _internal_ definitions from _external_ 184 definitions for these catalogs?_ 185 186 * _How often do users define new tables?_ (Affects going to implicit or 187 explicit internal/external definitions) 188 189 * _How much user effort is required to define a new table?_ (Affects going to implicit or 190 explicit internal/external definitions) 191 192 Follow-on: 193 - _How far can we automate this process?_ (Affects going to external or 194 internal explicit definitions) 195 - _Do users already have tools that automate this process?_ 196 197 * _How much do users need external catalogs?_ (Affects whether we need to 198 support external tables) 199 * _How many external catalog types do our users use?_ (Affects whether we need to 200 support external tables; many different types will dilute our effort). 201 202 ## Preferred route 203 204 This route requires support from design partners in the "Questions for our 205 design partners". I propose it as a sample guide for how we might proceed 206 if all questions resolve in a certain way. 207 208 * Define the current lakeFS Delta-Sharing tables schema as the lakeFS tables 209 schema. 210 * Possible extensions: 211 - If there is demand for efficient `hive` format, add support for storing 212 index metadata. 213 * Alternatives for Spark catalog support: 214 - Write a Spark catalog implementation. 215 - Write translators from lakeFS tables to some popular external catalogs, 216 for users who need to continue to use those. 217 * Write or adapt existing helper tools to help draft lakeFS table schema 218 from object listings. 219 220 221 [delta-sharing-schema-schema]: https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#schema-object