github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/design-datasets.md (about)

     1  ## Existing
     2  
     3  We have these forms of table definitions.  I order them here roughly by
     4  chronological order of initial implementation, mixed with mutual
     5  dependencies.
     6  
     7  ### `lakectl metastore`
     8  
     9  This command supports *Hive* and *Glue* metastores.
    10  
    11  A Hive metastore is stored off lakeFS on a Hive Metastore Server (HMS).
    12  `lakectl` allows updating the table stored on the HMS to match branch
    13  states, and can drive HMS merges and copies.  All metadata is stored on the
    14  HMS; the `lakectl metastore` command calls out to a HMS URI that is supplied
    15  by configuration or by flags.
    16  
    17  A Glue or similar metastore is stored off lakeFS.  lakeFS exports "symlinks"
    18  objects that give it direct access to S3.  All important table metadata is
    19  still stored on an external Glue catalog.
    20  
    21  ### Delta Lake and Delta-diff
    22  
    23  Given a path, Delta describes its own schema, so no further catalog is
    24  necessary.  The current lakeFS Delta Lake implementation is unaware of the
    25  list of tables stored in the repository.
    26  
    27  Delta diff has no explicit concept of tables because it needs none.  The web
    28  UI activates Delta diff (only) when a folder contains a `_delta_log/`
    29  subfolder.
    30  
    31  It will be possible to list all subfolders holding Delta Lake folders by
    32  listing with separator `_delta_log/`.  (A simple API change could add a
    33  "prefixes-only" mode to the lakeFS list-objects API, if needed.)
    34  
    35  ### Iceberg Catalog
    36  
    37  lakeFS Iceberg support is a catalog that locates Iceberg files.  This allows
    38  it to translate schema to refs.  lakeFS Iceberg catalog is an Iceberg HadoopCatalog wrapper:
    39  it supports access through paths.  In a sense, it does not support a true
    40  catalog concept of "all tables".
    41  
    42  Currently only HadoopCatalog-like behavior is supported, but it is not clear what _other_
    43  catalogs would work.
    44  
    45  ### Delta Sharing server
    46  
    47  The Delta Sharing server is available on lakeFS Cloud.  It is used to
    48  connect lakeFS to DataBricks Unity Catalog, along with similar support in
    49  other products.  The Delta Sharing protocol requires each Delta Sharing
    50  server to be a catalog for the tables that it manages.  So lakeFS Delta
    51  Sharing defines tables.
    52  
    53  The lakeFS Delta Sharing server defines tables by using YaML files under the
    54  `_lakefs_tables/` prefix.  Each table is defined by its own YaML file.  The
    55  server translates the schema of a table identifier to a ref.  By specifying
    56  paths through object keys these definitions exactly support lakeFS ref
    57  semantics: branching out, tagging, ref expressions, and even successfully
    58  merging defines the same table across schemas.
    59  
    60  lakeFS Delta Sharing supports two table types:
    61  
    62  * **Delta Lake**: These describe their own schema.  A `delta` table needs no
    63    information beyond a table name and its path.
    64  * **"Hive"**: These are partitioned directories of Parquet or possibly other
    65    files.  Despite the name, no Hive metastore is involved; rather the lakeFS
    66    Delta Sharing server defines similar semantics.  A `hive` table needs a
    67    name and a path, but also a list of partition columns and a schema for all
    68    fields.  Table schemas are defined using
    69    [this][delta-sharing-schema-schema] "a subset of Spark SQL's JSON Schema
    70    representation".  Currently Hive-style tables do not support column ranges
    71    metadata.
    72  
    73  ## Analysis
    74  
    75  We currently support these 3 ways of defining tables:
    76  
    77  * Implicitly, by discovering that a given path holds a table.  This is our
    78    current Delta Lake and Delta-diff support.
    79  * Externally, by deferring to an external catalog.  This is our metastore
    80    support (HMS is the catalog).  Our Iceberg support _might_ become partly
    81    external, if we added support to defer table definitions to some other
    82    catalog.
    83  * Internally, by effectively serving the catalog.  This is our lakeFS Delta
    84    Sharing (the Delta Sharing protocol does not call this a "catalog").  Our
    85    Iceberg support is also currently internal.
    86  
    87  ### Implicit definitions
    88  
    89  Implicit definitions give users the _easiest_ path.  However they have
    90  multiple limitations:
    91  
    92  * Require discoverability of the table by examining only object metadata.
    93    So supported only by "modern" table formats: Iceberg and Delta.  Formats
    94    such as prefixes containing Parquet files _cannot_ support all operations:
    95    - It is not possible to report table metadata efficiently.
    96    - It may not be possible to list all tables.
    97    - It will _not_ be possible to support named tables, all tables must be
    98      defined by paths.
    99  
   100  Using implicit definitions necessarily leaks through the table abstraction.
   101  There is no useful general case here.  Users will find it easy to request
   102  functionality that uses implicit definitions.
   103  
   104  #### Possible future features
   105  
   106  Define an implicit "Parquet folder" as a folder that holds many `.parquet`
   107  files, or a folder that holds many `<key>=<value>` subfolders with some
   108  `.parquet` files under them.  Then we could add to the GUI a display of a
   109  sample DuckDB query when looking at a Parquet folder:
   110  
   111  ### External definitions
   112  
   113  External definitions in catalogs may be the easiest way to give _some_ users
   114  a fully-functional experience, for _some_ catalogs.  Their principal
   115  limitations:
   116  
   117  * Risk of fragmentation: If there are many popular external catalogs we will
   118    need to support many of them.  Different catalogs support different
   119    use-cases, so we should expect to have to support multiple catalogs.
   120  * Limitations due to the external definition: Because existing catalog
   121    implementations were designed without lakeFS in mind, we will not be able
   122    to add lakeFS features to all of them.  For instance the HMS protocol is
   123    Thrift-based and does not support refs-as-schemas.
   124  
   125  Iceberg support might become an external definition once it supports
   126  deferring to an external catalog.  It does not yet support wrapping an
   127  external catalog because of these risks.
   128  
   129  #### Possible future features
   130  
   131  Support more catalog types.  Each catalog adds implementation effort.
   132  
   133  Support catalog wrappers that work with multiple external catalogs.  For
   134  instance, wrapping additional catalogs under the Iceberg catalog, making it
   135  an external definition.
   136  
   137  ### Internal definition
   138  
   139  An internal definition should allow us to support _all_ lakeFS features that
   140  systems can handle.  Because it reduces external dependencies, it is
   141  probably the _fastest_ way for us to implement "table" features on top of
   142  lakeFS -- we have seen this in the lakeFS Delta Sharing server.  The Iceberg
   143  catalog wrapper is currently internal, although it does not even support
   144  defining tables except via Spark definitions.
   145  
   146  Their principal limitations:
   147  
   148  * Possibly unacceptable cost to users: If users are deeply invested in
   149    existing catalogs they may be reluctant or unable to transition to lakeFS
   150    catalogs.  Again, wrappers might mitigate this.
   151  * Risk of fragmentation: There may be different base catalog interfaces for
   152    different use-cases.  For instance, the Iceberg catalog implementation is
   153    a particular _kind_ of Spark catalog, and will not be suitable for
   154    straight Delta Lake tables.  This fragmentation is _considerably less_
   155    than the fragmentation expected for external definitions, of course.
   156  * Becoming opinionated.  lakeFS is so far unopinionated as regards catalogs:
   157    it defines none and has difficulty integrating with most.  As soon as we
   158    define catalogs we run the risk of becoming opionated in ways that our
   159    users might prefer to avoid.
   160  
   161    One way to decrease this is to implement lakeFS table definitions as a
   162    separate lakeFS layer, and catalogs as yet other layers.  Users must still
   163    be able to use lakeFS without these extras.
   164  
   165  #### Possible future features
   166  
   167  Support Iceberg catalogs for multiple implementations: general Spark and
   168  Hadoop, non-Java ecosystem, etc.
   169  
   170  Support translation from existing (external) catalogs.
   171  
   172  Support repository / ref scanning to define useful draft table definitions.
   173  
   174  ## Questions for our design partners
   175  
   176  The key questions to answer are:
   177  
   178  * _How many tables have users already defined_?  (Affects whether we need to
   179    support external tables).
   180  
   181    Follow-on:
   182    - _Which existing catalogs do users have?_
   183    - _Can we automatically derive _internal_ definitions from _external_
   184      definitions for these catalogs?_
   185  
   186  * _How often do users define new tables?_  (Affects going to implicit or
   187    explicit internal/external definitions)
   188  
   189  * _How much user effort is required to define a new table?_  (Affects going to implicit or
   190    explicit internal/external definitions)
   191  
   192    Follow-on:
   193    - _How far can we automate this process?_  (Affects going to external or
   194      internal explicit definitions)
   195    - _Do users already have tools that automate this process?_
   196  
   197  * _How much do users need external catalogs?_  (Affects whether we need to
   198    support external tables)
   199  * _How many external catalog types do our users use?_  (Affects whether we need to
   200    support external tables; many different types will dilute our effort).
   201  
   202  ## Preferred route
   203  
   204  This route requires support from design partners in the "Questions for our
   205  design partners".  I propose it as a sample guide for how we might proceed
   206  if all questions resolve in a certain way.
   207  
   208  * Define the current lakeFS Delta-Sharing tables schema as the lakeFS tables
   209    schema.
   210  * Possible extensions:
   211    - If there is demand for efficient `hive` format, add support for storing
   212      index metadata.
   213  * Alternatives for Spark catalog support:
   214    - Write a Spark catalog implementation.
   215    - Write translators from lakeFS tables to some popular external catalogs,
   216      for users who need to continue to use those.
   217  * Write or adapt existing helper tools to help draft lakeFS table schema
   218    from object listings.
   219  
   220  
   221  [delta-sharing-schema-schema]:  https://github.com/delta-io/delta-sharing/blob/main/PROTOCOL.md#schema-object