github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/glue_hive_metastore.md (about)

     1  ---
     2  title: Glue / Hive metastore
     3  description: This section explains how to query data from lakeFS branches in services backed by Glue/Hive Metastore.
     4  parent: Integrations
     5  redirect_from: /using/glue_hive_metastore.html
     6  ---
     7  
     8  # Using lakeFS with the Glue/Hive Metastore
     9  
    10  {: .warning }
    11  **Deprecated Feature:** Having heard the feedback from the community, we are planning to replace the below manual steps with an automated process.
    12  You can read more about it [here]({% link howto/catalog_exports.md %}).
    13  
    14  {% include toc_2-3.html %}
    15  
    16  ## About Glue / Hive Metastore
    17  
    18  This part explains about how Glue/Hive Metastore work with lakeFS.
    19  
    20  Glue and Hive Metastore store metadata related to Hive and other services (such as Spark and Trino).
    21  They contain metadata such as the location of the table, information about columns, partitions and many more.
    22  
    23  ### Without lakeFS
    24  To query the table `my_table`, Spark will:
    25  * Request the metadata from Hive metastore (steps 1,2),
    26  * Use the location from the metadata to access the data in S3 (steps 3,4).
    27  ![metastore with S3]({{ site.baseurl }}/assets/img/metastore-S3.svg)
    28  
    29  
    30  ### With lakeFS
    31  When using lakeFS, the flow stays exactly the same. Note that the location of the table `my_table` now contains the branch `s3://example/main/path/to/table`
    32  ![metastore with S3]({{ site.baseurl }}/assets/img/metastore-lakefs.svg)
    33  
    34  
    35  
    36  ## Managing Tables With lakeFS Branches
    37  ### Motivation
    38  When creating a table in Glue/Hive metastore (using a client such as Spark, Hive, Presto), we specify the table location.
    39  Consider the table `my_table` that was created with the location `s3://example/main/path/to/table`.
    40  
    41  Suppose you created a new branch called `DEV` with `main` as the source branch.
    42  The data from `s3://example/main/path/to/table` is now accessible in `s3://example/DEV/path/to/table`.
    43  The metadata is not managed in lakeFS, meaning you don't have any table pointing to `s3://example/DEV/path/to/table`.
    44  
    45  To address this, lakeFS introduces `lakectl metastore` commands. The case above can be handled using the copy command: it creates a copy of `my_table` with data located in `s3://example/DEV/path/to/table`. Note that this is a fast, metadata-only operation.
    46  
    47  
    48  ### Configurations
    49  The `lakectl metastore` commands can run on Glue or Hive metastore.
    50  Add the following to the lakectl configuration file (by default `~/.lakectl.yaml`):
    51  
    52  #### Hive
    53  
    54  ``` yaml
    55  metastore:
    56    type: hive
    57    hive:
    58      uri: hive-metastore:9083
    59  ```
    60  
    61  #### Glue
    62  
    63  ``` yaml
    64  metastore:
    65    type: glue
    66    glue:
    67      catalog_id: 123456789012
    68      region: us-east-1
    69      profile: default # optional, implies using a credentials file
    70      credentials:
    71        access_key_id: AKIAIOSFODNN7EXAMPLE
    72        secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    73  ```
    74  
    75  **Note:** It's recommended to set type and catalog-id/metastore-uri in the lakectl configuration file.
    76  {: .note .pb-3 }
    77  
    78  ### Suggested Model
    79  
    80  For simplicity, we recommend creating a schema for each branch. That way, you can use the same table name across different schemas.
    81  
    82  For example:
    83  after creating branch `example_branch`, also create a schema named `example_branch`.
    84  For a table named `my_table` under the schema `main`, create a new table under the same name and under the schema `example_branch`. You now have two `my_table`, one in the main schema and one in the branch schema.
    85  
    86  
    87  ### Commands
    88  
    89  Metastore tools support three commands: `copy`, `diff`, and `create-symlink`.
    90  copy and diff can work both on Glue and on Hive.
    91  create-symlink works only on Glue.
    92  
    93  
    94  **Note:** If `to-schema` or `to-table` are not specified, the destination branch and source table names will be used as per the [suggested model](#suggested-model).
    95  {: .note .pb-3 }
    96  
    97  **Note:** Metastore commands can only run on tables located in lakeFS. You should not use tables that aren't located in lakeFS.
    98  {: .note .pb-3 }
    99  
   100  #### Copy
   101  
   102  The `copy` command creates a copy of a table pointing to the defined branch.
   103  In case the destination table already exists, the command will only merge the changes.
   104  
   105  Example:
   106  
   107  Suppose we created the table `inventory` on branch `main` on schema `default`.
   108  ```sql
   109  CREATE EXTERNAL TABLE `inventory`(
   110          `inv_item_sk` int,
   111          `inv_warehouse_sk` int,
   112          `inv_quantity_on_hand` int)
   113      PARTITIONED BY (
   114          `inv_date_sk` int) STORED AS ORC
   115      LOCATION
   116          's3a://my_repo/main/path/to/table';
   117  ```
   118  
   119  We create a new lakeFS branch `example_branch`:
   120  
   121  ```shell
   122  lakectl branch create lakefs://my_repo/example_branch --source lakefs://my_repo/main
   123  ```
   124  
   125  The data from `s3://my_repo/main/path/to/table` is now accessible in `s3://my_repo/DEV/path/to/table`.
   126  To query the data in `s3://my_repo/DEV/path/to/table`, you would like to create a copy of the table `inventory` in schema `example_branch` pointing to the new branch.
   127  
   128  ```bash
   129  lakectl metastore copy --from-schema default --from-table inventory --to-schema example_branch --to-table inventory --to-branch example_branch
   130  ```
   131  
   132  After running this command, query the table `example_branch.inventory` to get the data from `s3://my_repo/DEV/path/to/table`
   133  
   134  #### Copy Partition
   135  
   136  After adding a partition to the branch table, you may want to copy the partition to the main table.
   137  For example, for the new partition `2020-08-01`, run the following to copy the partition to the main table:
   138  
   139  ```bash
   140  lakectl metastore copy --type hive --from-schema example_branch --from-table inventory --to-schema default --to-table inventory --to-branch main -p 2020-08-01
   141  ```
   142  
   143  For a table partitioned by more than one column, specify the partition flag for every column. For example, for the partition `(year='2020',month='08',day='01')`:
   144  
   145  ```bash
   146  lakectl metastore copy --from-schema example_branch --from-table branch_inventory --to-schema default --to-branch main -p 2020 -p 08 -p 01
   147  ```
   148  
   149  #### Diff
   150  
   151  Provides a two-way diff between two tables.
   152  Shows added`+` , removed`-` and changed`~` partitions and columns.
   153  
   154  
   155  Example:
   156  
   157  Suppose you made some changes on the copied table `inventory` on schema `example_branch` and now want to view the changes before merging back to `inventory` on schema `default`.
   158  
   159  Hive:
   160  ```bash
   161  lakectl metastore diff --type hive --address thrift://hive-metastore:9083 --from-schema example_branch --from-table branch --to-schema default --to-table inventory
   162  ```
   163  
   164  The output will look like this:
   165  
   166  ```
   167  Columns are identical
   168  Partitions
   169  - 2020-07-04
   170  + 2020-07-05
   171  + 2020-07-06
   172  ~ 2020-07-08
   173  ```