github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/glue_hive_metastore.md (about) 1 --- 2 title: Glue / Hive metastore 3 description: This section explains how to query data from lakeFS branches in services backed by Glue/Hive Metastore. 4 parent: Integrations 5 redirect_from: /using/glue_hive_metastore.html 6 --- 7 8 # Using lakeFS with the Glue/Hive Metastore 9 10 {: .warning } 11 **Deprecated Feature:** Having heard the feedback from the community, we are planning to replace the below manual steps with an automated process. 12 You can read more about it [here]({% link howto/catalog_exports.md %}). 13 14 {% include toc_2-3.html %} 15 16 ## About Glue / Hive Metastore 17 18 This part explains about how Glue/Hive Metastore work with lakeFS. 19 20 Glue and Hive Metastore store metadata related to Hive and other services (such as Spark and Trino). 21 They contain metadata such as the location of the table, information about columns, partitions and many more. 22 23 ### Without lakeFS 24 To query the table `my_table`, Spark will: 25 * Request the metadata from Hive metastore (steps 1,2), 26 * Use the location from the metadata to access the data in S3 (steps 3,4). 27  28 29 30 ### With lakeFS 31 When using lakeFS, the flow stays exactly the same. Note that the location of the table `my_table` now contains the branch `s3://example/main/path/to/table` 32  33 34 35 36 ## Managing Tables With lakeFS Branches 37 ### Motivation 38 When creating a table in Glue/Hive metastore (using a client such as Spark, Hive, Presto), we specify the table location. 39 Consider the table `my_table` that was created with the location `s3://example/main/path/to/table`. 40 41 Suppose you created a new branch called `DEV` with `main` as the source branch. 42 The data from `s3://example/main/path/to/table` is now accessible in `s3://example/DEV/path/to/table`. 43 The metadata is not managed in lakeFS, meaning you don't have any table pointing to `s3://example/DEV/path/to/table`. 44 45 To address this, lakeFS introduces `lakectl metastore` commands. The case above can be handled using the copy command: it creates a copy of `my_table` with data located in `s3://example/DEV/path/to/table`. Note that this is a fast, metadata-only operation. 46 47 48 ### Configurations 49 The `lakectl metastore` commands can run on Glue or Hive metastore. 50 Add the following to the lakectl configuration file (by default `~/.lakectl.yaml`): 51 52 #### Hive 53 54 ``` yaml 55 metastore: 56 type: hive 57 hive: 58 uri: hive-metastore:9083 59 ``` 60 61 #### Glue 62 63 ``` yaml 64 metastore: 65 type: glue 66 glue: 67 catalog_id: 123456789012 68 region: us-east-1 69 profile: default # optional, implies using a credentials file 70 credentials: 71 access_key_id: AKIAIOSFODNN7EXAMPLE 72 secret_access_key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY 73 ``` 74 75 **Note:** It's recommended to set type and catalog-id/metastore-uri in the lakectl configuration file. 76 {: .note .pb-3 } 77 78 ### Suggested Model 79 80 For simplicity, we recommend creating a schema for each branch. That way, you can use the same table name across different schemas. 81 82 For example: 83 after creating branch `example_branch`, also create a schema named `example_branch`. 84 For a table named `my_table` under the schema `main`, create a new table under the same name and under the schema `example_branch`. You now have two `my_table`, one in the main schema and one in the branch schema. 85 86 87 ### Commands 88 89 Metastore tools support three commands: `copy`, `diff`, and `create-symlink`. 90 copy and diff can work both on Glue and on Hive. 91 create-symlink works only on Glue. 92 93 94 **Note:** If `to-schema` or `to-table` are not specified, the destination branch and source table names will be used as per the [suggested model](#suggested-model). 95 {: .note .pb-3 } 96 97 **Note:** Metastore commands can only run on tables located in lakeFS. You should not use tables that aren't located in lakeFS. 98 {: .note .pb-3 } 99 100 #### Copy 101 102 The `copy` command creates a copy of a table pointing to the defined branch. 103 In case the destination table already exists, the command will only merge the changes. 104 105 Example: 106 107 Suppose we created the table `inventory` on branch `main` on schema `default`. 108 ```sql 109 CREATE EXTERNAL TABLE `inventory`( 110 `inv_item_sk` int, 111 `inv_warehouse_sk` int, 112 `inv_quantity_on_hand` int) 113 PARTITIONED BY ( 114 `inv_date_sk` int) STORED AS ORC 115 LOCATION 116 's3a://my_repo/main/path/to/table'; 117 ``` 118 119 We create a new lakeFS branch `example_branch`: 120 121 ```shell 122 lakectl branch create lakefs://my_repo/example_branch --source lakefs://my_repo/main 123 ``` 124 125 The data from `s3://my_repo/main/path/to/table` is now accessible in `s3://my_repo/DEV/path/to/table`. 126 To query the data in `s3://my_repo/DEV/path/to/table`, you would like to create a copy of the table `inventory` in schema `example_branch` pointing to the new branch. 127 128 ```bash 129 lakectl metastore copy --from-schema default --from-table inventory --to-schema example_branch --to-table inventory --to-branch example_branch 130 ``` 131 132 After running this command, query the table `example_branch.inventory` to get the data from `s3://my_repo/DEV/path/to/table` 133 134 #### Copy Partition 135 136 After adding a partition to the branch table, you may want to copy the partition to the main table. 137 For example, for the new partition `2020-08-01`, run the following to copy the partition to the main table: 138 139 ```bash 140 lakectl metastore copy --type hive --from-schema example_branch --from-table inventory --to-schema default --to-table inventory --to-branch main -p 2020-08-01 141 ``` 142 143 For a table partitioned by more than one column, specify the partition flag for every column. For example, for the partition `(year='2020',month='08',day='01')`: 144 145 ```bash 146 lakectl metastore copy --from-schema example_branch --from-table branch_inventory --to-schema default --to-branch main -p 2020 -p 08 -p 01 147 ``` 148 149 #### Diff 150 151 Provides a two-way diff between two tables. 152 Shows added`+` , removed`-` and changed`~` partitions and columns. 153 154 155 Example: 156 157 Suppose you made some changes on the copied table `inventory` on schema `example_branch` and now want to view the changes before merging back to `inventory` on schema `default`. 158 159 Hive: 160 ```bash 161 lakectl metastore diff --type hive --address thrift://hive-metastore:9083 --from-schema example_branch --from-table branch --to-schema default --to-table inventory 162 ``` 163 164 The output will look like this: 165 166 ``` 167 Columns are identical 168 Partitions 169 - 2020-07-04 170 + 2020-07-05 171 + 2020-07-06 172 ~ 2020-07-08 173 ```