github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/presto_trino.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/presto_trino.md (about)

     1  ---
     2  title: Presto / Trino
     3  description: This section explains how you can start using lakeFS with Presto and Trino, an open-source distributed SQL query engine.
     4  parent: Integrations
     5  redirect_from:
     6      - /integrations/presto.html
     7      - /using/presto.html
     8  ---
     9  
    10  # Using lakeFS with Presto/Trino
    11  
    12  [Presto](https://prestodb.io){:target="_blank"} and [Trino](https://trinodb.io){:target="_blank"} are a distributed SQL query engines designed to query large data sets distributed over one or more heterogeneous data sources.
    13  
    14  
    15  Querying data in lakeFS from Presto/Trino is similar to querying data in S3 from Presto/Trino. It is done using the [Presto Hive connector](https://prestodb.io/docs/current/connector/hive.html){:target="_blank"} or [Trino Hive connector](https://trino.io/docs/current/connector/hive.html){:target="_blank"}.
    16  
    17  {% include toc_2-3.html %}
    18  
    19  {: .note-title}
    20  > Credentials
    21  > 
    22  > In the following examples, we set AWS credentials at runtime for clarity. In production, these properties should be set using one of Hadoop's standard ways of [Authenticating with S3](https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#Authenticating_with_S3){:target="_blank"}. 
    23  
    24  ## Configuration
    25  
    26  ### Configure the Hive connector
    27  
    28  Create `/etc/catalog/hive.properties` with the following contents to mount the `hive-hadoop2` connector as the Hive catalog, replacing `example.net:9083` with the correct host and port for your Hive Metastore Thrift service:
    29  ```properties
    30  connector.name=hive-hadoop2
    31  hive.metastore.uri=thrift://example.net:9083
    32  ```
    33  
    34  Add the lakeFS configurations to `/etc/catalog/hive.properties` in the corresponding S3 configuration properties:
    35  ```properties
    36  hive.s3.aws-access-key=AKIAIOSFODNN7EXAMPLE
    37  hive.s3.aws-secret-key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
    38  hive.s3.endpoint=https://lakefs.example.com
    39  hive.s3.path-style-access=true
    40  ```
    41  
    42  ### Configure Hive
    43  
    44  Presto/Trino uses Hive Metastore Service (HMS) or a compatible implementation of the Hive Metastore such as AWS Glue Data Catalog to write data to S3.
    45  In case you are using Hive Metastore, you will need to configure Hive as well.
    46  
    47  In file `hive-site.xml` add to the configuration:
    48  ```xml
    49  <configuration>
    50      ...
    51      <property>
    52          <name>fs.s3a.access.key</name>
    53          <value>AKIAIOSFODNN7EXAMPLE</value></property>
    54      <property>
    55          <name>fs.s3a.secret.key</name>
    56          <value>wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY</value>
    57      </property>
    58      <property>
    59          <name>fs.s3a.endpoint</name>
    60          <value>https://lakefs.example.com</value>
    61      </property>
    62      <property>
    63          <name>fs.s3a.path.style.access</name>
    64          <value>true</value>
    65      </property>
    66  </configuration>
    67  ```
    68   
    69  
    70  ## Examples
    71  
    72  Here are some examples based on examples from the [Presto Hive connector examples](https://prestodb.io/docs/current/connector/hive.html#examples){:target="_blank"} and [Trino Hive connector examples](https://trino.io/docs/current/connector/hive.html#examples){:target="_blank"}
    73  
    74  ### Example with schema
    75  
    76  Create a new schema named `main` that will store tables in a lakeFS repository named `example` branch: `master`:
    77  ```sql
    78  CREATE SCHEMA main
    79  WITH (location = 's3a://example/main')
    80  ```
    81  
    82  Create a new Hive table named `page_views` in the `web` schema stored using the ORC file format,
    83   partitioned by date and country, and bucketed by user into `50` buckets (note that Hive requires the partition columns to be the last columns in the table):
    84  ```sql
    85  CREATE TABLE main.page_views (
    86    view_time timestamp,
    87    user_id bigint,
    88    page_url varchar,
    89    ds date,
    90    country varchar
    91  )
    92  WITH (
    93    format = 'ORC',
    94    partitioned_by = ARRAY['ds', 'country'],
    95    bucketed_by = ARRAY['user_id'],
    96    bucket_count = 50
    97  )
    98  ```
    99  
   100  ### Example with External table
   101  
   102  Create an external Hive table named `request_logs` that points at existing data in lakeFS:
   103  
   104  ```sql
   105  CREATE TABLE main.request_logs (
   106    request_time timestamp,
   107    url varchar,
   108    ip varchar,
   109    user_agent varchar
   110  )
   111  WITH (
   112    format = 'TEXTFILE',
   113    external_location = 's3a://example/main/data/logs/'
   114  )
   115  ```
   116  
   117  ### Example of copying a table with [metastore tools](glue_hive_metastore.md):
   118  
   119  {: .warning }
   120  **Deprecated Feature:** Having heard the feedback from the community, we are planning to replace the below manual steps with an automated process.
   121  You can read more about it [here](https://github.com/treeverse/lakeFS/issues/6461).
   122  
   123  Copy the created table `page_views` on schema `main` to schema `example_branch` with location `s3a://example/example_branch/page_views/` 
   124  ```shell
   125  lakectl metastore copy --from-schema main --from-table page_views --to-branch example_branch 
   126  ```