github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/athena.md (about)

     1  ---
     2  title: Amazon Athena
     3  description: This section shows how you can start querying data from lakeFS using Amazon Athena.
     4  parent: Integrations
     5  redirect_from: /using/athena.html
     6  ---
     7  
     8  # Using lakeFS with Amazon Athena
     9  
    10  {: .warning }
    11  **Deprecated Feature:** Having heard the feedback from the community, we are planning to replace the below manual steps with an automated process.
    12  You can read more about it [here](https://github.com/treeverse/lakeFS/issues/6461).
    13  
    14  [Amazon Athena](https://aws.amazon.com/athena/) is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL.
    15  {:.pb-5 }
    16  
    17  Amazon Athena works directly above S3 and can't access lakeFS. Tables created using Athena aren't readable by lakeFS.
    18  However, tables stored in lakeFS (that were created with [glue/hive](glue_hive_metastore.md)) can be queried by Athena.
    19  
    20  To support querying data from lakeFS with Amazon Athena, we will use `create-symlink`, one of the [metastore commands](glue_hive_metastore.md) in [lakectl]({% link reference/cli.md %}).
    21  `create-symlink` receives a source table, destination table, and the table location. It performs two actions:
    22  1. It creates partitioned directories with symlink files in the underlying S3 bucket.
    23  1. It creates a table in Glue catalog with symlink format type and location pointing to the created symlinks.
    24  
    25  **Note**
    26  `.lakectl.yaml` file should be configured with the proper hive/glue credentials. [For more information](glue_hive_metastore.md#configurations) 
    27  {: .note }
    28  
    29  create-symlink receives a table in glue or hive pointing to lakeFS and creates a copy of the table in glue.
    30  The table data will use the `SymlinkTextInputFormat`, which will point to the lakeFS repository storage namespace. You will be able to query your data with Athena without copying any data. However, the symlinks table will only show the data that existed during 
    31  the copy. If the table changed in lakeFS, you need to run `create-symlink` again for your changed to be reflected in Athena.
    32  
    33  ### Example:
    34  
    35  Let's assume that some time ago, we created a hive table `my_table` that is stored in lakeFS repo `example` under branch `main`, using the command:
    36  ```shell
    37  CREATE EXTERNAL TABLE `my_table`(
    38     `id` bigint, 
    39     `key` string 
    40  )
    41  PARTITIONED BY (YEAR INT, MONTH INT)
    42  LOCATION 's3://example/main/my_table';
    43  WITH (format = 'PARQUET', external_location 's3a://example/main/my_table' );
    44  ```
    45  
    46  The repository `example` has the S3 storage space `s3://my-bucket/my-repo-prefix/`. 
    47  After inserting some data into it, the object structure under `lakefs://example/main/my_table` looks as follows:
    48  
    49  ![lakefs_table.png]({{ site.baseurl }}/assets/img/lakefs_table.png)
    50  
    51  To query that table with Athena, you need to use the `create-symlink` command as follows:
    52  
    53  ```shell
    54  lakectl metastore create-symlink \
    55    --repo example \
    56    --branch main \
    57    --path my_table \
    58    --from-client-type hive \
    59    --from-schema default \
    60    --from-table my_table \
    61    --to-schema default \ 
    62    --to-table my_table  
    63  ```
    64  
    65  The command will generate two notable outputs:
    66  
    67  1. For each partition, the command will create a symlink file:
    68  
    69  ```shell
    70  aws s3 ls s3://my-bucket/my-repo-prefix/my_table/ --recursive
    71  2021-11-23 17:46:29          0 my-repo-prefix/my_table/symlinks/example/main/my_table/year=2021/month=11/symlink.txt
    72  2021-11-23 17:46:29         60 my-repo-prefix/my_table/symlinks/example/main/my_table/year=2021/month=12/symlink.txt
    73  2021-11-23 17:46:30         60 my-repo-prefix/my_table/symlinks/example/main/my_table/year=2022/month=1/symlink.txt
    74  ```
    75  
    76  An example content of a symlink file, where each line represents a single object of the specific partition:
    77  
    78  ```text
    79  s3://my-bucket/my-repo-prefix/5bdc62da516944b49889770d98274227
    80  s3://my-bucket/my-repo-prefix/64262fbf3d6347a79ead641d2b2baee6
    81  s3://my-bucket/my-repo-prefix/64486c8de6484de69f12d7d26804c93e
    82  s3://my-bucket/my-repo-prefix/b0165d5c5b13473d8a0f460eece9eb26
    83  ```
    84  
    85  1. A glue table pointing to the symlink directories structure:
    86  
    87  ```shell
    88  aws glue get-table --name my_table --database-name default
    89  
    90  {
    91    "Table": {
    92      "Name": "my_table",
    93      "DatabaseName": "default",
    94      "Owner": "anonymous",
    95      "CreateTime": "2021-11-23T17:46:30+02:00",
    96      "UpdateTime": "2021-11-23T17:46:30+02:00",
    97      "LastAccessTime": "1970-01-01T02:00:00+02:00",
    98      "Retention": 0,
    99      "StorageDescriptor": {
   100        "Columns": [
   101          {
   102            "Name": "id",
   103            "Type": "bigint",
   104            "Comment": ""
   105          },
   106          {
   107            "Name": "key",
   108            "Type": "string",
   109            "Comment": ""
   110          }
   111        ],
   112        "Location": "s3://my-bucket/my-repo-prefix/symlinks/example/main/my_table",
   113        "InputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
   114        "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
   115        "Compressed": false,
   116        "NumberOfBuckets": -1,
   117        "SerdeInfo": {
   118          "Name": "default",
   119          "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
   120          "Parameters": {
   121            "serialization.format": "1"
   122          }
   123        },
   124        "StoredAsSubDirectories": false
   125      },
   126      "PartitionKeys": [
   127        {
   128          "Name": "year",
   129          "Type": "int",
   130          "Comment": ""
   131        },
   132        {
   133          "Name": "month",
   134          "Type": "int",
   135          "Comment": ""
   136        }
   137      ],
   138      "ViewOriginalText": "",
   139      "ViewExpandedText": "",
   140      "TableType": "EXTERNAL_TABLE",
   141      "Parameters": {
   142        "EXTERNAL": "TRUE",
   143        "bucketing_version": "2",
   144        "transient_lastDdlTime": "1637681750"
   145      },
   146      "CreatedBy": "arn:aws:iam::************:user/********",
   147      "IsRegisteredWithLakeFormation": false,
   148      "CatalogId": "*********"
   149    }
   150  }
   151  ```
   152  
   153  You can now safely use Athena to query `my_table`.