github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/deploy/aws.md (about)

     1  ---
     2  title: AWS
     3  grand_parent: How-To
     4  parent: Install lakeFS
     5  description: How to deploy and set up a production-suitable lakeFS environment on AWS
     6  redirect_from:
     7     - /deploying-aws/index.html
     8     - /deploying-aws/install.html
     9     - /deploying-aws/db.html
    10     - /deploying-aws/lb_dns.html
    11     - /setup/storage/s3.html 
    12     - /deploy/aws.html 
    13  next:  ["Import data into your installation", "/howto/import.html"]
    14  ---
    15  
    16  # Deploy lakeFS on AWS
    17  
    18  {: .tip }
    19  > The instructions given here are for a self-managed deployment of lakeFS on AWS. 
    20  > 
    21  > For a hosted lakeFS service with guaranteed SLAs, try [lakeFS Cloud](https://lakefs.cloud)
    22  
    23  When you deploy lakeFS on AWS these are the options available to use: 
    24  
    25  ![](/assets/img/deploy/deploy-on-aws.excalidraw.png)
    26  
    27  This guide walks you through the options available and how to configure them, finishing with configuring and running lakeFS itself and creating your first repository. 
    28  
    29  {% include toc.html %}
    30  
    31  ⏰ Expected deployment time: 25 min
    32  {: .note }
    33  
    34  ## Grant lakeFS permissions to DynamoDB
    35  
    36  By default, lakeFS will create the required DynamoDB table if it does not already exist. You'll have to give the IAM role used by lakeFS the following permissions:
    37  
    38  ```json
    39  {
    40      "Version": "2012-10-17",
    41      "Statement": [
    42          {
    43              "Sid": "ListAndDescribe",
    44              "Effect": "Allow",
    45              "Action": [
    46                  "dynamodb:List*",
    47                  "dynamodb:DescribeReservedCapacity*",
    48                  "dynamodb:DescribeLimits",
    49                  "dynamodb:DescribeTimeToLive"
    50              ],
    51              "Resource": "*"
    52          },
    53          {
    54              "Sid": "kvstore",
    55              "Effect": "Allow",
    56              "Action": [
    57                  "dynamodb:BatchGet*",
    58                  "dynamodb:DescribeTable",
    59                  "dynamodb:Get*",
    60                  "dynamodb:Query",
    61                  "dynamodb:Scan",
    62                  "dynamodb:BatchWrite*",
    63                  "dynamodb:CreateTable",
    64                  "dynamodb:Delete*",
    65                  "dynamodb:Update*",
    66                  "dynamodb:PutItem"
    67              ],
    68              "Resource": "arn:aws:dynamodb:*:*:table/kvstore"
    69          }
    70      ]
    71  }
    72  ```
    73  
    74  💡 You can also use lakeFS with PostgreSQL instead of DynamoDB! See the [configuration reference]({% link reference/configuration.md %}) for more information.
    75  {: .note }
    76  
    77  ## Run the lakeFS server
    78  
    79  <div class="tabs">
    80    <ul>
    81      <li><a href="#ec2">EC2</a></li>
    82      <li><a href="#eks">EKS</a></li>
    83    </ul>
    84    <div markdown="1" id="ec2">
    85    
    86  Connect to your EC2 instance using SSH:
    87  
    88  1. Create a `config.yaml` on your EC2 instance, with the following parameters:
    89    
    90     ```yaml
    91     ---
    92     database:
    93       type: "dynamodb"
    94    
    95     auth:
    96       encrypt:
    97         # replace this with a randomly-generated string. Make sure to keep it safe!
    98         secret_key: "[ENCRYPTION_SECRET_KEY]"
    99     
   100     blockstore:
   101       type: s3
   102     ```
   103  1. [Download the binary][downloads] to run on the EC2 instance.
   104  1. Run the `lakefs` binary on the EC2 instance:
   105  
   106     ```sh
   107     lakefs --config config.yaml run
   108     ```
   109  
   110  **Note:** It's preferable to run the binary as a service using systemd or your operating system's facilities.
   111  {: .note }
   112  
   113  ### Advanced: Deploying lakeFS behind an AWS Application Load Balancer
   114  
   115  1. Your security groups should allow the load balancer to access the lakeFS server.
   116  1. Create a target group with a listener for port 8000.
   117  1. Setup TLS termination using the domain names you wish to use (e.g., `lakefs.example.com` and potentially `s3.lakefs.example.com`, `*.s3.lakefs.example.com` if using [virtual-host addressing](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html)).
   118  1. Configure the health-check to use the exposed `/_health` URL
   119  
   120    </div>
   121    <div markdown="1" id="eks">
   122    
   123  You can install lakeFS on Kubernetes using a [Helm chart](https://github.com/treeverse/charts/tree/master/charts/lakefs).
   124  
   125  To install lakeFS with Helm:
   126  
   127  1. Copy the Helm values file relevant for S3:
   128     
   129     ```yaml
   130     secrets:
   131        # replace this with a randomly-generated string
   132        authEncryptSecretKey: [ENCRYPTION_SECRET_KEY]
   133     lakefsConfig: |
   134         database:
   135           type: dynamodb
   136         blockstore:
   137           type: s3
   138     ```
   139  1. Fill in the missing values and save the file as `conf-values.yaml`. For more configuration options, see our Helm chart [README](https://github.com/treeverse/charts/blob/master/charts/lakefs/README.md#custom-configuration){:target="_blank"}.
   140  
   141     The `lakefsConfig` parameter is the lakeFS configuration documented [here]({% link reference/configuration.md%}) but without sensitive information.
   142     Sensitive information like `databaseConnectionString` is given through separate parameters, and the chart will inject it into Kubernetes secrets.
   143     {: .note }
   144  
   145  1. In the directory where you created `conf-values.yaml`, run the following commands:
   146  
   147     ```bash
   148     # Add the lakeFS repository
   149     helm repo add lakefs https://charts.lakefs.io
   150     # Deploy lakeFS
   151     helm install my-lakefs lakefs/lakefs -f conf-values.yaml
   152     ```
   153  
   154     *my-lakefs* is the [Helm Release](https://helm.sh/docs/intro/using_helm/#three-big-concepts) name.
   155  
   156  ⚠️ Make sure the Kubernetes nodes have access to all buckets/containers with which you intend to use with lakeFS.
   157  If you can't provide such access, configure lakeFS with an AWS key-pair.
   158  {: .note .note-warning }
   159  
   160  ### Load balancing
   161  
   162  To configure a load balancer to direct requests to the lakeFS servers you can use the `LoadBalancer` Service type or a Kubernetes Ingress.
   163  By default, lakeFS operates on port 8000 and exposes a `/_health` endpoint that you can use for health checks.
   164  
   165  💡 The NGINX Ingress Controller by default limits the client body size to 1 MiB.
   166  Some clients use bigger chunks to upload objects - for example, multipart upload to lakeFS using the [S3 Gateway][s3-gateway] or 
   167  a simple PUT request using the [OpenAPI Server][openapi].
   168  Checkout Nginx [documentation](https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#custom-max-body-size) for increasing the limit, or an example of Nginx configuration with [MinIO](https://docs.min.io/docs/setup-nginx-proxy-with-minio.html).
   169  {: .note }
   170  
   171  
   172    </div>
   173  </div>
   174  
   175  ## Prepare your S3 bucket
   176  
   177  1. Take note of the bucket name you want to use with lakeFS
   178  2. Use the following as your bucket policy, filling in the placeholders:
   179  
   180     <div class="tabs">
   181        <ul>
   182           <li><a href="#bucket-policy-standard">Standard Permissions</a></li>
   183           <li><a href="#bucket-policy-express">Standard Permissions (with s3express)</a></li>
   184           <li><a href="#bucket-policy-minimal">Minimal Permissions (Advanced)</a></li>
   185        </ul>
   186        <div markdown="1" id="bucket-policy-standard">
   187  
   188        ```json 
   189     {
   190        "Id": "lakeFSPolicy",
   191        "Version": "2012-10-17",
   192        "Statement": [
   193           {
   194              "Sid": "lakeFSObjects",
   195              "Action": [
   196                 "s3:GetObject",
   197                 "s3:PutObject",
   198                 "s3:AbortMultipartUpload",
   199                 "s3:ListMultipartUploadParts"
   200              ],
   201              "Effect": "Allow",
   202              "Resource": ["arn:aws:s3:::[BUCKET_NAME_AND_PREFIX]/*"],
   203              "Principal": {
   204                 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"]
   205              }
   206           },
   207           {
   208              "Sid": "lakeFSBucket",
   209              "Action": [
   210                 "s3:ListBucket",
   211                 "s3:GetBucketLocation",
   212                 "s3:ListBucketMultipartUploads"
   213              ],
   214              "Effect": "Allow",
   215              "Resource": ["arn:aws:s3:::[BUCKET]"],
   216              "Principal": {
   217                 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"]
   218              }
   219           }
   220        ]
   221     }
   222        ```
   223  
   224        * Replace `[BUCKET_NAME]`, `[ACCOUNT_ID]` and `[IAM_ROLE]` with values relevant to your environment.
   225        * `[BUCKET_NAME_AND_PREFIX]` can be the bucket name. If you want to minimize the bucket policy permissions, use the bucket name together with a prefix (e.g. `example-bucket/a/b/c`).
   226          This way, lakeFS will be able to create repositories only under this specific path (see: [Storage Namespace][understand-repository]).
   227        * lakeFS will try to assume the role `[IAM_ROLE]`.
   228     </div>
   229     <div markdown="1" id="bucket-policy-express">
   230  
   231        To use an S3 Express One Zone _directory bucket_, use the following policy. Note the `lakeFSDirectoryBucket` statement which is specifically required for using a directory bucket.
   232  
   233        ```json 
   234     {
   235        "Id": "lakeFSPolicy",
   236        "Version": "2012-10-17",
   237        "Statement": [
   238           {
   239              "Sid": "lakeFSObjects",
   240              "Action": [
   241                 "s3:GetObject",
   242                 "s3:PutObject",
   243                 "s3:AbortMultipartUpload",
   244                 "s3:ListMultipartUploadParts"
   245              ],
   246              "Effect": "Allow",
   247              "Resource": ["arn:aws:s3:::[BUCKET_NAME_AND_PREFIX]/*"],
   248              "Principal": {
   249                 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"]
   250              }
   251           },
   252           {
   253              "Sid": "lakeFSBucket",
   254              "Action": [
   255                 "s3:ListBucket",
   256                 "s3:GetBucketLocation",
   257                 "s3:ListBucketMultipartUploads"
   258              ],
   259              "Effect": "Allow",
   260              "Resource": ["arn:aws:s3:::[BUCKET]"],
   261              "Principal": {
   262                 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"]
   263              }
   264           },
   265           {
   266              "Sid": "lakeFSDirectoryBucket",
   267              "Action": [
   268                 "s3express:CreateSession"
   269              ],
   270              "Effect": "Allow",
   271              "Resource": "arn:aws:s3express:[REGION]:[ACCOUNT_ID]:bucket/[BUCKET_NAME]"
   272           }
   273        ]
   274     }
   275        ```
   276  
   277        * Replace `[BUCKET_NAME]`, `[ACCOUNT_ID]` and `[IAM_ROLE]` with values relevant to your environment.
   278        * `[BUCKET_NAME_AND_PREFIX]` can be the bucket name. If you want to minimize the bucket policy permissions, use the bucket name together with a prefix (e.g. `example-bucket/a/b/c`).
   279          This way, lakeFS will be able to create repositories only under this specific path (see: [Storage Namespace][understand-repository]).
   280        * lakeFS will try to assume the role `[IAM_ROLE]`.
   281     </div>
   282     <div markdown="1" id="bucket-policy-minimal">
   283     If required lakeFS can operate without accessing the data itself, this permission section is useful if you are using [presigned URLs mode][presigned-url] or the [lakeFS Hadoop FileSystem Spark integration][integration-hadoopfs].
   284     Since this FileSystem performs many operations directly on the storage, lakeFS requires less permissive permissions, resulting in increased security.
   285  
   286     lakeFS always requires permissions to access the `_lakefs` prefix under your storage namespace, in which metadata
   287     is stored ([learn more][understand-commits]).
   288  
   289     By setting this policy **without** presign mode you'll be able to perform only metadata operations through lakeFS, meaning that you'll **not** be able
   290     to use lakeFS to upload or download objects. Specifically you won't be able to:
   291        * Upload objects using the lakeFS GUI (**Works with presign mode**)
   292        * Upload objects through Spark using the S3 gateway
   293        * Run `lakectl fs` commands (unless using **presign mode** with `--pre-sign` flag)
   294        * Use [Actions and Hooks](/howto/hooks/)
   295  
   296     ```json
   297     {
   298       "Id": "[POLICY_ID]",
   299       "Version": "2012-10-17",
   300       "Statement": [
   301     {
   302       "Sid": "lakeFSObjects",
   303       "Action": [
   304          "s3:GetObject",
   305          "s3:PutObject"
   306       ],
   307       "Effect": "Allow",
   308       "Resource": [
   309          "arn:aws:s3:::[STORAGE_NAMESPACE]/_lakefs/*"
   310       ],
   311       "Principal": {
   312         "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"]
   313       }
   314     },
   315      {
   316         "Sid": "lakeFSBucket",
   317         "Action": [
   318            "s3:ListBucket",
   319            "s3:GetBucketLocation"
   320         ],
   321         "Effect": "Allow",
   322         "Resource": ["arn:aws:s3:::[BUCKET]"],
   323         "Principal": {
   324            "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"]
   325         }
   326      }
   327       ]
   328     }
   329     ```
   330  
   331     We can use [presigned URLs mode][presigned-url] without allowing access to the data from the lakeFS server directly. 
   332     We can achieve this by using [condition keys](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html) such as [aws:referer](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-referer), [aws:SourceVpc](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourcevpc) and [aws:SourceIp](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceip).
   333  
   334     For example, assume the following scenario: 
   335     - lakeFS is deployed outside the company (i.e lakeFS cloud or other VPC **not** `vpc-123`)
   336     - We don't want lakeFS to be able to access the data, so we use presign URL, we still need lakeFS role to be able to sign the URL.
   337     - We want to allow access from the internal company VPC: `vpc-123`.
   338  
   339     ```json
   340           {
   341              "Sid": "allowLakeFSRoleFromCompanyOnly",
   342              "Effect": "Allow",
   343              "Principal": {
   344                  "AWS": "arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"
   345              },
   346              "Action": [
   347                  "s3:GetObject",
   348                  "s3:PutObject",
   349              ],
   350              "Resource": [
   351                 "arn:aws:s3:::[BUCKET]/*",
   352              ],
   353              "Condition": {
   354                  "StringEquals": {
   355                      "aws:SourceVpc": "vpc-123"
   356                  }
   357              }
   358          }
   359     ```
   360  
   361  
   362     </div>
   363     </div>
   364  
   365  #### S3 Storage Tier Classes
   366  
   367  lakeFS currently supports the following S3 Storage Classes:
   368  
   369  1. [S3 Standard](https://aws.amazon.com/s3/storage-classes/#General_purpose) - The default AWS S3 storage tier. Fully supported.
   370  2. [S3 Express One-Zone](https://aws.amazon.com/s3/storage-classes/express-one-zone/) - Fully supported.
   371  3. [S3 Glacier Instant Retrival](https://aws.amazon.com/s3/storage-classes/glacier/instant-retrieval/) - Supported with limitations: currently, pre-signed URLs are not supported when using Instant Retrival. The outstanding feature request [could be tracked here](https://github.com/treeverse/lakeFS/issues/7784).  
   372  
   373  Other storage classes are currently unsupported - either because they have not been tested with lakeFS or because they cannot be supported.
   374  
   375  If you need lakeFS to support a storage tier that isn't currently on the supported list, please [open an issue on GitHub](https://github.com/treeverse/lakeFS/issues).
   376  
   377  ### Alternative: use an AWS user
   378  
   379  lakeFS can authenticate with your AWS account using an AWS user, using an access key and secret. To allow this, change the policy's Principal accordingly:
   380  ```json
   381   "Principal": {
   382     "AWS": ["arn:aws:iam::<ACCOUNT_ID>:user/<IAM_USER>"]
   383   }
   384  ```
   385  
   386  {% include_relative includes/setup.md %}
   387  
   388  [downloads]:  {% link index.md %}#downloads
   389  [openapi]:  {% link understand/architecture.md %}#openapi-server
   390  [s3-gateway]:  {% link understand/architecture.md %}#s3-gateway
   391  [understand-repository]:  {% link understand/model.md %}#repository
   392  [integration-hadoopfs]:  {% link integrations/spark.md %}#lakefs-hadoop-filesystem
   393  [understand-commits]:  {% link understand/how/versioning-internals.md %}#constructing-a-consistent-view-of-the-keyspace-ie-a-commit
   394  [presigned-url]:  {% link reference/security/presigned-url.md %}#