github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/deploy/aws.md (about) 1 --- 2 title: AWS 3 grand_parent: How-To 4 parent: Install lakeFS 5 description: How to deploy and set up a production-suitable lakeFS environment on AWS 6 redirect_from: 7 - /deploying-aws/index.html 8 - /deploying-aws/install.html 9 - /deploying-aws/db.html 10 - /deploying-aws/lb_dns.html 11 - /setup/storage/s3.html 12 - /deploy/aws.html 13 next: ["Import data into your installation", "/howto/import.html"] 14 --- 15 16 # Deploy lakeFS on AWS 17 18 {: .tip } 19 > The instructions given here are for a self-managed deployment of lakeFS on AWS. 20 > 21 > For a hosted lakeFS service with guaranteed SLAs, try [lakeFS Cloud](https://lakefs.cloud) 22 23 When you deploy lakeFS on AWS these are the options available to use: 24 25  26 27 This guide walks you through the options available and how to configure them, finishing with configuring and running lakeFS itself and creating your first repository. 28 29 {% include toc.html %} 30 31 ⏰ Expected deployment time: 25 min 32 {: .note } 33 34 ## Grant lakeFS permissions to DynamoDB 35 36 By default, lakeFS will create the required DynamoDB table if it does not already exist. You'll have to give the IAM role used by lakeFS the following permissions: 37 38 ```json 39 { 40 "Version": "2012-10-17", 41 "Statement": [ 42 { 43 "Sid": "ListAndDescribe", 44 "Effect": "Allow", 45 "Action": [ 46 "dynamodb:List*", 47 "dynamodb:DescribeReservedCapacity*", 48 "dynamodb:DescribeLimits", 49 "dynamodb:DescribeTimeToLive" 50 ], 51 "Resource": "*" 52 }, 53 { 54 "Sid": "kvstore", 55 "Effect": "Allow", 56 "Action": [ 57 "dynamodb:BatchGet*", 58 "dynamodb:DescribeTable", 59 "dynamodb:Get*", 60 "dynamodb:Query", 61 "dynamodb:Scan", 62 "dynamodb:BatchWrite*", 63 "dynamodb:CreateTable", 64 "dynamodb:Delete*", 65 "dynamodb:Update*", 66 "dynamodb:PutItem" 67 ], 68 "Resource": "arn:aws:dynamodb:*:*:table/kvstore" 69 } 70 ] 71 } 72 ``` 73 74 💡 You can also use lakeFS with PostgreSQL instead of DynamoDB! See the [configuration reference]({% link reference/configuration.md %}) for more information. 75 {: .note } 76 77 ## Run the lakeFS server 78 79 <div class="tabs"> 80 <ul> 81 <li><a href="#ec2">EC2</a></li> 82 <li><a href="#eks">EKS</a></li> 83 </ul> 84 <div markdown="1" id="ec2"> 85 86 Connect to your EC2 instance using SSH: 87 88 1. Create a `config.yaml` on your EC2 instance, with the following parameters: 89 90 ```yaml 91 --- 92 database: 93 type: "dynamodb" 94 95 auth: 96 encrypt: 97 # replace this with a randomly-generated string. Make sure to keep it safe! 98 secret_key: "[ENCRYPTION_SECRET_KEY]" 99 100 blockstore: 101 type: s3 102 ``` 103 1. [Download the binary][downloads] to run on the EC2 instance. 104 1. Run the `lakefs` binary on the EC2 instance: 105 106 ```sh 107 lakefs --config config.yaml run 108 ``` 109 110 **Note:** It's preferable to run the binary as a service using systemd or your operating system's facilities. 111 {: .note } 112 113 ### Advanced: Deploying lakeFS behind an AWS Application Load Balancer 114 115 1. Your security groups should allow the load balancer to access the lakeFS server. 116 1. Create a target group with a listener for port 8000. 117 1. Setup TLS termination using the domain names you wish to use (e.g., `lakefs.example.com` and potentially `s3.lakefs.example.com`, `*.s3.lakefs.example.com` if using [virtual-host addressing](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html)). 118 1. Configure the health-check to use the exposed `/_health` URL 119 120 </div> 121 <div markdown="1" id="eks"> 122 123 You can install lakeFS on Kubernetes using a [Helm chart](https://github.com/treeverse/charts/tree/master/charts/lakefs). 124 125 To install lakeFS with Helm: 126 127 1. Copy the Helm values file relevant for S3: 128 129 ```yaml 130 secrets: 131 # replace this with a randomly-generated string 132 authEncryptSecretKey: [ENCRYPTION_SECRET_KEY] 133 lakefsConfig: | 134 database: 135 type: dynamodb 136 blockstore: 137 type: s3 138 ``` 139 1. Fill in the missing values and save the file as `conf-values.yaml`. For more configuration options, see our Helm chart [README](https://github.com/treeverse/charts/blob/master/charts/lakefs/README.md#custom-configuration){:target="_blank"}. 140 141 The `lakefsConfig` parameter is the lakeFS configuration documented [here]({% link reference/configuration.md%}) but without sensitive information. 142 Sensitive information like `databaseConnectionString` is given through separate parameters, and the chart will inject it into Kubernetes secrets. 143 {: .note } 144 145 1. In the directory where you created `conf-values.yaml`, run the following commands: 146 147 ```bash 148 # Add the lakeFS repository 149 helm repo add lakefs https://charts.lakefs.io 150 # Deploy lakeFS 151 helm install my-lakefs lakefs/lakefs -f conf-values.yaml 152 ``` 153 154 *my-lakefs* is the [Helm Release](https://helm.sh/docs/intro/using_helm/#three-big-concepts) name. 155 156 ⚠️ Make sure the Kubernetes nodes have access to all buckets/containers with which you intend to use with lakeFS. 157 If you can't provide such access, configure lakeFS with an AWS key-pair. 158 {: .note .note-warning } 159 160 ### Load balancing 161 162 To configure a load balancer to direct requests to the lakeFS servers you can use the `LoadBalancer` Service type or a Kubernetes Ingress. 163 By default, lakeFS operates on port 8000 and exposes a `/_health` endpoint that you can use for health checks. 164 165 💡 The NGINX Ingress Controller by default limits the client body size to 1 MiB. 166 Some clients use bigger chunks to upload objects - for example, multipart upload to lakeFS using the [S3 Gateway][s3-gateway] or 167 a simple PUT request using the [OpenAPI Server][openapi]. 168 Checkout Nginx [documentation](https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#custom-max-body-size) for increasing the limit, or an example of Nginx configuration with [MinIO](https://docs.min.io/docs/setup-nginx-proxy-with-minio.html). 169 {: .note } 170 171 172 </div> 173 </div> 174 175 ## Prepare your S3 bucket 176 177 1. Take note of the bucket name you want to use with lakeFS 178 2. Use the following as your bucket policy, filling in the placeholders: 179 180 <div class="tabs"> 181 <ul> 182 <li><a href="#bucket-policy-standard">Standard Permissions</a></li> 183 <li><a href="#bucket-policy-express">Standard Permissions (with s3express)</a></li> 184 <li><a href="#bucket-policy-minimal">Minimal Permissions (Advanced)</a></li> 185 </ul> 186 <div markdown="1" id="bucket-policy-standard"> 187 188 ```json 189 { 190 "Id": "lakeFSPolicy", 191 "Version": "2012-10-17", 192 "Statement": [ 193 { 194 "Sid": "lakeFSObjects", 195 "Action": [ 196 "s3:GetObject", 197 "s3:PutObject", 198 "s3:AbortMultipartUpload", 199 "s3:ListMultipartUploadParts" 200 ], 201 "Effect": "Allow", 202 "Resource": ["arn:aws:s3:::[BUCKET_NAME_AND_PREFIX]/*"], 203 "Principal": { 204 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"] 205 } 206 }, 207 { 208 "Sid": "lakeFSBucket", 209 "Action": [ 210 "s3:ListBucket", 211 "s3:GetBucketLocation", 212 "s3:ListBucketMultipartUploads" 213 ], 214 "Effect": "Allow", 215 "Resource": ["arn:aws:s3:::[BUCKET]"], 216 "Principal": { 217 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"] 218 } 219 } 220 ] 221 } 222 ``` 223 224 * Replace `[BUCKET_NAME]`, `[ACCOUNT_ID]` and `[IAM_ROLE]` with values relevant to your environment. 225 * `[BUCKET_NAME_AND_PREFIX]` can be the bucket name. If you want to minimize the bucket policy permissions, use the bucket name together with a prefix (e.g. `example-bucket/a/b/c`). 226 This way, lakeFS will be able to create repositories only under this specific path (see: [Storage Namespace][understand-repository]). 227 * lakeFS will try to assume the role `[IAM_ROLE]`. 228 </div> 229 <div markdown="1" id="bucket-policy-express"> 230 231 To use an S3 Express One Zone _directory bucket_, use the following policy. Note the `lakeFSDirectoryBucket` statement which is specifically required for using a directory bucket. 232 233 ```json 234 { 235 "Id": "lakeFSPolicy", 236 "Version": "2012-10-17", 237 "Statement": [ 238 { 239 "Sid": "lakeFSObjects", 240 "Action": [ 241 "s3:GetObject", 242 "s3:PutObject", 243 "s3:AbortMultipartUpload", 244 "s3:ListMultipartUploadParts" 245 ], 246 "Effect": "Allow", 247 "Resource": ["arn:aws:s3:::[BUCKET_NAME_AND_PREFIX]/*"], 248 "Principal": { 249 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"] 250 } 251 }, 252 { 253 "Sid": "lakeFSBucket", 254 "Action": [ 255 "s3:ListBucket", 256 "s3:GetBucketLocation", 257 "s3:ListBucketMultipartUploads" 258 ], 259 "Effect": "Allow", 260 "Resource": ["arn:aws:s3:::[BUCKET]"], 261 "Principal": { 262 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"] 263 } 264 }, 265 { 266 "Sid": "lakeFSDirectoryBucket", 267 "Action": [ 268 "s3express:CreateSession" 269 ], 270 "Effect": "Allow", 271 "Resource": "arn:aws:s3express:[REGION]:[ACCOUNT_ID]:bucket/[BUCKET_NAME]" 272 } 273 ] 274 } 275 ``` 276 277 * Replace `[BUCKET_NAME]`, `[ACCOUNT_ID]` and `[IAM_ROLE]` with values relevant to your environment. 278 * `[BUCKET_NAME_AND_PREFIX]` can be the bucket name. If you want to minimize the bucket policy permissions, use the bucket name together with a prefix (e.g. `example-bucket/a/b/c`). 279 This way, lakeFS will be able to create repositories only under this specific path (see: [Storage Namespace][understand-repository]). 280 * lakeFS will try to assume the role `[IAM_ROLE]`. 281 </div> 282 <div markdown="1" id="bucket-policy-minimal"> 283 If required lakeFS can operate without accessing the data itself, this permission section is useful if you are using [presigned URLs mode][presigned-url] or the [lakeFS Hadoop FileSystem Spark integration][integration-hadoopfs]. 284 Since this FileSystem performs many operations directly on the storage, lakeFS requires less permissive permissions, resulting in increased security. 285 286 lakeFS always requires permissions to access the `_lakefs` prefix under your storage namespace, in which metadata 287 is stored ([learn more][understand-commits]). 288 289 By setting this policy **without** presign mode you'll be able to perform only metadata operations through lakeFS, meaning that you'll **not** be able 290 to use lakeFS to upload or download objects. Specifically you won't be able to: 291 * Upload objects using the lakeFS GUI (**Works with presign mode**) 292 * Upload objects through Spark using the S3 gateway 293 * Run `lakectl fs` commands (unless using **presign mode** with `--pre-sign` flag) 294 * Use [Actions and Hooks](/howto/hooks/) 295 296 ```json 297 { 298 "Id": "[POLICY_ID]", 299 "Version": "2012-10-17", 300 "Statement": [ 301 { 302 "Sid": "lakeFSObjects", 303 "Action": [ 304 "s3:GetObject", 305 "s3:PutObject" 306 ], 307 "Effect": "Allow", 308 "Resource": [ 309 "arn:aws:s3:::[STORAGE_NAMESPACE]/_lakefs/*" 310 ], 311 "Principal": { 312 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"] 313 } 314 }, 315 { 316 "Sid": "lakeFSBucket", 317 "Action": [ 318 "s3:ListBucket", 319 "s3:GetBucketLocation" 320 ], 321 "Effect": "Allow", 322 "Resource": ["arn:aws:s3:::[BUCKET]"], 323 "Principal": { 324 "AWS": ["arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]"] 325 } 326 } 327 ] 328 } 329 ``` 330 331 We can use [presigned URLs mode][presigned-url] without allowing access to the data from the lakeFS server directly. 332 We can achieve this by using [condition keys](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html) such as [aws:referer](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-referer), [aws:SourceVpc](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourcevpc) and [aws:SourceIp](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-sourceip). 333 334 For example, assume the following scenario: 335 - lakeFS is deployed outside the company (i.e lakeFS cloud or other VPC **not** `vpc-123`) 336 - We don't want lakeFS to be able to access the data, so we use presign URL, we still need lakeFS role to be able to sign the URL. 337 - We want to allow access from the internal company VPC: `vpc-123`. 338 339 ```json 340 { 341 "Sid": "allowLakeFSRoleFromCompanyOnly", 342 "Effect": "Allow", 343 "Principal": { 344 "AWS": "arn:aws:iam::[ACCOUNT_ID]:role/[IAM_ROLE]" 345 }, 346 "Action": [ 347 "s3:GetObject", 348 "s3:PutObject", 349 ], 350 "Resource": [ 351 "arn:aws:s3:::[BUCKET]/*", 352 ], 353 "Condition": { 354 "StringEquals": { 355 "aws:SourceVpc": "vpc-123" 356 } 357 } 358 } 359 ``` 360 361 362 </div> 363 </div> 364 365 #### S3 Storage Tier Classes 366 367 lakeFS currently supports the following S3 Storage Classes: 368 369 1. [S3 Standard](https://aws.amazon.com/s3/storage-classes/#General_purpose) - The default AWS S3 storage tier. Fully supported. 370 2. [S3 Express One-Zone](https://aws.amazon.com/s3/storage-classes/express-one-zone/) - Fully supported. 371 3. [S3 Glacier Instant Retrival](https://aws.amazon.com/s3/storage-classes/glacier/instant-retrieval/) - Supported with limitations: currently, pre-signed URLs are not supported when using Instant Retrival. The outstanding feature request [could be tracked here](https://github.com/treeverse/lakeFS/issues/7784). 372 373 Other storage classes are currently unsupported - either because they have not been tested with lakeFS or because they cannot be supported. 374 375 If you need lakeFS to support a storage tier that isn't currently on the supported list, please [open an issue on GitHub](https://github.com/treeverse/lakeFS/issues). 376 377 ### Alternative: use an AWS user 378 379 lakeFS can authenticate with your AWS account using an AWS user, using an access key and secret. To allow this, change the policy's Principal accordingly: 380 ```json 381 "Principal": { 382 "AWS": ["arn:aws:iam::<ACCOUNT_ID>:user/<IAM_USER>"] 383 } 384 ``` 385 386 {% include_relative includes/setup.md %} 387 388 [downloads]: {% link index.md %}#downloads 389 [openapi]: {% link understand/architecture.md %}#openapi-server 390 [s3-gateway]: {% link understand/architecture.md %}#s3-gateway 391 [understand-repository]: {% link understand/model.md %}#repository 392 [integration-hadoopfs]: {% link integrations/spark.md %}#lakefs-hadoop-filesystem 393 [understand-commits]: {% link understand/how/versioning-internals.md %}#constructing-a-consistent-view-of-the-keyspace-ie-a-commit 394 [presigned-url]: {% link reference/security/presigned-url.md %}#