github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/sizing-guide.md (about) 1 --- 2 title: Sizing Guide 3 parent: How-To 4 description: This section provides a detailed sizing guide for deploying lakeFS. 5 redirect_from: 6 - /architecture/sizing-guide.html 7 - /understand/sizing-guide.html 8 --- 9 # Sizing guide 10 11 Note: For a scalable managed lakeFS service with guaranteed SLAs, try [lakeFS Cloud](https://lakefs.cloud) 12 {: .note } 13 14 {% include toc.html %} 15 16 ## System Requirements 17 18 ### Operating Systems and ISA 19 lakeFS can run on MacOS and Linux. Windows binaries are available but not rigorously tested - 20 we don't recommend deploying lakeFS to production on Windows. 21 x86_64 and arm64 architectures are supported for both MacOS and Linux. 22 23 ### Memory and CPU requirements 24 lakeFS servers require a minimum of 512mb of RAM and 1 CPU core. 25 For high throughput, additional CPUs help scale requests across different cores. 26 "Expensive" operations such as large diff or commit operations can take advantage of multiple cores. 27 28 ### Network 29 If using the data APIs such as the [S3 Gateway][s3-gateway], 30 lakeFS will require enough network bandwidth to support the planned concurrent network upload/download operations. 31 For most cloud providers, more powerful machines (i.e., more expensive and usually containing more CPU cores) also provide increased network bandwidth. 32 33 If using only the metadata APIs (for example, only using the Hadoop/Spark clients), network bandwidth is minimal, 34 at roughly 1Kb per request. 35 36 ### Disk 37 lakeFS greatly benefits from fast local disks. 38 A lakeFS instance doesn't require any strong durability guarantees from the underlying storage, 39 as the disk is only ever used as a local caching layer for lakeFS metadata and not for long-term storage. 40 lakeFS is designed to work with [ephemeral disks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html){: target="_blank" } - 41 these are usually based on NVMe and are tied to the machine's lifecycle. 42 Using ephemeral disks lakeFS can provide a very high throughput/cost ratio, 43 probably the best that could be achieved on a public cloud, so we recommend those. 44 45 A local cache of at least 512 MiB should be provided. 46 For large installations (managing >100 concurrently active branches, with >100M objects per commit), 47 we recommend allocating at least 10 GiB - since it's a caching layer over a relatively slow storage (the object store), 48 see [Important metrics](#important-metrics) below to understand how to size this: it should be big enough to hold all commit metadata for actively referenced commits. 49 50 51 ### lakeFS KV Store 52 53 lakeFS uses a key-value database to manage branch references, authentication and authorization information 54 and to keep track of currently uncommitted data across branches. 55 Please refer to the relevant driver tab for best practices, requirements and benchmarks. 56 57 #### Storage 58 The dataset stored in the metadata store is relatively modest as most metadata is pushed down into the object store. 59 Required storage is mostly a factor of the amount of uncommitted writes across all branches at any given point in time: 60 in the range of 150 MiB per every 100,000 uncommitted writes. 61 62 We recommend starting at 10 GiB for a production deployment, as it will likely be more than enough. 63 64 <div class="tabs"> 65 <ul> 66 <li><a href="#postgres-ram">PostgreSQL</a></li> 67 <li><a href="#dynamodb-ram">DynamoDB</a></li> 68 </ul> 69 <div markdown="1" id="postgres-ram"> 70 71 **RAM** 72 Since the data size is small, it's recommended to provide enough memory to hold the vast majority of that data in RAM. 73 Cloud providers will save you the need to tune this parameter - it will be set to a fixed percentage the chosen instance's available RAM (25% on AWS RDS, 30% on Google Cloud SQL). 74 It is recommended that you check with your selected cloud provider for configuration and provisioning information for you database. 75 For self-managed database instances follow these best practices 76 77 Ideally, configure the [shared_buffers](https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS){: target="_blank" } 78 of your PostgreSQL instances to be large enough to contain the currently active dataset. 79 Pick a database instance with enough RAM to accommodate this buffer size at roughly x4 the size given for `shared_buffers`. For example, if an installation has ~500,000 uncommitted writes at any given time, it would require about 750 MiB of `shared_buffers` 80 that would require about 3 GiB of RAM. 81 82 **CPU** 83 PostgreSQL CPU cores help scale concurrent requests. 1 CPU core for every 5,000 requests/second is ideal. 84 </div> 85 <div markdown="1" id="dynamodb-ram"> 86 lakeFS will create a DynamoDB table for you, defaults to on-demand capacity setting. No need to specify how much read and write throughput you expect your application to perform, as DynamoDB instantly accommodates your workloads as they ramp up or down. 87 88 You can customize the table settings to provisioned capacity which allows you to manage and optimize your costs by allocating read/write capacity in advance (see [Benchmarks](#benchmarks)) 89 90 **Notes**: 91 * Using DynamoDB on-demand capacity might generate unwanted costs if the table is abused, if you'd like to cap your costs, make sure to change the table to use provisioned capacity instead. 92 * lakeFS doesn't manage the DynamoDB's table lifecycle, we've included the table creation in order to help evaluating the system with minimal effort, any change to the table beyond the table creation - will need to be handled manually or by 3rd party tools. 93 94 **RAM** 95 Managed by AWS. 96 97 **CPU** 98 Managed by AWS. 99 </div> 100 </div> 101 102 ## Scaling factors 103 104 Scaling lakeFS, like most data systems, moves across two axes: throughput of requests (amount per given timeframe) and latency (time to complete a single request). 105 106 ### Understanding latency and throughput considerations 107 108 Most lakeFS operations are designed to be very low in latency. 109 Assuming a well-tuned local disk cache (see [Storage](#storage) above), 110 most critical path operations 111 (writing objects, requesting objects, deleting objects) are designed to complete in **<25ms at p90**. 112 Listing objects obviously requires accessing more data, but should always be on-par with what the underlying object store can provide, 113 and in most cases, it's actually faster. 114 At the worst case, for directory listing with 1,000 common prefixes returned, expect a latency of **75ms at p90**. 115 116 Managing branches (creating them, listing them and deleting them) are all constant-time operations, generally taking **<30ms at p90**. 117 118 Committing and merging can take longer, as they are proportional to the amount of **changes** introduced. 119 This is what makes lakeFS optimal for large Data Lakes - 120 the amount of changes introduced per commit usually stays relatively stable while the entire data set usually grows over time. 121 This means lakeFS will provide predictable performance: 122 committing 100 changes will take roughly the same amount of time whether the resulting commit contains 500 or 500 million objects. 123 124 See [Data Model]({% link understand/how/versioning-internals.md %}) for more information. 125 126 Scaling throughput depends very much on the amount of CPU cores available to lakeFS. 127 In many cases, it's easier to scale lakeFS across a fleet of smaller cloud instances (or containers) 128 than scale up with machines that have many cores. In fact, lakeFS works well in both cases. 129 Most critical path operations scale very well across machines. 130 131 ## Benchmarks 132 133 <div class="tabs"> 134 <ul> 135 <li><a href="#postgres-ram">PostgreSQL</a></li> 136 <li><a href="#dynamodb-ram">DynamoDB</a></li> 137 </ul> 138 <div markdown="1" id="postgres-ram"> 139 140 ### PostgresSQL 141 All benchmarks below were measured using 2 x [c5ad.4xlarge](https://aws.amazon.com/ec2/instance-types/c5/){: target="_blank" } instances 142 on [AWS us-east-1](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions). 143 Similar results can be achieved on Google Cloud using a `c2-standard-16` machine type, with an attached [local SSD](https://cloud.google.com/compute/docs/disks/local-ssd). 144 On Azure, you can use a `Standard_F16s_v2` virtual machine. 145 146 The PostgreSQL instance that was used is a [db.m6g.2xlarge](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html){: target="_blank" } 147 (8 vCPUs, 32 GB RAM). Equivalent machines on Google Cloud or Azure should yield similar results. 148 149 The example repository we tested against contains the metadata of a large lakeFS installation, 150 where each commit contains **~180,000,000** objects (representing ~7.5 Petabytes of data). 151 152 All tests are reproducible using the [lakectl abuse command][lakectl-abuse], 153 so use it to properly size and tune your setup. All tests are accompanied by the relevant `lakectl abuse` command that generated them. 154 155 ### Random reads 156 157 This test generates random read requests to lakeFS, 158 in a given commit. Paths are requested randomly from a file containing a set of preconfigured (and existing) paths. 159 160 161 **command executed:** 162 163 ```shell 164 lakectl abuse random-read \ 165 --from-file randomly_selected_paths.txt \ 166 --amount 500000 \ 167 --parallelism 128 \ 168 lakefs://example-repo/<commit hash> 169 ``` 170 171 **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and commit hash. 172 173 **Result Histogram (raw):** 174 175 ``` 176 Histogram (ms): 177 1 0 178 2 0 179 5 37945 180 7 179727 181 10 296964 182 15 399682 183 25 477502 184 50 499625 185 75 499998 186 100 499998 187 250 500000 188 350 500000 189 500 500000 190 750 500000 191 1000 500000 192 5000 500000 193 min 3 194 max 222 195 total 500000 196 ``` 197 198 So 50% of all requests took <10ms, while 99.9% of them took <50ms 199 200 **throughput:** 201 202 Average throughput during the experiment was **10851.69 requests/second** 203 204 ### Random Writes 205 206 This test generates random write requests to a given lakeFS branch. 207 All the paths are pre-generated and don't overwrite each other (as overwrites are relatively rare in a Data Lake setup). 208 209 **command executed:** 210 211 ```shell 212 lakectl abuse random-write \ 213 --amount 500000 \ 214 --parallelism 64 \ 215 lakefs://example-repo/main 216 ``` 217 218 **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch. 219 220 221 **Result Histogram (raw):** 222 223 ``` 224 Histogram (ms): 225 1 0 226 2 0 227 5 30715 228 7 219647 229 10 455807 230 15 498144 231 25 499535 232 50 499742 233 75 499784 234 100 499802 235 250 500000 236 350 500000 237 500 500000 238 750 500000 239 1000 500000 240 5000 500000 241 min 3 242 max 233 243 total 500000 244 ``` 245 246 So, 50% of all requests took <10ms, while 99.9% of them took <25ms. 247 248 **throughput:** 249 250 The average throughput during the experiment was **7595.46 requests/second**. 251 252 ### Branch creation 253 254 This test creates branches from a given reference. 255 256 **command executed:** 257 258 ```shell 259 lakectl abuse create-branches \ 260 --amount 500000 \ 261 --branch-prefix "benchmark-" \ 262 --parallelism 256 \ 263 lakefs://example-repo/<commit hash> 264 ``` 265 266 **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and commit hash. 267 268 **Result Histogram (raw):** 269 270 ``` 271 Histogram (ms): 272 1 0 273 2 1 274 5 5901 275 7 39835 276 10 135863 277 15 270201 278 25 399895 279 50 484932 280 75 497180 281 100 499303 282 250 499996 283 350 500000 284 500 500000 285 750 500000 286 1000 500000 287 5000 500000 288 min 2 289 max 304 290 total 500000 291 ``` 292 293 So, 50% of all requests took <15ms, while 99.9% of them took <100ms. 294 295 **throughput:** 296 297 The average throughput during the experiment was **7069.03 requests/second**. 298 </div> 299 <div markdown="1" id="dynamodb-ram"> 300 301 ### DynamoDB 302 All benchmarks below were measured using m5.xlarge instance on AWS us-east-1. 303 304 The DynamoDB table that was used was provisioned with 500/1000 read/write capacity. 305 306 The example repository we tested against contains the metadata of a large lakeFS installation, where each commit contains ~100,000,000 objects (representing ~3.5 Petabytes of data). 307 308 All tests are reproducible using the lakectl abuse command, so use it to properly size and tune your setup. All tests are accompanied by the relevant lakectl abuse command that generated them. 309 310 ### Random reads 311 312 This test generates random read requests to lakeFS, 313 in a given commit. Paths are requested randomly from a file containing a set of preconfigured (and existing) paths. 314 315 316 **command executed:** 317 318 ```shell 319 lakectl abuse random-read \ 320 --from-file randomly_selected_paths.txt \ 321 --amount 500000 \ 322 --parallelism 128 \ 323 lakefs://example-repo/<commit hash> 324 ``` 325 326 **Result Histogram (raw): Provisioned read capacity units = 1000 327 Provisioned write capacity units = 1000** 328 329 ``` 330 Histogram (ms): 331 1 0 332 2 0 333 5 0 334 7 0 335 10 0 336 15 0 337 25 122 338 50 47364 339 75 344489 340 100 460404 341 250 497912 342 350 498016 343 500 498045 344 750 498111 345 1000 498176 346 5000 499478 347 min 18 348 max 52272 349 total 500000 350 ``` 351 352 **Result Histogram (raw): Provisioned read capacity units = 500 353 Provisioned write capacity units = 500** 354 355 ``` 356 Histogram (ms): 357 1 0 358 2 0 359 5 0 360 7 0 361 10 0 362 15 1 363 25 2672 364 50 239661 365 75 420171 366 100 470146 367 250 486603 368 350 486715 369 500 486789 370 750 487443 371 1000 488113 372 5000 493201 373 min 14 374 max 648085 375 total 499998 376 ``` 377 378 ### Random Writes 379 380 This test generates random write requests to a given lakeFS branch. 381 All the paths are pre-generated and don't overwrite each other (as overwrites are relatively rare in a Data Lake setup). 382 383 **command executed:** 384 385 ```shell 386 lakectl abuse random-write \ 387 --amount 500000 \ 388 --parallelism 64 \ 389 lakefs://example-repo/main 390 ``` 391 392 **Result Histogram (raw): Provisioned read capacity units = 1000 393 Provisioned write capacity units = 1000** 394 395 ``` 396 Histogram (ms): 397 1 0 398 2 0 399 5 0 400 7 0 401 10 0 402 15 0 403 25 24 404 50 239852 405 75 458504 406 100 485225 407 250 493687 408 350 493872 409 500 493960 410 750 496239 411 1000 499194 412 5000 500000 413 min 23 414 max 4437 415 total 500000 416 ``` 417 **Result Histogram (raw): Provisioned read capacity units = 500 418 Provisioned write capacity units = 500** 419 420 ``` 421 Histogram (ms): 422 1 0 423 2 0 424 5 0 425 7 0 426 10 0 427 15 0 428 25 174 429 50 266460 430 75 462641 431 100 484486 432 250 490633 433 350 490856 434 500 490984 435 750 492973 436 1000 495605 437 5000 498920 438 min 21 439 max 50157 440 total 500000 441 ``` 442 443 ### Branch creation 444 445 This test creates branches from a given reference. 446 447 **command executed:** 448 449 ```shell 450 lakectl abuse create-branches \ 451 --amount 500000 \ 452 --branch-prefix "benchmark-" \ 453 --parallelism 256 \ 454 lakefs://example-repo/<commit hash> 455 ``` 456 457 **Result Histogram (raw): Provisioned read capacity units = 1000 458 Provisioned write capacity units = 1000** 459 460 ``` 461 Histogram (ms): 462 1 0 463 2 0 464 5 0 465 7 0 466 10 0 467 15 0 468 25 0 469 50 628 470 75 26153 471 100 58099 472 250 216160 473 350 307078 474 500 406165 475 750 422898 476 1000 431332 477 5000 475848 478 min 41 479 max 430725 480 total 490054 481 ``` 482 483 **Result Histogram (raw): Provisioned read capacity units = 500 484 Provisioned write capacity units = 500** 485 486 ``` 487 Histogram (ms): 488 1 0 489 2 0 490 5 0 491 7 0 492 10 0 493 15 0 494 25 0 495 50 3132 496 75 155570 497 100 292745 498 250 384224 499 350 397258 500 500 431141 501 750 441360 502 1000 445597 503 5000 469538 504 min 39 505 max 760626 506 total 497520 507 ``` 508 509 </div> 510 </div> 511 512 ## Important metrics 513 514 lakeFS exposes metrics using the [Prometheus protocol](https://prometheus.io/docs/introduction/overview/){: target="_blank" }. 515 Every lakeFS instance exposes a `/metrics` endpoint that could be used to extract them. 516 517 Here are a few notable metrics to keep track of when sizing lakeFS: 518 519 `api_requests_total` - Tracks throughput of API requests over time. 520 521 `api_request_duration_seconds` - Histogram of latency per operation type. 522 523 `gateway_request_duration_seconds` - Histogram of latency per [S3 Gateway][s3-gateway] operation. 524 525 <div class="tabs"> 526 <ul> 527 <li><a href="#postgres">PostgreSQL</a></li> 528 <li><a href="#dynamodb">DynamoDB</a></li> 529 </ul> 530 <div markdown="1" id="dynamodb"> 531 532 `dynamo_request_duration_seconds` - Time spent doing DynamoDB requests. 533 534 `dynamo_consumed_capacity_total` - The capacity units consumed by operation. 535 536 `dynamo_failures_total` - The total number of errors while working for kv store. 537 538 </div> 539 </div> 540 541 ## Reference architectures 542 543 Below are a **few example architectures for lakeFS deployment.** 544 545 ### Reference Architecture: Data Science/Research environment 546 547 **Use case:** Manage Machine learning or algorithms development. 548 Use lakeFS branches to achieve both isolation and reproducibility of experiments. 549 Data being managed by lakeFS is both structured tabular data, 550 as well as unstructured sensor and image data used for training. 551 Assuming a team of 20-50 researchers, with a dataset size of 500 TiB across 20M objects. 552 553 **Environment:** lakeFS will be deployed on Kubernetes. 554 managed by [AWS EKS](https://aws.amazon.com/eks/){: target="_blank" } 555 with PostgreSQL on [AWS RDS Aurora](https://aws.amazon.com/rds/aurora/postgresql-features/){: target="_blank" } 556 557 **Sizing:** Since most of the work is done by humans (vs. automated pipelines), most experiments tend to be small in scale, 558 reading and writing 10s to 1000s of objects. 559 The expected amount of branches active in parallel is relatively low, around 1-2 per user, 560 each representing a small amount of uncommitted changes at any given point in time. 561 Let's assume 5,000 uncommitted writes per branch = ~500k. 562 563 To support the expected throughput, a single moderate lakeFS instance should be more than enough, 564 since requests per second would be on the order of 10s to 100s. 565 For high availability, we'll deploy 2 pods with 1 CPU core and 1 GiB of RAM each. 566 567 Since the PostgreSQL instance is expected to hold a very small dataset 568 (at 500k, expected dataset size is `150MiB (for 100k records) * 5 = 750MiB`). 569 To ensure we have enough RAM to hold this, we'll need 3 GiB of RAM, so, a very moderate Aurora instance `db.t3.large` (2 vCPUs, 8 GB RAM) will be more than enough. 570 An equivalent database instance on GCP or Azure should give similar results. 571 572 <img src="{{ site.baseurl }}/assets/img/reference_arch1.png" alt="ML and Research lakeFS reference architecture"/> 573 574 575 ### Reference Architecture: Automated Production Pipelines 576 577 **Use case:** Manage multiple concurrent data pipelines using Apache Spark and Airflow. 578 Airflow DAGs start by creating a branch for isolation and for CI/CD. 579 Data being managed by lakeFS is structured, tabular data. The total dataset size is 10 PiB, spanning across 500M objects. 580 The expected throughput is 10k reads/second + 2k writes per second across 100 concurrent branches. 581 582 **Environment:** lakeFS will be deployed on Kubernetes. 583 managed by [AWS EKS](https://aws.amazon.com/eks/){: target="_blank" } 584 with PostgreSQL on [AWS RDS](https://aws.amazon.com/rds/aurora/postgresql-features/){: target="_blank" } 585 586 **Sizing:** Data pipelines tend to be bursty in nature: 587 reading in a lot of objects concurrently, doing some calculation or aggregation, and then writing many objects concurrently. 588 The expected amount of branches active in parallel is high, 589 with many Airflow DAGs running per day, each representing a moderate amount of uncommitted changes at any given point in time. 590 Let's assume 1,000 uncommitted writes/branch * 2500 branches = ~2.5M records. 591 592 To support the expected throughput, looking the benchmarking numbers above, 593 we're doing roughly 625 requests/core, so 24 cores should cover our peak traffic. We can deploy `6 * 4 CPU core pods`. 594 595 On to the PostgreSQL instance - at 500k, the expected dataset size is `150MiB (for 100k records) * 25 = 3750 MiB`. 596 To ensure we have enough RAM to hold this, we'll need at least 15 GiB of RAM, so we'll go with a `db.r5.xlarge` (4 vCPUs, 32GB RAM) Aurora instance. 597 An equivalent database instance on GCP or Azure should give similar results. 598 599 <img src="{{ site.baseurl }}/assets/img/reference_arch2.png" alt="Automated pipelines lakeFS reference architecture"/> 600 601 [s3-gateway]: {% link understand/architecture.md %}#s3-gateway 602 [lakectl-abuse]: {% link reference/cli.md %}#lakectl-abuse