github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/sizing-guide.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/sizing-guide.md (about)

     1  ---
     2  title: Sizing Guide
     3  parent: How-To
     4  description: This section provides a detailed sizing guide for deploying lakeFS.
     5  redirect_from:
     6      - /architecture/sizing-guide.html
     7      - /understand/sizing-guide.html
     8  ---
     9  # Sizing guide
    10  
    11  Note: For a scalable managed lakeFS service with guaranteed SLAs, try [lakeFS Cloud](https://lakefs.cloud)
    12  {: .note }
    13  
    14  {% include toc.html %}
    15  
    16  ## System Requirements
    17  
    18  ### Operating Systems and ISA
    19  lakeFS can run on MacOS and Linux. Windows binaries are available but not rigorously tested - 
    20  we don't recommend deploying lakeFS to production on Windows.
    21  x86_64 and arm64 architectures are supported for both MacOS and Linux.
    22  
    23  ### Memory and CPU requirements
    24  lakeFS servers require a minimum of 512mb of RAM and 1 CPU core. 
    25  For high throughput, additional CPUs help scale requests across different cores. 
    26  "Expensive" operations such as large diff or commit operations can take advantage of multiple cores. 
    27  
    28  ### Network
    29  If using the data APIs such as the [S3 Gateway][s3-gateway], 
    30  lakeFS will require enough network bandwidth to support the planned concurrent network upload/download operations.
    31  For most cloud providers, more powerful machines (i.e., more expensive and usually containing more CPU cores) also provide increased network bandwidth.
    32  
    33  If using only the metadata APIs (for example, only using the Hadoop/Spark clients), network bandwidth is minimal, 
    34  at roughly 1Kb per request.
    35  
    36  ### Disk
    37  lakeFS greatly benefits from fast local disks. 
    38  A lakeFS instance doesn't require any strong durability guarantees from the underlying storage, 
    39  as the disk is only ever used as a local caching layer for lakeFS metadata and not for long-term storage.
    40  lakeFS is designed to work with [ephemeral disks](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html){: target="_blank" } - 
    41  these are usually based on NVMe and are tied to the machine's lifecycle. 
    42  Using ephemeral disks lakeFS can provide a very high throughput/cost ratio, 
    43  probably the best that could be achieved on a public cloud, so we recommend those.
    44  
    45  A local cache of at least 512 MiB should be provided. 
    46  For large installations (managing >100 concurrently active branches, with >100M objects per commit),
    47  we recommend allocating at least 10 GiB - since it's a caching layer over a relatively slow storage (the object store), 
    48  see [Important metrics](#important-metrics) below to understand how to size this: it should be big enough to hold all commit metadata for actively referenced commits.
    49  
    50  
    51  ### lakeFS KV Store
    52  
    53  lakeFS uses a key-value database to manage branch references, authentication and authorization information 
    54  and to keep track of currently uncommitted data across branches.  
    55  Please refer to the relevant driver tab for best practices, requirements and benchmarks.
    56  
    57  #### Storage
    58  The dataset stored in the metadata store is relatively modest as most metadata is pushed down into the object store. 
    59  Required storage is mostly a factor of the amount of uncommitted writes across all branches at any given point in time: 
    60  in the range of 150 MiB per every 100,000 uncommitted writes. 
    61  
    62  We recommend starting at 10 GiB for a production deployment, as it will likely be more than enough.
    63  
    64  <div class="tabs">
    65    <ul>
    66      <li><a href="#postgres-ram">PostgreSQL</a></li>
    67      <li><a href="#dynamodb-ram">DynamoDB</a></li>
    68    </ul>
    69  <div markdown="1" id="postgres-ram">
    70  
    71  **RAM**  
    72  Since the data size is small, it's recommended to provide enough memory to hold the vast majority of that data in RAM.
    73  Cloud providers will save you the need to tune this parameter - it will be set to a fixed percentage the chosen instance's available RAM (25% on AWS RDS, 30% on Google Cloud SQL).
    74  It is recommended that you check with your selected cloud provider for configuration and provisioning information for you database.
    75  For self-managed database instances follow these best practices
    76  
    77  Ideally, configure the [shared_buffers](https://www.postgresql.org/docs/current/runtime-config-resource.html#GUC-SHARED-BUFFERS){: target="_blank" }
    78  of your PostgreSQL instances to be large enough to contain the currently active dataset.
    79  Pick a database instance with enough RAM to accommodate this buffer size at roughly x4 the size given for `shared_buffers`. For example, if an installation has ~500,000 uncommitted writes at any given time, it would require about 750 MiB of `shared_buffers`
    80  that would require about 3 GiB of RAM.
    81  
    82  **CPU**  
    83  PostgreSQL CPU cores help scale concurrent requests. 1 CPU core for every 5,000 requests/second is ideal.
    84  </div>
    85  <div markdown="1" id="dynamodb-ram">
    86  lakeFS will create a DynamoDB table for you, defaults to on-demand capacity setting. No need to specify how much read and write throughput you expect your application to perform, as DynamoDB instantly accommodates your workloads as they ramp up or down.
    87  
    88  You can customize the table settings to provisioned capacity which allows you to manage and optimize your costs by allocating read/write capacity in advance (see [Benchmarks](#benchmarks))
    89  
    90  **Notes**:
    91  * Using DynamoDB on-demand capacity might generate unwanted costs if the table is abused, if you'd like to cap your costs, make sure to change the table to use provisioned capacity instead.  
    92  * lakeFS doesn't manage the DynamoDB's table lifecycle, we've included the table creation in order to help evaluating the system with minimal effort, any change to the table beyond the table creation - will need to be handled manually or by 3rd party tools.
    93  
    94  **RAM**  
    95  Managed by AWS.
    96  
    97  **CPU**  
    98  Managed by AWS.
    99  </div>
   100  </div>
   101  
   102  ## Scaling factors
   103  
   104  Scaling lakeFS, like most data systems, moves across two axes: throughput of requests (amount per given timeframe) and latency (time to complete a single request).
   105  
   106  ### Understanding latency and throughput considerations
   107  
   108  Most lakeFS operations are designed to be very low in latency.
   109  Assuming a well-tuned local disk cache (see [Storage](#storage) above),
   110  most critical path operations
   111  (writing objects, requesting objects, deleting objects) are designed to complete in **<25ms at p90**.
   112  Listing objects obviously requires accessing more data, but should always be on-par with what the underlying object store can provide,
   113  and in most cases, it's actually faster.
   114  At the worst case, for directory listing with 1,000 common prefixes returned, expect a latency of **75ms at p90**.
   115  
   116  Managing branches (creating them, listing them and deleting them) are all constant-time operations, generally taking **<30ms at p90**.
   117  
   118  Committing and merging can take longer, as they are proportional to the amount of **changes** introduced.
   119  This is what makes lakeFS optimal for large Data Lakes -
   120  the amount of changes introduced per commit usually stays relatively stable while the entire data set usually grows over time.
   121  This means lakeFS will provide predictable performance:
   122  committing 100 changes will take roughly the same amount of time whether the resulting commit contains 500 or 500 million objects.
   123  
   124  See [Data Model]({% link understand/how/versioning-internals.md %}) for more information.
   125  
   126  Scaling throughput depends very much on the amount of CPU cores available to lakeFS.
   127  In many cases, it's easier to scale lakeFS across a fleet of smaller cloud instances (or containers)
   128  than scale up with machines that have many cores. In fact, lakeFS works well in both cases.
   129  Most critical path operations scale very well across machines.
   130  
   131  ## Benchmarks
   132  
   133  <div class="tabs">
   134    <ul>
   135      <li><a href="#postgres-ram">PostgreSQL</a></li>
   136      <li><a href="#dynamodb-ram">DynamoDB</a></li>
   137    </ul>
   138  <div markdown="1" id="postgres-ram">
   139  
   140  ### PostgresSQL
   141  All benchmarks below were measured using 2 x [c5ad.4xlarge](https://aws.amazon.com/ec2/instance-types/c5/){: target="_blank" } instances
   142  on [AWS us-east-1](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html#concepts-available-regions).
   143  Similar results can be achieved on Google Cloud using a `c2-standard-16` machine type, with an attached [local SSD](https://cloud.google.com/compute/docs/disks/local-ssd).
   144  On Azure, you can use a `Standard_F16s_v2` virtual machine.
   145  
   146  The PostgreSQL instance that was used is a [db.m6g.2xlarge](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Concepts.DBInstanceClass.html){: target="_blank" }
   147  (8 vCPUs, 32 GB RAM). Equivalent machines on Google Cloud or Azure should yield similar results.
   148  
   149  The example repository we tested against contains the metadata of a large lakeFS installation,
   150  where each commit contains **~180,000,000** objects (representing ~7.5 Petabytes of data).
   151  
   152  All tests are reproducible using the [lakectl abuse command][lakectl-abuse],
   153  so use it to properly size and tune your setup. All tests are accompanied by the relevant `lakectl abuse` command that generated them.
   154  
   155  ### Random reads
   156  
   157  This test generates random read requests to lakeFS,
   158  in a given commit. Paths are requested randomly from a file containing a set of preconfigured (and existing) paths.
   159  
   160  
   161  **command executed:**
   162  
   163  ```shell
   164  lakectl abuse random-read \
   165      --from-file randomly_selected_paths.txt \
   166      --amount 500000 \
   167      --parallelism 128 \
   168      lakefs://example-repo/<commit hash>
   169  ```
   170  
   171  **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and commit hash.
   172  
   173  **Result Histogram (raw):**
   174  
   175  ```
   176  Histogram (ms):
   177  1	0
   178  2	0
   179  5	37945
   180  7	179727
   181  10	296964
   182  15	399682
   183  25	477502
   184  50	499625
   185  75	499998
   186  100	499998
   187  250	500000
   188  350	500000
   189  500	500000
   190  750	500000
   191  1000	500000
   192  5000	500000
   193  min	3
   194  max	222
   195  total	500000
   196  ```
   197  
   198  So 50% of all requests took <10ms, while 99.9% of them took <50ms
   199  
   200  **throughput:**
   201  
   202  Average throughput during the experiment was **10851.69 requests/second**
   203  
   204  ### Random Writes
   205  
   206  This test generates random write requests to a given lakeFS branch.
   207  All the paths are pre-generated and don't overwrite each other (as overwrites are relatively rare in a Data Lake setup).
   208  
   209  **command executed:**
   210  
   211  ```shell
   212  lakectl abuse random-write \
   213      --amount 500000 \
   214      --parallelism 64 \
   215      lakefs://example-repo/main
   216  ```
   217  
   218  **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch.
   219  
   220  
   221  **Result Histogram (raw):**
   222  
   223  ```
   224  Histogram (ms):
   225  1	0
   226  2	0
   227  5	30715
   228  7	219647
   229  10	455807
   230  15	498144
   231  25	499535
   232  50	499742
   233  75	499784
   234  100	499802
   235  250	500000
   236  350	500000
   237  500	500000
   238  750	500000
   239  1000	500000
   240  5000	500000
   241  min	3
   242  max	233
   243  total	500000
   244  ```
   245  
   246  So, 50% of all requests took <10ms, while 99.9% of them took <25ms.
   247  
   248  **throughput:**
   249  
   250  The average throughput during the experiment was **7595.46 requests/second**.
   251  
   252  ### Branch creation
   253  
   254  This test creates branches from a given reference.
   255  
   256  **command executed:**
   257  
   258  ```shell
   259  lakectl abuse create-branches \
   260      --amount 500000 \
   261      --branch-prefix "benchmark-" \
   262      --parallelism 256 \
   263      lakefs://example-repo/<commit hash>
   264  ```
   265  
   266  **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and commit hash.
   267  
   268  **Result Histogram (raw):**
   269  
   270  ```
   271  Histogram (ms):
   272  1	0
   273  2	1
   274  5	5901
   275  7	39835
   276  10	135863
   277  15	270201
   278  25	399895
   279  50	484932
   280  75	497180
   281  100	499303
   282  250	499996
   283  350	500000
   284  500	500000
   285  750	500000
   286  1000	500000
   287  5000	500000
   288  min	2
   289  max	304
   290  total	500000
   291  ```
   292  
   293  So, 50% of all requests took <15ms, while 99.9% of them took <100ms.
   294  
   295  **throughput:**
   296  
   297  The average throughput during the experiment was **7069.03 requests/second**.
   298  </div>
   299  <div markdown="1" id="dynamodb-ram">
   300  
   301  ### DynamoDB
   302  All benchmarks below were measured using m5.xlarge instance on AWS us-east-1.
   303  
   304  The DynamoDB table that was used was provisioned with 500/1000 read/write capacity.
   305  
   306  The example repository we tested against contains the metadata of a large lakeFS installation, where each commit contains ~100,000,000 objects (representing ~3.5 Petabytes of data).
   307  
   308  All tests are reproducible using the lakectl abuse command, so use it to properly size and tune your setup. All tests are accompanied by the relevant lakectl abuse command that generated them.
   309  
   310  ### Random reads
   311  
   312  This test generates random read requests to lakeFS,
   313  in a given commit. Paths are requested randomly from a file containing a set of preconfigured (and existing) paths.
   314  
   315  
   316  **command executed:**
   317  
   318  ```shell
   319  lakectl abuse random-read \
   320      --from-file randomly_selected_paths.txt \
   321      --amount 500000 \
   322      --parallelism 128 \
   323      lakefs://example-repo/<commit hash>
   324  ```
   325  
   326  **Result Histogram (raw): Provisioned read capacity units = 1000
   327  Provisioned write capacity units = 1000**
   328  
   329  ```
   330  Histogram (ms):
   331  1	0
   332  2	0
   333  5	0
   334  7	0
   335  10	0
   336  15	0
   337  25	122
   338  50	47364
   339  75	344489
   340  100	460404
   341  250	497912
   342  350	498016
   343  500	498045
   344  750	498111
   345  1000 498176
   346  5000 499478
   347  min	18
   348  max	52272
   349  total 500000
   350  ```
   351  
   352  **Result Histogram (raw): Provisioned read capacity units = 500
   353  Provisioned write capacity units = 500**
   354  
   355  ```
   356  Histogram (ms):
   357  1	0
   358  2	0
   359  5	0
   360  7	0
   361  10	0
   362  15	1
   363  25	2672
   364  50	239661
   365  75	420171
   366  100	470146
   367  250	486603
   368  350	486715
   369  500	486789
   370  750	487443
   371  1000	488113
   372  5000	493201
   373  min	14
   374  max	648085
   375  total	499998
   376  ```
   377  
   378  ### Random Writes
   379  
   380  This test generates random write requests to a given lakeFS branch.
   381  All the paths are pre-generated and don't overwrite each other (as overwrites are relatively rare in a Data Lake setup).
   382  
   383  **command executed:**
   384  
   385  ```shell
   386  lakectl abuse random-write \
   387      --amount 500000 \
   388      --parallelism 64 \
   389      lakefs://example-repo/main
   390  ```
   391  
   392  **Result Histogram (raw): Provisioned read capacity units = 1000
   393  Provisioned write capacity units = 1000**
   394  
   395  ```
   396  Histogram (ms):
   397  1	0
   398  2	0
   399  5	0
   400  7	0
   401  10	0
   402  15	0
   403  25	24
   404  50	239852
   405  75	458504
   406  100	485225
   407  250	493687
   408  350	493872
   409  500	493960
   410  750	496239
   411  1000	499194
   412  5000	500000
   413  min	23
   414  max	4437
   415  total	500000
   416  ```
   417  **Result Histogram (raw): Provisioned read capacity units = 500
   418  Provisioned write capacity units = 500**
   419  
   420  ```
   421  Histogram (ms):
   422  1	0
   423  2	0
   424  5	0
   425  7	0
   426  10	0
   427  15	0
   428  25	174
   429  50	266460
   430  75	462641
   431  100	484486
   432  250	490633
   433  350	490856
   434  500	490984
   435  750	492973
   436  1000 495605
   437  5000 498920
   438  min	21
   439  max	50157
   440  total 500000
   441  ```
   442  
   443  ### Branch creation
   444  
   445  This test creates branches from a given reference.
   446  
   447  **command executed:**
   448  
   449  ```shell
   450  lakectl abuse create-branches \
   451      --amount 500000 \
   452      --branch-prefix "benchmark-" \
   453      --parallelism 256 \
   454      lakefs://example-repo/<commit hash>
   455  ```
   456  
   457  **Result Histogram (raw): Provisioned read capacity units = 1000
   458  Provisioned write capacity units = 1000**
   459  
   460  ```
   461  Histogram (ms):
   462  1	0
   463  2	0
   464  5	0
   465  7	0
   466  10	0
   467  15	0
   468  25	0
   469  50	628
   470  75	26153
   471  100	58099
   472  250	216160
   473  350	307078
   474  500	406165
   475  750	422898
   476  1000	431332
   477  5000	475848
   478  min	41
   479  max	430725
   480  total	490054
   481  ```
   482  
   483  **Result Histogram (raw): Provisioned read capacity units = 500
   484  Provisioned write capacity units = 500**
   485  
   486  ```
   487  Histogram (ms):
   488  1	0
   489  2	0
   490  5	0
   491  7	0
   492  10	0
   493  15	0
   494  25	0
   495  50	3132
   496  75	155570
   497  100	292745
   498  250	384224
   499  350	397258
   500  500	431141
   501  750	441360
   502  1000 445597
   503  5000 469538
   504  min	39
   505  max	760626
   506  total 497520
   507  ```
   508  
   509  </div>
   510  </div>
   511  
   512  ## Important metrics
   513  
   514  lakeFS exposes metrics using the [Prometheus protocol](https://prometheus.io/docs/introduction/overview/){: target="_blank" }. 
   515  Every lakeFS instance exposes a `/metrics` endpoint that could be used to extract them. 
   516  
   517  Here are a few notable metrics to keep track of when sizing lakeFS:
   518  
   519  `api_requests_total` - Tracks throughput of API requests over time.
   520  
   521  `api_request_duration_seconds` - Histogram of latency per operation type.
   522  
   523  `gateway_request_duration_seconds` - Histogram of latency per [S3 Gateway][s3-gateway] operation.
   524  
   525  <div class="tabs">
   526    <ul>
   527      <li><a href="#postgres">PostgreSQL</a></li>
   528      <li><a href="#dynamodb">DynamoDB</a></li>
   529    </ul>
   530  <div markdown="1" id="dynamodb">
   531  
   532  `dynamo_request_duration_seconds` - Time spent doing DynamoDB requests.
   533  
   534  `dynamo_consumed_capacity_total` - The capacity units consumed by operation.
   535  
   536  `dynamo_failures_total` - The total number of errors while working for kv store.
   537  
   538  </div>
   539  </div>
   540  
   541  ## Reference architectures
   542  
   543  Below are a **few example architectures for lakeFS deployment.** 
   544  
   545  ### Reference Architecture: Data Science/Research environment
   546  
   547  **Use case:** Manage Machine learning or algorithms development. 
   548  Use lakeFS branches to achieve both isolation and reproducibility of experiments. 
   549  Data being managed by lakeFS is both structured tabular data, 
   550  as well as unstructured sensor and image data used for training. 
   551  Assuming a team of 20-50 researchers, with a dataset size of 500 TiB across 20M objects.
   552  
   553  **Environment:** lakeFS will be deployed on Kubernetes. 
   554  managed by [AWS EKS](https://aws.amazon.com/eks/){: target="_blank" } 
   555  with PostgreSQL on [AWS RDS Aurora](https://aws.amazon.com/rds/aurora/postgresql-features/){: target="_blank" }
   556  
   557  **Sizing:** Since most of the work is done by humans (vs. automated pipelines), most experiments tend to be small in scale, 
   558  reading and writing 10s to 1000s of objects. 
   559  The expected amount of branches active in parallel is relatively low, around 1-2 per user, 
   560  each representing a small amount of uncommitted changes at any given point in time. 
   561  Let's assume 5,000 uncommitted writes per branch = ~500k. 
   562  
   563  To support the expected throughput, a single moderate lakeFS instance should be more than enough, 
   564  since requests per second would be on the order of 10s to 100s. 
   565  For high availability, we'll deploy 2 pods with 1 CPU core and 1 GiB of RAM each.
   566  
   567  Since the PostgreSQL instance is expected to hold a very small dataset 
   568  (at 500k, expected dataset size is `150MiB (for 100k records) * 5 = 750MiB`). 
   569  To ensure we have enough RAM to hold this, we'll need 3 GiB of RAM, so, a very moderate Aurora instance `db.t3.large` (2 vCPUs, 8 GB RAM) will be more than enough.
   570  An equivalent database instance on GCP or Azure should give similar results.
   571  
   572  <img src="{{ site.baseurl }}/assets/img/reference_arch1.png" alt="ML and Research lakeFS reference architecture"/>
   573  
   574  
   575  ### Reference Architecture: Automated Production Pipelines
   576  
   577  **Use case:** Manage multiple concurrent data pipelines using Apache Spark and Airflow. 
   578  Airflow DAGs start by creating a branch for isolation and for CI/CD. 
   579  Data being managed by lakeFS is structured, tabular data. The total dataset size is 10 PiB, spanning across 500M objects. 
   580  The expected throughput is 10k reads/second + 2k writes per second across 100 concurrent branches.
   581  
   582  **Environment:** lakeFS will be deployed on Kubernetes. 
   583  managed by [AWS EKS](https://aws.amazon.com/eks/){: target="_blank" } 
   584  with PostgreSQL on [AWS RDS](https://aws.amazon.com/rds/aurora/postgresql-features/){: target="_blank" }
   585  
   586  **Sizing:** Data pipelines tend to be bursty in nature: 
   587  reading in a lot of objects concurrently, doing some calculation or aggregation, and then writing many objects concurrently. 
   588  The expected amount of branches active in parallel is high, 
   589  with many Airflow DAGs running per day, each representing a moderate amount of uncommitted changes at any given point in time. 
   590  Let's assume 1,000 uncommitted writes/branch * 2500 branches = ~2.5M records. 
   591  
   592  To support the expected throughput, looking the benchmarking numbers above, 
   593  we're doing roughly 625 requests/core, so 24 cores should cover our peak traffic. We can deploy `6 * 4 CPU core pods`.
   594  
   595  On to the PostgreSQL instance - at 500k, the expected dataset size is `150MiB (for 100k records) * 25 = 3750 MiB`. 
   596  To ensure we have enough RAM to hold this, we'll need at least 15 GiB of RAM, so we'll go with a `db.r5.xlarge` (4 vCPUs, 32GB RAM) Aurora instance.
   597  An equivalent database instance on GCP or Azure should give similar results.
   598  
   599  <img src="{{ site.baseurl }}/assets/img/reference_arch2.png" alt="Automated pipelines lakeFS reference architecture"/>
   600  
   601  [s3-gateway]:  {% link understand/architecture.md %}#s3-gateway
   602  [lakectl-abuse]:  {% link reference/cli.md %}#lakectl-abuse