github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/architecture.md

github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/understand/architecture.md (about)

     1  ---
     2  title: Architecture
     3  parent: Understanding lakeFS
     4  description: lakeFS architecture overview. Learn more about lakeFS components, including its S3 API gateway.
     5  redirect_from:
     6      - /architecture/index.html
     7      - /architecture/overview.html
     8  ---
     9  # lakeFS Architecture
    10  
    11  lakeFS is distributed as a single binary encapsulating several logical services.
    12  
    13  The server itself is stateless, meaning you can easily add more instances to handle a bigger load.
    14  
    15  ![Architecture]({{ site.baseurl }}/assets/img/architecture.png)
    16  
    17  {% include toc_2-3.html %}
    18  ### Object Storage
    19  
    20  lakeFS stores data in object stores. Those supported include: 
    21  
    22  - AWS S3
    23  - Google Cloud Storage
    24  - Azure Blob Storage
    25  - MinIO
    26  - NetApp StorageGRID
    27  - Ceph
    28  
    29  ### Metadata Storage
    30  
    31  In additional a Key Value storage is used for storing metadata, with supported databases including PostgreSQL, DynamoDB, and CosmosDB Instructions of how to deploy such database on AWS can be found [here][dynamodb-permissions].
    32  
    33  Additional information on the data format can be found in [Versioning internals](./how/versioning-internals.md) and [Internal database structure](./how/kv.md)
    34  
    35  ### Load Balancing
    36  
    37  Accessing lakeFS is done using HTTP.
    38  lakeFS exposes a frontend UI, an [OpenAPI server](#openapi-server), as well as an S3-compatible service (see [S3 Gateway](#s3-gateway) below).
    39  lakeFS uses a single port that serves all three endpoints, so for most use cases a single load balancer pointing
    40  to lakeFS server(s) would do.
    41  
    42  ## lakeFS Components
    43  
    44  ### S3 Gateway
    45  
    46  The S3 Gateway is the layer in lakeFS responsible for the compatibility with S3. It implements a compatible subset of the S3 API to ensure most data systems can use lakeFS as a drop-in replacement for S3.
    47  
    48  See the [S3 API Reference]({% link reference/s3.md %}) section for information on supported API operations.
    49  
    50  ### OpenAPI Server
    51  
    52  The Swagger ([OpenAPI](https://swagger.io/docs/specification/basic-structure/){:target="_blank"}) server exposes the full set of lakeFS operations (see [Reference]({% link reference/api.md %})). This includes basic CRUD operations against repositories and objects, as well as versioning related operations such as branching, merging, committing, and reverting changes to data.
    53  
    54  ### Storage Adapter
    55  
    56  The Storage Adapter is an abstraction layer for communicating with any underlying object store. 
    57  Its implementations allow compatibility with many types of underlying storage such as S3, GCS, Azure Blob Storage, or non-production usages such as the local storage adapter.
    58  
    59  See the [roadmap][roadmap] for information on the future plans for storage compatibility. 
    60  
    61  ### Graveler
    62  
    63  The Graveler handles lakeFS versioning by translating lakeFS addresses to the actual stored objects.
    64  To learn about the data model used to store lakeFS metadata, see the [versioning internals page]({% link understand/how/versioning-internals.md %}).
    65  
    66  ### Authentication & Authorization Service
    67  
    68  The Auth service handles the creation, management, and validation of user credentials and [RBAC policies](https://en.wikipedia.org/wiki/Role-based_access_control){:target="_blank"}.
    69  
    70  The credential scheme, along with the request signing logic, are compatible with AWS IAM (both [SIGv2](https://docs.aws.amazon.com/general/latest/gr/signature-version-2.html) and [SIGv4](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html)).
    71  
    72  Currently, the Auth service manages its own database of users and credentials and doesn't use IAM in any way. 
    73  
    74  ### Hooks Engine
    75  
    76  The Hooks Engine enables CI/CD for data by triggering user defined [Actions][data-quality-gates] that will run during commit/merge. 
    77  
    78  ### UI
    79  
    80  The UI layer is a simple browser-based client that uses the OpenAPI server. It allows management, exploration, and data access to repositories, branches, commits and objects in the system.
    81  
    82  ## Applications
    83  
    84  As a rule of thumb, lakeFS supports any S3-compatible application. This means that many common data applications work with lakeFS out-of-the-box.
    85  Check out our [integrations]({% link integrations/index.md %}) to learn more.
    86  
    87  ## lakeFS Clients
    88  
    89  Some data applications benefit from deeper integrations with lakeFS to support different use cases or enhanced functionality provided by lakeFS clients.
    90  
    91  ### OpenAPI Generated SDKs
    92  
    93  OpenAPI specification can be used to generate lakeFS clients for many programming languages.
    94  For example, the [Python lakefs-client](https://pypi.org/project/lakefs-client/) or the [Java client](https://search.maven.org/artifact/io.lakefs/api-client) are published with every new lakeFS release.
    95  
    96  ### lakectl
    97  
    98  [lakectl]({% link reference/cli.md %}) is a CLI tool that enables lakeFS operations using the lakeFS API from your preferred terminal.
    99  
   100  ### Spark Metadata Client
   101  
   102  The lakeFS [Spark Metadata Client]({% link reference/spark-client.md %}) makes it easy to perform
   103  operations related to lakeFS metadata, at scale. Examples include [garbage collection]({% link howto/garbage-collection/index.md %}) or [exporting data from lakeFS]({% link howto/export.md %}).
   104  
   105  ### lakeFS Hadoop FileSystem
   106  
   107  Thanks to the [S3 Gateway](#s3-gateway), it's possible to interact with lakeFS using Hadoop's S3AFIleSystem, 
   108  but due to limitations of the S3 API, doing so requires reading and writing data objects through the lakeFS server.
   109  Using [lakeFSFileSystem][hadoopfs] increases Spark ETL jobs performance by executing the metadata operations on the lakeFS server,
   110  and all data operations directly through the same underlying object store that lakeFS uses.
   111  
   112  
   113  [data-quality-gates]:  {% link understand/use_cases/cicd_for_data.md %}#using-hooks-as-data-quality-gates
   114  [dynamodb-permissions]:  {% link howto/deploy/aws.md %}#grant-dynamodb-permissions-to-lakefs
   115  [roadmap]:  {% link project/index.md %}#roadmap
   116  [hadoopfs]:  {% link integrations/spark.md %}#lakefs-hadoop-filesystem