github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/deploy/azure.md (about)

     1  ---
     2  title: Azure
     3  grand_parent: How-To
     4  parent: Install lakeFS
     5  description: How to deploy and set up a production-suitable lakeFS environment on Microsoft Azure
     6  redirect_from:
     7     - /setup/storage/blob.html 
     8     - /deploy/azure.html 
     9  next:  ["Import data into your installation", "/howto/import.html"]
    10  ---
    11  
    12  # Deploy lakeFS on Azure
    13  
    14  {: .tip }
    15  > The instructions given here are for a self-managed deployment of lakeFS on Azure. 
    16  > 
    17  > For a hosted lakeFS service with guaranteed SLAs, try [lakeFS Cloud](https://lakefs.cloud)
    18  
    19  When you deploy lakeFS on Azure these are the options available to use: 
    20  
    21  ![](/assets/img/deploy/deploy-on-azure.excalidraw.png)
    22  
    23  This guide walks you through the options available and how to configure them, finishing with configuring and running lakeFS itself and creating your first repository. 
    24  
    25  {% include toc.html %}
    26  
    27  ⏰ Expected deployment time: 25 min
    28  {: .note }
    29  
    30  ## 1. Object Storage
    31  
    32  lakeFS supports the following [Azure Storage](https://learn.microsoft.com/en-us/azure/storage/common/storage-introduction) types:
    33  
    34  1. [Azure Blob Storage](https://azure.microsoft.com/en-gb/products/storage/blobs)
    35  2. [Azure Data Lake Storage Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) ([HNS](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace))
    36  
    37  Data Lake Storage Gen1 is not supported.
    38  
    39  ## 2. Authentication Method
    40  
    41  lakeFS supports two ways to authenticate with Azure.
    42  
    43  <div class="tabs">
    44    <ul>
    45      <li><a href="#iba">Identity Based Authentication (recommended)</a></li>
    46      <li><a href="#sac">Storage Account Credentials</a></li>
    47    </ul>
    48  
    49  <div markdown="1" id="iba">
    50  
    51  lakeFS uses environment variables to determine credentials to use for authentication. The following authentication methods are supported:
    52  
    53  1. Managed Service Identity (MSI)
    54  1. Service Principal RBAC
    55  1. Azure CLI
    56  
    57  For deployments inside the Azure ecosystem it is recommended to use a managed identity.  
    58  
    59  More information on authentication methods and environment variables can be found [here](https://learn.microsoft.com/en-us/azure/developer/go/azure-sdk-authentication)
    60  
    61  ### How to Create Service Principal for Resource Group
    62  
    63  It is recommended to create a resource group that consists of all the resources lakeFS should have access to.
    64  
    65  Using a resource group will allow dynamic removal/addition of services from the group, effectively providing/preventing access for lakeFS to these resources without requiring any changes in configuration in lakeFS or providing lakeFS with any additional credentials.
    66  
    67  The minimal role required for the service principal is "Storage Blob Data Contributor"
    68  
    69  The following Azure CLI command creates a service principal for a resource group called "lakeFS" with permission to access (read/write/delete)
    70  Blob Storage resources in the resource group and with an expiry of 5 years
    71  
    72  ``` shell
    73  az ad sp create-for-rbac \
    74    --role "Storage Blob Data Contributor" \
    75    --scopes /subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/lakeFS --years 5
    76      
    77  Creating 'Storage Blob Data Contributor' role assignment under scope '/subscriptions/947382ea-681a-4541-99ab-b718960c6289/resourceGroups/lakeFS'
    78  The output includes credentials that you must protect. Be sure that you do not include these credentials in your code or check the credentials into your source control. For more information, see https://aka.ms/azadsp-cli
    79  {
    80    "appId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    81    "displayName": "azure-cli-2023-01-30-06-18-30",
    82    "password": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX",
    83    "tenant": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"
    84  }
    85  ```
    86  
    87  The command output should be used to populate the following environment variables:
    88  
    89  ```
    90  AZURE_CLIENT_ID      =  $appId
    91  AZURE_TENANT_ID      =  $tenant
    92  AZURE_CLIENT_SECRET  =  $password
    93  ```
    94  
    95  **Note:** Service Principal credentials have an expiry date and lakeFS will lose access to resources unless credentials are renewed on time.
    96  {: .note }
    97  
    98  **Note:** It is possible to provide both account based credentials and environment variables to lakeFS. In that case - lakeFS will use
    99  the account credentials for any access to data located in the given account, and will try to use the identity credentials for any data located outside the given account.
   100  {: .note }
   101  
   102  </div>
   103  
   104  <div markdown="2" id="sac">
   105  
   106  Storage account credentials can be set directly in the lakeFS configuration using the following parameters:
   107  
   108  * `blockstore.azure.storage_account`
   109  * `blockstore.azure.storage_access_key`  
   110  
   111  ### Limitations
   112  
   113  Please note that using this authentication method limits lakeFS to the scope of the given storage account. 
   114  
   115  Specifically, **the following operations will not work**:
   116  
   117  1. Import of data from different storage accounts
   118  1. Copy/Read/Write of data that was imported from a different storage account
   119  1. Create pre-signed URL for data that was imported from a different storage account
   120  
   121  
   122  </div>
   123  </div>
   124  
   125  
   126  ## 3. K/V Store
   127  
   128  lakeFS stores metadata in a database for its versioning engine. This is done via a Key-Value interface that can be implemented on any DB engine and lakeFS comes with several built-in driver implementations (You can read more about it [here](https://docs.lakefs.io/understand/how/kv.html)). The database used doesn't _have_ to be a dedicated K/V database.
   129  
   130  <div class="tabs">
   131    <ul>
   132      <li><a href="#cosmosdb">CosmosDB (Beta)</a></li>
   133      <li><a href="#postgres">PostgreSQL</a></li>
   134    </ul>
   135    <div markdown="1" id="cosmosdb">
   136  
   137  [CosmosDB](https://azure.microsoft.com/en-us/products/cosmos-db/) is a managed database service provided by Azure. lakeFS supports [CosmosDB For NoSQL](https://learn.microsoft.com/en-GB/azure/cosmos-db/nosql/) as a database backend. 
   138  
   139  1. Follow the official [Azure documentation](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-create-account?tabs=azure-cli) 
   140     on how to create a CosmosDB account for NoSQL and connect to it.
   141  1. Once your CosmosDB account is set up, you can create a Database for 
   142     lakeFS. For lakeFS ACID guarantees, make sure to select the [Bounded 
   143     staleness consistency](https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels#bounded-staleness-consistency),
   144     for single region deployments.
   145  1. Create a new container in the database and select type 
   146     `partitionKey` as the Partition key (case sensitive). 
   147  1. Pass the endpoint, database name and container name to lakeFS as 
   148     described in the [configuration guide][config-reference-azure-block].
   149     You can either pass the CosmosDB's account read-write key to lakeFS, or 
   150     use a managed identity to authenticate to CosmosDB, as described 
   151     [earlier](#identity-based-credentials).
   152  
   153  A note on CosmosDB capacity modes: lakeFS usage of CosmosDB is still in its 
   154  early days and has not been battle tested. Both capacity modes, Provisioned 
   155  and Serverless, has been tested for some workloads and passed with flying 
   156  colors. The Provisioned mode was configured with 400-4000 RU/s.
   157  {: .note }
   158  
   159  </div>
   160    <div markdown="2" id="postgres">
   161  
   162  Below we show you how to create a database on Azure Database, but you can use any PostgreSQL database as long as it's accessible by your lakeFS installation.
   163  
   164  If you already have a database, take note of the connection string and skip to the [next step](#run-the-lakefs-server)
   165  
   166  1. Follow the official [Azure documentation](https://docs.microsoft.com/en-us/azure/postgresql/quickstart-create-server-database-portal){: target="_blank" } on how to create a PostgreSQL instance and connect to it.
   167     Make sure that you're using PostgreSQL version >= 11.
   168  1. Once your Azure Database for PostgreSQL server is set up and the server is in the _Available_ state, take note of the endpoint and username.
   169     ![Azure postgres Connection String]({{ site.baseurl }}/assets/img/azure_postgres_conn.png)
   170  1. Make sure your Access control roles allow you to connect to the database instance.
   171  
   172  </div>
   173  </div>
   174  
   175  ## 4. Run the lakeFS server
   176  
   177  Now that you've chosen and configured object storage, a K/V store, and authentication—you're ready to configure and run lakeFS. There are three different ways you can run lakeFS:
   178  
   179  <div class="tabs">
   180    <ul>
   181      <li><a href="#vm">Azure VM</a></li>
   182      <li><a href="#docker">Docker</a></li>
   183      <li><a href="#aks">Azure Kubernetes Service (AKS)</a></li>
   184    </ul>
   185    <div markdown="1" id="vm">
   186  
   187  Connect to your VM instance using SSH:
   188  
   189  1. Create a `config.yaml` on your VM, with the following parameters:
   190    
   191     ```yaml
   192     ---
   193     database:
   194       type: "postgres"
   195       postgres:
   196         connection_string: "[DATABASE_CONNECTION_STRING]"
   197    
   198     auth:
   199       encrypt:
   200         # replace this with a randomly-generated string. Make sure to keep it safe!
   201         secret_key: "[ENCRYPTION_SECRET_KEY]"
   202     
   203     blockstore:
   204       type: azure
   205       azure:
   206     ```
   207  1. [Download the binary][downloads] to run on the VM.
   208  1. Run the `lakefs` binary:
   209    
   210     ```sh
   211     lakefs --config config.yaml run
   212     ```
   213  
   214  **Note:** It's preferable to run the binary as a service using systemd or your operating system's facilities.
   215  {: .note }
   216  
   217  </div>
   218  <div markdown="2" id="docker">
   219  
   220  To support container-based environments, you can configure lakeFS using environment variables. Here is a `docker run`
   221  command to demonstrate starting lakeFS using Docker:
   222  
   223  ```sh
   224  docker run \
   225    --name lakefs \
   226    -p 8000:8000 \
   227    -e LAKEFS_DATABASE_TYPE="postgres" \
   228    -e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \
   229    -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \
   230    -e LAKEFS_BLOCKSTORE_TYPE="azure" \
   231    -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT="[YOUR_STORAGE_ACCOUNT]" \
   232    -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCESS_KEY="[YOUR_ACCESS_KEY]" \
   233    treeverse/lakefs:latest run
   234  ```
   235  
   236  See the [reference][config-envariables] for a complete list of environment variables.
   237  
   238  
   239  </div>
   240  <div markdown="2" id="aks">
   241  
   242  You can install lakeFS on Kubernetes using a [Helm chart](https://github.com/treeverse/charts/tree/master/charts/lakefs).
   243  
   244  To install lakeFS with Helm:
   245  
   246  1. Copy the Helm values file relevant for Azure Blob:
   247     
   248     ```yaml
   249     secrets:
   250         # replace this with the connection string of the database you created in a previous step:
   251         databaseConnectionString: [DATABASE_CONNECTION_STRING]
   252         # replace this with a randomly-generated string
   253         authEncryptSecretKey: [ENCRYPTION_SECRET_KEY]
   254     lakefsConfig: |
   255         blockstore:
   256           type: azure
   257           azure:
   258         #  If you chose to authenticate via access key, unmark the following rows and insert the values from the previous step 
   259         #  storage_account: [your storage account]
   260         #  storage_access_key: [your access key]
   261     ```
   262  1. Fill in the missing values and save the file as `conf-values.yaml`. For more configuration options, see our Helm chart [README](https://github.com/treeverse/charts/blob/master/charts/lakefs/README.md#custom-configuration){:target="_blank"}.
   263  
   264     The `lakefsConfig` parameter is the lakeFS configuration documented [here](https://docs.lakefs.io/reference/configuration.html) but without sensitive information.
   265     Sensitive information like `databaseConnectionString` is given through separate parameters, and the chart will inject it into Kubernetes secrets.
   266     {: .note }
   267  
   268  1. In the directory where you created `conf-values.yaml`, run the following commands:
   269  
   270     ```bash
   271     # Add the lakeFS repository
   272     helm repo add lakefs https://charts.lakefs.io
   273     # Deploy lakeFS
   274     helm install my-lakefs lakefs/lakefs -f conf-values.yaml
   275     ```
   276  
   277     *my-lakefs* is the [Helm Release](https://helm.sh/docs/intro/using_helm/#three-big-concepts) name.
   278  
   279  
   280  ### Load balancing
   281  
   282  To configure a load balancer to direct requests to the lakeFS servers you can use the `LoadBalancer` Service type or a Kubernetes Ingress.
   283  By default, lakeFS operates on port 8000 and exposes a `/_health` endpoint that you can use for health checks.
   284  
   285  💡 The NGINX Ingress Controller by default limits the client body size to 1 MiB.
   286  Some clients use bigger chunks to upload objects - for example, multipart upload to lakeFS using the [S3-compatible Gateway][s3-gateway] or 
   287  a simple PUT request using the [OpenAPI Server][openapi].
   288  Check out Nginx [documentation](https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#custom-max-body-size) for increasing the limit, or an example of Nginx configuration with [MinIO](https://docs.min.io/docs/setup-nginx-proxy-with-minio.html).
   289  {: .note }
   290  
   291  </div>
   292  </div>
   293  
   294  
   295  
   296  {% include_relative includes/setup.md %}
   297  
   298  [config-envariables]:  {% link reference/configuration.md %}#using-environment-variables %}
   299  [config-reference-azure-block]:  {% link reference/configuration.md %}#example-azure-blob-storage
   300  [downloads]:  {% link index.md %}#downloads
   301  [openapi]:  {% link understand/architecture.md %}#openapi-server
   302  [s3-gateway]:  {% link understand/architecture.md %}#s3-gateway
   303  [understand-repository]:  {% link understand/model.md %}#repository
   304  [integration-hadoopfs]:  {% link integrations/spark.md %}#lakefs-hadoop-filesystem
   305  [understand-commits]:  {% link understand/how/versioning-internals.md %}#constructing-a-consistent-view-of-the-keyspace-ie-a-commit