github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/howto/deploy/azure.md (about) 1 --- 2 title: Azure 3 grand_parent: How-To 4 parent: Install lakeFS 5 description: How to deploy and set up a production-suitable lakeFS environment on Microsoft Azure 6 redirect_from: 7 - /setup/storage/blob.html 8 - /deploy/azure.html 9 next: ["Import data into your installation", "/howto/import.html"] 10 --- 11 12 # Deploy lakeFS on Azure 13 14 {: .tip } 15 > The instructions given here are for a self-managed deployment of lakeFS on Azure. 16 > 17 > For a hosted lakeFS service with guaranteed SLAs, try [lakeFS Cloud](https://lakefs.cloud) 18 19 When you deploy lakeFS on Azure these are the options available to use: 20 21  22 23 This guide walks you through the options available and how to configure them, finishing with configuring and running lakeFS itself and creating your first repository. 24 25 {% include toc.html %} 26 27 ⏰ Expected deployment time: 25 min 28 {: .note } 29 30 ## 1. Object Storage 31 32 lakeFS supports the following [Azure Storage](https://learn.microsoft.com/en-us/azure/storage/common/storage-introduction) types: 33 34 1. [Azure Blob Storage](https://azure.microsoft.com/en-gb/products/storage/blobs) 35 2. [Azure Data Lake Storage Gen2](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-introduction) ([HNS](https://learn.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-namespace)) 36 37 Data Lake Storage Gen1 is not supported. 38 39 ## 2. Authentication Method 40 41 lakeFS supports two ways to authenticate with Azure. 42 43 <div class="tabs"> 44 <ul> 45 <li><a href="#iba">Identity Based Authentication (recommended)</a></li> 46 <li><a href="#sac">Storage Account Credentials</a></li> 47 </ul> 48 49 <div markdown="1" id="iba"> 50 51 lakeFS uses environment variables to determine credentials to use for authentication. The following authentication methods are supported: 52 53 1. Managed Service Identity (MSI) 54 1. Service Principal RBAC 55 1. Azure CLI 56 57 For deployments inside the Azure ecosystem it is recommended to use a managed identity. 58 59 More information on authentication methods and environment variables can be found [here](https://learn.microsoft.com/en-us/azure/developer/go/azure-sdk-authentication) 60 61 ### How to Create Service Principal for Resource Group 62 63 It is recommended to create a resource group that consists of all the resources lakeFS should have access to. 64 65 Using a resource group will allow dynamic removal/addition of services from the group, effectively providing/preventing access for lakeFS to these resources without requiring any changes in configuration in lakeFS or providing lakeFS with any additional credentials. 66 67 The minimal role required for the service principal is "Storage Blob Data Contributor" 68 69 The following Azure CLI command creates a service principal for a resource group called "lakeFS" with permission to access (read/write/delete) 70 Blob Storage resources in the resource group and with an expiry of 5 years 71 72 ``` shell 73 az ad sp create-for-rbac \ 74 --role "Storage Blob Data Contributor" \ 75 --scopes /subscriptions/XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX/resourceGroups/lakeFS --years 5 76 77 Creating 'Storage Blob Data Contributor' role assignment under scope '/subscriptions/947382ea-681a-4541-99ab-b718960c6289/resourceGroups/lakeFS' 78 The output includes credentials that you must protect. Be sure that you do not include these credentials in your code or check the credentials into your source control. For more information, see https://aka.ms/azadsp-cli 79 { 80 "appId": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", 81 "displayName": "azure-cli-2023-01-30-06-18-30", 82 "password": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX", 83 "tenant": "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX" 84 } 85 ``` 86 87 The command output should be used to populate the following environment variables: 88 89 ``` 90 AZURE_CLIENT_ID = $appId 91 AZURE_TENANT_ID = $tenant 92 AZURE_CLIENT_SECRET = $password 93 ``` 94 95 **Note:** Service Principal credentials have an expiry date and lakeFS will lose access to resources unless credentials are renewed on time. 96 {: .note } 97 98 **Note:** It is possible to provide both account based credentials and environment variables to lakeFS. In that case - lakeFS will use 99 the account credentials for any access to data located in the given account, and will try to use the identity credentials for any data located outside the given account. 100 {: .note } 101 102 </div> 103 104 <div markdown="2" id="sac"> 105 106 Storage account credentials can be set directly in the lakeFS configuration using the following parameters: 107 108 * `blockstore.azure.storage_account` 109 * `blockstore.azure.storage_access_key` 110 111 ### Limitations 112 113 Please note that using this authentication method limits lakeFS to the scope of the given storage account. 114 115 Specifically, **the following operations will not work**: 116 117 1. Import of data from different storage accounts 118 1. Copy/Read/Write of data that was imported from a different storage account 119 1. Create pre-signed URL for data that was imported from a different storage account 120 121 122 </div> 123 </div> 124 125 126 ## 3. K/V Store 127 128 lakeFS stores metadata in a database for its versioning engine. This is done via a Key-Value interface that can be implemented on any DB engine and lakeFS comes with several built-in driver implementations (You can read more about it [here](https://docs.lakefs.io/understand/how/kv.html)). The database used doesn't _have_ to be a dedicated K/V database. 129 130 <div class="tabs"> 131 <ul> 132 <li><a href="#cosmosdb">CosmosDB (Beta)</a></li> 133 <li><a href="#postgres">PostgreSQL</a></li> 134 </ul> 135 <div markdown="1" id="cosmosdb"> 136 137 [CosmosDB](https://azure.microsoft.com/en-us/products/cosmos-db/) is a managed database service provided by Azure. lakeFS supports [CosmosDB For NoSQL](https://learn.microsoft.com/en-GB/azure/cosmos-db/nosql/) as a database backend. 138 139 1. Follow the official [Azure documentation](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/how-to-create-account?tabs=azure-cli) 140 on how to create a CosmosDB account for NoSQL and connect to it. 141 1. Once your CosmosDB account is set up, you can create a Database for 142 lakeFS. For lakeFS ACID guarantees, make sure to select the [Bounded 143 staleness consistency](https://learn.microsoft.com/en-us/azure/cosmos-db/consistency-levels#bounded-staleness-consistency), 144 for single region deployments. 145 1. Create a new container in the database and select type 146 `partitionKey` as the Partition key (case sensitive). 147 1. Pass the endpoint, database name and container name to lakeFS as 148 described in the [configuration guide][config-reference-azure-block]. 149 You can either pass the CosmosDB's account read-write key to lakeFS, or 150 use a managed identity to authenticate to CosmosDB, as described 151 [earlier](#identity-based-credentials). 152 153 A note on CosmosDB capacity modes: lakeFS usage of CosmosDB is still in its 154 early days and has not been battle tested. Both capacity modes, Provisioned 155 and Serverless, has been tested for some workloads and passed with flying 156 colors. The Provisioned mode was configured with 400-4000 RU/s. 157 {: .note } 158 159 </div> 160 <div markdown="2" id="postgres"> 161 162 Below we show you how to create a database on Azure Database, but you can use any PostgreSQL database as long as it's accessible by your lakeFS installation. 163 164 If you already have a database, take note of the connection string and skip to the [next step](#run-the-lakefs-server) 165 166 1. Follow the official [Azure documentation](https://docs.microsoft.com/en-us/azure/postgresql/quickstart-create-server-database-portal){: target="_blank" } on how to create a PostgreSQL instance and connect to it. 167 Make sure that you're using PostgreSQL version >= 11. 168 1. Once your Azure Database for PostgreSQL server is set up and the server is in the _Available_ state, take note of the endpoint and username. 169  170 1. Make sure your Access control roles allow you to connect to the database instance. 171 172 </div> 173 </div> 174 175 ## 4. Run the lakeFS server 176 177 Now that you've chosen and configured object storage, a K/V store, and authentication—you're ready to configure and run lakeFS. There are three different ways you can run lakeFS: 178 179 <div class="tabs"> 180 <ul> 181 <li><a href="#vm">Azure VM</a></li> 182 <li><a href="#docker">Docker</a></li> 183 <li><a href="#aks">Azure Kubernetes Service (AKS)</a></li> 184 </ul> 185 <div markdown="1" id="vm"> 186 187 Connect to your VM instance using SSH: 188 189 1. Create a `config.yaml` on your VM, with the following parameters: 190 191 ```yaml 192 --- 193 database: 194 type: "postgres" 195 postgres: 196 connection_string: "[DATABASE_CONNECTION_STRING]" 197 198 auth: 199 encrypt: 200 # replace this with a randomly-generated string. Make sure to keep it safe! 201 secret_key: "[ENCRYPTION_SECRET_KEY]" 202 203 blockstore: 204 type: azure 205 azure: 206 ``` 207 1. [Download the binary][downloads] to run on the VM. 208 1. Run the `lakefs` binary: 209 210 ```sh 211 lakefs --config config.yaml run 212 ``` 213 214 **Note:** It's preferable to run the binary as a service using systemd or your operating system's facilities. 215 {: .note } 216 217 </div> 218 <div markdown="2" id="docker"> 219 220 To support container-based environments, you can configure lakeFS using environment variables. Here is a `docker run` 221 command to demonstrate starting lakeFS using Docker: 222 223 ```sh 224 docker run \ 225 --name lakefs \ 226 -p 8000:8000 \ 227 -e LAKEFS_DATABASE_TYPE="postgres" \ 228 -e LAKEFS_DATABASE_POSTGRES_CONNECTION_STRING="[DATABASE_CONNECTION_STRING]" \ 229 -e LAKEFS_AUTH_ENCRYPT_SECRET_KEY="[ENCRYPTION_SECRET_KEY]" \ 230 -e LAKEFS_BLOCKSTORE_TYPE="azure" \ 231 -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCOUNT="[YOUR_STORAGE_ACCOUNT]" \ 232 -e LAKEFS_BLOCKSTORE_AZURE_STORAGE_ACCESS_KEY="[YOUR_ACCESS_KEY]" \ 233 treeverse/lakefs:latest run 234 ``` 235 236 See the [reference][config-envariables] for a complete list of environment variables. 237 238 239 </div> 240 <div markdown="2" id="aks"> 241 242 You can install lakeFS on Kubernetes using a [Helm chart](https://github.com/treeverse/charts/tree/master/charts/lakefs). 243 244 To install lakeFS with Helm: 245 246 1. Copy the Helm values file relevant for Azure Blob: 247 248 ```yaml 249 secrets: 250 # replace this with the connection string of the database you created in a previous step: 251 databaseConnectionString: [DATABASE_CONNECTION_STRING] 252 # replace this with a randomly-generated string 253 authEncryptSecretKey: [ENCRYPTION_SECRET_KEY] 254 lakefsConfig: | 255 blockstore: 256 type: azure 257 azure: 258 # If you chose to authenticate via access key, unmark the following rows and insert the values from the previous step 259 # storage_account: [your storage account] 260 # storage_access_key: [your access key] 261 ``` 262 1. Fill in the missing values and save the file as `conf-values.yaml`. For more configuration options, see our Helm chart [README](https://github.com/treeverse/charts/blob/master/charts/lakefs/README.md#custom-configuration){:target="_blank"}. 263 264 The `lakefsConfig` parameter is the lakeFS configuration documented [here](https://docs.lakefs.io/reference/configuration.html) but without sensitive information. 265 Sensitive information like `databaseConnectionString` is given through separate parameters, and the chart will inject it into Kubernetes secrets. 266 {: .note } 267 268 1. In the directory where you created `conf-values.yaml`, run the following commands: 269 270 ```bash 271 # Add the lakeFS repository 272 helm repo add lakefs https://charts.lakefs.io 273 # Deploy lakeFS 274 helm install my-lakefs lakefs/lakefs -f conf-values.yaml 275 ``` 276 277 *my-lakefs* is the [Helm Release](https://helm.sh/docs/intro/using_helm/#three-big-concepts) name. 278 279 280 ### Load balancing 281 282 To configure a load balancer to direct requests to the lakeFS servers you can use the `LoadBalancer` Service type or a Kubernetes Ingress. 283 By default, lakeFS operates on port 8000 and exposes a `/_health` endpoint that you can use for health checks. 284 285 💡 The NGINX Ingress Controller by default limits the client body size to 1 MiB. 286 Some clients use bigger chunks to upload objects - for example, multipart upload to lakeFS using the [S3-compatible Gateway][s3-gateway] or 287 a simple PUT request using the [OpenAPI Server][openapi]. 288 Check out Nginx [documentation](https://kubernetes.github.io/ingress-nginx/user-guide/nginx-configuration/annotations/#custom-max-body-size) for increasing the limit, or an example of Nginx configuration with [MinIO](https://docs.min.io/docs/setup-nginx-proxy-with-minio.html). 289 {: .note } 290 291 </div> 292 </div> 293 294 295 296 {% include_relative includes/setup.md %} 297 298 [config-envariables]: {% link reference/configuration.md %}#using-environment-variables %} 299 [config-reference-azure-block]: {% link reference/configuration.md %}#example-azure-blob-storage 300 [downloads]: {% link index.md %}#downloads 301 [openapi]: {% link understand/architecture.md %}#openapi-server 302 [s3-gateway]: {% link understand/architecture.md %}#s3-gateway 303 [understand-repository]: {% link understand/model.md %}#repository 304 [integration-hadoopfs]: {% link integrations/spark.md %}#lakefs-hadoop-filesystem 305 [understand-commits]: {% link understand/how/versioning-internals.md %}#constructing-a-consistent-view-of-the-keyspace-ie-a-commit