github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/airflow.md (about) 1 --- 2 title: Apache Airflow 3 description: Easily build reproducible data pipelines with Airflow and lakeFS using commits, without modifying the code or logic of your job. 4 parent: Integrations 5 redirect_from: /using/airflow.html 6 --- 7 8 # Using lakeFS with Apache Airflow 9 10 [Apache Airflow](https://airflow.apache.org/) is a platform that allows users to programmatically author, schedule, and monitor workflows. 11 12 To run Airflow with lakeFS, you need to follow a few steps. 13 14 ## Create a lakeFS connection on Airflow 15 16 To access the lakeFS server and authenticate with it, create a new [Airflow 17 Connection](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html) 18 of type HTTP and add it to your DAG. You can do that using the Airflow UI 19 or the CLI. Here’s an example Airflow command that does just that: 20 21 ```bash 22 airflow connections add conn_lakefs --conn-type=HTTP --conn-host=http://<LAKEFS_ENDPOINT> \ 23 --conn-extra='{"access_key_id":"<LAKEFS_ACCESS_KEY_ID>","secret_access_key":"<LAKEFS_SECRET_ACCESS_KEY>"}' 24 ``` 25 26 ## Install the lakeFS Airflow package 27 28 You can use `pip` to install [the package](https://pypi.org/project/airflow-provider-lakefs/) 29 30 ```bash 31 pip install airflow-provider-lakefs 32 ``` 33 34 ## Use the package 35 36 ### Operators 37 38 The package exposes several operations to interact with a lakeFS server: 39 40 1. `CreateBranchOperator` creates a new lakeFS branch from the source branch (`main` by default). 41 42 ```python 43 task_create_branch = CreateBranchOperator( 44 task_id='create_branch', 45 repo='example-repo', 46 branch='example-branch', 47 source_branch='main' 48 ) 49 ``` 50 1. `CommitOperator` commits uncommitted changes to a branch. 51 52 ```python 53 task_commit = CommitOperator( 54 task_id='commit', 55 repo='example-repo', 56 branch='example-branch', 57 msg='committing to lakeFS using airflow!', 58 metadata={'committed_from": "airflow-operator'} 59 ) 60 ``` 61 1. `MergeOperator` merges 2 lakeFS branches. 62 63 ```python 64 task_merge = MergeOperator( 65 task_id='merge_branches', 66 source_ref='example-branch', 67 destination_branch='main', 68 msg='merging job outputs', 69 metadata={'committer': 'airflow-operator'} 70 ) 71 ``` 72 73 ### Sensors 74 75 Sensors are also available that allow synchronizing a running DAG with external operations: 76 77 1. `CommitSensor` waits until a commit has been applied to the branch 78 79 ```python 80 task_sense_commit = CommitSensor( 81 repo='example-repo', 82 branch='example-branch', 83 task_id='sense_commit' 84 ) 85 ``` 86 1. `FileSensor` waits until a given file is present on a branch. 87 88 ```python 89 task_sense_file = FileSensor( 90 task_id='sense_file', 91 repo='example-repo', 92 branch='example-branch', 93 path="file/to/sense" 94 ) 95 ``` 96 97 ### Example 98 99 This [example DAG](https://github.com/treeverse/airflow-provider-lakeFS/blob/main/lakefs_provider/example_dags/lakefs-dag.py) 100 in the airflow-provider-lakeFS repository shows how to use all of these. 101 102 ### Performing other operations 103 104 Sometimes an operator might not be supported by airflow-provider-lakeFS yet. You can access lakeFS directly by using: 105 106 - SimpleHttpOperator to send [API requests]({% link reference/api.md %}) to lakeFS. 107 - BashOperator with [lakectl]({% link reference/cli.md %}) commands. 108 For example, deleting a branch using BashOperator: 109 ```bash 110 commit_extract = BashOperator( 111 task_id='delete_branch', 112 bash_command='lakectl branch delete lakefs://example-repo/example-branch', 113 dag=dag, 114 ) 115 ``` 116 117 **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch.