github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/docs/integrations/airflow.md (about)

     1  ---
     2  title: Apache Airflow
     3  description: Easily build reproducible data pipelines with Airflow and lakeFS using commits, without modifying the code or logic of your job.
     4  parent: Integrations
     5  redirect_from: /using/airflow.html
     6  ---
     7  
     8  # Using lakeFS with Apache Airflow
     9  
    10  [Apache Airflow](https://airflow.apache.org/) is a platform that allows users to programmatically author, schedule, and monitor workflows.
    11  
    12  To run Airflow with lakeFS, you need to follow a few steps.
    13  
    14  ## Create a lakeFS connection on Airflow
    15  
    16  To access the lakeFS server and authenticate with it, create a new [Airflow
    17  Connection](https://airflow.apache.org/docs/apache-airflow/stable/howto/connection.html)
    18  of type HTTP and add it to your DAG.  You can do that using the Airflow UI
    19  or the CLI. Here’s an example Airflow command that does just that:
    20  
    21  ```bash
    22  airflow connections add conn_lakefs --conn-type=HTTP --conn-host=http://<LAKEFS_ENDPOINT> \
    23      --conn-extra='{"access_key_id":"<LAKEFS_ACCESS_KEY_ID>","secret_access_key":"<LAKEFS_SECRET_ACCESS_KEY>"}'
    24  ```
    25  
    26  ## Install the lakeFS Airflow package
    27  
    28  You can use `pip` to install [the package](https://pypi.org/project/airflow-provider-lakefs/)
    29  
    30  ```bash
    31  pip install airflow-provider-lakefs
    32  ```
    33  
    34  ## Use the package
    35  
    36  ### Operators
    37  
    38  The package exposes several operations to interact with a lakeFS server:
    39  
    40  1. `CreateBranchOperator` creates a new lakeFS branch from the source branch (`main` by default).
    41  
    42     ```python
    43     task_create_branch = CreateBranchOperator(
    44        task_id='create_branch',
    45        repo='example-repo',
    46        branch='example-branch',
    47        source_branch='main'
    48     )
    49     ```
    50  1. `CommitOperator` commits uncommitted changes to a branch.
    51  
    52     ```python
    53     task_commit = CommitOperator(
    54         task_id='commit',
    55         repo='example-repo',
    56         branch='example-branch',
    57         msg='committing to lakeFS using airflow!',
    58         metadata={'committed_from": "airflow-operator'}
    59     )
    60     ```
    61  1. `MergeOperator` merges 2 lakeFS branches.
    62  
    63     ```python
    64     task_merge = MergeOperator(
    65       task_id='merge_branches',
    66       source_ref='example-branch',
    67       destination_branch='main',
    68       msg='merging job outputs',
    69       metadata={'committer': 'airflow-operator'}
    70     )
    71     ```
    72  
    73  ### Sensors
    74  
    75  Sensors are also available that allow synchronizing a running DAG with external operations:
    76  
    77  1. `CommitSensor` waits until a commit has been applied to the branch
    78     
    79     ```python
    80     task_sense_commit = CommitSensor(
    81         repo='example-repo',
    82         branch='example-branch',
    83         task_id='sense_commit'
    84     )
    85     ```
    86  1. `FileSensor` waits until a given file is present on a branch.
    87  
    88     ```python
    89     task_sense_file = FileSensor(
    90         task_id='sense_file',
    91         repo='example-repo',
    92         branch='example-branch',
    93         path="file/to/sense"
    94     )
    95     ```
    96  
    97  ### Example
    98  
    99  This [example DAG](https://github.com/treeverse/airflow-provider-lakeFS/blob/main/lakefs_provider/example_dags/lakefs-dag.py)
   100  in the airflow-provider-lakeFS repository shows how to use all of these.
   101  
   102  ### Performing other operations
   103  
   104  Sometimes an operator might not be supported by airflow-provider-lakeFS yet. You can access lakeFS directly by using:
   105  
   106  - SimpleHttpOperator to send [API requests]({% link reference/api.md %}) to lakeFS. 
   107  - BashOperator with [lakectl]({% link reference/cli.md %}) commands.
   108  For example, deleting a branch using BashOperator:
   109  ```bash
   110  commit_extract = BashOperator(
   111     task_id='delete_branch',
   112     bash_command='lakectl branch delete lakefs://example-repo/example-branch',
   113     dag=dag,
   114  )
   115  ```
   116  
   117  **Note** lakeFS version <= v0.33.1 uses '@' (instead of '/') as separator between repository and branch.