github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/commit-metadata/airflow.md (about)

     1  # Metadata for Airflow integration
     2  
     3  ## About
     4  
     5  This is a design to integrate metadata from Airflow into lakeFS.  It
     6  consists of metadata integration and UI for using it.
     7  
     8  ## Metadata integration
     9  
    10  Metadata that Airflow operators and/or hooks may place on commits to
    11  associate them with the generating Airflow.
    12  
    13  ### Stability
    14  
    15  Commit _metadata_, like diamonds, are forever: It is not generally possible to
    16  modify commit metadata.  Even if we were willing to rewrite commits and
    17  could do so safely, it would still break commit digests as stable
    18  identifiers.
    19  
    20  ### Forwards compatibility
    21  
    22  Obviously we cannot guarantee any future metadata compatibility.  However
    23  _where current implementation cost is unaffected_, this design is _forwards
    24  compatible_:
    25  
    26  * Airflow metadata added today should remain usable.
    27  * If all goes well eventually we will add more generic UI support for
    28    metadata from other non-Airflow systems.  Metadata added today should
    29    ideally work there without carving out special cases.  So it needs to be
    30    self-contained -- for instance, an "Airflow" label should be derivable
    31    from the metadata.
    32  
    33  ### Source of truth
    34  
    35  A [wise woman once wrote](https://twitter.com/mipsytipsy/status/998084191488126976):
    36  
    37  > You either have one source of truth, or multiple sources of lies.
    38  
    39  There is a natural tension and thereby a continuoum between these two
    40  extremes:
    41  
    42  * **Normalized data.**  lakeFS commits hold only enough to identify an
    43    Airflow run - its URL.  Airflow holds all other run metadata, and is
    44    required to make use of the URL.
    45  * **Denormalized data.**  lakeFS commits hold all the information that
    46    lakeFS and its users need about an Airflow run.  All run metadata is
    47    copied to lakeFS commit metadata.
    48  
    49  We always need the normalized data pointer, it is the only real identifier
    50  of the run.[^1] The lakeFS UI can offer the best integration when lakeFS is
    51  the source of truth.  Initially we shall allow users to select what metadata
    52  to copy.  The default will be _all_.  For velocity we will avoid adding any
    53  metadata that is in any way difficult to produce in the lakeFS Airflow
    54  commit operator.  We may revisit this decision after feedback from users.
    55  
    56  We identify runs by:
    57  
    58  * Their Airflow run URL;
    59  * The task_id;
    60  * The task try_number.
    61  
    62  This yields an identifying URL
    63  `https://{AIRFLOW_HOST}/api/v1/dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}/`.
    64  However this URL is _not_ good for displaying in the GUI; there we give the
    65  URL of the well-known DAG run GUI.
    66  
    67  
    68  [^1]: Naively the run ID would be sufficient.  But that imposes an
    69      assumption of a single Airflow instance.
    70  
    71  ### Scope
    72  
    73  We add to each commit metadata for these items:
    74  
    75  * DAG run.  The URL of the DAG run in Airflow that generated this commit.
    76    This is the URL of [Get a DAG run][get-dag-run] in the REST API.
    77  * Optionally, additional DAG identifiers such as:
    78    * Git commit digest
    79    * Airflow system identifier
    80    * Airflow logical ("execution") date
    81    * Notes (an Airflow metadata concept)
    82  
    83    We allow all items that can be retrieved by [Get a DAG run][get-dag-run]
    84    in the REST API, and copy terminology from there.
    85  
    86    All optionals must be exactly that.
    87    
    88  ### Naming
    89  
    90  We define a simple naming scheme that allows some future-compatibility.  All
    91  items use the prefix "`::lakefs::Airflow::`", staking a claim to the
    92  "`::lakefs::`" prefix on commit metadata.
    93  
    94  lakeFS supports only string metadata.  All plain suffixes are assumed to be
    95  plain strings.  If a suffix ends in `[...]`, that encodes a _type_.  We
    96  initially define the types that are needed to integrate Airflow.
    97  
    98  We add these suffixes to encode how to access the DAG run:
    99  
   100  * `url[url:id]`: The DAG run
   101  * `url[url:ui]`: URL to access the Airflow UI for the run.
   102  
   103  The REST API suggests the following suffixes:
   104  
   105  * `dag_run_id`: string
   106  * `dag_id`: string
   107  * `logical_date[iso8601]`: ISO 8601
   108  
   109    `execution_date` is deprecated by Airflow and equal to `logical_date`, so
   110    it is _not_ included.
   111  * `data_interval_start[iso8601]`: ISO 8601
   112  * `data_interval_end[iso8601]`: ISO 8601
   113  * `last_scheduling_decision[iso8601]`:  ISO 8601
   114  * `run_type`: string
   115  * `external_trigger[boolean]`: boolean encoded as "`true`" or "`false`"
   116  * `note`: string
   117  
   118  So we might end up with the following metadata:
   119  
   120  | key                                                    | value                                |
   121  |:-------------------------------------------------------|:-------------------------------------|
   122  | `::lakefs::Airflow::dag_run_id`                        | scheduled__2023-04-13T05:40:00+00:00 |
   123  | `::lakefs::Airflow::dag_id`                            | big_data_dag                         |
   124  | `::lakefs::Airflow::logical_date[iso8601]`             | 2023-04-14T01:02:03+00:00            |
   125  | `::lakefs::Airflow::data_interval_start[iso8601]`      | 2023-04-14T00:11:22+00:00            |
   126  | `::lakefs::Airflow::data_interval_end[iso8601]`        | 2023-04-14T00:22:25+00:00            |
   127  | `::lakefs::Airflow::last_scheduling_decision[iso8601]` | 2023-04-14T01:02:01+00:00            |
   128  | `::lakefs::Airflow::run_type`                          | scheduled                            |
   129  | `::lakefs::Airflow::external_trigger[boolean]`         | false                                |
   130  | `::lakefs::Airflow::note`                              | This time for sure!                  |
   131  
   132  #### Changing Airflows
   133  
   134  If the domain name of the Airflow system changes then URLs stop working.
   135  The DAG run URL is still a useful _identifier_, but it cannot be used to
   136  query for information.
   137  
   138  In a second phase, we may offer a simple future-proof method to allow
   139  changing URLs: Configure a number of "URL endpoints" on lakeFS.  They will
   140  look like variable expansions ("`http://$my_airflow/path/to/dag`") in commit
   141  metadata.  Configuring the current Airflow base URL in the lakeFS
   142  configuration allows both uniqueness and forwards compatibility to work.
   143  
   144  Users are not required to use these shortcuts, of course -- and small or
   145  test installations might choose to avoid them entirely.
   146  
   147  ## lakeFS UI
   148  
   149  Integration of the Airflow UI into the lakeFS UI.  The UI will allow viewing
   150  the associated metadata from the UI.  When viewing a commit, the Airflow UI
   151  URL will become a link.  We shall also open the DAG in a frame above the
   152  commit metadata if this is available.
   153  
   154  ### Flow
   155  
   156  When displaying a commit, and if enabled, scan its metadata for URLs of type
   157  `[url:ui]`.  If one is found, expand config variables in it and render it as a green button labelled "Open Airflow UI".
   158  
   159  ## Non-Airflow systems
   160  
   161  Structured naming of metadata allows the UI to behave in a more generic
   162  manner.  We can implement this shortly after implementing the original
   163  Airflow-only UI.  It might optionally scan metadata for _all_ keys of the
   164  form "`::lakefs::Product::property`", and use their types to display
   165  correctly.
   166  
   167  For instance, merely adding a property
   168  ```conf
   169  ::lakefs::GitHub::url[url:ui]=https://github.com/apache/airflow/tree/d16e54d16e54
   170  ```
   171  should be enough to link a particular commit on GitHub, and even name it "GitHub".
   172  
   173  ## Future
   174  
   175  ### Framing Airflow
   176  
   177  In future we might render the UI in an [HTML iframe][mdn-iframe]:
   178  
   179  ```html
   180  <iframe title="Airflow UI" src="https://path/to/airflow"/>
   181  ```
   182  
   183  Airflow can supposedly be framed in a UI.  By default this is allowed but it
   184  can be [disabled by configuring][airflow-framing].
   185  
   186  ```ini
   187  [webserver]
   188  x_frame_enabled = False
   189  ```
   190  
   191  This is phrased as a security advantage: it prevents clickjacking.
   192  Unfortunately there is no way around this -- it's an HTML + website feature.
   193  
   194  But it turns out that even this is hard:
   195  
   196  * Airflow used to [treat the configuration `X_FRAME_ENABLED` _in
   197    reverse_][airflow-reversed-x-frame_enabled].  So even configuring it on an
   198    older Airflow version (2.2.4 and below) will be confusing, and many
   199    upgraded installations may have it backwards!
   200  * The Astronomer login flow does not appear to understand this variable, and
   201    I was unable to embed Astronomer Airflow into an iframe.
   202  
   203  [get-dag-run]:  https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/get_dag_run
   204  [airflow-framing]:  https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/security/webserver.html#rendering-airflow-ui-in-a-web-frame-from-another-site
   205  [mdn-iframe]:  https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe
   206  [airflow-reversed-x-frame-enabled]:  https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst#the-webserverx_frame_enabled-configuration-works-according-to-description-now-23222