github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/accepted/commit-metadata/airflow.md (about) 1 # Metadata for Airflow integration 2 3 ## About 4 5 This is a design to integrate metadata from Airflow into lakeFS. It 6 consists of metadata integration and UI for using it. 7 8 ## Metadata integration 9 10 Metadata that Airflow operators and/or hooks may place on commits to 11 associate them with the generating Airflow. 12 13 ### Stability 14 15 Commit _metadata_, like diamonds, are forever: It is not generally possible to 16 modify commit metadata. Even if we were willing to rewrite commits and 17 could do so safely, it would still break commit digests as stable 18 identifiers. 19 20 ### Forwards compatibility 21 22 Obviously we cannot guarantee any future metadata compatibility. However 23 _where current implementation cost is unaffected_, this design is _forwards 24 compatible_: 25 26 * Airflow metadata added today should remain usable. 27 * If all goes well eventually we will add more generic UI support for 28 metadata from other non-Airflow systems. Metadata added today should 29 ideally work there without carving out special cases. So it needs to be 30 self-contained -- for instance, an "Airflow" label should be derivable 31 from the metadata. 32 33 ### Source of truth 34 35 A [wise woman once wrote](https://twitter.com/mipsytipsy/status/998084191488126976): 36 37 > You either have one source of truth, or multiple sources of lies. 38 39 There is a natural tension and thereby a continuoum between these two 40 extremes: 41 42 * **Normalized data.** lakeFS commits hold only enough to identify an 43 Airflow run - its URL. Airflow holds all other run metadata, and is 44 required to make use of the URL. 45 * **Denormalized data.** lakeFS commits hold all the information that 46 lakeFS and its users need about an Airflow run. All run metadata is 47 copied to lakeFS commit metadata. 48 49 We always need the normalized data pointer, it is the only real identifier 50 of the run.[^1] The lakeFS UI can offer the best integration when lakeFS is 51 the source of truth. Initially we shall allow users to select what metadata 52 to copy. The default will be _all_. For velocity we will avoid adding any 53 metadata that is in any way difficult to produce in the lakeFS Airflow 54 commit operator. We may revisit this decision after feedback from users. 55 56 We identify runs by: 57 58 * Their Airflow run URL; 59 * The task_id; 60 * The task try_number. 61 62 This yields an identifying URL 63 `https://{AIRFLOW_HOST}/api/v1/dags/{dag_id}/dagRuns/{dag_run_id}/taskInstances/{task_id}/`. 64 However this URL is _not_ good for displaying in the GUI; there we give the 65 URL of the well-known DAG run GUI. 66 67 68 [^1]: Naively the run ID would be sufficient. But that imposes an 69 assumption of a single Airflow instance. 70 71 ### Scope 72 73 We add to each commit metadata for these items: 74 75 * DAG run. The URL of the DAG run in Airflow that generated this commit. 76 This is the URL of [Get a DAG run][get-dag-run] in the REST API. 77 * Optionally, additional DAG identifiers such as: 78 * Git commit digest 79 * Airflow system identifier 80 * Airflow logical ("execution") date 81 * Notes (an Airflow metadata concept) 82 83 We allow all items that can be retrieved by [Get a DAG run][get-dag-run] 84 in the REST API, and copy terminology from there. 85 86 All optionals must be exactly that. 87 88 ### Naming 89 90 We define a simple naming scheme that allows some future-compatibility. All 91 items use the prefix "`::lakefs::Airflow::`", staking a claim to the 92 "`::lakefs::`" prefix on commit metadata. 93 94 lakeFS supports only string metadata. All plain suffixes are assumed to be 95 plain strings. If a suffix ends in `[...]`, that encodes a _type_. We 96 initially define the types that are needed to integrate Airflow. 97 98 We add these suffixes to encode how to access the DAG run: 99 100 * `url[url:id]`: The DAG run 101 * `url[url:ui]`: URL to access the Airflow UI for the run. 102 103 The REST API suggests the following suffixes: 104 105 * `dag_run_id`: string 106 * `dag_id`: string 107 * `logical_date[iso8601]`: ISO 8601 108 109 `execution_date` is deprecated by Airflow and equal to `logical_date`, so 110 it is _not_ included. 111 * `data_interval_start[iso8601]`: ISO 8601 112 * `data_interval_end[iso8601]`: ISO 8601 113 * `last_scheduling_decision[iso8601]`: ISO 8601 114 * `run_type`: string 115 * `external_trigger[boolean]`: boolean encoded as "`true`" or "`false`" 116 * `note`: string 117 118 So we might end up with the following metadata: 119 120 | key | value | 121 |:-------------------------------------------------------|:-------------------------------------| 122 | `::lakefs::Airflow::dag_run_id` | scheduled__2023-04-13T05:40:00+00:00 | 123 | `::lakefs::Airflow::dag_id` | big_data_dag | 124 | `::lakefs::Airflow::logical_date[iso8601]` | 2023-04-14T01:02:03+00:00 | 125 | `::lakefs::Airflow::data_interval_start[iso8601]` | 2023-04-14T00:11:22+00:00 | 126 | `::lakefs::Airflow::data_interval_end[iso8601]` | 2023-04-14T00:22:25+00:00 | 127 | `::lakefs::Airflow::last_scheduling_decision[iso8601]` | 2023-04-14T01:02:01+00:00 | 128 | `::lakefs::Airflow::run_type` | scheduled | 129 | `::lakefs::Airflow::external_trigger[boolean]` | false | 130 | `::lakefs::Airflow::note` | This time for sure! | 131 132 #### Changing Airflows 133 134 If the domain name of the Airflow system changes then URLs stop working. 135 The DAG run URL is still a useful _identifier_, but it cannot be used to 136 query for information. 137 138 In a second phase, we may offer a simple future-proof method to allow 139 changing URLs: Configure a number of "URL endpoints" on lakeFS. They will 140 look like variable expansions ("`http://$my_airflow/path/to/dag`") in commit 141 metadata. Configuring the current Airflow base URL in the lakeFS 142 configuration allows both uniqueness and forwards compatibility to work. 143 144 Users are not required to use these shortcuts, of course -- and small or 145 test installations might choose to avoid them entirely. 146 147 ## lakeFS UI 148 149 Integration of the Airflow UI into the lakeFS UI. The UI will allow viewing 150 the associated metadata from the UI. When viewing a commit, the Airflow UI 151 URL will become a link. We shall also open the DAG in a frame above the 152 commit metadata if this is available. 153 154 ### Flow 155 156 When displaying a commit, and if enabled, scan its metadata for URLs of type 157 `[url:ui]`. If one is found, expand config variables in it and render it as a green button labelled "Open Airflow UI". 158 159 ## Non-Airflow systems 160 161 Structured naming of metadata allows the UI to behave in a more generic 162 manner. We can implement this shortly after implementing the original 163 Airflow-only UI. It might optionally scan metadata for _all_ keys of the 164 form "`::lakefs::Product::property`", and use their types to display 165 correctly. 166 167 For instance, merely adding a property 168 ```conf 169 ::lakefs::GitHub::url[url:ui]=https://github.com/apache/airflow/tree/d16e54d16e54 170 ``` 171 should be enough to link a particular commit on GitHub, and even name it "GitHub". 172 173 ## Future 174 175 ### Framing Airflow 176 177 In future we might render the UI in an [HTML iframe][mdn-iframe]: 178 179 ```html 180 <iframe title="Airflow UI" src="https://path/to/airflow"/> 181 ``` 182 183 Airflow can supposedly be framed in a UI. By default this is allowed but it 184 can be [disabled by configuring][airflow-framing]. 185 186 ```ini 187 [webserver] 188 x_frame_enabled = False 189 ``` 190 191 This is phrased as a security advantage: it prevents clickjacking. 192 Unfortunately there is no way around this -- it's an HTML + website feature. 193 194 But it turns out that even this is hard: 195 196 * Airflow used to [treat the configuration `X_FRAME_ENABLED` _in 197 reverse_][airflow-reversed-x-frame_enabled]. So even configuring it on an 198 older Airflow version (2.2.4 and below) will be confusing, and many 199 upgraded installations may have it backwards! 200 * The Astronomer login flow does not appear to understand this variable, and 201 I was unable to embed Astronomer Airflow into an iframe. 202 203 [get-dag-run]: https://airflow.apache.org/docs/apache-airflow/stable/stable-rest-api-ref.html#operation/get_dag_run 204 [airflow-framing]: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/security/webserver.html#rendering-airflow-ui-in-a-web-frame-from-another-site 205 [mdn-iframe]: https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe 206 [airflow-reversed-x-frame-enabled]: https://github.com/apache/airflow/blob/main/RELEASE_NOTES.rst#the-webserverx_frame_enabled-configuration-works-according-to-description-now-23222