github.com/treeverse/lakefs@v1.24.1-0.20240520134607-95648127bfb0/design/open/ttv-wizard-mvp.md (about) 1 # TimeToValue: Wizard MVP Design 2 3 ## Description 4 5 This design document will specify a possible solution for the time-to-value [wizard MVP](https://github.com/treeverse/lakeFS/issues/3411). 6 The wizard will provide a quick and clear way to start interacting with lakeFS. 7 It will do that by allowing the users to 8 1. Initialize a new repository with a given namespace (and name) with a ‘main’ default branch. 9 2. Import their data (write metadata) into the new repository’s main branch by specifying the data’s S3 bucket namespace. 10 3. Get custom Spark configurations and custom Hive metastore configurations to access lakeFS using the S3 Gateway (this is for the MVP). 11 4. Summarise all actions performed (or skipped) in a README file which will be at the root of the initialized repository. 12 13 --- 14 15 ## System Overview 16 17  18 [(excalidraw file)](diagrams/wizard-mvp.excalidraw) 19 20 ### Wizard UI Component 21 22 The wizard UI component is responsible for the user’s Spark onboarding process. The process is as follows: 23 1. Create a repository (named as the user wishes) and a ‘main’ branch in it. 24 2. Import the user’s data to ‘main’ and display a progress bar (which will show a link to required permissions). Only available on cloud-deployed lakeFS. 25 3. Generate Spark configurations: the wizard will return Spark configurations for the users to use in their Spark core-site.xml file, [databricks key-value format](https://docs.databricks.com/clusters/configure.html#spark-configuration), or [EMR JSON](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html) ([core-site](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html)) 26 -> All three templates will be requested (each in a separate request). 27 1. Asks the user to enter the lakeFS endpoint (by default it will be https://`location.host`) 28 2. [Figma Design](https://www.figma.com/file/haD9y599LzW6LvsYBI2xWU/Spark-use-case?node-id=31%3A200) 29 4. Generates a README file with all actions performed, and will commit it as a first commit (after the import) to the ‘main’ branch of the created repository. 30 31 ### Templating Service 32 33 The [templating service](https://github.com/treeverse/lakeFS/pull/3373) is responsible for fetching, authenticating, and expanding the required templates and returning them to the client. 34 **Process**: 35 1. Get the template (the location should be specified in the incoming request). The file must be a valid [`html/template`](https://pkg.go.dev/html/template) (specified using the `.<type>.html.tt` suffix) or [`text/template`](https://pkg.go.dev/text/template) (specified using the `.<type>.tt` suffix) parsable template text. 36 2. Use the configured template functions to validate the user’s permissions to perform the required actions, and to generate credentials on the fly. 37 3. Expand the template with the config file and query string params, and return it with the correct `Content-Type` header (inferred from the template). 38 39 ### Wizard Templates 40 41 The following templates will be saved within the lakeFS binary and be directed by the Wizard component to be expanded: 42 * SparkEMR.conf.tt 43 * SparkDatabricks.conf.tt 44 * SparkXML.conf.tt 45 * README.md.tt 46 * MetastoreEMR.conf.tt 47 * MetastoreDatabricks.conf.tt 48 * MetastoreXML.conf.tt 49 50 --- 51 52 ## APIs 53 54 ### Templating Service 55 56 - **Endpoint**: `/api/v1/templates` 57 - **Request**: 58 - Method: `GET` 59 - Parameters: 60 - Template URL (`template_location`): `string` - URL of the template. Retrieved from the query string, must be relative (to a URL configured on the server). 61 - Any other configurations required for the templates: `string` - retrieved from query string. 62 - **Response**: 63 - Return value: The expanded template 64 - Headers: 65 - `Content-Type` - The template's content type. 66 - **Errors**: 67 1. *403- Forbidden*: The requesting user is forbidden from accessing the configurations or functionality (like generating credentials). 68 2. *400- Bad request*: The request is missing information necessary for the template’s expansion. 69 3. *500- Internal server error*: The lakeFS server cannot access some provided template locations. 70 71 --- 72 73 ## Possible Flows 74 75 ### Example template 76 77 **Databricks Spark configurations** 78 Name: *databricksConfig.props.tt* 79 ```properties 80 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 81 {{with $creds := new_credentials}} 82 spark.hadoop.fs.s3a.access_key={{$creds.Key}} 83 spark.hadoop.fs.s3a.secret_key={{$creds.Secret}} 84 {{end}} 85 spark.hadoop.fs.s3a.endpoint={{ .query.lakefs_url }} 86 spark.hadoop.fs.s3a.path.style.access=true 87 ``` 88 89 **Local Metastore configurations** 90 Name: *localMetastoreConfig.xml.tt* 91 ```xml 92 <configuration> 93 <property> 94 <name>fs.s3a.path.style.access</name> 95 <value>true</value> 96 </property> 97 <property> 98 <name>fs.s3.impl</name> 99 <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> 100 </property> 101 <property> 102 <name>fs.s3a.endpoint</name> 103 <value>{{ .query.lakefs_url }}</value> 104 </property> 105 {{with $creds := new_credentials}} 106 <property> 107 <name>fs.s3a.access.key</name> 108 <value>{{$creds.Key}}</value> 109 </property> 110 <property> 111 <name>fs.s3a.secret.key</name> 112 <value>{{$creds.Secret}}</value> 113 </property> 114 {{end}} 115 </configuration> 116 ``` 117 118 ### Happy flow - All steps 119 120 1. The user clicks on ‘Create Repository’, then clicks on ‘Spark Quickstart’ 121 2. The wizard starts by showing an input textbox for the users to type the repo name. The user types ‘spark-repo’ 122 3. The wizard creates a repo named ‘spark-repo’ and sets a default ‘main’ branch. 123 4. The wizard asks the users if they want to import existing data to lakeFS. The user specifies the location of the bucket (after they validated that the lakeFS role has the right permissions and that the bucket has the correct policy), and clicks ‘OK’. 124 * An object counter will show the progress of the import process and will signal once it’s over. 125 5. The wizard asks the user for their lakeFS endpoint (and will show a default placeholder pointing to the current URL). 126 6. The wizard will send a GET request to the templating service with a query string of the format: 127 ``` 128 ?lakefs_url=https://my-lakefs.io&template_location=databricksConfig.props.tt 129 ``` 130 7. The templating service will fetch the templates from the provided locations, expand the `.query.lakefs_url` parameter for both templates and return a response of the format: 131 ```properties 132 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 133 spark.hadoop.fs.s3a.access_key=ACCESSKEYDONTTELL 134 spark.hadoop.fs.s3a.secret_key=SECRETKEYDONTTELL 135 spark.hadoop.fs.s3a.endpoint=https://my-lakefs.io 136 spark.hadoop.fs.s3a.path.style.access=true 137 ``` 138 each returned template will return with a `Content-Type` header describing the way it should be presented. 139 8. The wizard will present each configuration in a different snippet view for the users to copy and paste into their configuration files. 140 9. The wizard will send a GET request to the templating service with a query string of the format: 141 ``` 142 ?lakefs_url=https://my-lakefs.io&template_location=README.md.tt 143 ``` 144 It will do so to generate a README.md file. 145 10. The returned README file will describe the steps taken, the configurations generated but without secrets and some commands to explain how to connect Hive Metastore to lakeFS: 146 ```markdown 147 1. Created a repository "spark-repo" and branch "main". 148 2. Imported data from <S3 location>. 149 3. Generated the following configurations: 150 <Spark configurations with hidden credentials> 151 <Metastore configurations with hidden credentials> 152 4. Instructions to configure Hive Metastore with lakeFS. 153 5. Generated this README file and committed it. 154 ``` 155 11. Upload and commit the README file to the `main` branch. 156 157 ### Happy flow - Spark template only 158 159 1. The user clicks on ‘Create Repository’, then clicks on ‘Spark Quickstart’ 160 2. The wizard starts by showing an input textbox for the users to type the repo name. The user types ‘spark-repo’ 161 3. The wizard creates a repo named ‘spark-repo’ and sets a default ‘main’ branch. 162 4. The wizard asks the users if they want to import existing data to lakeFS. The user skips this step using the skip button. 163 5. The wizard asks the user for their lakeFS endpoint (and will show a default placeholder pointing to the current URL). 164 6. The wizard will send a GET request to the templating service with a query string of the format: 165 ``` 166 ?lakefs_url=https://my-lakefs.io&template_location=databricksConfig.props.tt 167 ``` 168 7. The templating service will fetch the templates from the provided locations, expand the `.query.lakefs_url` parameter for both templates and return a response of the format: 169 ```properties 170 spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem 171 spark.hadoop.fs.s3a.access_key=ACCESSKEYDONTTELL 172 spark.hadoop.fs.s3a.secret_key=SECRETKEYDONTTELL 173 spark.hadoop.fs.s3a.endpoint=https://my-lakefs.io 174 spark.hadoop.fs.s3a.path.style.access=true 175 ``` 176 each returned template will return with a `Content-Type` header describing the way it should be presented. 177 8. The wizard will present each configuration in a different snippet view for the users to copy and paste into their configuration files. 178 9. The wizard will send a GET request to the templating service with a query string of the format: 179 ``` 180 ?lakefs_url=https://my-lakefs.io&template_location=README.md.tt 181 ``` 182 It will do so to generate a README.md file. 183 10. The returned README file will describe the steps taken, the configurations generated but without secrets and some commands to explain how to connect Hive Metastore to lakeFS: 184 ```markdown 185 1. Created a repository "spark-repo" and branch "main". 186 2. Generated the following configurations: 187 <Spark configurations with hidden secrets> 188 3. Instructions to configure Hive Metastore with lakeFS. 189 4. Generated this README file and committed it. 190 ``` 191 11. Upload and commit the README file to the `main` branch. 192 193 ### Sad flow - No import permissions 194 195 1. The user clicks on ‘Create Repository’, then clicks on ‘Spark Quickstart’ 196 2. The wizard starts by showing an input textbox for the users to enter the repo name. The user types ‘spark-repo’ 197 3. The wizard creates a repo named ‘spark-repo’ and sets a default ‘main’ branch. 198 4. The wizard asks the users if they want to import existing data to lakeFS. The user specifies the location of the bucket, and clicks ‘OK’. 199 1. The import functionality panics as there are no permissions to access the given storage. 200 2. The wizard will show an error message like: “Please verify your lakeFS server and storage have the required permissions” and link to the docs to show the needed permissions. 201 5. Continue as above 202 - The generated README will not include the import step. 203 204 ### Sad flow - No credential generation permissions 205 206 1. The user clicks on ‘Create Repository’, then clicks on ‘Spark Quickstart’ 207 2. The wizard starts by showing an input textbox for the users to enter the repo name. The user types ‘spark-repo’ 208 3. The wizard creates a repo named ‘spark-repo’ and sets a default ‘main’ branch. 209 4. The wizard asks the users if they want to import existing data to lakeFS. The user skips this step using the skip button. 210 6. The wizard sends a GET request to the templating service with a query string of the format: 211 ``` 212 ?lakefs_url=<url>&template_location=databricksConfig.props.tt 213 ``` 214 7. The templating service will fetch the template from the provided location and will fail to generate the user’s credentials as the requesting user doesn’t have the permissions required. 215 8. The templating service will return **‘403 Forbidden’** to the wizard. 216 9. The wizard will prompt a message saying that the user doesn’t have the required permissions for generating credentials. 217 10. Continue with the flow as described above… 218 219 ### Sad flow - Missing template properties 220 221 1. The user clicks on ‘Create Repository’, then clicks on ‘Spark Quickstart’ 222 2. The wizard starts by showing an input textbox for the users to enter the repo name. The user types ‘spark-repo’ 223 3. The wizard creates a repo named ‘spark-repo’ and sets a default ‘main’ branch. 224 4. The wizard asks the users if they want to import existing data to lakeFS. The user skips this step using the skip button. 225 6. The wizard sends a GET request to the templating service with a query string of the format: 226 ``` 227 ?template_location=databricksConfig.props.tt 228 ``` 229 7. The templating service will fail to satisfy the `lakefs_url` template property and will return **‘400 Bad Request: error code 1’** to the wizard. 230 8. The wizard will prompt a message saying that some needed information were not specified and that he should make sure he typed everything along the way. 231 9. Continue with the flow as described above… 232 233 ### Sad flow - No fetching permissions 234 235 1. The user clicks on ‘Create Repository’, then clicks on ‘Spark Quickstart’ 236 2. The wizard starts by showing an input textbox for the users to enter the repo name. The user types ‘spark-repo’ 237 3. The wizard creates a repo named ‘spark-repo’ and sets a default ‘main’ branch. 238 4. The wizard asks the users if they want to import existing data to lakeFS. The user skips this step using the skip button. 239 6. The wizard sends a GET request to the templating service with a query string of the format: 240 ``` 241 ?lakefs_url=<url>&template_location=databricksConfig.props.tt 242 ``` 243 7. The templating service tries to fetch the template from the provided location and will fail to do so as the server doesn’t have sufficient permissions. 244 8. The templating service will return **‘500 Internal Server Error’** to the wizard. 245 9. The wizard will prompt a message saying that the server could not access the requested template. 246 10. Continue with the flow as described above… 247 248 --- 249 250 ## Monitoring 251 252 ### Operative Metrics 253 254 1. Templating service was called 255 ```json 256 { 257 "class": "templating_service", 258 "name": "calling_service", 259 "value": "<service name>" 260 } 261 ``` 262 2. Templating service- status 200 263 ```json 264 { 265 "class": "templating_service", 266 "name": "successful_call", 267 "value": "<service name>" 268 } 269 ``` 270 3. Templating service - status 500 - no access to provided template location 271 ```json 272 { 273 "class": "templating_service", 274 "name": "no_access", 275 "value": "<service name>" 276 } 277 ``` 278 4. Templating service - status 5xx - general 279 ```json 280 { 281 "class": "templating_service", 282 "name": "5xx", 283 "value": "<service name>" 284 } 285 ``` 286 5. Templating service - status 4xx - general 287 ```json 288 { 289 "class": "templating_service", 290 "name": "4xx", 291 "value": "<service name>" 292 } 293 ``` 294 295 ### BI Metrics 296 297 Sent directly from the GUI Wizard 298 299 1. Wizard GUI - Quickstart started 300 ```json 301 { 302 "class": "spark_wizard", 303 "name": "quickstart_start", 304 "value": 1 305 } 306 ``` 307 2. Wizard GUI - Import data requested 308 ```json 309 { 310 "class": "spark_wizard", 311 "name": "import_data", 312 "value": 1 313 } 314 ``` 315 3. Wizard GUI - Spark config generated 316 ```json 317 { 318 "class": "spark_wizard", 319 "name": "generate_spark_template", 320 "value": 1 321 } 322 ``` 323 4. Wizard GUI - Quickstart ended 324 ```json 325 { 326 "class": "spark_wizard", 327 "name": "quickstart_end", 328 "value": 1 329 } 330 ```