github.com/cockroachdb/cockroachdb-parser@v0.23.3-0.20240213214944-911057d40c9a/pkg/sql/parser/README.md (about) 1 # Generated Documentation and Embedded Help Texts 2 3 Many parts of the `parser` package include special consideration for generation 4 or other production of user-facing documentation. This includes interactive help 5 messages, generated documentation of the set of available functions, or diagrams 6 of the various expressions. 7 8 Generated documentation is produced and maintained at compile time, while the 9 interactive, contextual help is returned at runtime. 10 11 We equip the generated parser with the ability to report contextual 12 help in two circumstances: 13 14 - when the user explicitly requests help with the HELPTOKEN (current syntax: standalone "`??`") 15 - when the user makes a grammatical mistake (e.g. `INSERT sometable INTO(x, y) ...`) 16 17 We use the `docgen` tool to produce the generated documentation files that are 18 then included in the broader (handwritten) published documentation. 19 20 # Help texts embedded in the grammar 21 22 The help is embedded in the grammar using special markers in 23 yacc comments, for example: 24 25 ``` 26 // %Help: HELPKEY - shortdescription 27 // %Category: SomeCat 28 // %Text: whatever until next %marker at start of line, or non-comment. 29 // %SeeAlso: whatever until next %marker at start of line, or non-comment. 30 // %End (optional) 31 ``` 32 33 The "HELPKEY" becomes the map key in the generated Go map. 34 35 These texts are extracted automatically by `help.awk` and converted 36 into a Go data structure in `help_messages.go`. 37 38 # Support in the parser 39 40 ## Primary mechanism - LALR error recovery 41 42 The primary mechanism is leveraging error recovery in LALR parsers 43 using the special `error` token [1] [2]: when an unexpected token is 44 encountered, the LALR parser will pop tokens on the stack until the 45 prefix matches a grammar rule with the special "`error`" token (if 46 any). If such a rule exists, its action is used to reduce and the 47 erroneous tokens are discarded. 48 49 **This mechanism is used both when the user makes a mistake, and when 50 the user inserts the HELPTOKEN in the middle of a statement.** When 51 present in the middle of a statement, HELPTOKEN is considered an error 52 and triggers the error recovery. 53 54 We use this for contextual help by providing `error` rules that 55 generate a contextual help text during LALR error recovery. 56 57 For example: 58 59 ``` 60 backup_stmt: 61 BACKUP targets TO string_or_placeholder opt_as_of_clause opt_incremental opt_with_options 62 { 63 $$.val = &Backup{Targets: $2.targetList(), To: $4.expr(), IncrementalFrom: $6.exprs(), AsOf: $5.asOfClause(), Options: $7.kvOptions()} 64 } 65 | BACKUP error { return helpWith(sqllex, `BACKUP`) } 66 ``` 67 68 In this example, the grammar specifies that if the BACKUP keyword is 69 followed by some input tokens such that the first (valid) grammar rule 70 doesn't apply, the parser will "recover from the error" by 71 backtracking up until the point it only sees `BACKUP` on the stack 72 followed by non-parsable tokens, at which points it takes the `error` 73 rule and executes its action. 74 75 The action is `return helpWith(...)`. What this does is: 76 77 - halts parsing (the generated parser executes all actions 78 in a big loop; a `return` interrupts this loop); 79 - makes the parser return with an error (the `helpWith` 80 function returns non-zero); 81 - extends the parsing error message with a help 82 text; this help text can subsequently be exploited in a client to 83 display the help message in a friendly manner. 84 85 ### Code generation 86 87 Since the pattern "`{ return helpWith(sqllex, ...) }`" is common, we also implement 88 a shorthand syntax based on comments, for example: 89 90 ``` 91 backup_stmt: 92 ... 93 | BACKUP error // SHOW HELP: BACKUP 94 ``` 95 96 The special comment syntax "`SHOW HELP: XXXX`" is substituted by means 97 of an auxiliary script (`replace_help_rules.awk`) into the form 98 explained above. 99 100 ### Secondary mechanism - explicit help token 101 102 The mechanism described above works both when the user make a 103 grammatical error and when they place the HELPTOKEN in the middle of a 104 statement, rendering it invalid. 105 106 However for contextual help this is not sufficient: what happens if 107 the user requests HELPTOKEN *at a position in the grammar where 108 everything before is a complete, valid SQL input*? 109 110 For example: `DELETE FROM foo ?` 111 112 When encountering this input, the LALR parser will see `DELETE FROM 113 foo` first, then *reduce* using the DELETE action because everything 114 up to this point is a valid DELETE statement. When the HELPTOKEN is 115 encountered, the statement has already been completed *and the LALR 116 parser doesn't 'know' any more that it was in the context of a DELETE 117 statement*. 118 119 If we try to place an `error`-based recovery rule at the top-level: 120 121 ``` 122 stmt: 123 alter_stmt 124 | backup_stmt 125 | ... 126 | delete_stmt 127 | ... 128 | error { ??? } 129 ``` 130 131 This wouldn't work: the code inside the `error` action cannot 132 "observe" the tokens observed so far and there would be no way to know 133 whether the error should be about DELETE, or instead about ALTER, 134 BACKUP, etc. 135 136 So in order to handle HELPTOKEN after a valid statement, we must place 137 it in a rule where the context is still available, that is *before the 138 statement's grammar rule is reduced.* 139 140 Where would that be? Suppose we had a simple statement rule: 141 142 ``` 143 somesimplestmt: 144 SIMPLE DO SOMETHING { $$ = reduce(...) } 145 | SIMPLE error { help ... } 146 ``` 147 148 We could extend with: 149 150 ``` 151 somesimplestmt: 152 SIMPLE DO SOMETHING { $$ = reduce(...) } 153 | SIMPLE DO SOMETHING HELPTOKEN { help ... } 154 | SIMPLE error { help ... } 155 ``` 156 157 (the alternative also works: 158 159 ``` 160 somesimplestmt: 161 SIMPLE DO SOMETHING { $$ = reduce(...) } 162 | SIMPLE DO SOMETHING error { help ... } 163 | SIMPLE error { help ... } 164 ``` 165 ) 166 167 That is all fine and dandy, but in SQL we have statements with many alternate forms, for example: 168 169 ``` 170 alter_rename_table_stmt: 171 ALTER TABLE relation_expr RENAME TO qualified_name { ... } 172 | ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... } 173 | ALTER TABLE relation_expr RENAME opt_column name TO name { ... } 174 | ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... } 175 ``` 176 177 To add complementary handling of the help token at the end of valid statements we could, but would hate to, duplicate all the rules: 178 179 ``` 180 alter_rename_table_stmt: 181 ALTER TABLE relation_expr RENAME TO qualified_name { ... } 182 | ALTER TABLE relation_expr RENAME TO qualified_name HELPTOKEN { help ... } 183 | ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... } 184 | ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name HELPTOKEN { help ... } 185 | ALTER TABLE relation_expr RENAME opt_column name TO name { ... } 186 | ALTER TABLE relation_expr RENAME opt_column name TO name HELPTOKEN { help ... } 187 | ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... } 188 | ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name HELPTOKEN { help ... } 189 ``` 190 191 This duplication is horrendous (not to mention hard to maintain), so 192 instead we should attempt to factor the help token *in a context where 193 it is still known that we are dealing just with that statement*. 194 195 The following works: 196 197 ``` 198 alter_rename_table_stmt: 199 real_alter_rename_table_stmt { $$ = $1 } 200 | real_alter_rename_table_stmt HELPTOKEN { help ... } 201 202 real_alter_rename_table_stmt: 203 ALTER TABLE relation_expr RENAME TO qualified_name { ... } 204 | ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... } 205 | ALTER TABLE relation_expr RENAME opt_column name TO name { ... } 206 | ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... } 207 ``` 208 209 Or does it? Without anything else, yacc complains with a "shift/reduce 210 conflict". The reason is coming from the ambiguity: when the parsing 211 stack contains everything sufficient to match a 212 `real_alter_rename_table_stmt`, there is a choice between *reducing* 213 the simple form `alter_rename_table_stmt: 214 real_alter_rename_table_stmt`, or *shifting* into the more complex 215 form `alter_rename_table_stmt: real_alter_rename_table_stmt 216 HELPTOKEN`. 217 218 This is another form of the textbook situation when yacc is used to 219 parse if-else statements in a programming language: the rule `stmt: IF 220 cond THEN body | IF cond THEN body ELSE body` is ambiguous (and yields 221 a shift/reduce conflict) for exactly the same reason. 222 223 The solution here is also straight out of a textbook: one simply 224 informs yacc of the relative priority between the two candidate 225 rules. In this case, when faced with a neutral choice, we encourage 226 yacc to shift. The particular mechanism is to tell yacc that one rule 227 has a *higher priority* than another. 228 229 It just so happens however that the yacc language only allows us to 230 set relative priorities of *tokens*, not rules. And here we have a 231 problem, of the two rules that need to be prioritized, only one has a 232 token to work with (the one with HELPTOKEN). Which token should we 233 prioritize for the other? 234 235 Conveniently yacc knows about this trouble and offers us an awkward, 236 but working solution: we can tell it "use for this rule the same 237 priority level as an existing token, even though the token is not part 238 of the rule". The syntax for this is `rule %prec TOKEN`. 239 240 We can then use this as follows: 241 242 ``` 243 alter_rename_table_stmt: 244 real_alter_rename_table_stmt %prec LOWTOKEN { $$ = $1 } 245 | real_alter_rename_table_stmt HELPTOKEN %prec HIGHTOKEN { help ... } 246 ``` 247 248 We could create two new pseudo-tokens for this (called `LOWTOKEN` and 249 `HIGHTOKEN`) however conveniently we can also reuse otherwise valid 250 tokens that have known relative priorities. We settled in our case on 251 `VALUES` (low priority) and `UMINUS` (high priority). 252 253 ### Code generation 254 255 With the latter mechanism presented above the pattern 256 257 ``` 258 rule: 259 somerule %prec VALUES 260 | somerule HELPTOKEN %prec UMINUS { help ...}` 261 ``` 262 263 becomes super common, so we automate it with the following special syntax: 264 265 ``` 266 rule: 267 somerule // EXTEND WITH HELP: XXX 268 ``` 269 270 And the code replacement in `replace_help_rules.awk` expands this to 271 the form above automatically. 272 273 # Generated Documentation 274 275 Documentation of the SQL functions and operators is generated by the `docgen` 276 utility, using `make generate PKG=./docs/...`. The markdown-formatted files are 277 kept in `docs/generated/sql` and should be re-generated whenever the 278 functions/operators they document change, and indeed if regenerating produces a 279 diff, a CI failure is expected. 280 281 # References 282 283 1. https://www.gnu.org/software/bison/manual/html_node/Error-Recovery.html 284 2. http://stackoverflow.com/questions/9796608/error-handling-in-yacc