github.com/cockroachdb/cockroachdb-parser@v0.23.3-0.20240213214944-911057d40c9a/pkg/sql/parser/README.md (about)

     1  # Generated Documentation and Embedded Help Texts
     2  
     3  Many parts of the `parser` package include special consideration for generation
     4  or other production of user-facing documentation. This includes interactive help
     5  messages, generated documentation of the set of available functions, or diagrams
     6  of the various expressions.
     7  
     8  Generated documentation is produced and maintained at compile time, while the
     9  interactive, contextual help is returned at runtime.
    10  
    11  We equip the generated parser with the ability to report contextual
    12  help in two circumstances:
    13  
    14  - when the user explicitly requests help with the HELPTOKEN (current syntax: standalone "`??`")
    15  - when the user makes a grammatical mistake (e.g. `INSERT sometable INTO(x, y) ...`)
    16  
    17  We use the `docgen` tool to produce the generated documentation files that are
    18  then included in the broader (handwritten) published documentation.
    19  
    20  # Help texts embedded in the grammar
    21  
    22  The help is embedded in the grammar using special markers in
    23  yacc comments, for example:
    24  
    25  ```
    26  // %Help: HELPKEY - shortdescription
    27  // %Category: SomeCat
    28  // %Text: whatever until next %marker at start of line, or non-comment.
    29  // %SeeAlso: whatever until next %marker at start of line, or non-comment.
    30  // %End (optional)
    31  ```
    32  
    33  The "HELPKEY" becomes the map key in the generated Go map.
    34  
    35  These texts are extracted automatically by `help.awk` and converted
    36  into a Go data structure in `help_messages.go`.
    37  
    38  # Support in the parser
    39  
    40  ## Primary mechanism - LALR error recovery
    41  
    42  The primary mechanism is leveraging error recovery in LALR parsers
    43  using the special `error` token [1] [2]: when an unexpected token is
    44  encountered, the LALR parser will pop tokens on the stack until the
    45  prefix matches a grammar rule with the special "`error`" token (if
    46  any). If such a rule exists, its action is used to reduce and the
    47  erroneous tokens are discarded.
    48  
    49  **This mechanism is used both when the user makes a mistake, and when
    50  the user inserts the HELPTOKEN in the middle of a statement.** When
    51  present in the middle of a statement, HELPTOKEN is considered an error
    52  and triggers the error recovery.
    53  
    54  We use this for contextual help by providing `error` rules that
    55  generate a contextual help text during LALR error recovery.
    56  
    57  For example:
    58  
    59  ```
    60  backup_stmt:
    61    BACKUP targets TO string_or_placeholder opt_as_of_clause opt_incremental opt_with_options
    62    {
    63      $$.val = &Backup{Targets: $2.targetList(), To: $4.expr(), IncrementalFrom: $6.exprs(), AsOf: $5.asOfClause(), Options: $7.kvOptions()}
    64    }
    65  | BACKUP error { return helpWith(sqllex, `BACKUP`) }
    66  ```
    67  
    68  In this example, the grammar specifies that if the BACKUP keyword is
    69  followed by some input tokens such that the first (valid) grammar rule
    70  doesn't apply, the parser will "recover from the error" by
    71  backtracking up until the point it only sees `BACKUP` on the stack
    72  followed by non-parsable tokens, at which points it takes the `error`
    73  rule and executes its action.
    74  
    75  The action is `return helpWith(...)`. What this does is:
    76  
    77  - halts parsing (the generated parser executes all actions
    78    in a big loop; a `return` interrupts this loop);
    79  - makes the parser return with an error (the `helpWith`
    80    function returns non-zero);
    81  - extends the parsing error message with a help
    82    text; this help text can subsequently be exploited in a client to
    83    display the help message in a friendly manner.
    84  
    85  ### Code generation
    86  
    87  Since the pattern "`{ return helpWith(sqllex, ...) }`" is common, we also implement
    88  a shorthand syntax based on comments, for example:
    89  
    90  ```
    91  backup_stmt:
    92     ...
    93  | BACKUP error // SHOW HELP: BACKUP
    94  ```
    95  
    96  The special comment syntax "`SHOW HELP: XXXX`" is substituted by means
    97  of an auxiliary script (`replace_help_rules.awk`) into the form
    98  explained above.
    99  
   100  ### Secondary mechanism - explicit help token
   101  
   102  The mechanism described above works both when the user make a
   103  grammatical error and when they place the HELPTOKEN in the middle of a
   104  statement, rendering it invalid.
   105  
   106  However for contextual help this is not sufficient: what happens if
   107  the user requests HELPTOKEN *at a position in the grammar where
   108  everything before is a complete, valid SQL input*?
   109  
   110  For example: `DELETE FROM foo ?`
   111  
   112  When encountering this input, the LALR parser will see `DELETE FROM
   113  foo` first, then *reduce* using the DELETE action because everything
   114  up to this point is a valid DELETE statement. When the HELPTOKEN is
   115  encountered, the statement has already been completed *and the LALR
   116  parser doesn't 'know' any more that it was in the context of a DELETE
   117  statement*.
   118  
   119  If we try to place an `error`-based recovery rule at the top-level:
   120  
   121  ```
   122  stmt:
   123    alter_stmt
   124  | backup_stmt
   125  | ...
   126  | delete_stmt
   127  | ...
   128  | error { ??? }
   129  ```
   130  
   131  This wouldn't work: the code inside the `error` action cannot
   132  "observe" the tokens observed so far and there would be no way to know
   133  whether the error should be about DELETE, or instead about ALTER,
   134  BACKUP, etc.
   135  
   136  So in order to handle HELPTOKEN after a valid statement, we must place
   137  it in a rule where the context is still available, that is *before the
   138  statement's grammar rule is reduced.*
   139  
   140  Where would that be? Suppose we had a simple statement rule:
   141  
   142  ```
   143  somesimplestmt:
   144    SIMPLE DO SOMETHING { $$ = reduce(...) }
   145  | SIMPLE error { help ... }
   146  ```
   147  
   148  We could extend with:
   149  
   150  ```
   151  somesimplestmt:
   152    SIMPLE DO SOMETHING { $$ = reduce(...) }
   153  | SIMPLE DO SOMETHING HELPTOKEN { help ... }
   154  | SIMPLE error { help ... }
   155  ```
   156  
   157  (the alternative also works:
   158  
   159  ```
   160  somesimplestmt:
   161    SIMPLE DO SOMETHING { $$ = reduce(...) }
   162  | SIMPLE DO SOMETHING error { help ... }
   163  | SIMPLE error { help ... }
   164  ```
   165  )
   166  
   167  That is all fine and dandy, but in SQL we have statements with many alternate forms, for example:
   168  
   169  ```
   170  alter_rename_table_stmt:
   171    ALTER TABLE relation_expr RENAME TO qualified_name { ... }
   172  | ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... }
   173  | ALTER TABLE relation_expr RENAME opt_column name TO name { ... }
   174  | ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... }
   175  ```
   176  
   177  To add complementary handling of the help token at the end of valid statements we could, but would hate to, duplicate all the rules:
   178  
   179  ```
   180  alter_rename_table_stmt:
   181    ALTER TABLE relation_expr RENAME TO qualified_name { ... }
   182  | ALTER TABLE relation_expr RENAME TO qualified_name HELPTOKEN { help ... }
   183  | ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... }
   184  | ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name HELPTOKEN { help ... }
   185  | ALTER TABLE relation_expr RENAME opt_column name TO name { ... }
   186  | ALTER TABLE relation_expr RENAME opt_column name TO name HELPTOKEN { help ... }
   187  | ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... }
   188  | ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name HELPTOKEN { help ... }
   189  ```
   190  
   191  This duplication is horrendous (not to mention hard to maintain), so
   192  instead we should attempt to factor the help token *in a context where
   193  it is still known that we are dealing just with that statement*.
   194  
   195  The following works:
   196  
   197  ```
   198  alter_rename_table_stmt:
   199    real_alter_rename_table_stmt { $$ = $1 }
   200  | real_alter_rename_table_stmt HELPTOKEN { help ... }
   201  
   202  real_alter_rename_table_stmt:
   203    ALTER TABLE relation_expr RENAME TO qualified_name { ... }
   204  | ALTER TABLE IF EXISTS relation_expr RENAME TO qualified_name { ... }
   205  | ALTER TABLE relation_expr RENAME opt_column name TO name { ... }
   206  | ALTER TABLE IF EXISTS relation_expr RENAME opt_column name TO name { ... }
   207  ```
   208  
   209  Or does it? Without anything else, yacc complains with a "shift/reduce
   210  conflict". The reason is coming from the ambiguity: when the parsing
   211  stack contains everything sufficient to match a
   212  `real_alter_rename_table_stmt`, there is a choice between *reducing*
   213  the simple form `alter_rename_table_stmt:
   214  real_alter_rename_table_stmt`, or *shifting* into the more complex
   215  form `alter_rename_table_stmt: real_alter_rename_table_stmt
   216  HELPTOKEN`.
   217  
   218  This is another form of the textbook situation when yacc is used to
   219  parse if-else statements in a programming language: the rule `stmt: IF
   220  cond THEN body | IF cond THEN body ELSE body` is ambiguous (and yields
   221  a shift/reduce conflict) for exactly the same reason.
   222  
   223  The solution here is also straight out of a textbook: one simply
   224  informs yacc of the relative priority between the two candidate
   225  rules. In this case, when faced with a neutral choice, we encourage
   226  yacc to shift. The particular mechanism is to tell yacc that one rule
   227  has a *higher priority* than another.
   228  
   229  It just so happens however that the yacc language only allows us to
   230  set relative priorities of *tokens*, not rules. And here we have a
   231  problem, of the two rules that need to be prioritized, only one has a
   232  token to work with (the one with HELPTOKEN). Which token should we
   233  prioritize for the other?
   234  
   235  Conveniently yacc knows about this trouble and offers us an awkward,
   236  but working solution: we can tell it "use for this rule the same
   237  priority level as an existing token, even though the token is not part
   238  of the rule". The syntax for this is `rule %prec TOKEN`.
   239  
   240  We can then use this as follows:
   241  
   242  ```
   243  alter_rename_table_stmt:
   244    real_alter_rename_table_stmt           %prec LOWTOKEN { $$ = $1 }
   245  | real_alter_rename_table_stmt HELPTOKEN %prec HIGHTOKEN { help ... }
   246  ```
   247  
   248  We could create two new pseudo-tokens for this (called `LOWTOKEN` and
   249  `HIGHTOKEN`) however conveniently we can also reuse otherwise valid
   250  tokens that have known relative priorities. We settled in our case on
   251  `VALUES` (low priority) and `UMINUS` (high priority).
   252  
   253  ### Code generation
   254  
   255  With the latter mechanism presented above the pattern
   256  
   257  ```
   258  rule:
   259    somerule           %prec VALUES
   260  | somerule HELPTOKEN %prec UMINUS { help ...}`
   261  ```
   262  
   263  becomes super common, so we automate it with the following special syntax:
   264  
   265  ```
   266  rule:
   267    somerule // EXTEND WITH HELP: XXX
   268  ```
   269  
   270  And the code replacement in `replace_help_rules.awk` expands this to
   271  the form above automatically.
   272  
   273  # Generated Documentation
   274  
   275  Documentation of the SQL functions and operators is generated by the `docgen`
   276  utility, using `make generate PKG=./docs/...`. The markdown-formatted files are
   277  kept in `docs/generated/sql` and should be re-generated whenever the
   278  functions/operators they document change, and indeed if regenerating produces a
   279  diff, a CI failure is expected.
   280  
   281  # References
   282  
   283  1. https://www.gnu.org/software/bison/manual/html_node/Error-Recovery.html
   284  2. http://stackoverflow.com/questions/9796608/error-handling-in-yacc