github.com/amarpal/go-tools@v0.0.0-20240422043104-40142f59f616/pattern/doc.go (about)

     1  /*
     2  Package pattern implements a simple language for pattern matching Go ASTs.
     3  
     4  # Design decisions and trade-offs
     5  
     6  The language is designed specifically for the task of filtering ASTs
     7  to simplify the implementation of analyses in staticcheck.
     8  It is also intended to be trivial to parse and execute.
     9  
    10  To that end, we make certain decisions that make the language more
    11  suited to its task, while making certain queries infeasible.
    12  
    13  Furthermore, it is fully expected that the majority of analyses will still require ordinary Go code
    14  to further process the filtered AST, to make use of type information and to enforce complex invariants.
    15  It is not our goal to design a scripting language for writing entire checks in.
    16  
    17  # The language
    18  
    19  At its core, patterns are a representation of Go ASTs, allowing for the use of placeholders to enable pattern matching.
    20  Their syntax is inspired by LISP and Haskell, but unlike LISP, the core unit of patterns isn't the list, but the node.
    21  There is a fixed set of nodes, identified by name, and with the exception of the Or node, all nodes have a fixed number of arguments.
    22  In addition to nodes, there are atoms, which represent basic units such as strings or the nil value.
    23  
    24  Pattern matching is implemented via bindings, represented by the Binding node.
    25  A Binding can match nodes and associate them with names, to later recall the nodes.
    26  This allows for expressing "this node must be equal to that node" constraints.
    27  
    28  To simplify writing and reading patterns, a small amount of additional syntax exists on top of nodes and atoms.
    29  This additional syntax doesn't add any new features of its own, it simply provides shortcuts to creating nodes and atoms.
    30  
    31  To show an example of a pattern, first consider this snippet of Go code:
    32  
    33  	if x := fn(); x != nil {
    34  		for _, v := range x {
    35  			println(v, x)
    36  		}
    37  	}
    38  
    39  The corresponding AST expressed as an idiomatic pattern would look as follows:
    40  
    41  	(IfStmt
    42  		(AssignStmt (Ident "x") ":=" (CallExpr (Ident "fn") []))
    43  		(BinaryExpr (Ident "x") "!=" (Ident "nil"))
    44  		(RangeStmt
    45  			(Ident "_") (Ident "v") ":=" (Ident "x")
    46  			(CallExpr (Ident "println") [(Ident "v") (Ident "x")]))
    47  		nil)
    48  
    49  Two things are worth noting about this representation.
    50  First, the [el1 el2 ...] syntax is a short-hand for creating lists.
    51  It is a short-hand for el1:el2:[], which itself is a short-hand for (List el1 (List el2 (List nil nil)).
    52  Second, note the absence of a lot of lists in places that normally accept lists.
    53  For example, assignment assigns a number of right-hands to a number of left-hands, yet our AssignStmt is lacking any form of list.
    54  This is due to the fact that a single node can match a list of exactly one element.
    55  Thus, the two following forms have identical matching behavior:
    56  
    57  	(AssignStmt (Ident "x") ":=" (CallExpr (Ident "fn") []))
    58  	(AssignStmt [(Ident "x")] ":=" [(CallExpr (Ident "fn") [])])
    59  
    60  This section serves as an overview of the language's syntax.
    61  More in-depth explanations of the matching behavior as well as an exhaustive list of node types follows in the coming sections.
    62  
    63  # Pattern matching
    64  
    65  # TODO write about pattern matching
    66  
    67  - inspired by haskell syntax, but much, much simpler and naive
    68  
    69  # Node types
    70  
    71  The language contains two kinds of nodes: those that map to nodes in the AST, and those that implement additional logic.
    72  
    73  Nodes that map directly to AST nodes are named identically to the types in the go/ast package.
    74  What follows is an exhaustive list of these nodes:
    75  
    76  	(ArrayType len elt)
    77  	(AssignStmt lhs tok rhs)
    78  	(BasicLit kind value)
    79  	(BinaryExpr x op y)
    80  	(BranchStmt tok label)
    81  	(CallExpr fun args)
    82  	(CaseClause list body)
    83  	(ChanType dir value)
    84  	(CommClause comm body)
    85  	(CompositeLit type elts)
    86  	(DeferStmt call)
    87  	(Ellipsis elt)
    88  	(EmptyStmt)
    89  	(Field names type tag)
    90  	(ForStmt init cond post body)
    91  	(FuncDecl recv name type body)
    92  	(FuncLit type body)
    93  	(FuncType params results)
    94  	(GenDecl specs)
    95  	(GoStmt call)
    96  	(Ident name)
    97  	(IfStmt init cond body else)
    98  	(ImportSpec name path)
    99  	(IncDecStmt x tok)
   100  	(IndexExpr x index)
   101  	(InterfaceType methods)
   102  	(KeyValueExpr key value)
   103  	(MapType key value)
   104  	(RangeStmt key value tok x body)
   105  	(ReturnStmt results)
   106  	(SelectStmt body)
   107  	(SelectorExpr x sel)
   108  	(SendStmt chan value)
   109  	(SliceExpr x low high max)
   110  	(StarExpr x)
   111  	(StructType fields)
   112  	(SwitchStmt init tag body)
   113  	(TypeAssertExpr)
   114  	(TypeSpec name type)
   115  	(TypeSwitchStmt init assign body)
   116  	(UnaryExpr op x)
   117  	(ValueSpec names type values)
   118  
   119  Additionally, there are the String, Token and nil atoms.
   120  Strings are double-quoted string literals, as in (Ident "someName").
   121  Tokens are also represented as double-quoted string literals, but are converted to token.Token values in contexts that require tokens,
   122  such as in (BinaryExpr x "<" y), where "<" is transparently converted to token.LSS during matching.
   123  The keyword 'nil' denotes the nil value, which represents the absence of any value.
   124  
   125  We also define the (List head tail) node, which is used to represent sequences of elements as a singly linked list.
   126  The head is a single element, and the tail is the remainder of the list.
   127  For example,
   128  
   129  	(List "foo" (List "bar" (List "baz" (List nil nil))))
   130  
   131  represents a list of three elements, "foo", "bar" and "baz". There is dedicated syntax for writing lists, which looks as follows:
   132  
   133  	["foo" "bar" "baz"]
   134  
   135  This syntax is itself syntactic sugar for the following form:
   136  
   137  	"foo":"bar":"baz":[]
   138  
   139  This form is of particular interest for pattern matching, as it allows matching on the head and tail. For example,
   140  
   141  	"foo":"bar":_
   142  
   143  would match any list with at least two elements, where the first two elements are "foo" and "bar". This is equivalent to writing
   144  
   145  	(List "foo" (List "bar" _))
   146  
   147  Note that it is not possible to match from the end of the list.
   148  That is, there is no way to express a query such as "a list of any length where the last element is foo".
   149  
   150  Note that unlike in LISP, nil and empty lists are distinct from one another.
   151  In patterns, with respect to lists, nil is akin to Go's untyped nil.
   152  It will match a nil ast.Node, but it will not match a nil []ast.Expr. Nil will, however, match pointers to named types such as *ast.Ident.
   153  Similarly, lists are akin to Go's
   154  slices. An empty list will match both a nil and an empty []ast.Expr, but it will not match a nil ast.Node.
   155  
   156  Due to the difference between nil and empty lists, an empty list is represented as (List nil nil), i.e. a list with no head or tail.
   157  Similarly, a list of one element is represented as (List el (List nil nil)). Unlike in LISP, it cannot be represented by (List el nil).
   158  
   159  Finally, there are nodes that implement special logic or matching behavior.
   160  
   161  (Any) matches any value. The underscore (_) maps to this node, making the following two forms equivalent:
   162  
   163  	(Ident _)
   164  	(Ident (Any))
   165  
   166  (Builtin name) matches a built-in identifier or function by name.
   167  This is a type-aware variant of (Ident name).
   168  Instead of only comparing the name, it resolves the object behind the name and makes sure it's a pre-declared identifier.
   169  
   170  For example, in the following piece of code
   171  
   172  	func fn() {
   173  		println(true)
   174  		true := false
   175  		println(true)
   176  	}
   177  
   178  the pattern
   179  
   180  	(Builtin "true")
   181  
   182  will match exactly once, on the first use of 'true' in the function.
   183  Subsequent occurrences of 'true' no longer refer to the pre-declared identifier.
   184  
   185  (Object name) matches an identifier by name, but yields the
   186  types.Object it refers to.
   187  
   188  (Symbol name) matches ast.Idents and ast.SelectorExprs that refer to a symbol with a given fully qualified name.
   189  For example, "net/url.PathEscape" matches the PathEscape function in the net/url package,
   190  and "(net/url.EscapeError).Error" refers to the Error method on the net/url.EscapeError type,
   191  either on an instance of the type, or on the type itself.
   192  
   193  For example, the following patterns match the following lines of code:
   194  
   195  	(CallExpr (Symbol "fmt.Println") _) // pattern 1
   196  	(CallExpr (Symbol "(net/url.EscapeError).Error") _) // pattern 2
   197  
   198  	fmt.Println("hello, world") // matches pattern 1
   199  	var x url.EscapeError
   200  	x.Error() // matches pattern 2
   201  	(url.EscapeError).Error(x) // also matches pattern 2
   202  
   203  (Binding name node) creates or uses a binding.
   204  Bindings work like variable assignments, allowing referring to already matched nodes.
   205  As an example, bindings are necessary to match self-assignment of the form "x = x",
   206  since we need to express that the right-hand side is identical to the left-hand side.
   207  
   208  If a binding's node is not nil, the matcher will attempt to match a node according to the pattern.
   209  If a binding's node is nil, the binding will either recall an existing value, or match the Any node.
   210  It is an error to provide a non-nil node to a binding that has already been bound.
   211  
   212  Referring back to the earlier example, the following pattern will match self-assignment of idents:
   213  
   214  	(AssignStmt (Binding "lhs" (Ident _)) "=" (Binding "lhs" nil))
   215  
   216  Because bindings are a crucial component of pattern matching, there is special syntax for creating and recalling bindings.
   217  Lower-case names refer to bindings. If standing on its own, the name "foo" will be equivalent to (Binding "foo" nil).
   218  If a name is followed by an at-sign (@) then it will create a binding for the node that follows.
   219  Together, this allows us to rewrite the earlier example as follows:
   220  
   221  	(AssignStmt lhs@(Ident _) "=" lhs)
   222  
   223  (Or nodes...) is a variadic node that tries matching each node until one succeeds. For example, the following pattern matches all idents of name "foo" or "bar":
   224  
   225  	(Ident (Or "foo" "bar"))
   226  
   227  We could also have written
   228  
   229  	(Or (Ident "foo") (Ident "bar"))
   230  
   231  and achieved the same result. We can also mix different kinds of nodes:
   232  
   233  	(Or (Ident "foo") (CallExpr (Ident "bar") _))
   234  
   235  When using bindings inside of nodes used inside Or, all or none of the bindings will be bound.
   236  That is, partially matched nodes that ultimately failed to match will not produce any bindings observable outside of the matching attempt.
   237  We can thus write
   238  
   239  	(Or (Ident name) (CallExpr name))
   240  
   241  and 'name' will either be a String if the first option matched, or an Ident or SelectorExpr if the second option matched.
   242  
   243  (Not node)
   244  
   245  The Not node negates a match. For example, (Not (Ident _)) will match all nodes that aren't identifiers.
   246  
   247  ChanDir(0)
   248  
   249  # Automatic unnesting of AST nodes
   250  
   251  The Go AST has several types of nodes that wrap other nodes.
   252  To simplify matching, we automatically unwrap some of these nodes.
   253  
   254  These nodes are ExprStmt (for using expressions in a statement context),
   255  ParenExpr (for parenthesized expressions),
   256  DeclStmt (for declarations in a statement context),
   257  and LabeledStmt (for labeled statements).
   258  
   259  Thus, the query
   260  
   261  	(FuncLit _ [(CallExpr _ _)]
   262  
   263  will match a function literal containing a single function call,
   264  even though in the actual Go AST, the CallExpr is nested inside an ExprStmt,
   265  as function bodies are made up of sequences of statements.
   266  
   267  On the flip-side, there is no way to specifically match these wrapper nodes.
   268  For example, there is no way of searching for unnecessary parentheses, like in the following piece of Go code:
   269  
   270  	((x)) += 2
   271  */
   272  package pattern