github.com/amarpal/go-tools@v0.0.0-20240422043104-40142f59f616/pattern/doc.go (about) 1 /* 2 Package pattern implements a simple language for pattern matching Go ASTs. 3 4 # Design decisions and trade-offs 5 6 The language is designed specifically for the task of filtering ASTs 7 to simplify the implementation of analyses in staticcheck. 8 It is also intended to be trivial to parse and execute. 9 10 To that end, we make certain decisions that make the language more 11 suited to its task, while making certain queries infeasible. 12 13 Furthermore, it is fully expected that the majority of analyses will still require ordinary Go code 14 to further process the filtered AST, to make use of type information and to enforce complex invariants. 15 It is not our goal to design a scripting language for writing entire checks in. 16 17 # The language 18 19 At its core, patterns are a representation of Go ASTs, allowing for the use of placeholders to enable pattern matching. 20 Their syntax is inspired by LISP and Haskell, but unlike LISP, the core unit of patterns isn't the list, but the node. 21 There is a fixed set of nodes, identified by name, and with the exception of the Or node, all nodes have a fixed number of arguments. 22 In addition to nodes, there are atoms, which represent basic units such as strings or the nil value. 23 24 Pattern matching is implemented via bindings, represented by the Binding node. 25 A Binding can match nodes and associate them with names, to later recall the nodes. 26 This allows for expressing "this node must be equal to that node" constraints. 27 28 To simplify writing and reading patterns, a small amount of additional syntax exists on top of nodes and atoms. 29 This additional syntax doesn't add any new features of its own, it simply provides shortcuts to creating nodes and atoms. 30 31 To show an example of a pattern, first consider this snippet of Go code: 32 33 if x := fn(); x != nil { 34 for _, v := range x { 35 println(v, x) 36 } 37 } 38 39 The corresponding AST expressed as an idiomatic pattern would look as follows: 40 41 (IfStmt 42 (AssignStmt (Ident "x") ":=" (CallExpr (Ident "fn") [])) 43 (BinaryExpr (Ident "x") "!=" (Ident "nil")) 44 (RangeStmt 45 (Ident "_") (Ident "v") ":=" (Ident "x") 46 (CallExpr (Ident "println") [(Ident "v") (Ident "x")])) 47 nil) 48 49 Two things are worth noting about this representation. 50 First, the [el1 el2 ...] syntax is a short-hand for creating lists. 51 It is a short-hand for el1:el2:[], which itself is a short-hand for (List el1 (List el2 (List nil nil)). 52 Second, note the absence of a lot of lists in places that normally accept lists. 53 For example, assignment assigns a number of right-hands to a number of left-hands, yet our AssignStmt is lacking any form of list. 54 This is due to the fact that a single node can match a list of exactly one element. 55 Thus, the two following forms have identical matching behavior: 56 57 (AssignStmt (Ident "x") ":=" (CallExpr (Ident "fn") [])) 58 (AssignStmt [(Ident "x")] ":=" [(CallExpr (Ident "fn") [])]) 59 60 This section serves as an overview of the language's syntax. 61 More in-depth explanations of the matching behavior as well as an exhaustive list of node types follows in the coming sections. 62 63 # Pattern matching 64 65 # TODO write about pattern matching 66 67 - inspired by haskell syntax, but much, much simpler and naive 68 69 # Node types 70 71 The language contains two kinds of nodes: those that map to nodes in the AST, and those that implement additional logic. 72 73 Nodes that map directly to AST nodes are named identically to the types in the go/ast package. 74 What follows is an exhaustive list of these nodes: 75 76 (ArrayType len elt) 77 (AssignStmt lhs tok rhs) 78 (BasicLit kind value) 79 (BinaryExpr x op y) 80 (BranchStmt tok label) 81 (CallExpr fun args) 82 (CaseClause list body) 83 (ChanType dir value) 84 (CommClause comm body) 85 (CompositeLit type elts) 86 (DeferStmt call) 87 (Ellipsis elt) 88 (EmptyStmt) 89 (Field names type tag) 90 (ForStmt init cond post body) 91 (FuncDecl recv name type body) 92 (FuncLit type body) 93 (FuncType params results) 94 (GenDecl specs) 95 (GoStmt call) 96 (Ident name) 97 (IfStmt init cond body else) 98 (ImportSpec name path) 99 (IncDecStmt x tok) 100 (IndexExpr x index) 101 (InterfaceType methods) 102 (KeyValueExpr key value) 103 (MapType key value) 104 (RangeStmt key value tok x body) 105 (ReturnStmt results) 106 (SelectStmt body) 107 (SelectorExpr x sel) 108 (SendStmt chan value) 109 (SliceExpr x low high max) 110 (StarExpr x) 111 (StructType fields) 112 (SwitchStmt init tag body) 113 (TypeAssertExpr) 114 (TypeSpec name type) 115 (TypeSwitchStmt init assign body) 116 (UnaryExpr op x) 117 (ValueSpec names type values) 118 119 Additionally, there are the String, Token and nil atoms. 120 Strings are double-quoted string literals, as in (Ident "someName"). 121 Tokens are also represented as double-quoted string literals, but are converted to token.Token values in contexts that require tokens, 122 such as in (BinaryExpr x "<" y), where "<" is transparently converted to token.LSS during matching. 123 The keyword 'nil' denotes the nil value, which represents the absence of any value. 124 125 We also define the (List head tail) node, which is used to represent sequences of elements as a singly linked list. 126 The head is a single element, and the tail is the remainder of the list. 127 For example, 128 129 (List "foo" (List "bar" (List "baz" (List nil nil)))) 130 131 represents a list of three elements, "foo", "bar" and "baz". There is dedicated syntax for writing lists, which looks as follows: 132 133 ["foo" "bar" "baz"] 134 135 This syntax is itself syntactic sugar for the following form: 136 137 "foo":"bar":"baz":[] 138 139 This form is of particular interest for pattern matching, as it allows matching on the head and tail. For example, 140 141 "foo":"bar":_ 142 143 would match any list with at least two elements, where the first two elements are "foo" and "bar". This is equivalent to writing 144 145 (List "foo" (List "bar" _)) 146 147 Note that it is not possible to match from the end of the list. 148 That is, there is no way to express a query such as "a list of any length where the last element is foo". 149 150 Note that unlike in LISP, nil and empty lists are distinct from one another. 151 In patterns, with respect to lists, nil is akin to Go's untyped nil. 152 It will match a nil ast.Node, but it will not match a nil []ast.Expr. Nil will, however, match pointers to named types such as *ast.Ident. 153 Similarly, lists are akin to Go's 154 slices. An empty list will match both a nil and an empty []ast.Expr, but it will not match a nil ast.Node. 155 156 Due to the difference between nil and empty lists, an empty list is represented as (List nil nil), i.e. a list with no head or tail. 157 Similarly, a list of one element is represented as (List el (List nil nil)). Unlike in LISP, it cannot be represented by (List el nil). 158 159 Finally, there are nodes that implement special logic or matching behavior. 160 161 (Any) matches any value. The underscore (_) maps to this node, making the following two forms equivalent: 162 163 (Ident _) 164 (Ident (Any)) 165 166 (Builtin name) matches a built-in identifier or function by name. 167 This is a type-aware variant of (Ident name). 168 Instead of only comparing the name, it resolves the object behind the name and makes sure it's a pre-declared identifier. 169 170 For example, in the following piece of code 171 172 func fn() { 173 println(true) 174 true := false 175 println(true) 176 } 177 178 the pattern 179 180 (Builtin "true") 181 182 will match exactly once, on the first use of 'true' in the function. 183 Subsequent occurrences of 'true' no longer refer to the pre-declared identifier. 184 185 (Object name) matches an identifier by name, but yields the 186 types.Object it refers to. 187 188 (Symbol name) matches ast.Idents and ast.SelectorExprs that refer to a symbol with a given fully qualified name. 189 For example, "net/url.PathEscape" matches the PathEscape function in the net/url package, 190 and "(net/url.EscapeError).Error" refers to the Error method on the net/url.EscapeError type, 191 either on an instance of the type, or on the type itself. 192 193 For example, the following patterns match the following lines of code: 194 195 (CallExpr (Symbol "fmt.Println") _) // pattern 1 196 (CallExpr (Symbol "(net/url.EscapeError).Error") _) // pattern 2 197 198 fmt.Println("hello, world") // matches pattern 1 199 var x url.EscapeError 200 x.Error() // matches pattern 2 201 (url.EscapeError).Error(x) // also matches pattern 2 202 203 (Binding name node) creates or uses a binding. 204 Bindings work like variable assignments, allowing referring to already matched nodes. 205 As an example, bindings are necessary to match self-assignment of the form "x = x", 206 since we need to express that the right-hand side is identical to the left-hand side. 207 208 If a binding's node is not nil, the matcher will attempt to match a node according to the pattern. 209 If a binding's node is nil, the binding will either recall an existing value, or match the Any node. 210 It is an error to provide a non-nil node to a binding that has already been bound. 211 212 Referring back to the earlier example, the following pattern will match self-assignment of idents: 213 214 (AssignStmt (Binding "lhs" (Ident _)) "=" (Binding "lhs" nil)) 215 216 Because bindings are a crucial component of pattern matching, there is special syntax for creating and recalling bindings. 217 Lower-case names refer to bindings. If standing on its own, the name "foo" will be equivalent to (Binding "foo" nil). 218 If a name is followed by an at-sign (@) then it will create a binding for the node that follows. 219 Together, this allows us to rewrite the earlier example as follows: 220 221 (AssignStmt lhs@(Ident _) "=" lhs) 222 223 (Or nodes...) is a variadic node that tries matching each node until one succeeds. For example, the following pattern matches all idents of name "foo" or "bar": 224 225 (Ident (Or "foo" "bar")) 226 227 We could also have written 228 229 (Or (Ident "foo") (Ident "bar")) 230 231 and achieved the same result. We can also mix different kinds of nodes: 232 233 (Or (Ident "foo") (CallExpr (Ident "bar") _)) 234 235 When using bindings inside of nodes used inside Or, all or none of the bindings will be bound. 236 That is, partially matched nodes that ultimately failed to match will not produce any bindings observable outside of the matching attempt. 237 We can thus write 238 239 (Or (Ident name) (CallExpr name)) 240 241 and 'name' will either be a String if the first option matched, or an Ident or SelectorExpr if the second option matched. 242 243 (Not node) 244 245 The Not node negates a match. For example, (Not (Ident _)) will match all nodes that aren't identifiers. 246 247 ChanDir(0) 248 249 # Automatic unnesting of AST nodes 250 251 The Go AST has several types of nodes that wrap other nodes. 252 To simplify matching, we automatically unwrap some of these nodes. 253 254 These nodes are ExprStmt (for using expressions in a statement context), 255 ParenExpr (for parenthesized expressions), 256 DeclStmt (for declarations in a statement context), 257 and LabeledStmt (for labeled statements). 258 259 Thus, the query 260 261 (FuncLit _ [(CallExpr _ _)] 262 263 will match a function literal containing a single function call, 264 even though in the actual Go AST, the CallExpr is nested inside an ExprStmt, 265 as function bodies are made up of sequences of statements. 266 267 On the flip-side, there is no way to specifically match these wrapper nodes. 268 For example, there is no way of searching for unnecessary parentheses, like in the following piece of Go code: 269 270 ((x)) += 2 271 */ 272 package pattern