github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/talks/2011/lex.slide (about) 1 Lexical Scanning in Go 2 GTUG Sydney 3 30 Aug 2011 4 5 Rob Pike 6 r@golang.org 7 8 9 * Video 10 11 A video of this talk was recorded at the Go Sydney Meetup. 12 13 .link https://www.youtube.com/watch?v=HxaD_trXwRE Watch the talk on YouTube 14 15 16 * Structural mismatch 17 18 Many programming problems realign one data structure to fit another structure. 19 20 - breaking text into lines 21 - "blocking" and "deblocking" 22 - packet assembly and disassembly 23 - parsing 24 - lexing 25 26 * Sometimes hard 27 28 The pieces on either side have independent state, lookahead, buffers, ... 29 Can be messy to do well. 30 31 Coroutines were invented to solve this problem! 32 They enable us to write the two pieces independently. 33 34 Let's look at this topic in the context of a lexer. 35 36 37 * A new template system 38 39 Wanted to replace the old Go template package. 40 It had many problems: 41 42 - inflexible 43 - inexpressive 44 - code too fragile 45 46 * A new template system 47 48 Key change was re-engineering with a true lexer, parser, and evaluator. 49 Has arbitrary text plus actions in `{{` `}}`. 50 51 .code lex/snippets /Evaluation/,/Control.structures/ 52 53 * Today we focus on the lexer 54 55 Must tokenize: 56 57 - the stuff outside actions 58 - action delimiters 59 - identifiers 60 - numeric constants 61 - string constants 62 - and others 63 64 * Lex items 65 66 Two things identify each lexed item: 67 68 - its type 69 - its value; a string is all we need 70 71 .code lex/lex1.oldgo /item.represents/,/^}/ 72 73 * Lex type 74 75 The type is just an integer constant. 76 We use `iota` to define the values. 77 78 .code lex/lex1.oldgo /itemType.identifies/,/type/ 79 .code lex/lex1.oldgo /const/,/itemEOF/ 80 81 * Lex type values (continued) 82 83 .code lex/lex1.oldgo /itemElse/,/^\)/ 84 85 * Printing a lex item 86 87 `Printf` has a convention making it easy to print any type: just define a `String()` method: 88 89 .code lex/lex1.oldgo /func.*item.*String/,/^}/ 90 91 * How to tokenize? 92 93 Many approaches available: 94 95 - use a tool such as lex or ragel 96 - use regular expressions 97 - use states, actions, and a switch statement 98 99 * Tools 100 101 Nothing wrong with using a tool but: 102 103 - hard to get good errors (can be very important) 104 - tend to require learning another language 105 - result can be large, even slow 106 - often a poor fit 107 - but lexing is easy to do yourself! 108 109 * Regular expressions 110 111 Blogged about this last week. 112 113 - overkill 114 - slow 115 - can explore the state space too much 116 - misuse of a dynamic engine to ask static questions 117 118 * Let's write our own 119 120 It's easy! 121 122 Plus, most programming languages lex pretty much the same tokens, so once we learn how it's trivial to adapt the lexer for the next purpose. 123 124 - an argument both for and against tools 125 126 * State machine 127 128 Many people will tell you to write a switch statement, 129 something like this: 130 131 .code lex/snippets /One/,/^}/ 132 133 * State machines are forgetful 134 135 Boring and repetitive and error-prone, but anyway: 136 137 Why switch? 138 139 After each action, you know where you want to be; 140 the new state is the result of the action. 141 142 But we throw the info away and recompute it from the state. 143 144 (A consequence of returning to the caller.) 145 146 A tool can compile that out, but so can we. 147 148 * What is a state? An action? 149 150 State represents where we are and what we expect. 151 152 Action represents what we do. 153 154 Actions result in a new state. 155 156 * State function 157 158 Let's put them together: a state function. 159 160 Executes an action, returns the next state—as a state function. 161 162 Recursive definition but simple and clear. 163 164 .code lex/lex1.oldgo /stateFn/,/type/ 165 166 * The run loop 167 168 Our state machine is trivial: 169 just run until the state goes to `nil`, representing "done". 170 171 .code lex/snippets /run.lexes/,/^}/ 172 173 * The concurrent step 174 175 How do we make tokens available to the client? 176 Tokens can emerge at times that are inconvenient to stop to return to the caller. 177 178 Use concurrency: 179 Run the state machine as a goroutine, 180 emit values on a channel. 181 182 * The lexer type 183 184 Here is the `lexer` type. Notice the channel of items; ignore the rest for now. 185 186 .code lex/lex1.oldgo /lexer.holds/,/^}/ 187 188 * Starting the lexer 189 190 A `lexer` initializes itself to lex a string and launches the state machine as a goroutine, returning the lexer itself and a channel of items. 191 192 The API will change, don't worry about it now. 193 194 .code lex/lex1.oldgo /func.lex/,/^}/ 195 196 * The real run routine 197 198 Here's the real state machine run function, which runs as a goroutine. 199 200 .code lex/lex1.oldgo /run.lexes/,/^}/ 201 202 * The token emitter 203 204 A token is a type and a value, but (yay Go) the value can just be sliced from the input string. 205 The `lexer` remembers where it is in the input and the emit routine just lobs that substring to the caller as the token's value. 206 207 .code lex/lex1.oldgo /input.*scanned/,/pos.*position/ 208 .code lex/lex1.oldgo /emit.passes/,/^}/ 209 210 * Starting the machine 211 212 As the `lexer` begins it's looking for plain text, so the initial state is the function `lexText`. 213 It absorbs plain text until a "left meta" is encountered. 214 215 .code lex/lex1.oldgo /run.lexes/,/^}/ 216 .code lex/lex1.oldgo /leftMeta/ 217 218 * lexText 219 220 .code lex/lex1.oldgo /^func.lexText/,/^}/ 221 222 * lexLeftMeta 223 224 A trivial state function. 225 When we get here, we know there's a `leftMeta` in the input. 226 227 .code lex/lex1.oldgo /^func.lexLeftMeta/,/^}/ 228 229 * lexInsideAction 230 231 .code lex/lex1.oldgo /^func.lexInsideAction/,/itemPipe/ 232 233 * More of lexInsideAction 234 235 This will give you the flavor. 236 237 .code lex/lex1.oldgo /case.*"/,/lexRawQuote/ 238 .code lex/lex1.oldgo /case.*9/,/lexIdentifier/ 239 240 * The next function 241 242 .code lex/lex1.oldgo /next.returns.the/,/^}/ 243 244 * Some lexing helpers 245 246 .code lex/lex1.oldgo /ignore.skips/,/^}/ 247 .code lex/lex1.oldgo /backup.steps/,/^}/ 248 249 * The peek function 250 251 .code lex/lex1.oldgo /peek.returns.but/,/^}/ 252 253 * The accept functions 254 255 .code lex/lex1.oldgo /accept.consumes/,/^}/ 256 .code lex/lex1.oldgo /acceptRun.consumes/,/^}/ 257 258 * Lexing a number, including floating point 259 260 .code lex/lex1.oldgo /^func.lexNumber/,/imaginary/ 261 262 * Lexing a number, continued 263 264 This is more accepting than it should be, but not by much. Caller must call `Atof` to validate. 265 266 .code lex/lex1.oldgo /Is.it.imaginary/,/^}/ 267 268 * Errors 269 270 Easy to handle: emit the bad token and shut down the machine. 271 272 .code lex/lex1.oldgo /error.returns/,/^}/ 273 274 * Summary 275 276 Concurrency makes the lexer easy to design. 277 278 Goroutines allow lexer and caller (parser) each to run at its own rate, as clean sequential code. 279 280 Channels give us a clean way to emit tokens. 281 282 * A problem 283 284 Can't run a goroutine to completion during initialization. 285 Forbidden by the language specification. 286 (Raises awful issues about order of init, best avoided.) 287 288 That means we can't lex & parse a template during init. 289 290 The goroutine is a problem.... 291 292 _(Note:_This_restriction_was_lifted_in_Go_version_1_but_the_discussion_is_still_interesting.)_ 293 294 * Design vs. implementation 295 296 …but it's not necessary anyway. 297 298 The work is done by the design; now we just adjust the API. 299 300 We can change the API to hide the channel, provide a function to get the next token, and rewrite the run function. 301 302 It's easy. 303 304 * A new API 305 306 Hide the channel and buffer it slightly, turning it into a ring buffer. 307 308 .code lex/r59-lex.go /lex.creates.a.new/,/^}/ 309 310 * A function for the next item 311 312 Traditional lexer API: return next item. 313 Includes the modified state machine runner. 314 . 315 .code lex/r59-lex.go /nextItem.returns/,/^}/ 316 317 * That's it 318 319 We now have a traditional API for a lexer with a simple, concurrent implementation under the covers. 320 321 Even though the implementation is no longer truly concurrent, it still has all the advantages of concurrent design. 322 323 We wouldn't have such a clean, efficient design if we hadn't thought about the problem in a concurrent way, without worrying about "restart". 324 325 Model completely removes concerns about "structural mismatch". 326 327 * Concurrency is a design approach 328 329 Concurrency is not about parallelism. 330 331 (Although it can enable parallelism). 332 333 Concurrency is a way to design a program by decomposing it into independently executing pieces. 334 335 The result can be clean, efficient, and very adaptable. 336 337 * Conclusion 338 339 Lexers are fun. 340 341 Concurrency is fun. 342 343 Go is fun. 344 345 * For more information 346 347 Go: [[http://golang.org]] 348 349 New templates: http://golang.org/pkg/exp/template/ 350 351 (Next release will move them out of experimental.)