github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/talks/2011/lex.slide (about)

     1  Lexical Scanning in Go
     2  GTUG Sydney
     3  30 Aug 2011
     4  
     5  Rob Pike
     6  r@golang.org
     7  
     8  
     9  * Video
    10  
    11  A video of this talk was recorded at the Go Sydney Meetup.
    12  
    13  .link https://www.youtube.com/watch?v=HxaD_trXwRE Watch the talk on YouTube
    14  
    15  
    16  * Structural mismatch
    17  
    18  Many programming problems realign one data structure to fit another structure.
    19  
    20  - breaking text into lines
    21  - "blocking" and "deblocking"
    22  - packet assembly and disassembly
    23  - parsing
    24  - lexing
    25  
    26  * Sometimes hard
    27  
    28  The pieces on either side have independent state, lookahead, buffers, ...
    29  Can be messy to do well.
    30  
    31  Coroutines were invented to solve this problem!
    32  They enable us to write the two pieces independently.
    33  
    34  Let's look at this topic in the context of a lexer.
    35  
    36  
    37  * A new template system
    38  
    39  Wanted to replace the old Go template package.
    40  It had many problems:
    41  
    42  - inflexible
    43  - inexpressive
    44  - code too fragile
    45  
    46  * A new template system
    47  
    48  Key change was re-engineering with a true lexer, parser, and evaluator.
    49  Has arbitrary text plus actions in `{{` `}}`.
    50  
    51  .code lex/snippets /Evaluation/,/Control.structures/
    52  
    53  * Today we focus on the lexer
    54  
    55  Must tokenize:
    56  
    57  - the stuff outside actions
    58  - action delimiters
    59  - identifiers
    60  - numeric constants
    61  - string constants
    62  - and others
    63  
    64  * Lex items
    65  
    66  Two things identify each lexed item:
    67  
    68  - its type
    69  - its value; a string is all we need
    70  
    71  .code lex/lex1.oldgo /item.represents/,/^}/
    72  
    73  * Lex type
    74  
    75  The type is just an integer constant.
    76  We use `iota` to define the values.
    77  
    78  .code lex/lex1.oldgo /itemType.identifies/,/type/
    79  .code lex/lex1.oldgo /const/,/itemEOF/
    80  
    81  * Lex type values (continued)
    82  
    83  .code lex/lex1.oldgo /itemElse/,/^\)/
    84  
    85  * Printing a lex item
    86  
    87  `Printf` has a convention making it easy to print any type: just define a `String()` method:
    88  
    89  .code lex/lex1.oldgo /func.*item.*String/,/^}/
    90  
    91  * How to tokenize?
    92  
    93  Many approaches available:
    94  
    95  - use a tool such as lex or ragel
    96  - use regular expressions
    97  - use states, actions, and a switch statement
    98  
    99  * Tools
   100  
   101  Nothing wrong with using a tool but:
   102  
   103  - hard to get good errors (can be very important)
   104  - tend to require learning another language
   105  - result can be large, even slow
   106  - often a poor fit
   107  - but lexing is easy to do yourself!
   108  
   109  * Regular expressions
   110  
   111  Blogged about this last week.
   112  
   113  - overkill
   114  - slow
   115  - can explore the state space too much
   116  - misuse of a dynamic engine to ask static questions
   117  
   118  * Let's write our own
   119  
   120  It's easy!
   121  
   122  Plus, most programming languages lex pretty much the same tokens, so once we learn how it's trivial to adapt the lexer for the next purpose.
   123  
   124  - an argument both for and against tools
   125  
   126  * State machine
   127  
   128  Many people will tell you to write a switch statement,
   129  something like this:
   130  
   131  .code lex/snippets /One/,/^}/
   132  
   133  * State machines are forgetful
   134  
   135  Boring and repetitive and error-prone, but anyway:
   136  
   137  Why switch?
   138  
   139  After each action, you know where you want to be;
   140  the new state is the result of the action.
   141  
   142  But we throw the info away and recompute it from the state.
   143  
   144  (A consequence of returning to the caller.)
   145  
   146  A tool can compile that out, but so can we.
   147  
   148  * What is a state? An action?
   149  
   150  State represents where we are and what we expect.
   151  
   152  Action represents what we do.
   153  
   154  Actions result in a new state.
   155  
   156  * State function
   157  
   158  Let's put them together: a state function.
   159  
   160  Executes an action, returns the next state—as a state function.
   161  
   162  Recursive definition but simple and clear.
   163  
   164  .code lex/lex1.oldgo /stateFn/,/type/
   165  
   166  * The run loop
   167  
   168  Our state machine is trivial:
   169  just run until the state goes to `nil`, representing "done".
   170  
   171  .code lex/snippets /run.lexes/,/^}/
   172  
   173  * The concurrent step
   174  
   175  How do we make tokens available to the client?
   176  Tokens can emerge at times that are inconvenient to stop to return to the caller.
   177  
   178  Use concurrency:
   179  Run the state machine as a goroutine,
   180  emit values on a channel.
   181  
   182  * The lexer type
   183  
   184  Here is the `lexer` type. Notice the channel of items; ignore the rest for now.
   185  
   186  .code lex/lex1.oldgo /lexer.holds/,/^}/
   187  
   188  * Starting the lexer
   189  
   190  A `lexer` initializes itself to lex a string and launches the state machine as a goroutine, returning the lexer itself and a channel of items.
   191  
   192  The API will change, don't worry about it now.
   193  
   194  .code lex/lex1.oldgo /func.lex/,/^}/
   195  
   196  * The real run routine
   197  
   198  Here's the real state machine run function, which runs as a goroutine.
   199  
   200  .code lex/lex1.oldgo /run.lexes/,/^}/
   201  
   202  * The token emitter
   203  
   204  A token is a type and a value, but (yay Go) the value can just be sliced from the input string.
   205  The `lexer` remembers where it is in the input and the emit routine just lobs that substring to the caller as the token's value.
   206  
   207  .code lex/lex1.oldgo /input.*scanned/,/pos.*position/
   208  .code lex/lex1.oldgo /emit.passes/,/^}/
   209  
   210  * Starting the machine
   211  
   212  As the `lexer` begins it's looking for plain text, so the initial state is the function `lexText`.
   213  It absorbs plain text until a "left meta" is encountered.
   214  
   215  .code lex/lex1.oldgo /run.lexes/,/^}/
   216  .code lex/lex1.oldgo /leftMeta/
   217  
   218  * lexText
   219  
   220  .code lex/lex1.oldgo /^func.lexText/,/^}/
   221  
   222  * lexLeftMeta
   223  
   224  A trivial state function.
   225  When we get here, we know there's a `leftMeta` in the input.
   226  
   227  .code lex/lex1.oldgo /^func.lexLeftMeta/,/^}/
   228  
   229  * lexInsideAction
   230  
   231  .code lex/lex1.oldgo /^func.lexInsideAction/,/itemPipe/
   232  
   233  * More of lexInsideAction
   234  
   235  This will give you the flavor.
   236  
   237  .code lex/lex1.oldgo /case.*"/,/lexRawQuote/
   238  .code lex/lex1.oldgo /case.*9/,/lexIdentifier/
   239  
   240  * The next function
   241  
   242  .code lex/lex1.oldgo /next.returns.the/,/^}/
   243  
   244  * Some lexing helpers
   245  
   246  .code lex/lex1.oldgo /ignore.skips/,/^}/
   247  .code lex/lex1.oldgo /backup.steps/,/^}/
   248  
   249  * The peek function
   250  
   251  .code lex/lex1.oldgo /peek.returns.but/,/^}/
   252  
   253  * The accept functions
   254  
   255  .code lex/lex1.oldgo /accept.consumes/,/^}/
   256  .code lex/lex1.oldgo /acceptRun.consumes/,/^}/
   257  
   258  * Lexing a number, including floating point
   259  
   260  .code lex/lex1.oldgo /^func.lexNumber/,/imaginary/
   261  
   262  * Lexing a number, continued
   263  
   264  This is more accepting than it should be, but not by much. Caller must call `Atof` to validate.
   265  
   266  .code lex/lex1.oldgo /Is.it.imaginary/,/^}/
   267  
   268  * Errors
   269  
   270  Easy to handle: emit the bad token and shut down the machine.
   271  
   272  .code lex/lex1.oldgo /error.returns/,/^}/
   273  
   274  * Summary
   275  
   276  Concurrency makes the lexer easy to design.
   277  
   278  Goroutines allow lexer and caller (parser) each to run at its own rate, as clean sequential code.
   279  
   280  Channels give us a clean way to emit tokens.
   281  
   282  * A problem
   283  
   284  Can't run a goroutine to completion during initialization.
   285  Forbidden by the language specification.
   286  (Raises awful issues about order of init, best avoided.)
   287  
   288  That means we can't lex & parse a template during init.
   289  
   290  The goroutine is a problem....
   291  
   292  _(Note:_This_restriction_was_lifted_in_Go_version_1_but_the_discussion_is_still_interesting.)_
   293  
   294  * Design vs. implementation
   295  
   296  …but it's not necessary anyway.
   297  
   298  The work is done by the design; now we just adjust the API.
   299  
   300  We can change the API to hide the channel, provide a function to get the next token, and rewrite the run function.
   301  
   302  It's easy.
   303  
   304  * A new API
   305  
   306  Hide the channel and buffer it slightly, turning it into a ring buffer.
   307  
   308  .code lex/r59-lex.go /lex.creates.a.new/,/^}/
   309  
   310  * A function for the next item
   311  
   312  Traditional lexer API: return next item.
   313  Includes the modified state machine runner.
   314  .
   315  .code lex/r59-lex.go /nextItem.returns/,/^}/
   316  
   317  * That's it
   318  
   319  We now have a traditional API for a lexer with a simple, concurrent implementation under the covers.
   320  
   321  Even though the implementation is no longer truly concurrent, it still has all the advantages of concurrent design.
   322  
   323  We wouldn't have such a clean, efficient design if we hadn't thought about the problem in a concurrent way, without worrying about "restart".
   324  
   325  Model completely removes concerns about "structural mismatch".
   326  
   327  * Concurrency is a design approach
   328  
   329  Concurrency is not about parallelism.
   330  
   331  (Although it can enable parallelism).
   332  
   333  Concurrency is a way to design a program by decomposing it into independently executing pieces.
   334  
   335  The result can be clean, efficient, and very adaptable.
   336  
   337  * Conclusion
   338  
   339  Lexers are fun.
   340  
   341  Concurrency is fun.
   342  
   343  Go is fun.
   344  
   345  * For more information
   346  
   347  Go: [[http://golang.org]]
   348  
   349  New templates: http://golang.org/pkg/exp/template/
   350  
   351  (Next release will move them out of experimental.)