github.com/skycoinproject/nex@v0.0.0-20191231010827-3bb2d0c49bc5/README.asciidoc

github.com/skycoinproject/nex@v0.0.0-20191231010827-3bb2d0c49bc5/README.asciidoc (about)

     1  = Nex =
     2  
     3  Nex is a lexer similar to Lex/Flex that:
     4  
     5  - generates Go code instead of C code
     6  - integrates with Go's yacc instead of YACC/Bison
     7  - supports UTF-8
     8  - supports nested _structural regular expressions_.
     9  
    10  See http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf[Structural
    11  Regular Expressions] by Rob Pike. I wrote this code to get acquainted with Go
    12  and also to explore some of the ideas in the paper. Also, I've always been
    13  meaning to implement algorithms I learned from a compilers course I took many
    14  years ago. Back then, we never coded them; merely understanding the theory was
    15  enough to pass the exam.
    16  
    17  Go has a less general http://golang.org/pkg/scanner/[scanner package],
    18  but it is especially suited for tokenizing Go code.
    19  
    20  == Installation ==
    21  
    22    $ export GOPATH=/tmp/go
    23    $ go get github.com/blynn/nex
    24  
    25  == Example ==
    26  
    27  http://flex.sourceforge.net/manual/Simple-Examples.html[One simple example in
    28  the Flex manual] is a scanner that counts characters and lines. The program is
    29  similar in Nex:
    30  
    31  ------------------------------------------
    32  /\n/{ nLines++; nChars++ }
    33  /./{ nChars++ }
    34  //
    35  package main
    36  import ("fmt";"os")
    37  func main() {
    38    var nLines, nChars int
    39    NN_FUN(NewLexer(os.Stdin))
    40    fmt.Printf("%d %d\n", nLines, nChars)
    41  }
    42  ------------------------------------------
    43  
    44  The syntax resembles Awk more than Flex: each regex must be delimited. An empty
    45  regex terminates the rules section and signifies the presence of user code,
    46  which is printed on standard output with `NN_FUN` replaced by the generated
    47  scanner.
    48  
    49  Name the above example `lc.nex`. Then compile and run it by typing:
    50  
    51   $ nex -r -s lc.nex
    52  
    53  The program runs on standard input and output. For example:
    54  
    55   $ nex -r -s lc.nex < /usr/share/dict/words
    56   99171 938587
    57  
    58  To generate Go code for a scanner without compiling and running it, type:
    59  
    60   $ nex -s < lc.nex  # Prints code on standard output.
    61  
    62  or:
    63  
    64   $ nex -s lc.nex  # Writes code to lc.nn.go
    65  
    66  The `NN_FUN` macro is primitive, but I was unable to think of another way to
    67  achieve an Awk-esque feel. Purists unable to tolerate text substitution will
    68  need more code:
    69  
    70  ------------------------------------------
    71  /\n/{ lval.l++; lval.c++ }
    72  /./{ lval.c++ }
    73  //
    74  package main
    75  import ("fmt";"os")
    76  type yySymType struct { l, c int }
    77  func main() {
    78    v := new(yySymType)
    79    NewLexer(os.Stdin).Lex(v)
    80    fmt.Printf("%d %d\n", v.l, v.c)
    81  }
    82  ------------------------------------------
    83  
    84  and must run `nex` without the `-s` option:
    85  
    86   $ nex lc.nex
    87  
    88  We could avoid defining a struct by using globals instead, but even then we
    89  need a throwaway definition of yySymType.
    90  
    91  The yy prefix can be modified by adding `-y` option. When using yacc, it must use the same prefix:
    92  
    93   $ nex -p YY lc.nex && go tool yacc -p YY && go run lc.nn.go y.go
    94  
    95  == Toy Pascal ==
    96  
    97  The Flex manual also exhibits a http://flex.sourceforge.net/manual/Simple-Examples.html[scanner for a toy Pascal-like language],
    98  though last I checked, its comment regex was a little buggy. Here is a
    99  modified Nex version, without string-to-number conversions:
   100  
   101  ------------------------------------------
   102  /[0-9]+/          { println("An integer:", txt()) }
   103  /[0-9]+\.[0-9]*/  { println("A float:", txt()) }
   104  /if|then|begin|end|procedure|function/
   105                    { println( "A keyword:", txt()) }
   106  /[a-z][a-z0-9]*/  { println("An identifier:", txt()) }
   107  /\+|-|\*|\//      { println("An operator:", txt()) }
   108  /[ \t\n]+/        { /* eat up whitespace */ }
   109  /./               { println("Unrecognized character:", txt()) }
   110  /{[^\{\}\n]*}/    { /* eat up one-line comments */ }
   111  //
   112  package main
   113  import "os"
   114  func main() {
   115    lex := NewLexer(os.Stdin)
   116    txt := func() string { return lex.Text() }
   117    NN_FUN(lex)
   118  }
   119  ------------------------------------------
   120  
   121  Enough simple examples! Let us see what nesting can do.
   122  
   123  == Peter into silicon ==
   124  
   125  In ``Structural Regular Expressions'', Pike imagines a newline-agnostic Awk
   126  that operates on matched text, rather than on the whole line containing a
   127  match, and writes code converting an input array of characters into
   128  descriptions of rectangles. For example, given an input such as:
   129  
   130  ------------------------------------------
   131      #######
   132     #########
   133    ####  #####
   134   ####    ####   #
   135   ####      #####
   136  ####        ###
   137  ########   #####
   138  #### #########
   139  #### #  # ####
   140  ## #  ###   ##
   141  ###    #  ###
   142  ###    ##
   143   ##   #
   144    #   ####
   145    # #
   146  ##   #   ##
   147  ------------------------------------------
   148  
   149  we wish to produce something like:
   150  
   151  ------------------------------------------
   152  rect 5 12 1 2
   153  rect 4 13 2 3
   154  rect 3 7 3 4
   155  rect 9 14 3 4
   156  ...
   157  rect 10 12 16 17
   158  ------------------------------------------
   159  
   160  With Nex, we don't have to imagine: such programs are real. Below are practical
   161  Nex programs that strongly resemble their theoretical counterparts.
   162  The one-character-at-a-time variant:
   163  
   164  ------------------------------------------
   165  / /{ x++ }
   166  /#/{ println("rect", x, x+1, y, y+1); x++ }
   167  /\n/{ x=1; y++ }
   168  //
   169  package main
   170  import "os"
   171  func main() {
   172    x, y := 1, 1
   173    NN_FUN(NewLexer(os.Stdin))
   174  }
   175  ------------------------------------------
   176  
   177  The one-run-at-a-time variant:
   178  
   179  ------------------------------------------
   180  / +/{ x+=len(txt()) }
   181  /#+/{ println("rect", x, x+len(txt()), y, y+1); x+=len(txt()) }
   182  /\n/{ x=1; y++ }
   183  //
   184  package main
   185  import "os"
   186  func main() {
   187    x, y := 1, 1
   188    lex := NewLexer(os.Stdin)
   189    txt := func() string { return lex.Text() }
   190    NN_FUN(lex)
   191  }
   192  ------------------------------------------
   193  
   194  The programs are more verbose than Awk because Go is the backend.
   195  
   196  == Rob but not robot ==
   197  
   198  Pike demonstrates how nesting structural expressions leads to a few simple text
   199  editor commands to print all lines containing "rob" but not "robot". Though Nex
   200  fails to separate looping from matching, a corresponding program is bearable:
   201  
   202  ------------------------------------------
   203  /[^\n]*\n/ < { isrobot = false; isrob = false }
   204    /robot/    { isrobot = true }
   205    /rob/      { isrob = true }
   206  >            { if isrob && !isrobot { fmt.print(lex.Text()) } }
   207  //
   208  package main
   209  import ("fmt";"os")
   210  func main() {
   211    var isrobot, isrob bool
   212    lex := NewLexer(os.Stdin)
   213    NN_FUN(lex)
   214  }
   215  ------------------------------------------
   216  
   217  The "<" and ">" delimit nested expressions, and work as follows.
   218  On reading a line, we find it matches the first regex, so we execute the code
   219  immediately following the opening "<".
   220  
   221  Then it's as if we run Nex again, except we focus only on the patterns and
   222  actions up to the closing ">", with the matched line as the entire input. Thus
   223  we look for occurrences of "rob" and "robot" in just the matched line and set
   224  flags accordingly.
   225  
   226  After the line ends, we execute the code following the closing ">" and return
   227  to our original state, scanning for more lines.
   228  
   229  == Word count ==
   230  
   231  We can simultaneously count lines, words, and characters with Nex thanks to
   232  nesting:
   233  ------------------------------------------
   234  /[^\n]*\n/ < {}
   235    /[^ \t\r\n]*/ < {}
   236      /./  { nChars++ }
   237    >      { nWords++ }
   238    /./    { nChars++ }
   239  >        { nLines++ }
   240  //
   241  package main
   242  import ("fmt";"os")
   243  func main() {
   244    var nLines, nWords, nChars int
   245    NN_FUN(NewLexer(os.Stdin))
   246    fmt.Printf("%d %d %d\n", nLines, nWords, nChars)
   247  }
   248  ------------------------------------------
   249  
   250  The first regex matches entire lines: each line is passed to the first level
   251  of nested regexes. Within this level, the first regex matches words in the
   252  line: each word is passed to the second level of nested regexes. Within
   253  the second level, a regex causes every character of the word to be counted.
   254  
   255  Lastly, we also count whitespace characters, a task performed by the second
   256  regex of the first level of nested regexes. We could remove this statement
   257  to count only non-whitespace characters.
   258  
   259  == UTF-8 ==
   260  
   261  The following Nex program converts Eastern Arabic numerals to the digits used
   262  in the Western world, and also Chinese phrases for numbers (the analog of
   263  something like "one-hundred and fifty-three") into digits.
   264  
   265  ------------------------------------------
   266  /[零一二三四五六七八九十百千]+/ { fmt.Print(zhToInt(txt())) }
   267  /[٠-٩]/ {
   268    // The above character class might show up right-to-left in a browser.
   269    // The equivalent of 0 should be on the left, and the equivalent of 9 should
   270    // be on the right.
   271    //
   272    // The Eastern Arabic numerals are ٠١٢٣٤٥٦٧٨٩.
   273    fmt.Print([]rune(txt())[0] - rune('٠'))
   274  }
   275  /./ { fmt.Print(txt()) }
   276  //
   277  package main
   278  import ("fmt";"os")
   279  func zhToInt(s string) int {
   280    n := 0
   281    prev := 0
   282    f := func(m int) {
   283      if 0 == prev { prev = 1 }
   284      n += m * prev
   285      prev = 0
   286    }
   287    for _, c := range s {
   288      for m, v := range []rune("一二三四五六七八九") {
   289        if v == c {
   290  	prev = m+1
   291  	goto continue2
   292        }
   293      }
   294      switch c {
   295      case '零':
   296      case '十': f(10)
   297      case '百': f(100)
   298      case '千': f(1000)
   299      }
   300  continue2:
   301    }
   302    n += prev
   303    return n
   304  }
   305  func main() {
   306    lex := NewLexer(os.Stdin)
   307    txt := func() string { return lex.Text() }
   308    NN_FUN(lex)
   309  }
   310  ------------------------------------------
   311  
   312  == nex and Go's yacc ==
   313  
   314  The parser generated by `go tool yacc` exports so little that it's easiest to
   315  keep the lexer and the parser in the same package.
   316  
   317  Here's a yacc file based on the
   318  http://dinosaur.compilertools.net/bison/bison_5.html[reverse-Polish-notation
   319  calculator example from the Bison manual]:
   320  
   321  ------------------------------------------
   322  %{
   323  package main
   324  import "fmt"
   325  %}
   326  
   327  %union {
   328    n int
   329  }
   330  
   331  %token NUM
   332  %%
   333  input:    /* empty */
   334         | input line
   335  ;
   336  
   337  line:     '\n'
   338         | exp '\n'      { fmt.Println($1.n); }
   339  ;
   340  
   341  exp:     NUM           { $$.n = $1.n;        }
   342         | exp exp '+'   { $$.n = $1.n + $2.n; }
   343         | exp exp '-'   { $$.n = $1.n - $2.n; }
   344         | exp exp '*'   { $$.n = $1.n * $2.n; }
   345         | exp exp '/'   { $$.n = $1.n / $2.n; }
   346  	/* Unary minus    */
   347         | exp 'n'       { $$.n = -$1.n;       }
   348  ;
   349  %%
   350  ------------------------------------------
   351  
   352  We must import `fmt` even if we don't use it, since code generated by yacc
   353  needs it. Also, the `%union` is mandatory; it generates `yySymType`.
   354  
   355  Call the above `rp.y`. Then a suitable lexer, say `rp.nex`, might be:
   356  
   357  ------------------------------------------
   358  /[ \t]/  { /* Skip blanks and tabs. */ }
   359  /[0-9]*/ { lval.n,_ = strconv.Atoi(yylex.Text()); return NUM }
   360  /./ { return int(yylex.Text()[0]) }
   361  //
   362  package main
   363  import ("os";"strconv")
   364  func main() {
   365    yyParse(NewLexer(os.Stdin))
   366  }
   367  ------------------------------------------
   368  
   369  Compile the two with:
   370  
   371   $ nex rp.nex && go tool yacc rp.y && go build y.go rp.nn.go
   372  
   373  For brevity, we work in the `main` package. In a larger project we might want
   374  to write a package that exports a function wrapped around `yyParse()`. This is
   375  fine, provided the parser and the lexer are both in the same package.
   376  
   377  Alternatively, we could use yacc's `-p` option to change the prefix from `yy`
   378  to one that begins with an uppercase letter.
   379  
   380  == Matching the beginning and end of input ==
   381  
   382  We can simulate awk's BEGIN and END blocks with a regex that matches the entire
   383  input:
   384  
   385  ------------------------------------------
   386  /.*/ < { println("BEGIN") }
   387    /a/  { println("a") }
   388  >      { println("END") }
   389  //
   390  package main
   391  import "os"
   392  func main() {
   393    NN_FUN(NewLexer(os.Stdin))
   394  }
   395  ------------------------------------------
   396  
   397  However, this causes Nex to read the entire input into memory. To solve
   398  this problem, Nex supports the following syntax:
   399  
   400  ------------------------------------------
   401  <      { println("BEGIN") }
   402    /a/  { println("a") }
   403  >      { println("END") }
   404  package main
   405  import "os"
   406  func main() {
   407    NN_FUN(NewLexer(os.Stdin))
   408  }
   409  ------------------------------------------
   410  
   411  In other words, if a bare '<' appears as the first pattern, then its action is
   412  executed before reading the input. The last pattern must be a bare '>', and its
   413  action is executed on end of input.
   414  
   415  Additionally, no empty regex is needed to mark the beginning of the Go program.
   416  (Fortunately, an empty regex is also a Go comment, so there's no harm done if
   417  present.)
   418  
   419  == Matching Nuances ==
   420  
   421  Among rules in the same scope, the longest matching pattern takes precedence.
   422  In event of a tie, the first pattern wins.
   423  
   424  Unanchored patterns never match the empty string. For example,
   425  
   426    /(foo)*/ {}
   427  
   428  matches "foo" and "foofoo", but not "".
   429  
   430  Anchored patterns can match the empty string at most once; after the match, the
   431  start or end null strings are "used up" so will not match again.
   432  
   433  Internally, this is implemented by omitting the very first check to see if the
   434  current state is accepted when running the DFA corresponding to the regex. An
   435  alternative would be to simply ignore matches of length 0, but I chose to allow
   436  anchored empty matches just in case there turn out to be applications for them.
   437  I'm open to changing this behaviour.
   438  
   439  == Contributing and Testing ==
   440  
   441  Check out this repo (or a clone) into a directory with the following structure:
   442  
   443    mkdir -p nex/src
   444    cd nex/src
   445    git clone https://github.com/blynn/nex.git
   446  
   447  The Makefile will put the binary into e.g. nex/bin
   448  
   449  == Reference ==
   450  
   451    func NewLexer(in io.Reader) *Lexer
   452  
   453    // NewLexerWithInit creates a new Lexer object, runs the given callback on it,
   454    // then returns it.
   455    func NewLexerWithInit(in io.Reader, initFun func(*Lexer)) *Lexer
   456  
   457    // Lex runs the lexer. Always returns 0.
   458    // When the -s option is given, this function is not generated;
   459    // instead, the NN_FUN macro runs the lexer.
   460  	func (yylex *Lexer) Lex(lval *yySymType) int
   461  
   462    // Text returns the matched text.
   463    func (yylex *Lexer) Text() string
   464  
   465    // Line returns the current line number.
   466    // The first line is 0.
   467    func (yylex *Lexer) Line() int
   468  
   469    // Column returns the current column number.
   470    // The first column is 0.
   471    func (yylex *Lexer) Column() int