github.com/skycoinproject/nex@v0.0.0-20191231010827-3bb2d0c49bc5/README.asciidoc (about) 1 = Nex = 2 3 Nex is a lexer similar to Lex/Flex that: 4 5 - generates Go code instead of C code 6 - integrates with Go's yacc instead of YACC/Bison 7 - supports UTF-8 8 - supports nested _structural regular expressions_. 9 10 See http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf[Structural 11 Regular Expressions] by Rob Pike. I wrote this code to get acquainted with Go 12 and also to explore some of the ideas in the paper. Also, I've always been 13 meaning to implement algorithms I learned from a compilers course I took many 14 years ago. Back then, we never coded them; merely understanding the theory was 15 enough to pass the exam. 16 17 Go has a less general http://golang.org/pkg/scanner/[scanner package], 18 but it is especially suited for tokenizing Go code. 19 20 == Installation == 21 22 $ export GOPATH=/tmp/go 23 $ go get github.com/blynn/nex 24 25 == Example == 26 27 http://flex.sourceforge.net/manual/Simple-Examples.html[One simple example in 28 the Flex manual] is a scanner that counts characters and lines. The program is 29 similar in Nex: 30 31 ------------------------------------------ 32 /\n/{ nLines++; nChars++ } 33 /./{ nChars++ } 34 // 35 package main 36 import ("fmt";"os") 37 func main() { 38 var nLines, nChars int 39 NN_FUN(NewLexer(os.Stdin)) 40 fmt.Printf("%d %d\n", nLines, nChars) 41 } 42 ------------------------------------------ 43 44 The syntax resembles Awk more than Flex: each regex must be delimited. An empty 45 regex terminates the rules section and signifies the presence of user code, 46 which is printed on standard output with `NN_FUN` replaced by the generated 47 scanner. 48 49 Name the above example `lc.nex`. Then compile and run it by typing: 50 51 $ nex -r -s lc.nex 52 53 The program runs on standard input and output. For example: 54 55 $ nex -r -s lc.nex < /usr/share/dict/words 56 99171 938587 57 58 To generate Go code for a scanner without compiling and running it, type: 59 60 $ nex -s < lc.nex # Prints code on standard output. 61 62 or: 63 64 $ nex -s lc.nex # Writes code to lc.nn.go 65 66 The `NN_FUN` macro is primitive, but I was unable to think of another way to 67 achieve an Awk-esque feel. Purists unable to tolerate text substitution will 68 need more code: 69 70 ------------------------------------------ 71 /\n/{ lval.l++; lval.c++ } 72 /./{ lval.c++ } 73 // 74 package main 75 import ("fmt";"os") 76 type yySymType struct { l, c int } 77 func main() { 78 v := new(yySymType) 79 NewLexer(os.Stdin).Lex(v) 80 fmt.Printf("%d %d\n", v.l, v.c) 81 } 82 ------------------------------------------ 83 84 and must run `nex` without the `-s` option: 85 86 $ nex lc.nex 87 88 We could avoid defining a struct by using globals instead, but even then we 89 need a throwaway definition of yySymType. 90 91 The yy prefix can be modified by adding `-y` option. When using yacc, it must use the same prefix: 92 93 $ nex -p YY lc.nex && go tool yacc -p YY && go run lc.nn.go y.go 94 95 == Toy Pascal == 96 97 The Flex manual also exhibits a http://flex.sourceforge.net/manual/Simple-Examples.html[scanner for a toy Pascal-like language], 98 though last I checked, its comment regex was a little buggy. Here is a 99 modified Nex version, without string-to-number conversions: 100 101 ------------------------------------------ 102 /[0-9]+/ { println("An integer:", txt()) } 103 /[0-9]+\.[0-9]*/ { println("A float:", txt()) } 104 /if|then|begin|end|procedure|function/ 105 { println( "A keyword:", txt()) } 106 /[a-z][a-z0-9]*/ { println("An identifier:", txt()) } 107 /\+|-|\*|\// { println("An operator:", txt()) } 108 /[ \t\n]+/ { /* eat up whitespace */ } 109 /./ { println("Unrecognized character:", txt()) } 110 /{[^\{\}\n]*}/ { /* eat up one-line comments */ } 111 // 112 package main 113 import "os" 114 func main() { 115 lex := NewLexer(os.Stdin) 116 txt := func() string { return lex.Text() } 117 NN_FUN(lex) 118 } 119 ------------------------------------------ 120 121 Enough simple examples! Let us see what nesting can do. 122 123 == Peter into silicon == 124 125 In ``Structural Regular Expressions'', Pike imagines a newline-agnostic Awk 126 that operates on matched text, rather than on the whole line containing a 127 match, and writes code converting an input array of characters into 128 descriptions of rectangles. For example, given an input such as: 129 130 ------------------------------------------ 131 ####### 132 ######### 133 #### ##### 134 #### #### # 135 #### ##### 136 #### ### 137 ######## ##### 138 #### ######### 139 #### # # #### 140 ## # ### ## 141 ### # ### 142 ### ## 143 ## # 144 # #### 145 # # 146 ## # ## 147 ------------------------------------------ 148 149 we wish to produce something like: 150 151 ------------------------------------------ 152 rect 5 12 1 2 153 rect 4 13 2 3 154 rect 3 7 3 4 155 rect 9 14 3 4 156 ... 157 rect 10 12 16 17 158 ------------------------------------------ 159 160 With Nex, we don't have to imagine: such programs are real. Below are practical 161 Nex programs that strongly resemble their theoretical counterparts. 162 The one-character-at-a-time variant: 163 164 ------------------------------------------ 165 / /{ x++ } 166 /#/{ println("rect", x, x+1, y, y+1); x++ } 167 /\n/{ x=1; y++ } 168 // 169 package main 170 import "os" 171 func main() { 172 x, y := 1, 1 173 NN_FUN(NewLexer(os.Stdin)) 174 } 175 ------------------------------------------ 176 177 The one-run-at-a-time variant: 178 179 ------------------------------------------ 180 / +/{ x+=len(txt()) } 181 /#+/{ println("rect", x, x+len(txt()), y, y+1); x+=len(txt()) } 182 /\n/{ x=1; y++ } 183 // 184 package main 185 import "os" 186 func main() { 187 x, y := 1, 1 188 lex := NewLexer(os.Stdin) 189 txt := func() string { return lex.Text() } 190 NN_FUN(lex) 191 } 192 ------------------------------------------ 193 194 The programs are more verbose than Awk because Go is the backend. 195 196 == Rob but not robot == 197 198 Pike demonstrates how nesting structural expressions leads to a few simple text 199 editor commands to print all lines containing "rob" but not "robot". Though Nex 200 fails to separate looping from matching, a corresponding program is bearable: 201 202 ------------------------------------------ 203 /[^\n]*\n/ < { isrobot = false; isrob = false } 204 /robot/ { isrobot = true } 205 /rob/ { isrob = true } 206 > { if isrob && !isrobot { fmt.print(lex.Text()) } } 207 // 208 package main 209 import ("fmt";"os") 210 func main() { 211 var isrobot, isrob bool 212 lex := NewLexer(os.Stdin) 213 NN_FUN(lex) 214 } 215 ------------------------------------------ 216 217 The "<" and ">" delimit nested expressions, and work as follows. 218 On reading a line, we find it matches the first regex, so we execute the code 219 immediately following the opening "<". 220 221 Then it's as if we run Nex again, except we focus only on the patterns and 222 actions up to the closing ">", with the matched line as the entire input. Thus 223 we look for occurrences of "rob" and "robot" in just the matched line and set 224 flags accordingly. 225 226 After the line ends, we execute the code following the closing ">" and return 227 to our original state, scanning for more lines. 228 229 == Word count == 230 231 We can simultaneously count lines, words, and characters with Nex thanks to 232 nesting: 233 ------------------------------------------ 234 /[^\n]*\n/ < {} 235 /[^ \t\r\n]*/ < {} 236 /./ { nChars++ } 237 > { nWords++ } 238 /./ { nChars++ } 239 > { nLines++ } 240 // 241 package main 242 import ("fmt";"os") 243 func main() { 244 var nLines, nWords, nChars int 245 NN_FUN(NewLexer(os.Stdin)) 246 fmt.Printf("%d %d %d\n", nLines, nWords, nChars) 247 } 248 ------------------------------------------ 249 250 The first regex matches entire lines: each line is passed to the first level 251 of nested regexes. Within this level, the first regex matches words in the 252 line: each word is passed to the second level of nested regexes. Within 253 the second level, a regex causes every character of the word to be counted. 254 255 Lastly, we also count whitespace characters, a task performed by the second 256 regex of the first level of nested regexes. We could remove this statement 257 to count only non-whitespace characters. 258 259 == UTF-8 == 260 261 The following Nex program converts Eastern Arabic numerals to the digits used 262 in the Western world, and also Chinese phrases for numbers (the analog of 263 something like "one-hundred and fifty-three") into digits. 264 265 ------------------------------------------ 266 /[零一二三四五六七八九十百千]+/ { fmt.Print(zhToInt(txt())) } 267 /[٠-٩]/ { 268 // The above character class might show up right-to-left in a browser. 269 // The equivalent of 0 should be on the left, and the equivalent of 9 should 270 // be on the right. 271 // 272 // The Eastern Arabic numerals are ٠١٢٣٤٥٦٧٨٩. 273 fmt.Print([]rune(txt())[0] - rune('٠')) 274 } 275 /./ { fmt.Print(txt()) } 276 // 277 package main 278 import ("fmt";"os") 279 func zhToInt(s string) int { 280 n := 0 281 prev := 0 282 f := func(m int) { 283 if 0 == prev { prev = 1 } 284 n += m * prev 285 prev = 0 286 } 287 for _, c := range s { 288 for m, v := range []rune("一二三四五六七八九") { 289 if v == c { 290 prev = m+1 291 goto continue2 292 } 293 } 294 switch c { 295 case '零': 296 case '十': f(10) 297 case '百': f(100) 298 case '千': f(1000) 299 } 300 continue2: 301 } 302 n += prev 303 return n 304 } 305 func main() { 306 lex := NewLexer(os.Stdin) 307 txt := func() string { return lex.Text() } 308 NN_FUN(lex) 309 } 310 ------------------------------------------ 311 312 == nex and Go's yacc == 313 314 The parser generated by `go tool yacc` exports so little that it's easiest to 315 keep the lexer and the parser in the same package. 316 317 Here's a yacc file based on the 318 http://dinosaur.compilertools.net/bison/bison_5.html[reverse-Polish-notation 319 calculator example from the Bison manual]: 320 321 ------------------------------------------ 322 %{ 323 package main 324 import "fmt" 325 %} 326 327 %union { 328 n int 329 } 330 331 %token NUM 332 %% 333 input: /* empty */ 334 | input line 335 ; 336 337 line: '\n' 338 | exp '\n' { fmt.Println($1.n); } 339 ; 340 341 exp: NUM { $$.n = $1.n; } 342 | exp exp '+' { $$.n = $1.n + $2.n; } 343 | exp exp '-' { $$.n = $1.n - $2.n; } 344 | exp exp '*' { $$.n = $1.n * $2.n; } 345 | exp exp '/' { $$.n = $1.n / $2.n; } 346 /* Unary minus */ 347 | exp 'n' { $$.n = -$1.n; } 348 ; 349 %% 350 ------------------------------------------ 351 352 We must import `fmt` even if we don't use it, since code generated by yacc 353 needs it. Also, the `%union` is mandatory; it generates `yySymType`. 354 355 Call the above `rp.y`. Then a suitable lexer, say `rp.nex`, might be: 356 357 ------------------------------------------ 358 /[ \t]/ { /* Skip blanks and tabs. */ } 359 /[0-9]*/ { lval.n,_ = strconv.Atoi(yylex.Text()); return NUM } 360 /./ { return int(yylex.Text()[0]) } 361 // 362 package main 363 import ("os";"strconv") 364 func main() { 365 yyParse(NewLexer(os.Stdin)) 366 } 367 ------------------------------------------ 368 369 Compile the two with: 370 371 $ nex rp.nex && go tool yacc rp.y && go build y.go rp.nn.go 372 373 For brevity, we work in the `main` package. In a larger project we might want 374 to write a package that exports a function wrapped around `yyParse()`. This is 375 fine, provided the parser and the lexer are both in the same package. 376 377 Alternatively, we could use yacc's `-p` option to change the prefix from `yy` 378 to one that begins with an uppercase letter. 379 380 == Matching the beginning and end of input == 381 382 We can simulate awk's BEGIN and END blocks with a regex that matches the entire 383 input: 384 385 ------------------------------------------ 386 /.*/ < { println("BEGIN") } 387 /a/ { println("a") } 388 > { println("END") } 389 // 390 package main 391 import "os" 392 func main() { 393 NN_FUN(NewLexer(os.Stdin)) 394 } 395 ------------------------------------------ 396 397 However, this causes Nex to read the entire input into memory. To solve 398 this problem, Nex supports the following syntax: 399 400 ------------------------------------------ 401 < { println("BEGIN") } 402 /a/ { println("a") } 403 > { println("END") } 404 package main 405 import "os" 406 func main() { 407 NN_FUN(NewLexer(os.Stdin)) 408 } 409 ------------------------------------------ 410 411 In other words, if a bare '<' appears as the first pattern, then its action is 412 executed before reading the input. The last pattern must be a bare '>', and its 413 action is executed on end of input. 414 415 Additionally, no empty regex is needed to mark the beginning of the Go program. 416 (Fortunately, an empty regex is also a Go comment, so there's no harm done if 417 present.) 418 419 == Matching Nuances == 420 421 Among rules in the same scope, the longest matching pattern takes precedence. 422 In event of a tie, the first pattern wins. 423 424 Unanchored patterns never match the empty string. For example, 425 426 /(foo)*/ {} 427 428 matches "foo" and "foofoo", but not "". 429 430 Anchored patterns can match the empty string at most once; after the match, the 431 start or end null strings are "used up" so will not match again. 432 433 Internally, this is implemented by omitting the very first check to see if the 434 current state is accepted when running the DFA corresponding to the regex. An 435 alternative would be to simply ignore matches of length 0, but I chose to allow 436 anchored empty matches just in case there turn out to be applications for them. 437 I'm open to changing this behaviour. 438 439 == Contributing and Testing == 440 441 Check out this repo (or a clone) into a directory with the following structure: 442 443 mkdir -p nex/src 444 cd nex/src 445 git clone https://github.com/blynn/nex.git 446 447 The Makefile will put the binary into e.g. nex/bin 448 449 == Reference == 450 451 func NewLexer(in io.Reader) *Lexer 452 453 // NewLexerWithInit creates a new Lexer object, runs the given callback on it, 454 // then returns it. 455 func NewLexerWithInit(in io.Reader, initFun func(*Lexer)) *Lexer 456 457 // Lex runs the lexer. Always returns 0. 458 // When the -s option is given, this function is not generated; 459 // instead, the NN_FUN macro runs the lexer. 460 func (yylex *Lexer) Lex(lval *yySymType) int 461 462 // Text returns the matched text. 463 func (yylex *Lexer) Text() string 464 465 // Line returns the current line number. 466 // The first line is 0. 467 func (yylex *Lexer) Line() int 468 469 // Column returns the current column number. 470 // The first column is 0. 471 func (yylex *Lexer) Column() int