go.starlark.net@v0.0.0-20231101134539-556fd59b42f6/doc/impl.md (about)

     1  
     2  # Starlark in Go: Implementation
     3  
     4  This document (a work in progress) describes some of the design
     5  choices of the Go implementation of Starlark.
     6  
     7    * [Scanner](#scanner)
     8    * [Parser](#parser)
     9    * [Resolver](#resolver)
    10    * [Evaluator](#evaluator)
    11      * [Data types](#data-types)
    12      * [Freezing](#freezing)
    13    * [Testing](#testing)
    14  
    15  
    16  ## Scanner
    17  
    18  The scanner is derived from Russ Cox's
    19  [buildifier](https://github.com/bazelbuild/buildtools/tree/master/buildifier)
    20  tool, which pretty-prints Bazel BUILD files.
    21  
    22  Most of the work happens in `(*scanner).nextToken`.
    23  
    24  ## Parser
    25  
    26  The parser is hand-written recursive-descent parser. It uses the
    27  technique of [precedence
    28  climbing](http://www.engr.mun.ca/~theo/Misc/exp_parsing.htm#climbing)
    29  to reduce the number of productions.
    30  
    31  In some places the parser accepts a larger set of programs than are
    32  strictly valid, leaving the task of rejecting them to the subsequent
    33  resolver pass. For example, in the function call `f(a, b=c)` the
    34  parser accepts any expression for `a` and `b`, even though `b` may
    35  legally be only an identifier. For the parser to distinguish these
    36  cases would require additional lookahead.
    37  
    38  ## Resolver
    39  
    40  The resolver reports structural errors in the program, such as the use
    41  of `break` and `continue` outside of a loop.
    42  
    43  Starlark has stricter syntactic limitations than Python. For example,
    44  it does not permit `for` loops or `if` statements at top level, nor
    45  does it permit global variables to be bound more than once.
    46  These limitations come from the Bazel project's desire to make it easy
    47  to identify the sole statement that defines each global, permitting
    48  accurate cross-reference documentation.
    49  
    50  In addition, the resolver validates all variable names, classifying
    51  them as references to universal, global, local, or free variables.
    52  Local and free variables are mapped to a small integer, allowing the
    53  evaluator to use an efficient (flat) representation for the
    54  environment.
    55  
    56  Not all features of the Go implementation are "standard" (that is,
    57  supported by Bazel's Java implementation), at least for now, so
    58  non-standard features such as `set`
    59  are flag-controlled.  The resolver reports
    60  any uses of dialect features that have not been enabled.
    61  
    62  
    63  ## Evaluator
    64  
    65  ### Data types
    66  
    67  <b>Integers:</b> Integers are representing using `big.Int`, an
    68  arbitrary precision integer. This representation was chosen because,
    69  for many applications, Starlark must be able to handle without loss
    70  protocol buffer values containing signed and unsigned 64-bit integers,
    71  which requires 65 bits of precision.
    72  
    73  Small integers (<256) are preallocated, but all other values require
    74  memory allocation. Integer performance is relatively poor, but it
    75  matters little for Bazel-like workloads which depend much
    76  more on lists of strings than on integers. (Recall that a typical loop
    77  over a list in Starlark does not materialize the loop index as an `int`.)
    78  
    79  An optimization worth trying would be to represent integers using
    80  either an `int32` or `big.Int`, with the `big.Int` used only when
    81  `int32` does not suffice. Using `int32`, not `int64`, for "small"
    82  numbers would make it easier to detect overflow from operations like
    83  `int32 * int32`, which would trigger the use of `big.Int`.
    84  
    85  <b>Floating point</b>:
    86  Floating point numbers are represented using Go's `float64`.
    87  Again, `float` support is required to support protocol buffers. The
    88  existence of floating-point NaN and its infamous comparison behavior
    89  (`NaN != NaN`) had many ramifications for the API, since we cannot
    90  assume the result of an ordered comparison is either less than,
    91  greater than, or equal: it may also fail.
    92  
    93  <b>Strings</b>:
    94  
    95  TODO: discuss UTF-8 and string.bytes method.
    96  
    97  <b>Dictionaries and sets</b>:
    98  Starlark dictionaries have predictable iteration order.
    99  Furthermore, many Starlark values are hashable in Starlark even though
   100  the Go values that represent them are not hashable in Go: big
   101  integers, for example.
   102  Consequently, we cannot use Go maps to implement Starlark's dictionary.
   103  
   104  We use a simple hash table whose buckets are linked lists, each
   105  element of which holds up to 8 key/value pairs. In a well-distributed
   106  table the list should rarely exceed length 1. In addition, each
   107  key/value item is part of doubly-linked list that maintains the
   108  insertion order of the elements for iteration.
   109  
   110  <b>Struct:</b>
   111  The `starlarkstruct` Go package provides a non-standard Starlark
   112  extension data type, `struct`, that maps field identifiers to
   113  arbitrary values. Fields are accessed using dot notation: `y = s.f`.
   114  This data type is extensively used in Bazel, but its specification is
   115  currently evolving.
   116  
   117  Starlark has no `class` mechanism, nor equivalent of Python's
   118  `namedtuple`, though it is likely that future versions will support
   119  some way to define a record data type of several fields, with a
   120  representation more efficient than a hash table.
   121  
   122  
   123  ### Freezing
   124  
   125  All mutable values created during module initialization are _frozen_
   126  upon its completion. It is this property that permits a Starlark module
   127  to be referenced by two Starlark threads running concurrently (such as
   128  the initialization threads of two other modules) without the
   129  possibility of a data race.
   130  
   131  The Go implementation supports freezing by storing an additional
   132  "frozen" Boolean variable in each mutable object. Once this flag is set,
   133  all subsequent attempts at mutation fail. Every value defines a
   134  Freeze method that sets its own frozen flag if not already set, and
   135  calls Freeze for each value that it contains.
   136  For example, when a list is frozen, it freezes each of its elements;
   137  when a dictionary is frozen, it freezes each of its keys and values;
   138  and when a function value is frozen, it freezes each of the free
   139  variables and parameter default values implicitly referenced by its closure.
   140  Application-defined types must also follow this discipline.
   141  
   142  The freeze mechanism in the Go implementation is finer grained than in
   143  the Java implementation: in effect, the latter has one "frozen" flag
   144  per module, and every value holds a reference to the frozen flag of
   145  its module. This makes setting the frozen flag more efficient---a
   146  simple bit flip, no need to traverse the object graph---but coarser
   147  grained. Also, it complicates the API slightly because to construct a
   148  list, say, requires a reference to the frozen flag it should use.
   149  
   150  The Go implementation would also permit the freeze operation to be
   151  exposed to the program, for example as a built-in function.
   152  This has proven valuable in writing tests of the freeze mechanism
   153  itself, but is otherwise mostly a curiosity.
   154  
   155  
   156  ### Fail-fast iterators
   157  
   158  In some languages (such as Go), a program may mutate a data structure
   159  while iterating over it; for example, a range loop over a map may
   160  delete map elements. In other languages (such as Java), iterators do
   161  extra bookkeeping so that modification of the underlying collection
   162  invalidates the iterator, and the next attempt to use it fails.
   163  This often helps to detect subtle mistakes.
   164  
   165  Starlark takes this a step further. Instead of mutation of the
   166  collection invalidating the iterator, the act of iterating makes the
   167  collection temporarily immutable, so that an attempt to, say, delete a
   168  dict element while looping over the dict, will fail. The error is
   169  reported against the delete operation, not the iteration.
   170  
   171  This is implemented by having each mutable iterable value record a
   172  counter of active iterators. Starting a loop increments this counter,
   173  and completing a loop decrements it. A collection with a nonzero
   174  counter behaves as if frozen. If the collection is actually frozen,
   175  the counter bookkeeping is unnecessary. (Consequently, iterator
   176  bookkeeping is needed only while objects are still mutable, before
   177  they can have been published to another thread, and thus no
   178  synchronization is necessary.)
   179  
   180  A consequence of this design is that in the Go API, it is imperative
   181  to call `Done` on each iterator once it is no longer needed.
   182  
   183  ```
   184  TODO
   185  starlark.Value interface and subinterfaces
   186  argument passing to builtins: UnpackArgs, UnpackPositionalArgs.
   187  ```
   188  
   189  <b>Evaluation strategy:</b>
   190  The evaluator uses a simple recursive tree walk, returning a value or
   191  an error for each expression. We have experimented with just-in-time
   192  compilation of syntax trees to bytecode, but two limitations in the
   193  current Go compiler prevent this strategy from outperforming the
   194  tree-walking evaluator.
   195  
   196  First, the Go compiler does not generate a "computed goto" for a
   197  switch statement ([Go issue
   198  5496](https://github.com/golang/go/issues/5496)). A bytecode
   199  interpreter's main loop is a for-loop around a switch statement with
   200  dozens or hundreds of cases, and the speed with which each case can be
   201  dispatched strongly affects overall performance.
   202  Currently, a switch statement generates a binary tree of ordered
   203  comparisons, requiring several branches instead of one.
   204  
   205  Second, the Go compiler's escape analysis assumes that the underlying
   206  array from a `make([]Value, n)` allocation always escapes
   207  ([Go issue 20533](https://github.com/golang/go/issues/20533)).
   208  Because the bytecode interpreter's operand stack has a non-constant
   209  length, it must be allocated with `make`. The resulting allocation
   210  adds to the cost of each Starlark function call; this can be tolerated
   211  by amortizing one very large stack allocation across many calls.
   212  More problematic appears to be the cost of the additional GC write
   213  barriers incurred by every VM operation: every intermediate result is
   214  saved to the VM's operand stack, which is on the heap.
   215  By contrast, intermediate results in the tree-walking evaluator are
   216  never stored to the heap.
   217  
   218  ```
   219  TODO
   220  frames, backtraces, errors.
   221  threads
   222  Print
   223  Load
   224  ```
   225  
   226  ## Testing
   227  
   228  ```
   229  TODO
   230  starlarktest package
   231  `assert` module
   232  starlarkstruct
   233  integration with Go testing.T
   234  ```
   235  
   236  
   237  ## TODO
   238  
   239  
   240  ```
   241  Discuss practical separation of code and data.
   242  ```