github.com/solo-io/cue@v0.4.7/doc/ref/impl.md

github.com/solo-io/cue@v0.4.7/doc/ref/impl.md (about)

     1  # Implementing CUE
     2  
     3  
     4  > NOTE: this is a working document attempting to describe CUE in a way
     5  > relatable to existing graph unification systems. It is mostly
     6  > redundant to [the spec](./spec.md). Unless one is interested in
     7  > understanding how to implement CUE or how it relates to the existing
     8  > body of research, read the spec instead.
     9  
    10  
    11  CUE is modeled after typed feature structure and graph unification systems
    12  such as LKB.
    13  There is a wealth of research related to such systems and graph unification in
    14  general.
    15  This document describes the core semantics of CUE in a notation
    16  that allows relating it to this existing body of research.
    17  
    18  
    19  ## Background
    20  
    21  CUE was inspired by a formalism known as
    22  typed attribute structures [Carpenter 1992] or
    23  typed feature structures [Copestake 2002],
    24  which are used in linguistics to encode grammars and
    25  lexicons. Being able to effectively encode large amounts of data in a rigorous
    26  manner, this formalism seemed like a great fit for large-scale configuration.
    27  
    28  Although CUE configurations are specified as trees, not graphs, implementations
    29  can benefit from considering them as graphs when dealing with cycles,
    30  and effectively turning them into graphs when applying techniques like
    31  structure sharing.
    32  Dealing with cycles is well understood for typed attribute structures
    33  and as CUE configurations are formally closely related to them,
    34  we can benefit from this knowledge without reinventing the wheel.
    35  
    36  ## Formal Definition
    37  
    38  
    39  <!--
    40  The previous section is equivalent to the below text with the main difference
    41  that it is only defined for trees. Technically, structs are more akin dags,
    42  but that is hard to explain at this point and also unnecessarily pedantic.
    43   We keep the definition closer to trees and will layer treatment
    44  of cycles on top of these definitions to achieve the same result (possibly
    45  without the benefits of structure sharing of a dag).
    46  
    47  A _field_ is a field name, or _label_ and a protype.
    48  A _struct_ is a set of _fields_ with unique labels for each field.
    49  -->
    50  
    51  A CUE configuration can be defined in terms of constraints, which are
    52  analogous to typed attribute structures referred to above.
    53  
    54  ### Definition of basic values
    55  
    56  > A _basic value_ is any CUE value that is not a struct (or, by
    57  > extension, a list).
    58  > All basic values are partially ordered in a lattice, such that for any
    59  > basic value `a` and `b` there is a unique greatest lower bound
    60  > defined for the subsumption relation `a ⊑ b`.
    61  
    62  ```
    63  Basic values
    64  null
    65  true
    66  bool
    67  3.14
    68  string
    69  "Hello"
    70  >=0
    71  <8
    72  re("Hello .*!")
    73  ```
    74  
    75  The basic values correspond to their respective types defined earlier.
    76  
    77  Struct (and by extension lists), are represented by the abstract notion of
    78  a typed feature structure.
    79  Each node in a configuration, including the root node,
    80  is associated with a constraint.
    81  
    82  
    83  ### Definition of a typed feature structures and substructures
    84  
    85  <!-- jba: This isn't adding understanding. I'd rather you omitted it and
    86     added a bit of rigor to the above spec. Or at a minimum, translate the
    87     formalism into the terms you use above.
    88  -->
    89  
    90  > A typed feature structure_ defined for a finite set of labels `Label`
    91  > is directed acyclic graph with labeled
    92  > arcs and values, represented by a tuple `C = <Q, q0, υ, δ>`, where
    93  >
    94  > 1. `Q` is the finite set of nodes,
    95  > 1. `q0 ∈ Q`, is the root node,
    96  > 1. `υ: Q → T` is the total node typing function,
    97  >     for a finite set of possible terms `T`.
    98  > 1. `δ: Label × Q → Q` is the partial feature function,
    99  >
   100  > subject to the following conditions:
   101  >
   102  > 1. there is no node `q` or label `l` such that `δ(q, l) = q0` (root)
   103  > 2. for every node `q` in `Q` there is a path `π` (i.e. a sequence of
   104  >    members of Label) such that `δ(q0, π) = q` (unique root, correctness)
   105  > 3. there is no node `q` or path `π` such that `δ(q, π) = q` (no cycles)
   106  >
   107  > where `δ` is extended to be defined on paths as follows:
   108  >
   109  > 1. `δ(q, ϵ) = q`, where `ϵ` is the empty path
   110  > 2. `δ(q, l∙π) = δ(δ(l, q), π)`
   111  >
   112  > The _substructures_ of a typed feature structure are the
   113  > typed feature structures rooted at each node in the structure.
   114  >
   115  > The set of all possible typed feature structures for a given label
   116  > set is denoted as `𝒞`<sub>`Label`</sub>.
   117  >
   118  > The set of _terms_ for label set `Label` is recursively defined as
   119  >
   120  > 1. every basic value: `P ⊆ T`
   121  > 1. every constraint in `𝒞`<sub>`Label`</sub> is a term: `𝒞`<sub>`Label`</sub>` ⊆ T`
   122  >    a _reference_ may refer to any substructure of `C`.
   123  > 1. for every `n` values `t₁, ..., tₙ`, and every `n`-ary function symbol
   124  >    `f ∈ F_n`, the value `f(t₁,...,tₙ) ∈ T`.
   125  >
   126  
   127  
   128  This definition has been taken and modified from [Carpenter, 1992]
   129  and [Copestake, 2002].
   130  
   131  Without loss of generality, we will henceforth assume that the given set
   132  of labels is constant and denote `𝒞`<sub>`Label`</sub> as `𝒞`.
   133  
   134  In CUE configurations, the abstract constraints implicated by `υ`
   135  are CUE expressions.
   136  Literal structs can be treated as part of the original typed feature structure
   137  and do not need evaluation.
   138  Any other expression is evaluated and unified with existing values of that node.
   139  
   140  References in expressions refer to other nodes within the `C` and represent
   141  a copy of the substructure `C'` of `C` rooted at these nodes.
   142  Any references occurring in terms assigned to nodes of `C'` are be updated to
   143  point to the equivalent node in a copy of `C'`.
   144  <!-- TODO: define formally. Right now this is implied already by the
   145  definition of evaluation functions and unification: unifying
   146  the original TFS' structure of the constraint with the current node
   147  preserves the structure of the original graph by definition.
   148  This is getting very implicit, though.
   149  -->
   150  The functions defined by `F` correspond to the binary and unary operators
   151  and interpolation construct of CUE, as well as builtin functions.
   152  
   153  CUE allows duplicate labels within a struct, while the definition of
   154  typed feature structures does not.
   155  A duplicate label `l` with respective values `a` and `b` is represented in
   156  a constraint as a single label with term `&(a, b)`,
   157  the unification of `a` and `b`.
   158  Multiple labels may be recursively combined in any order.
   159  
   160  <!-- unnecessary, probably.
   161  #### Definition of evaluated value
   162  
   163  > A fully evaluated value, `T_evaluated ⊆ T` is a subset of `T` consisting
   164  > only of atoms, typed attribute structures and constraint functions.
   165  >
   166  > A value is called _ground_ if it is an atom or typed attribute structure.
   167  
   168  #### Unification of evaluated values
   169  
   170  > A fully evaluated value, `T_evaluated ⊆ T` is a subset of `T` consisting
   171  > only of atoms, typed attribute structures and constraint functions.
   172  >
   173  > A value is called _ground_ if it is an atom or typed attribute structure.
   174  -->
   175  
   176  ### Definition of subsumption and unification on typed attribute structure
   177  
   178  > For a given collection of constraints `𝒞`,
   179  > we define `π ≡`<sub>`C`</sub> `π'` to mean that typed feature structure `C ∈ 𝒞`
   180  > contains path equivalence between the paths `π` and `π'`
   181  > (i.e. `δ(q0, π) = δ(q0, π')`, where `q0` is the root node of `C`);
   182  > and `𝒫`<sub>`C`</sub>`(π) = c` to mean that
   183  > the typed feature structure at the path `π` in `C`
   184  > is `c` (i.e. `𝒫`<sub>`C`</sub>`(π) = c` if and only if `υ(δ(q0, π)) == c`,
   185  > where `q0` is the root node of `C`).
   186  > Subsumption is then defined as follows:
   187  > `C ∈ 𝒞` subsumes `C' ∈ 𝒞`, written `C' ⊑ C`, if and only if:
   188  >
   189  > - `π ≡`<sub>`C`</sub> `π'` implies  `π ≡`<sub>`C'`</sub> `π'`
   190  > - `𝒫`<sub>`C`</sub>`(π) = c` implies`𝒫`<sub>`C'`</sub>`(π) = c` and  `c' ⊑ c`
   191  >
   192  > The unification of `C` and  `C'`, denoted `C ⊓ C'`,
   193  > is the greatest lower bound of `C` and `C'` in `𝒞` ordered by subsumption.
   194  
   195  <!-- jba: So what does this get you that you don't already have from the
   196  various "instance-of" definitions in the main spec? I thought those were
   197  sufficiently precise. Although I admit that references and cycles
   198  are still unclear to me. -->
   199  
   200  Like with the subsumption relation for basic values,
   201  the subsumption relation for constraints determines the mutual placement
   202  of constraints within the partial order of all values.
   203  
   204  
   205  ### Evaluation function
   206  
   207  > The evaluation function is given by `E: T -> 𝒞`.
   208  > The unification of two typed feature structures is evaluated as defined above.
   209  > All other functions are evaluated according to the definitions found earlier
   210  > in this spec.
   211  > An error is indicated by `_|_`.
   212  
   213  #### Definition of well-formedness
   214  
   215  > We say that a given typed feature structure `C = <Q, q0, υ, δ> ∈ 𝒞` is
   216  > a _well-formed_ typed feature structure if and only if for all nodes `q ∈ Q`,
   217  > the substructure `C'` rooted at `q`,
   218  > is such that `E(υ(q)) ∈ 𝒞` and `C' = <Q', q, δ', υ'> ⊑ E(υ(q))`.
   219  
   220  <!-- Also, like Copestake, define appropriate features?
   221  Appropriate features are useful for detecting unused variables.
   222  
   223  Appropriate features could be introduced by distinguishing between:
   224  
   225  a: MyStruct // appropriate features are MyStruct
   226  a: {a : 1}
   227  
   228  and
   229  
   230  a: MyStruct & { a: 1 } // appropriate features are those of MyStruct + 'a'
   231  
   232  This is way too subtle, though.
   233  
   234  Alternatively: use Haskell's approach:
   235  
   236  #a: MyStruct // define a to be MyStruct any other features are allowed but
   237               // discarded from the model. Unused features are an error.
   238  
   239  Let's first try to see if we can get away with static usage analysis.
   240  A variant would be to define appropriate features unconditionally, but enforce
   241  them only for unused variables, with some looser definition of unused.
   242  -->
   243  
   244  The _evaluation_ of a CUE configuration represented by `C`
   245  is defined as the process of making `C` well-formed.
   246  
   247  <!--
   248  ore abstractly, we can define this structure as the tuple
   249  `<≡, 𝒫>`, where
   250  
   251  - `≡ ⊆ Path × Path` where `π ≡ π'` if and only if `Δ(π, q0) = Δ(π', q0)` (path equivalence)
   252  - `P: Path → ℙ` is `υ(Δ(π, q))` (path value).
   253  
   254  A struct `a = <≡, 𝒫>` subsumes a struct `b = <≡', 𝒫'>`, or `a ⊑ b`,
   255  if and only if
   256  
   257  - `π ≡ π'` implied `π ≡' π'`, and
   258  - `𝒫(π) = v` implies `𝒫'(π) = v'` and `v' ⊑ v`
   259  -->
   260  
   261  ### References
   262  Theory:
   263  - [1992] Bob Carpenter, "The logic of typed feature structures.";
   264    Cambridge University Press, ISBN:0-521-41932-8
   265  - [2002] Ann Copestake, "Implementing Typed Feature Structure Grammars.";
   266    CSLI Publications, ISBN 1-57586-261-1
   267  
   268  Some graph unification algorithms:
   269  
   270  - [1985] Fernando C. N. Pereira, "A structure-sharing representation for
   271    unification-based grammar formalisms."; In Proc. of the 23rd Annual Meeting of
   272    the Association for Computational Linguistics. Chicago, IL
   273  - [1991] H. Tomabechi, "Quasi-destructive graph unifications.."; In Proceedings
   274    of the 29th Annual Meeting of the ACL. Berkeley, CA
   275  - [1992] Hideto Tomabechi, "Quasi-destructive graph unifications with structure-
   276     sharing."; In Proceedings of the 15th International Conference on
   277     Computational Linguistics (COLING-92), Nantes, France.
   278  - [2001] Marcel van Lohuizen, "Memory-efficient and thread-safe
   279    quasi-destructive graph unification."; In Proceedings of the 38th Meeting of
   280    the Association for Computational Linguistics. Hong Kong, China.
   281  
   282  
   283  ## Implementation
   284  
   285  The _evaluation_ of a CUE configuration `C` is defined as the process of
   286  making `C` well-formed.
   287  
   288  
   289  This section does not define any operational semantics.
   290  As the unification operation is communitive, transitive, and reflexive,
   291  implementations have a considerable amount of leeway in
   292  choosing an evaluation strategy.
   293  Although most algorithms for the unification of typed attribute structure
   294  that have been proposed are near `O(n)`, there can be considerable performance
   295  benefits of choosing one of the many proposed evaluation strategies over the
   296  other.
   297  Implementations will need to be verified against the above formal definition.
   298  
   299  
   300  ### Constraint functions
   301  
   302  A _constraint function_ is a unary function `f` which for any input `a` only
   303  returns values that are an instance of `a`. For instance, the constraint
   304  function `f` for `string` returns `"foo"` for `f("foo")` and `_|_` for `f(1)`.
   305  Constraint functions may take other constraint functions as arguments to
   306  produce a more restricting constraint function.
   307  For instance, the constraint function `f` for `<=8` returns `5` for `f(5)`,
   308  `>=5 & <=8` for `f(>=5)`, and `_|_` for `f("foo")`.
   309  
   310  
   311  Constraint functions play a special role in unification.
   312  The unification function `&(a, b)` is defined as
   313  
   314  - `a & b` if `a` and `b` are two atoms
   315  - `a & b` if `a` and `b` are two nodes, respresenting struct
   316  - `a(b)` or `b(a)` if either `a` or `b` is a constraint function, respectively.
   317  
   318  Implementations are free to pick which constraint function is applied if
   319  both `a` and `b` are constraint functions, as the properties of unification
   320  will ensure this produces identical results.
   321  
   322  
   323  ### References
   324  
   325  A distinguising feature of CUE's unification algorithm is the use of references.
   326  In conventional graph unification for typed feature structures, the structures
   327  that are unified into the existing graph are independent and pre-evaluated.
   328  In CUE, the typed feature structures indicated by references may still need to
   329  be evaluated.
   330  Some conventional evaluation strategies may not cope well with references that
   331  refer to each other.
   332  The simple solution is to deploy a breadth-first evaluation strategy, rather than
   333  the more traditional depth-first approach.
   334  Other approaches are possible, however, and implementations are free to choose
   335  which approach is deployed.
   336