github.com/solo-io/cue@v0.4.7/doc/ref/impl.md (about) 1 # Implementing CUE 2 3 4 > NOTE: this is a working document attempting to describe CUE in a way 5 > relatable to existing graph unification systems. It is mostly 6 > redundant to [the spec](./spec.md). Unless one is interested in 7 > understanding how to implement CUE or how it relates to the existing 8 > body of research, read the spec instead. 9 10 11 CUE is modeled after typed feature structure and graph unification systems 12 such as LKB. 13 There is a wealth of research related to such systems and graph unification in 14 general. 15 This document describes the core semantics of CUE in a notation 16 that allows relating it to this existing body of research. 17 18 19 ## Background 20 21 CUE was inspired by a formalism known as 22 typed attribute structures [Carpenter 1992] or 23 typed feature structures [Copestake 2002], 24 which are used in linguistics to encode grammars and 25 lexicons. Being able to effectively encode large amounts of data in a rigorous 26 manner, this formalism seemed like a great fit for large-scale configuration. 27 28 Although CUE configurations are specified as trees, not graphs, implementations 29 can benefit from considering them as graphs when dealing with cycles, 30 and effectively turning them into graphs when applying techniques like 31 structure sharing. 32 Dealing with cycles is well understood for typed attribute structures 33 and as CUE configurations are formally closely related to them, 34 we can benefit from this knowledge without reinventing the wheel. 35 36 ## Formal Definition 37 38 39 <!-- 40 The previous section is equivalent to the below text with the main difference 41 that it is only defined for trees. Technically, structs are more akin dags, 42 but that is hard to explain at this point and also unnecessarily pedantic. 43 We keep the definition closer to trees and will layer treatment 44 of cycles on top of these definitions to achieve the same result (possibly 45 without the benefits of structure sharing of a dag). 46 47 A _field_ is a field name, or _label_ and a protype. 48 A _struct_ is a set of _fields_ with unique labels for each field. 49 --> 50 51 A CUE configuration can be defined in terms of constraints, which are 52 analogous to typed attribute structures referred to above. 53 54 ### Definition of basic values 55 56 > A _basic value_ is any CUE value that is not a struct (or, by 57 > extension, a list). 58 > All basic values are partially ordered in a lattice, such that for any 59 > basic value `a` and `b` there is a unique greatest lower bound 60 > defined for the subsumption relation `a ⊑ b`. 61 62 ``` 63 Basic values 64 null 65 true 66 bool 67 3.14 68 string 69 "Hello" 70 >=0 71 <8 72 re("Hello .*!") 73 ``` 74 75 The basic values correspond to their respective types defined earlier. 76 77 Struct (and by extension lists), are represented by the abstract notion of 78 a typed feature structure. 79 Each node in a configuration, including the root node, 80 is associated with a constraint. 81 82 83 ### Definition of a typed feature structures and substructures 84 85 <!-- jba: This isn't adding understanding. I'd rather you omitted it and 86 added a bit of rigor to the above spec. Or at a minimum, translate the 87 formalism into the terms you use above. 88 --> 89 90 > A typed feature structure_ defined for a finite set of labels `Label` 91 > is directed acyclic graph with labeled 92 > arcs and values, represented by a tuple `C = <Q, q0, υ, δ>`, where 93 > 94 > 1. `Q` is the finite set of nodes, 95 > 1. `q0 ∈ Q`, is the root node, 96 > 1. `υ: Q → T` is the total node typing function, 97 > for a finite set of possible terms `T`. 98 > 1. `δ: Label × Q → Q` is the partial feature function, 99 > 100 > subject to the following conditions: 101 > 102 > 1. there is no node `q` or label `l` such that `δ(q, l) = q0` (root) 103 > 2. for every node `q` in `Q` there is a path `π` (i.e. a sequence of 104 > members of Label) such that `δ(q0, π) = q` (unique root, correctness) 105 > 3. there is no node `q` or path `π` such that `δ(q, π) = q` (no cycles) 106 > 107 > where `δ` is extended to be defined on paths as follows: 108 > 109 > 1. `δ(q, ϵ) = q`, where `ϵ` is the empty path 110 > 2. `δ(q, l∙π) = δ(δ(l, q), π)` 111 > 112 > The _substructures_ of a typed feature structure are the 113 > typed feature structures rooted at each node in the structure. 114 > 115 > The set of all possible typed feature structures for a given label 116 > set is denoted as `𝒞`<sub>`Label`</sub>. 117 > 118 > The set of _terms_ for label set `Label` is recursively defined as 119 > 120 > 1. every basic value: `P ⊆ T` 121 > 1. every constraint in `𝒞`<sub>`Label`</sub> is a term: `𝒞`<sub>`Label`</sub>` ⊆ T` 122 > a _reference_ may refer to any substructure of `C`. 123 > 1. for every `n` values `t₁, ..., tₙ`, and every `n`-ary function symbol 124 > `f ∈ F_n`, the value `f(t₁,...,tₙ) ∈ T`. 125 > 126 127 128 This definition has been taken and modified from [Carpenter, 1992] 129 and [Copestake, 2002]. 130 131 Without loss of generality, we will henceforth assume that the given set 132 of labels is constant and denote `𝒞`<sub>`Label`</sub> as `𝒞`. 133 134 In CUE configurations, the abstract constraints implicated by `υ` 135 are CUE expressions. 136 Literal structs can be treated as part of the original typed feature structure 137 and do not need evaluation. 138 Any other expression is evaluated and unified with existing values of that node. 139 140 References in expressions refer to other nodes within the `C` and represent 141 a copy of the substructure `C'` of `C` rooted at these nodes. 142 Any references occurring in terms assigned to nodes of `C'` are be updated to 143 point to the equivalent node in a copy of `C'`. 144 <!-- TODO: define formally. Right now this is implied already by the 145 definition of evaluation functions and unification: unifying 146 the original TFS' structure of the constraint with the current node 147 preserves the structure of the original graph by definition. 148 This is getting very implicit, though. 149 --> 150 The functions defined by `F` correspond to the binary and unary operators 151 and interpolation construct of CUE, as well as builtin functions. 152 153 CUE allows duplicate labels within a struct, while the definition of 154 typed feature structures does not. 155 A duplicate label `l` with respective values `a` and `b` is represented in 156 a constraint as a single label with term `&(a, b)`, 157 the unification of `a` and `b`. 158 Multiple labels may be recursively combined in any order. 159 160 <!-- unnecessary, probably. 161 #### Definition of evaluated value 162 163 > A fully evaluated value, `T_evaluated ⊆ T` is a subset of `T` consisting 164 > only of atoms, typed attribute structures and constraint functions. 165 > 166 > A value is called _ground_ if it is an atom or typed attribute structure. 167 168 #### Unification of evaluated values 169 170 > A fully evaluated value, `T_evaluated ⊆ T` is a subset of `T` consisting 171 > only of atoms, typed attribute structures and constraint functions. 172 > 173 > A value is called _ground_ if it is an atom or typed attribute structure. 174 --> 175 176 ### Definition of subsumption and unification on typed attribute structure 177 178 > For a given collection of constraints `𝒞`, 179 > we define `π ≡`<sub>`C`</sub> `π'` to mean that typed feature structure `C ∈ 𝒞` 180 > contains path equivalence between the paths `π` and `π'` 181 > (i.e. `δ(q0, π) = δ(q0, π')`, where `q0` is the root node of `C`); 182 > and `𝒫`<sub>`C`</sub>`(π) = c` to mean that 183 > the typed feature structure at the path `π` in `C` 184 > is `c` (i.e. `𝒫`<sub>`C`</sub>`(π) = c` if and only if `υ(δ(q0, π)) == c`, 185 > where `q0` is the root node of `C`). 186 > Subsumption is then defined as follows: 187 > `C ∈ 𝒞` subsumes `C' ∈ 𝒞`, written `C' ⊑ C`, if and only if: 188 > 189 > - `π ≡`<sub>`C`</sub> `π'` implies `π ≡`<sub>`C'`</sub> `π'` 190 > - `𝒫`<sub>`C`</sub>`(π) = c` implies`𝒫`<sub>`C'`</sub>`(π) = c` and `c' ⊑ c` 191 > 192 > The unification of `C` and `C'`, denoted `C ⊓ C'`, 193 > is the greatest lower bound of `C` and `C'` in `𝒞` ordered by subsumption. 194 195 <!-- jba: So what does this get you that you don't already have from the 196 various "instance-of" definitions in the main spec? I thought those were 197 sufficiently precise. Although I admit that references and cycles 198 are still unclear to me. --> 199 200 Like with the subsumption relation for basic values, 201 the subsumption relation for constraints determines the mutual placement 202 of constraints within the partial order of all values. 203 204 205 ### Evaluation function 206 207 > The evaluation function is given by `E: T -> 𝒞`. 208 > The unification of two typed feature structures is evaluated as defined above. 209 > All other functions are evaluated according to the definitions found earlier 210 > in this spec. 211 > An error is indicated by `_|_`. 212 213 #### Definition of well-formedness 214 215 > We say that a given typed feature structure `C = <Q, q0, υ, δ> ∈ 𝒞` is 216 > a _well-formed_ typed feature structure if and only if for all nodes `q ∈ Q`, 217 > the substructure `C'` rooted at `q`, 218 > is such that `E(υ(q)) ∈ 𝒞` and `C' = <Q', q, δ', υ'> ⊑ E(υ(q))`. 219 220 <!-- Also, like Copestake, define appropriate features? 221 Appropriate features are useful for detecting unused variables. 222 223 Appropriate features could be introduced by distinguishing between: 224 225 a: MyStruct // appropriate features are MyStruct 226 a: {a : 1} 227 228 and 229 230 a: MyStruct & { a: 1 } // appropriate features are those of MyStruct + 'a' 231 232 This is way too subtle, though. 233 234 Alternatively: use Haskell's approach: 235 236 #a: MyStruct // define a to be MyStruct any other features are allowed but 237 // discarded from the model. Unused features are an error. 238 239 Let's first try to see if we can get away with static usage analysis. 240 A variant would be to define appropriate features unconditionally, but enforce 241 them only for unused variables, with some looser definition of unused. 242 --> 243 244 The _evaluation_ of a CUE configuration represented by `C` 245 is defined as the process of making `C` well-formed. 246 247 <!-- 248 ore abstractly, we can define this structure as the tuple 249 `<≡, 𝒫>`, where 250 251 - `≡ ⊆ Path × Path` where `π ≡ π'` if and only if `Δ(π, q0) = Δ(π', q0)` (path equivalence) 252 - `P: Path → ℙ` is `υ(Δ(π, q))` (path value). 253 254 A struct `a = <≡, 𝒫>` subsumes a struct `b = <≡', 𝒫'>`, or `a ⊑ b`, 255 if and only if 256 257 - `π ≡ π'` implied `π ≡' π'`, and 258 - `𝒫(π) = v` implies `𝒫'(π) = v'` and `v' ⊑ v` 259 --> 260 261 ### References 262 Theory: 263 - [1992] Bob Carpenter, "The logic of typed feature structures."; 264 Cambridge University Press, ISBN:0-521-41932-8 265 - [2002] Ann Copestake, "Implementing Typed Feature Structure Grammars."; 266 CSLI Publications, ISBN 1-57586-261-1 267 268 Some graph unification algorithms: 269 270 - [1985] Fernando C. N. Pereira, "A structure-sharing representation for 271 unification-based grammar formalisms."; In Proc. of the 23rd Annual Meeting of 272 the Association for Computational Linguistics. Chicago, IL 273 - [1991] H. Tomabechi, "Quasi-destructive graph unifications.."; In Proceedings 274 of the 29th Annual Meeting of the ACL. Berkeley, CA 275 - [1992] Hideto Tomabechi, "Quasi-destructive graph unifications with structure- 276 sharing."; In Proceedings of the 15th International Conference on 277 Computational Linguistics (COLING-92), Nantes, France. 278 - [2001] Marcel van Lohuizen, "Memory-efficient and thread-safe 279 quasi-destructive graph unification."; In Proceedings of the 38th Meeting of 280 the Association for Computational Linguistics. Hong Kong, China. 281 282 283 ## Implementation 284 285 The _evaluation_ of a CUE configuration `C` is defined as the process of 286 making `C` well-formed. 287 288 289 This section does not define any operational semantics. 290 As the unification operation is communitive, transitive, and reflexive, 291 implementations have a considerable amount of leeway in 292 choosing an evaluation strategy. 293 Although most algorithms for the unification of typed attribute structure 294 that have been proposed are near `O(n)`, there can be considerable performance 295 benefits of choosing one of the many proposed evaluation strategies over the 296 other. 297 Implementations will need to be verified against the above formal definition. 298 299 300 ### Constraint functions 301 302 A _constraint function_ is a unary function `f` which for any input `a` only 303 returns values that are an instance of `a`. For instance, the constraint 304 function `f` for `string` returns `"foo"` for `f("foo")` and `_|_` for `f(1)`. 305 Constraint functions may take other constraint functions as arguments to 306 produce a more restricting constraint function. 307 For instance, the constraint function `f` for `<=8` returns `5` for `f(5)`, 308 `>=5 & <=8` for `f(>=5)`, and `_|_` for `f("foo")`. 309 310 311 Constraint functions play a special role in unification. 312 The unification function `&(a, b)` is defined as 313 314 - `a & b` if `a` and `b` are two atoms 315 - `a & b` if `a` and `b` are two nodes, respresenting struct 316 - `a(b)` or `b(a)` if either `a` or `b` is a constraint function, respectively. 317 318 Implementations are free to pick which constraint function is applied if 319 both `a` and `b` are constraint functions, as the properties of unification 320 will ensure this produces identical results. 321 322 323 ### References 324 325 A distinguising feature of CUE's unification algorithm is the use of references. 326 In conventional graph unification for typed feature structures, the structures 327 that are unified into the existing graph are independent and pre-evaluated. 328 In CUE, the typed feature structures indicated by references may still need to 329 be evaluated. 330 Some conventional evaluation strategies may not cope well with references that 331 refer to each other. 332 The simple solution is to deploy a breadth-first evaluation strategy, rather than 333 the more traditional depth-first approach. 334 Other approaches are possible, however, and implementations are free to choose 335 which approach is deployed. 336