github.com/ipld/go-ipld-prime@v0.21.0/schema/gen/go/HACKME_maybe.md (about) 1 How do maybe/nullable/optional work? 2 ==================================== 3 4 (No, this document is not about things that we should "maybe" hack on. 5 It's about the feature we use to describe `nullable` and `optional` fields 6 in generated golang code.) 7 8 background 9 ---------- 10 11 You'll need to understand what the `nullable` and `optional` modifiers in IPLD schemas mean. 12 The https://specs.datamodel.io/ site has more content about that. 13 14 ### how this works outside of schemas 15 16 There are concepts of null and of absent present in the core `Node` and `NodeAssembler` interfaces. 17 `Node` specifies `IsNull() bool` and `IsAbsent() bool` predicates; 18 and `NodeAssembler` specifies an `AssignNull` function. 19 20 There are also singleton values available called `datamodel.Null` and `datamodel.Absent` 21 which report true for `IsNull` and `IsAbsent`, respectively. 22 These singletons can be used by an function that need to return a null or absence indicator. 23 24 There's really no reason for any package full of `Node` implementations need to make their own types for these values, 25 since the singletons are always fine to use. 26 However, there's also nothing stopping a `Node` implementation from doing interesting 27 custom internal memory layouts to describe whether they contain nulls, etc -- 28 and there's nothing particularly blessed about the `datamodel.Null` singleton. 29 Any value reporting `IsNull` to be `true` must be treated indistinguishably from `datamodel.Null`. 30 31 This indistinguishability is bidirectional. 32 For example, if you have some `myFancyNodeType`, and it answers `IsNull` as `true`, 33 and you insert this into a `basicnode.Map`, then ask for that value back from the map later... 34 you're very likely to get `datamodel.Null`, and not your concrete value of `myFancyNodeType` back again. 35 (This contract is important because some node implementations may compress 36 the concept of null into a bitmask, or otherwise similarly optimize things internally.) 37 38 #### null 39 40 The concept of "null" has a Kind in the IPLD Data Model. 41 It's implemented by the `datamodel.nullNode` type (which has no fields -- it's a "unit" type), 42 and is exposed as the `datamodel.Null` singleton value. 43 44 (More generally, `datamodel.Node` can be null by having its `Kind()` method return `datamodel.Kind_Null`, 45 and having the `IsNull()` method return `true`. 46 However, most code prefers to return the `datamodel.Null` singleton value whenever it can.) 47 48 Null values can be easily produced: the `AssignNull()` method on `datamodel.NodeAssembler` produces nulls; 49 and many codecs have some concept of null, meaning deserialization can produce them. 50 51 Null values work essentially the same way in both the plain Data Model and when working with Schemas. 52 53 #### absent 54 55 There's also a concept of "absent". 56 "Absent" is separate and distinct from the concept of "null" -- null is still a _value_; absent just means _nothing there_. 57 58 (Those familiar with javascript might note that javascript also has concepts of "null" versus "undefined". 59 It's the same idea -- we just call it "absent" instead of "undefined".) 60 61 Absent is implemented by the `datamodel.absentNode` type (which has no fields -- it's a "unit" type), 62 and is exposed as the `datamodel.Absent` singleton value. 63 64 (More generally, an `datamodel.Node` can describe itself as containing "absent" by having the `IsAbsent()` method return `true`. 65 (The `Kind()` method still returns `datamodel.Kind_Null`, for lack of better option.) 66 However, most code prefers to return the `datamodel.Absent` singleton value whenever it can.) 67 68 Absent values aren't really used at the Data Model level. 69 If you ask for a map key that isn't present in the map, the lookup method will return `nil` and `ErrNotExists`. 70 71 Absent values *do* show up at the Schema level, however. 72 Specifically, in structs: a struct can have a field which is `optional`, 73 one of the values such an optional field may report itself as having is `datamodel.Absent`. 74 This represents when a value *wasn't present* in the serialized form of the struct, 75 even though the schema lets us know that it could be, and that it's part of the struct's type. 76 (Accordingly, no `ErrNotExists` is returned for a lookup of that field -- 77 the field is always considered to _exist_... the value is just _absent_.) 78 Iterators will also return the field name key, together with `datamodel.Absent` as the value. 79 80 However, absent values can't really be *created*. 81 There's no such thing as an `AssignAbsent` or `AssignAbsent` method on the `datamodel.NodeAssembler` interface. 82 Codecs similarly can't produce absent as a value (obviously -- codecs work over `datamodel.NodeAssembler`, so how could they?). 83 Absent values are just produced by implication, when a field is defined, but its value isn't set. 84 85 Despite absent values not being used or produced at the Data Model, we still have methods like `IsAbsent` specified 86 as part of the `datamodel.Node` interface so that it's possible to write code which is generic over 87 either plain Data Model or Schema data while using just that interface. 88 89 ### the above is all regarding generic interfaces 90 91 As long as we're talking about the `datamodel.Node` _interface_, 92 we talk about the `datamodel.Null` and `datamodel.Absent` singletons, and their contracts in terms of the interface. 93 94 (Part of the reason this works is because an interface, in golang, 95 comes in two parts: a pointer to the typeinfo of the inhabitant value, 96 and a pointer to the value itself. 97 This means anywhere we have an `datamodel.Node` return type, we can toss `datamodel.Null` 98 or `datamodel.Absent` into it with no additional overhead!) 99 100 When we talk about concrete types, rather than the `datamodel.Node` _interface_ -- 101 as we're going to, in codegen -- it's a different scenario. 102 We can't just return `datamodel.Null` pointers for a `genresult.Foo` value; 103 if `genresult.Foo` is a concrete type, that's just flat out a compile error. 104 105 So what shall we do? 106 107 We introduce the "maybe" types. 108 109 110 111 the maybe types 112 --------------- 113 114 The general rule of "return `datamodel.Null` whenever you have a null value" 115 holds up only as long as our API is returning monomorphized `datamodel.Node` interfaces -- 116 in that situation, `datamodel.Null` fits within `datamodel.Node`, and there's no trouble. 117 118 This doesn't hold up when we get to codegen. 119 Or rather, more specifically, it even holds up for codegen... 120 as long as we're still returning monomorphized `datamodel.Node` interfaces (and a decent amount of the API surface still does so). 121 At the moment we want to return a concrete native type, it breaks. 122 123 We call methods created by codegen that use specific types 124 (e.g., methods that you _couldn't have_ without codegen) 125 "speciated" methods. And we do want them! 126 127 So we have to decide how to handle null and absent for these speciated methods. 128 129 ### goals of the maybe types 130 131 There are a couple of things we want to accomplish with the maybe types: 132 133 - Be able to have speciated methods that return a specific type (for doc, editor autocomplete, etc purposes). 134 - Be able to have speciated methods that return specific *concrete* type (i.e. not only do we want to be more specific than `datamodel.Node`, we don't want an interface _at all_ -- so that the compiler can do inlining and optimization and so forth). 135 - Make reading and writing code that uses speciated methods and handles nullable or optional fields be reasonably ergonomic (and as always, this may vary by "taste"). 136 137 And we'll consider one more fourth, bonus goal: 138 139 - It would be nice if the maybe types can clearly discuss whether the type is `(null|value)` vs `(absent|value)` vs `(absent|null|value)`, because this would let the golang compiler help check more of our logical correctness in code written using optionals and nullables. 140 141 ### there is only one type generated for each maybe 142 143 For every type generated, there is one maybe type also generated. 144 (At least this much is clearly necesary to satisfy the goals about "specific types".) 145 146 This means *we dropped the bonus goal* above. 147 Making `(null|value)` vs `(absent|value)` vs `(absent|null|value)` distinguishable to the golang compiler 148 would require three *additional* generated types (for obvious reasons) for each type specified by the Schema. 149 We decided that's simply too onerous. 150 151 (A different codegen project could certainly make a different choice here, though.) 152 153 ### the symbol for maybe types 154 155 For some type named `T` generated into a package named `gen`... 156 157 - the main type symbol is `gen.T`; 158 - the maybe for that type is `gen.MaybeT`; 159 160 Beware that this may spell trouble if your schema contains any types 161 with names starting in "Maybe". 162 (You can use adjunct config to change symbols for those types, if necessary.) 163 164 (There are also internal symbols for the same things, 165 but these are prefixed in such a way as to make collision not a concern.) 166 167 ### maybe types don't implement the full Node interface 168 169 The "maybe" types don't implement the full `datamodel.Node` interface. 170 They could have! They don't. 171 172 Arguments that went in favor of implementing `Node`: 173 174 - generally "seem fine" 175 - certainly makes sense to be able to 'IsNull' on it like any other Node. 176 - if in practice the maybe is embeded, we can return an internal pointer to it just fine, so there's no obvious runtime perf reason not to. 177 178 Arguments against: 179 180 - it's another type with a ton of methods. or two, or four. 181 - may increase the binary size. possibly by a significant constant multiplier. 182 - definitely increases the gsloc size, significantly. 183 - would it have a 'Type' accessor on it? 184 - if so, what does it say? 185 - simply not sure how useful this is! 186 - istm one will often either be passing the MaybeT to other speciated functions, or, fairly immediately de-maybing it. 187 - if this is true, the number of times anyone wants to treat it as a Node in practice are near zero. 188 - does this imply the existence of a _MaybeT__Assembler type, as well? 189 - binary and gsloc size still drifting up; this needs to justify itself and provide value to be worth it. 190 - what would be the expected behavior of handing a _MaybeT__Assembler to something like unmarshalling? 191 - if you have a null in the root, you can describe this with a kinded union, and probably would be better off for it. 192 - if you have can absent value in the root of a document you're unmarshalling... what? That's called "EOF". 193 - does a _MaybeT__Assembler show up usefully in the middle of a tree? 194 - it does not! there's always a _P_ValueAssembler type involved there anyway (this is needed for parent state machine purposes), and it largely delegates to the _T__Assembler, but is already a perfect position to add on the "maybe" semantics if the P type has them for its children. 195 196 The arguments against carried the day. 197 198 ### the maybe type is emebbedable 199 200 It's important that the "maybe" types be embeddable, for all the same reasons that 201 [we normally want embeddable types](./HACKME_memorylayout.md#embed-by-default). 202 203 It's interesting to consider the alternatives, though: 204 205 We could've bitpacked the isAbsent or isNull flags for a field into one word at the top of a struct, for example. 206 But, there are numerous drawbacks to this: 207 208 - the complexity of this is high. 209 - it would be exposed to anyone who writes addntl code in-package, which is asking for errors. 210 - the only thing this buys us is *slightly* less resident memory size. 211 - and long story short: if you look at how many other programming language do this, pareto-wise, no one in the world at large appears to care. 212 - it only applies to structs! maps or lists would require yet more custom bitpacking of a different arrangement. 213 214 If someone wants to do another codegen project someday, or make PRs to this one, which does choose bitpacking, 215 it would probably be neat. It's just a lot of effort for a payout that doesn't seem to often be worth it. 216 217 (We also ended up using pointers to a field with a `schema.Maybe` type _heavily_ 218 in the internals of our codegen outputs, in order to let child and parent assemblers coordinate. 219 Rebuilding this to work with a bitpacking alignment and yet still be composable enough to do its job... uufdah. Tricky. 220 It might be possible to use the current system in the assembler state, but flip it bitpack in the resulting immutable nodes, 221 and thereby get the best of both worlds. If you who reads this is enthusiastic, feel free to explore it.) 222 223 ### ...but the user is only exposed to the pointer form 224 225 This is the same story as for the main types: it's covered in 226 [unexported implementations, exported aliases](./HACKME_memorylayout.md#unexported-implementations-exported-aliases). 227 228 Genenerally, this "shielded" type means you can only have a MaybeT with valid contents, 229 because no one can ever produce the uninitialized "zero" value of the type. 230 This means there's no "invalid" state which can kick you in the shins at runtime, 231 and we generally regard that as a good thing. 232 233 It also just keeps things syntactically simple. 234 One always refers to "MaybeT"; never with a star. 235 236 ### whether or not the maybe's inhabitant type is embedded is based on adjunct config 237 238 Although the maybe type itself is embeddable, its _inhabitant_ may be 239 either embedded in the maybe type or be a pointer, at your option. 240 241 This is clearest to explain in code: you can have either: 242 243 ```go 244 type MaybeFoo struct { 245 m schema.Maybe // enum bit for present|absent|null 246 v Foo // the inhabitant (here, embedded) 247 } 248 ``` 249 250 or: 251 252 ```go 253 type MaybeFoo struct { 254 m schema.Maybe // enum bit for present|absent|null 255 v *Foo // the inhabitant (here, a pointer!) 256 } 257 ``` 258 259 (Yes, we're talking about a one-character difference in the code.) 260 261 Which of these two forms is generated can be selected by adjunct config. 262 ("Adjunct" config just means: it's not part of the schema; it's part of the 263 config for this codegen tool.) 264 265 There are advantages to each: 266 267 - the embedded form is ([as usual](./HACKME_memorylayout.md#embed-by-default)), faster for workloads where the value is usually present (it provokes fewer allocations). 268 - the pointer form may use less memory when the value is absent; it works for cyclic structures; and if assigning a whole subtree at once, it allows faster assignment. 269 270 Also, for cyclic structures, such as `type Foo {String:nullable Foo}`, or `type Bar struct{ recurse optional Bar }`, the pointer form is *required*. 271 (Otherwise... how big of a slab of memory would we be allocating? Infinite? Nope; compile error.) 272 273 By default, we generate the pointer form. 274 However, your application may experience significant performance improvements by selectively using the embed form. 275 Check it out and tune for what's right for your application. 276 277 (FUTURE: we should make more clever defaults: it's reasonable to default to embed form for any type that is of scalar kind.) 278 279 280 281 implementation detail notes 282 --------------------------- 283 284 ### how state machines and maybes work 285 286 Assemblers for recursive stuff have state machines that are used to insure 287 orderly transitions between each key and value assembly, 288 and that a complete entry has been assembled before the next entry or the finish. 289 (For example, you can't go key-then-key in a map, 290 nor start a value and then start another value before finishing the first one in a list, 291 nor finish a map when you've just inserted a key and no value, and so forth.) 292 293 One part of this is straightforward: we simply implement state machines, 294 using bog-standard patterns around a typed uint and logical transition guards 295 in all the relevant functions. Done and done. Except... 296 297 How do child assemblers signal to their parent that they've become finished? 298 Theoretically, easy; in practice, to work efficiently... 299 This poses a bit of an implementation challenge. 300 301 One obvious solution is to put a callback field in assemblers, and have 302 the parent assembler supply the child assembler with a callback that can 303 update the parent's state machine when the child becomes finished. 304 This is logically correct, but practically, problematic and Not Fast: 305 it requires generating a closure of some kind which composes the function 306 pointer with the pointer to that parent assembler: and since this is two words 307 of memory, it implies an allocation and (unfortunately) a heap escape. 308 An allocation per child key and value in a recursive structure is unacceptable; 309 we want to set a _much_ higher bar for performance here. 310 311 So, we move on to less obvious solutions: we're all in the same package here, 312 so we can twiddle the bits of our neighboring structures quite directly, yes? 313 What if we just have assemblers contain pointers to a state machine uint, 314 and they do a fixed-value compare-and-swap when they're done? 315 This is terrifyingly direct and has no abstractions, yes indeed: but 316 we do generally assume control all the code in this package for any of our 317 correctness constraints, so this is in-bounds (if admittedly uncomfortable). 318 319 Now let's combine that with one more concern: nullables. When an assembler 320 is not at the root of a document, it may need to accept null values. 321 We could do this by generating distinct assembler types for use in positions 322 where nulls are allowed; but though such an approach would work, it is bulky. 323 We'd much rather be able to reuse assembler types in either scenario. 324 325 So, let's have assemblers contain two pointers: 326 the already-familiar 'w' pointer, and also an 'm' pointer. 327 The 'm' pointer effectively communicates up whether the child has become finished 328 when it becomes either 'Maybe_Null' or 'Maybe_Value'. 329 330 We add a few new states to the 'm' value, and use it to hint in both directions: 331 assemblers will assume nulls are not an acceptable transition *unless* the 'm' 332 value comes initialized with a hint that we are in a situation where they work. 333 334 The costs here are "some": it's another pointer indirection and memory set. 335 However, compared to the alternatives, it's pretty good: versus an allocation 336 (in the callback approach), this is a huge win; and we're even pretty safe to 337 bet that that pointer indirection is going to land in a cache line already hot. 338 339 You can find the additional magic consts crammed into `schema.Maybe` fields 340 for this statekeeping during assembly defined in the "minima" file in codegen output. 341 They are named `midvalue` and `allowNull`. 342 343 344 345 this could have been different 346 ------------------------------ 347 348 There are many ways this design could've been different: 349 350 ### we could have every maybe type implement Node 351 352 As already discussed above, it would cause a lot of extra boilerplate methods, 353 increasing both the generated code source size and binary size; 354 but on the plus side, it would've been in some ways arguably more consistent. 355 356 We didn't. 357 358 ### we could've generated three maybes per type 359 360 Already discussed above. 361 362 We didn't. 363 364 ### we could've designed schemas differently 365 366 A lot of the twists of the design originate from the fact that both `optional` 367 and `nullable` are both rather special as well as very contextual in IPLD Schemas 368 (e.g., `optional` is only permitted in a very few special places in a schema). 369 If we had built a very different type system, maybe things would come out differently. 370 371 Some of this has some exploration in some gists: 372 373 - https://gist.github.com/warpfork/9dd8b68deff2b90f96167c900ea31eec#dubious-soln-drop-nullable-completely-make-inline-anonymous-union-syntax-instead 374 - https://gist.github.com/warpfork/9dd8b68deff2b90f96167c900ea31eec#soln-change-how-schemas-regard-nullable-and-optional 375 - https://gist.github.com/warpfork/9dd8b68deff2b90f96167c900ea31eec#soln-support-absent-as-a-discriminator-in-kinded-unions 376 377 But suffice to say, that's a very big topic. 378 379 Optionals and nullables are the way they are because they seemed like useful 380 concepts for describing the structure of data which has serial forms; 381 how they map onto any particular programming language (such as Go) was a secondary concern. 382 This design for a golang library is trying to do its best within that. 383 384 ### we could've done X with techinque Y 385 386 Probably, yes :) 387 388 This is just one implementation of codegen for Golang for IPLD Schemas. 389 Competing implementations that make different choices are absolutely welcome :) 390 391 392