github.com/ipld/go-ipld-prime@v0.21.0/schema/gen/go/HACKME_maybe.md

github.com/ipld/go-ipld-prime@v0.21.0/schema/gen/go/HACKME_maybe.md (about)

     1  How do maybe/nullable/optional work?
     2  ====================================
     3  
     4  (No, this document is not about things that we should "maybe" hack on.
     5  It's about the feature we use to describe `nullable` and `optional` fields
     6  in generated golang code.)
     7  
     8  background
     9  ----------
    10  
    11  You'll need to understand what the `nullable` and `optional` modifiers in IPLD schemas mean.
    12  The https://specs.datamodel.io/ site has more content about that.
    13  
    14  ### how this works outside of schemas
    15  
    16  There are concepts of null and of absent present in the core `Node` and `NodeAssembler` interfaces.
    17  `Node` specifies `IsNull() bool` and `IsAbsent() bool` predicates;
    18  and `NodeAssembler` specifies an `AssignNull` function.
    19  
    20  There are also singleton values available called `datamodel.Null` and `datamodel.Absent`
    21  which report true for `IsNull` and `IsAbsent`, respectively.
    22  These singletons can be used by an function that need to return a null or absence indicator.
    23  
    24  There's really no reason for any package full of `Node` implementations need to make their own types for these values,
    25  since the singletons are always fine to use.
    26  However, there's also nothing stopping a `Node` implementation from doing interesting
    27  custom internal memory layouts to describe whether they contain nulls, etc --
    28  and there's nothing particularly blessed about the `datamodel.Null` singleton.
    29  Any value reporting `IsNull` to be `true` must be treated indistinguishably from `datamodel.Null`.
    30  
    31  This indistinguishability is bidirectional.
    32  For example, if you have some `myFancyNodeType`, and it answers `IsNull` as `true`,
    33  and you insert this into a `basicnode.Map`, then ask for that value back from the map later...
    34  you're very likely to get `datamodel.Null`, and not your concrete value of `myFancyNodeType` back again.
    35  (This contract is important because some node implementations may compress
    36  the concept of null into a bitmask, or otherwise similarly optimize things internally.)
    37  
    38  #### null
    39  
    40  The concept of "null" has a Kind in the IPLD Data Model.
    41  It's implemented by the `datamodel.nullNode` type (which has no fields -- it's a "unit" type),
    42  and is exposed as the `datamodel.Null` singleton value.
    43  
    44  (More generally, `datamodel.Node` can be null by having its `Kind()` method return `datamodel.Kind_Null`,
    45  and having the `IsNull()` method return `true`.
    46  However, most code prefers to return the `datamodel.Null` singleton value whenever it can.)
    47  
    48  Null values can be easily produced: the `AssignNull()` method on `datamodel.NodeAssembler` produces nulls;
    49  and many codecs have some concept of null, meaning deserialization can produce them.
    50  
    51  Null values work essentially the same way in both the plain Data Model and when working with Schemas.
    52  
    53  #### absent
    54  
    55  There's also a concept of "absent".
    56  "Absent" is separate and distinct from the concept of "null" -- null is still a _value_; absent just means _nothing there_.
    57  
    58  (Those familiar with javascript might note that javascript also has concepts of "null" versus "undefined".
    59  It's the same idea -- we just call it "absent" instead of "undefined".)
    60  
    61  Absent is implemented by the `datamodel.absentNode` type (which has no fields -- it's a "unit" type),
    62  and is exposed as the `datamodel.Absent` singleton value.
    63  
    64  (More generally, an `datamodel.Node` can describe itself as containing "absent" by having the `IsAbsent()` method return `true`.
    65  (The `Kind()` method still returns `datamodel.Kind_Null`, for lack of better option.)
    66  However, most code prefers to return the `datamodel.Absent` singleton value whenever it can.)
    67  
    68  Absent values aren't really used at the Data Model level.
    69  If you ask for a map key that isn't present in the map, the lookup method will return `nil` and `ErrNotExists`.
    70  
    71  Absent values *do* show up at the Schema level, however.
    72  Specifically, in structs: a struct can have a field which is `optional`,
    73  one of the values such an optional field may report itself as having is `datamodel.Absent`.
    74  This represents when a value *wasn't present* in the serialized form of the struct,
    75  even though the schema lets us know that it could be, and that it's part of the struct's type.
    76  (Accordingly, no `ErrNotExists` is returned for a lookup of that field --
    77  the field is always considered to _exist_... the value is just _absent_.)
    78  Iterators will also return the field name key, together with `datamodel.Absent` as the value.
    79  
    80  However, absent values can't really be *created*.
    81  There's no such thing as an `AssignAbsent` or `AssignAbsent` method on the `datamodel.NodeAssembler` interface.
    82  Codecs similarly can't produce absent as a value (obviously -- codecs work over `datamodel.NodeAssembler`, so how could they?).
    83  Absent values are just produced by implication, when a field is defined, but its value isn't set.
    84  
    85  Despite absent values not being used or produced at the Data Model, we still have methods like `IsAbsent` specified
    86  as part of the `datamodel.Node` interface so that it's possible to write code which is generic over
    87  either plain Data Model or Schema data while using just that interface.
    88  
    89  ### the above is all regarding generic interfaces
    90  
    91  As long as we're talking about the `datamodel.Node` _interface_,
    92  we talk about the `datamodel.Null` and `datamodel.Absent` singletons, and their contracts in terms of the interface.
    93  
    94  (Part of the reason this works is because an interface, in golang,
    95  comes in two parts: a pointer to the typeinfo of the inhabitant value,
    96  and a pointer to the value itself.
    97  This means anywhere we have an `datamodel.Node` return type, we can toss `datamodel.Null`
    98  or `datamodel.Absent` into it with no additional overhead!)
    99  
   100  When we talk about concrete types, rather than the `datamodel.Node` _interface_ --
   101  as we're going to, in codegen -- it's a different scenario.
   102  We can't just return `datamodel.Null` pointers for a `genresult.Foo` value;
   103  if `genresult.Foo` is a concrete type, that's just flat out a compile error.
   104  
   105  So what shall we do?
   106  
   107  We introduce the "maybe" types.
   108  
   109  
   110  
   111  the maybe types
   112  ---------------
   113  
   114  The general rule of "return `datamodel.Null` whenever you have a null value"
   115  holds up only as long as our API is returning monomorphized `datamodel.Node` interfaces --
   116  in that situation, `datamodel.Null` fits within `datamodel.Node`, and there's no trouble.
   117  
   118  This doesn't hold up when we get to codegen.
   119  Or rather, more specifically, it even holds up for codegen...
   120  as long as we're still returning monomorphized `datamodel.Node` interfaces (and a decent amount of the API surface still does so).
   121  At the moment we want to return a concrete native type, it breaks.
   122  
   123  We call methods created by codegen that use specific types
   124  (e.g., methods that you _couldn't have_ without codegen)
   125  "speciated" methods.  And we do want them!
   126  
   127  So we have to decide how to handle null and absent for these speciated methods.
   128  
   129  ### goals of the maybe types
   130  
   131  There are a couple of things we want to accomplish with the maybe types:
   132  
   133  - Be able to have speciated methods that return a specific type (for doc, editor autocomplete, etc purposes).
   134  - Be able to have speciated methods that return specific *concrete* type (i.e. not only do we want to be more specific than `datamodel.Node`, we don't want an interface _at all_ -- so that the compiler can do inlining and optimization and so forth).
   135  - Make reading and writing code that uses speciated methods and handles nullable or optional fields be reasonably ergonomic (and as always, this may vary by "taste").
   136  
   137  And we'll consider one more fourth, bonus goal:
   138  
   139  - It would be nice if the maybe types can clearly discuss whether the type is `(null|value)` vs `(absent|value)` vs `(absent|null|value)`, because this would let the golang compiler help check more of our logical correctness in code written using optionals and nullables.
   140  
   141  ### there is only one type generated for each maybe
   142  
   143  For every type generated, there is one maybe type also generated.
   144  (At least this much is clearly necesary to satisfy the goals about "specific types".)
   145  
   146  This means *we dropped the bonus goal* above.
   147  Making `(null|value)` vs `(absent|value)` vs `(absent|null|value)` distinguishable to the golang compiler
   148  would require three *additional* generated types (for obvious reasons) for each type specified by the Schema.
   149  We decided that's simply too onerous.
   150  
   151  (A different codegen project could certainly make a different choice here, though.)
   152  
   153  ### the symbol for maybe types
   154  
   155  For some type named `T` generated into a package named `gen`...
   156  
   157  - the main type symbol is `gen.T`;
   158  - the maybe for that type is `gen.MaybeT`;
   159  
   160  Beware that this may spell trouble if your schema contains any types
   161  with names starting in "Maybe".
   162  (You can use adjunct config to change symbols for those types, if necessary.)
   163  
   164  (There are also internal symbols for the same things,
   165  but these are prefixed in such a way as to make collision not a concern.)
   166  
   167  ### maybe types don't implement the full Node interface
   168  
   169  The "maybe" types don't implement the full `datamodel.Node` interface.
   170  They could have!  They don't.
   171  
   172  Arguments that went in favor of implementing `Node`:
   173  
   174  - generally "seem fine"
   175  - certainly makes sense to be able to 'IsNull' on it like any other Node.
   176  - if in practice the maybe is embeded, we can return an internal pointer to it just fine, so there's no obvious runtime perf reason not to.
   177  
   178  Arguments against:
   179  
   180  - it's another type with a ton of methods.  or two, or four.
   181  	- may increase the binary size.  possibly by a significant constant multiplier.
   182  	- definitely increases the gsloc size, significantly.
   183  - would it have a 'Type' accessor on it?
   184  	- if so, what does it say?
   185  - simply not sure how useful this is!
   186  	- istm one will often either be passing the MaybeT to other speciated functions, or, fairly immediately de-maybing it.
   187  		- if this is true, the number of times anyone wants to treat it as a Node in practice are near zero.
   188  - does this imply the existence of a _MaybeT__Assembler type, as well?
   189  	- binary and gsloc size still drifting up; this needs to justify itself and provide value to be worth it.
   190  	- what would be the expected behavior of handing a _MaybeT__Assembler to something like unmarshalling?
   191  		- if you have a null in the root, you can describe this with a kinded union, and probably would be better off for it.
   192  		- if you have can absent value in the root of a document you're unmarshalling... what?  That's called "EOF".
   193  	- does a _MaybeT__Assembler show up usefully in the middle of a tree?
   194  		- it does not!  there's always a _P_ValueAssembler type involved there anyway (this is needed for parent state machine purposes), and it largely delegates to the _T__Assembler, but is already a perfect position to add on the "maybe" semantics if the P type has them for its children.
   195  
   196  The arguments against carried the day.
   197  
   198  ### the maybe type is emebbedable
   199  
   200  It's important that the "maybe" types be embeddable, for all the same reasons that
   201  [we normally want embeddable types](./HACKME_memorylayout.md#embed-by-default).
   202  
   203  It's interesting to consider the alternatives, though:
   204  
   205  We could've bitpacked the isAbsent or isNull flags for a field into one word at the top of a struct, for example.
   206  But, there are numerous drawbacks to this:
   207  
   208  - the complexity of this is high.
   209  - it would be exposed to anyone who writes addntl code in-package, which is asking for errors.
   210  - the only thing this buys us is *slightly* less resident memory size.
   211  	- and long story short: if you look at how many other programming language do this, pareto-wise, no one in the world at large appears to care.
   212  - it only applies to structs!  maps or lists would require yet more custom bitpacking of a different arrangement.
   213  
   214  If someone wants to do another codegen project someday, or make PRs to this one, which does choose bitpacking,
   215  it would probably be neat.  It's just a lot of effort for a payout that doesn't seem to often be worth it.
   216  
   217  (We also ended up using pointers to a field with a `schema.Maybe` type _heavily_
   218  in the internals of our codegen outputs, in order to let child and parent assemblers coordinate.
   219  Rebuilding this to work with a bitpacking alignment and yet still be composable enough to do its job... uufdah.  Tricky.
   220  It might be possible to use the current system in the assembler state, but flip it bitpack in the resulting immutable nodes,
   221  and thereby get the best of both worlds.  If you who reads this is enthusiastic, feel free to explore it.)
   222  
   223  ### ...but the user is only exposed to the pointer form
   224  
   225  This is the same story as for the main types: it's covered in
   226  [unexported implementations, exported aliases](./HACKME_memorylayout.md#unexported-implementations-exported-aliases).
   227  
   228  Genenerally, this "shielded" type means you can only have a MaybeT with valid contents,
   229  because no one can ever produce the uninitialized "zero" value of the type.
   230  This means there's no "invalid" state which can kick you in the shins at runtime,
   231  and we generally regard that as a good thing.
   232  
   233  It also just keeps things syntactically simple.
   234  One always refers to "MaybeT"; never with a star.
   235  
   236  ### whether or not the maybe's inhabitant type is embedded is based on adjunct config
   237  
   238  Although the maybe type itself is embeddable, its _inhabitant_ may be
   239  either embedded in the maybe type or be a pointer, at your option.
   240  
   241  This is clearest to explain in code: you can have either:
   242  
   243  ```go
   244  type MaybeFoo struct {
   245  	m schema.Maybe // enum bit for present|absent|null
   246  	v Foo          // the inhabitant (here, embedded)
   247  }
   248  ```
   249  
   250  or:
   251  
   252  ```go
   253  type MaybeFoo struct {
   254  	m schema.Maybe // enum bit for present|absent|null
   255  	v *Foo         // the inhabitant (here, a pointer!)
   256  }
   257  ```
   258  
   259  (Yes, we're talking about a one-character difference in the code.)
   260  
   261  Which of these two forms is generated can be selected by adjunct config.
   262  ("Adjunct" config just means: it's not part of the schema; it's part of the
   263  config for this codegen tool.)
   264  
   265  There are advantages to each:
   266  
   267  - the embedded form is ([as usual](./HACKME_memorylayout.md#embed-by-default)), faster for workloads where the value is usually present (it provokes fewer allocations).
   268  - the pointer form may use less memory when the value is absent; it works for cyclic structures; and if assigning a whole subtree at once, it allows faster assignment.
   269  
   270  Also, for cyclic structures, such as `type Foo {String:nullable Foo}`, or `type Bar struct{ recurse optional Bar }`, the pointer form is *required*.
   271  (Otherwise... how big of a slab of memory would we be allocating?  Infinite?  Nope; compile error.)
   272  
   273  By default, we generate the pointer form.
   274  However, your application may experience significant performance improvements by selectively using the embed form.
   275  Check it out and tune for what's right for your application.
   276  
   277  (FUTURE: we should make more clever defaults: it's reasonable to default to embed form for any type that is of scalar kind.)
   278  
   279  
   280  
   281  implementation detail notes
   282  ---------------------------
   283  
   284  ### how state machines and maybes work
   285  
   286  Assemblers for recursive stuff have state machines that are used to insure
   287  orderly transitions between each key and value assembly,
   288  and that a complete entry has been assembled before the next entry or the finish.
   289  (For example, you can't go key-then-key in a map,
   290  nor start a value and then start another value before finishing the first one in a list,
   291  nor finish a map when you've just inserted a key and no value, and so forth.)
   292  
   293  One part of this is straightforward: we simply implement state machines,
   294  using bog-standard patterns around a typed uint and logical transition guards
   295  in all the relevant functions.  Done and done.  Except...
   296  
   297  How do child assemblers signal to their parent that they've become finished?
   298  Theoretically, easy; in practice, to work efficiently...
   299  This poses a bit of an implementation challenge.
   300  
   301  One obvious solution is to put a callback field in assemblers, and have
   302  the parent assembler supply the child assembler with a callback that can
   303  update the parent's state machine when the child becomes finished.
   304  This is logically correct, but practically, problematic and Not Fast:
   305  it requires generating a closure of some kind which composes the function
   306  pointer with the pointer to that parent assembler: and since this is two words
   307  of memory, it implies an allocation and (unfortunately) a heap escape.
   308  An allocation per child key and value in a recursive structure is unacceptable;
   309  we want to set a _much_ higher bar for performance here.
   310  
   311  So, we move on to less obvious solutions: we're all in the same package here,
   312  so we can twiddle the bits of our neighboring structures quite directly, yes?
   313  What if we just have assemblers contain pointers to a state machine uint,
   314  and they do a fixed-value compare-and-swap when they're done?
   315  This is terrifyingly direct and has no abstractions, yes indeed: but
   316  we do generally assume control all the code in this package for any of our
   317  correctness constraints, so this is in-bounds (if admittedly uncomfortable).
   318  
   319  Now let's combine that with one more concern: nullables.  When an assembler
   320  is not at the root of a document, it may need to accept null values.
   321  We could do this by generating distinct assembler types for use in positions
   322  where nulls are allowed; but though such an approach would work, it is bulky.
   323  We'd much rather be able to reuse assembler types in either scenario.
   324  
   325  So, let's have assemblers contain two pointers:
   326  the already-familiar 'w' pointer, and also an 'm' pointer.
   327  The 'm' pointer effectively communicates up whether the child has become finished
   328  when it becomes either 'Maybe_Null' or 'Maybe_Value'.
   329  
   330  We add a few new states to the 'm' value, and use it to hint in both directions:
   331  assemblers will assume nulls are not an acceptable transition *unless* the 'm'
   332  value comes initialized with a hint that we are in a situation where they work.
   333  
   334  The costs here are "some": it's another pointer indirection and memory set.
   335  However, compared to the alternatives, it's pretty good: versus an allocation
   336  (in the callback approach), this is a huge win; and we're even pretty safe to
   337  bet that that pointer indirection is going to land in a cache line already hot.
   338  
   339  You can find the additional magic consts crammed into `schema.Maybe` fields
   340  for this statekeeping during assembly defined in the "minima" file in codegen output.
   341  They are named `midvalue` and `allowNull`.
   342  
   343  
   344  
   345  this could have been different
   346  ------------------------------
   347  
   348  There are many ways this design could've been different:
   349  
   350  ### we could have every maybe type implement Node
   351  
   352  As already discussed above, it would cause a lot of extra boilerplate methods,
   353  increasing both the generated code source size and binary size;
   354  but on the plus side, it would've been in some ways arguably more consistent.
   355  
   356  We didn't.
   357  
   358  ### we could've generated three maybes per type
   359  
   360  Already discussed above.
   361  
   362  We didn't.
   363  
   364  ### we could've designed schemas differently
   365  
   366  A lot of the twists of the design originate from the fact that both `optional`
   367  and `nullable` are both rather special as well as very contextual in IPLD Schemas
   368  (e.g., `optional` is only permitted in a very few special places in a schema).
   369  If we had built a very different type system, maybe things would come out differently.
   370  
   371  Some of this has some exploration in some gists:
   372  
   373  - https://gist.github.com/warpfork/9dd8b68deff2b90f96167c900ea31eec#dubious-soln-drop-nullable-completely-make-inline-anonymous-union-syntax-instead
   374  - https://gist.github.com/warpfork/9dd8b68deff2b90f96167c900ea31eec#soln-change-how-schemas-regard-nullable-and-optional
   375  - https://gist.github.com/warpfork/9dd8b68deff2b90f96167c900ea31eec#soln-support-absent-as-a-discriminator-in-kinded-unions
   376  
   377  But suffice to say, that's a very big topic.
   378  
   379  Optionals and nullables are the way they are because they seemed like useful
   380  concepts for describing the structure of data which has serial forms;
   381  how they map onto any particular programming language (such as Go) was a secondary concern.
   382  This design for a golang library is trying to do its best within that.
   383  
   384  ### we could've done X with techinque Y
   385  
   386  Probably, yes :)
   387  
   388  This is just one implementation of codegen for Golang for IPLD Schemas.
   389  Competing implementations that make different choices are absolutely welcome :)
   390  
   391  
   392