github.com/ipld/go-ipld-prime@v0.21.0/schema/gen/go/HACKME_wip.md (about)

     1  
     2  ### absent values
     3  
     4  The handling of absent values is still not consistent.
     5  
     6  Currently:
     7  
     8  - reading (via accessors or iterators) yields `datamodel.Absent` values for absent fields
     9  - putting those datamodel.Absent values via NodeAssembler.AssignNode will result in `ErrWrongKind`.
    10  - *the recursive copies embedded in AssignNode methods don't handle absents either*.
    11  
    12  The first two are defensible and consistent (if not necessarily ergonomic).
    13  The third is downright a bug, and needs to be fixed.
    14  
    15  How we fix it is not entirely settled.
    16  
    17  - Option 1: keep the hostility to absent assignment
    18  - Option 2: *require* explicit absent assignment
    19  - Option 3: become indifferent to absent assignment when it's valid
    20  - Option 4: don't yield values that are absent during iteration at all
    21  
    22  Option 3 seems the most preferrable (and least user-hostile).
    23  (Options 1 and 2 create work for end users;
    24  Option 4 has questionable logical consistency.)
    25  
    26  Updating the codegen to do Option 3 needs some work, though.
    27  
    28  It's likely that the way to go about this would involve adding two more valid
    29  bit states to the extended schema.Maybe values: one for allowAbsent (similar to
    30  the existing allowNull), and another for both (for "nullable optional" fields).
    31  Every NodeAssembler would then have to support that, just as they each support allowNull now.
    32  
    33  I think the above design is valid, but it's not implemented nor tested yet.
    34  
    35  
    36  ### AssignNode optimality
    37  
    38  The AssignNode methods we generate currently do pretty blithe things with large structures:
    39  they iterate over the given node, and hurl entries into the assembler's AssignKey and AssignValue methods.
    40  
    41  This isn't always optimal.
    42  For any structure that is more efficient when fed info in an ideal order, we might want to take account of that.
    43  
    44  For example, unions with representation mode "inline" are a stellar example of this:
    45  if the discriminant key comes first, they can work *much, much* more efficiently.
    46  By contrast, if the discriminant key shows up late in the object, it is necessary to
    47  have buffered *all the other* data, then backtrack to handle it once the discriminant is found and parsed.
    48  
    49  At best, this probably means iterating once, plucking out the discriminant entry,
    50  and then *getting a new iterator* that starts from the beginning (which shifts
    51  the buffer problem to the Node we're consuming data from).
    52  
    53  Even more irritatingly: since NodeAssembler has to accept entries in any order
    54  if it is to accept information streamingly from codecs, the NodeAssembler
    55  *also* has to be ready to do the buffering work...
    56  TODO ouch what are the ValueAssembler going to yield for dealing with children?
    57  TODO we have to hand out dummy ValueAssembler types that buffer... a crazy amount of stuff.  (Reinvent refmt.Tok??  argh.)  cannot avoid???
    58  TODO this means where errors arise from will be nuts: you cant say if anything is wrong until you figure out the discriminant.  then we replay everything?  your errors for deeper stuff will appear... uh... midway, from a random AssembleValue finishing that happens to be for the discriminant.  that is not pleasant.
    59  
    60  ... let's leave that thought aside: suffice to say, some assemblers are *really*
    61  not happy or performant if they have to accept things in unpleasant orderings.
    62  
    63  So.
    64  
    65  We should flip all this on its head.  The AssignNode methods should lean in
    66  on the knowledge they have about the structure they're building, and assume
    67  that the Node we're copying content from supports random access:
    68  pluck the fields that we care most about out first with direct lookups,
    69  and only use iteration to cover the remaining data that the new structure
    70  doesn't care about the ordering of.
    71  
    72  Perhaps this only matters for certain styles of unions.
    73  
    74  
    75  ### sidenote about codec interfaces
    76  
    77  Perhaps we should get used to the idea of codec packages offering two styles of methods:
    78  
    79  - `UnmarshalIntoAssembler(io.Reader, datamodel.NodeAssembler) error`
    80  	- this is for when you have opinions about what kind of in-memory format should be used
    81  - `Unmarshal(io.Reader) (datamodel.Node, error)`
    82  	- this is for when you want to let the codec pick.
    83  
    84  We might actually end up preferring the latter in a fair number of cases.
    85  
    86  Looking at this inline union ordering situation described above:
    87  the best path through that (other than saying "don't fking use inline unions,
    88  and if you do, put the discriminant in the first fking entry or gtfo") would probably be
    89  to do a cbor (or whatever) unmarshal that produces the half-deserialized skip-list nodes
    90  (which are specialized to the cbor format rather than general purpose, but we want that in this story)...
    91  and those can then claim to do random access, thereby letting them take on the "buffering".
    92  This approach would let the serialization-specialized nodes take on the work,
    93  rather than forcing the union's NodeAssembler to do buffer at a higher level...
    94  which is good because doing that buffering in a structured way at a higher level
    95  is actually more work and causes more memory fragmentation and allocations.
    96  
    97  Whew.
    98  
    99  I have not worked out what this implies for multicodecs or other muxes that do compositions of codecs.
   100  
   101  
   102  ### enums of union keys
   103  
   104  It's extremely common to have an enum that is the discrimant values of a union.
   105  
   106  We should make a schema syntax for that.
   107  
   108  We tend to generate such an enum in codegen anyway, for various purposes.
   109  Might as well let people name it outright too, if they have the slightest desire to do so.
   110  
   111  (Doesn't apply to kinded unions.)
   112  
   113  
   114  ### can reset methods be replaced with duff's device?
   115  
   116  Yes.  Well, sort of.  Okay, no.
   117  
   118  It's close!  Assemblers were all written such that their zero values are ready to go.
   119  
   120  However, there's a couple of situations where you *wouldn't* want to blithely zero everything:
   121  for example, if an assembler has to do some allocations, but they're reusable,
   122  you wouldn't want to turn those other objects into garbage by zeroing the pointer to them.
   123  See the following section about new-alloc child assemblers for an example of this.
   124  
   125  
   126  ### what's up with new-alloc child assemblers?
   127  
   128  Mostly, child assemblers are embedded in the assembler for the type that contains them;
   129  this is part of our allocation amortization strategy and important to performance.
   130  However, it doesn't always apply:
   131  Sometimes we *need* independently allocated assemblers, even when they're child assemblers:
   132  recursive structures need this (otherwise, how big would the slab be?  infinite?  no; halt).
   133  Sometimes we also just *want* them, somewhat more mildly: if a union is one of several things,
   134  and some of them are uncommonly used but huuuuge, then maybe we'd rather allocate the child assemblers
   135  individually on demand rather than pay a large resident memory cost to embed all the possibilities.
   136  
   137  There's a couple things to think about with these:
   138  
   139  - resetting assemblers with a duff's device strategy wouldn't recursively reset these;
   140    it would just orphan them.  While possibly leaving them pointed into some parts of memory in the parent slab ('cm' in particular comes to mind).
   141    This could be a significant correctness issue.
   142     - But who's responsibility is it to "safe" this?  Nilling 'w' proactively should also make this pretty innocuous, as one option (but we don't currently do this).
   143  
   144  - if the parent assembler is being used in some highly reusable situation (e.g. it's a list value or map value),
   145    is the parent able to hold onto and re-use the child assembler?  We probably usually still want to do this, even if it's in a separate piece of heap.
   146    - For unions, there's a question of if we should hold onto each child assembler, or just the most recent; that's a choice we could make and tune.
   147      If the answer is "most recent only", we could even crank down the resident size by use of more interfaces instead of concrete types (at the cost of some other runtime performance debufs, most likely).
   148  
   149  We've chosen to discard the possibility of duff's device as an assembler resetting implementation.
   150  As a result, we don't have to do proactive 'w'-nil'ing in places we might otherwise have to.
   151  And union assemblers hold on to all child assembler types they've ever needed.