github.com/ipld/go-ipld-prime@v0.21.0/schema/gen/go/HACKME_wip.md

github.com/ipld/go-ipld-prime@v0.21.0/schema/gen/go/HACKME_wip.md (about)

1
2 ### absent values
3
4 The handling of absent values is still not consistent.
5
6 Currently:
7
8 - reading (via accessors or iterators) yields `datamodel.Absent` values for absent fields
9 - putting those datamodel.Absent values via NodeAssembler.AssignNode will result in `ErrWrongKind`.
10 - *the recursive copies embedded in AssignNode methods don't handle absents either*.
11
12 The first two are defensible and consistent (if not necessarily ergonomic).
13 The third is downright a bug, and needs to be fixed.
14
15 How we fix it is not entirely settled.
16
17 - Option 1: keep the hostility to absent assignment
18 - Option 2: *require* explicit absent assignment
19 - Option 3: become indifferent to absent assignment when it's valid
20 - Option 4: don't yield values that are absent during iteration at all
21
22 Option 3 seems the most preferrable (and least user-hostile).
23 (Options 1 and 2 create work for end users;
24 Option 4 has questionable logical consistency.)
25
26 Updating the codegen to do Option 3 needs some work, though.
27
28 It's likely that the way to go about this would involve adding two more valid
29 bit states to the extended schema.Maybe values: one for allowAbsent (similar to
30 the existing allowNull), and another for both (for "nullable optional" fields).
31 Every NodeAssembler would then have to support that, just as they each support allowNull now.
32
33 I think the above design is valid, but it's not implemented nor tested yet.
34
35
36 ### AssignNode optimality
37
38 The AssignNode methods we generate currently do pretty blithe things with large structures:
39 they iterate over the given node, and hurl entries into the assembler's AssignKey and AssignValue methods.
40
41 This isn't always optimal.
42 For any structure that is more efficient when fed info in an ideal order, we might want to take account of that.
43
44 For example, unions with representation mode "inline" are a stellar example of this:
45 if the discriminant key comes first, they can work *much, much* more efficiently.
46 By contrast, if the discriminant key shows up late in the object, it is necessary to
47 have buffered *all the other* data, then backtrack to handle it once the discriminant is found and parsed.
48
49 At best, this probably means iterating once, plucking out the discriminant entry,
50 and then *getting a new iterator* that starts from the beginning (which shifts
51 the buffer problem to the Node we're consuming data from).
52
53 Even more irritatingly: since NodeAssembler has to accept entries in any order
54 if it is to accept information streamingly from codecs, the NodeAssembler
55 *also* has to be ready to do the buffering work...
56 TODO ouch what are the ValueAssembler going to yield for dealing with children?
57 TODO we have to hand out dummy ValueAssembler types that buffer... a crazy amount of stuff. (Reinvent refmt.Tok?? argh.) cannot avoid???
58 TODO this means where errors arise from will be nuts: you cant say if anything is wrong until you figure out the discriminant. then we replay everything? your errors for deeper stuff will appear... uh... midway, from a random AssembleValue finishing that happens to be for the discriminant. that is not pleasant.
59
60 ... let's leave that thought aside: suffice to say, some assemblers are *really*
61 not happy or performant if they have to accept things in unpleasant orderings.
62
63 So.
64
65 We should flip all this on its head. The AssignNode methods should lean in
66 on the knowledge they have about the structure they're building, and assume
67 that the Node we're copying content from supports random access:
68 pluck the fields that we care most about out first with direct lookups,
69 and only use iteration to cover the remaining data that the new structure
70 doesn't care about the ordering of.
71
72 Perhaps this only matters for certain styles of unions.
73
74
75 ### sidenote about codec interfaces
76
77 Perhaps we should get used to the idea of codec packages offering two styles of methods:
78
79 - `UnmarshalIntoAssembler(io.Reader, datamodel.NodeAssembler) error`
80 - this is for when you have opinions about what kind of in-memory format should be used
81 - `Unmarshal(io.Reader) (datamodel.Node, error)`
82 - this is for when you want to let the codec pick.
83
84 We might actually end up preferring the latter in a fair number of cases.
85
86 Looking at this inline union ordering situation described above:
87 the best path through that (other than saying "don't fking use inline unions,
88 and if you do, put the discriminant in the first fking entry or gtfo") would probably be
89 to do a cbor (or whatever) unmarshal that produces the half-deserialized skip-list nodes
90 (which are specialized to the cbor format rather than general purpose, but we want that in this story)...
91 and those can then claim to do random access, thereby letting them take on the "buffering".
92 This approach would let the serialization-specialized nodes take on the work,
93 rather than forcing the union's NodeAssembler to do buffer at a higher level...
94 which is good because doing that buffering in a structured way at a higher level
95 is actually more work and causes more memory fragmentation and allocations.
96
97 Whew.
98
99 I have not worked out what this implies for multicodecs or other muxes that do compositions of codecs.
100
101
102 ### enums of union keys
103
104 It's extremely common to have an enum that is the discrimant values of a union.
105
106 We should make a schema syntax for that.
107
108 We tend to generate such an enum in codegen anyway, for various purposes.
109 Might as well let people name it outright too, if they have the slightest desire to do so.
110
111 (Doesn't apply to kinded unions.)
112
113
114 ### can reset methods be replaced with duff's device?
115
116 Yes. Well, sort of. Okay, no.
117
118 It's close! Assemblers were all written such that their zero values are ready to go.
119
120 However, there's a couple of situations where you *wouldn't* want to blithely zero everything:
121 for example, if an assembler has to do some allocations, but they're reusable,
122 you wouldn't want to turn those other objects into garbage by zeroing the pointer to them.
123 See the following section about new-alloc child assemblers for an example of this.
124
125
126 ### what's up with new-alloc child assemblers?
127
128 Mostly, child assemblers are embedded in the assembler for the type that contains them;
129 this is part of our allocation amortization strategy and important to performance.
130 However, it doesn't always apply:
131 Sometimes we *need* independently allocated assemblers, even when they're child assemblers:
132 recursive structures need this (otherwise, how big would the slab be? infinite? no; halt).
133 Sometimes we also just *want* them, somewhat more mildly: if a union is one of several things,
134 and some of them are uncommonly used but huuuuge, then maybe we'd rather allocate the child assemblers
135 individually on demand rather than pay a large resident memory cost to embed all the possibilities.
136
137 There's a couple things to think about with these:
138
139 - resetting assemblers with a duff's device strategy wouldn't recursively reset these;
140 it would just orphan them. While possibly leaving them pointed into some parts of memory in the parent slab ('cm' in particular comes to mind).
141 This could be a significant correctness issue.
142 - But who's responsibility is it to "safe" this? Nilling 'w' proactively should also make this pretty innocuous, as one option (but we don't currently do this).
143
144 - if the parent assembler is being used in some highly reusable situation (e.g. it's a list value or map value),
145 is the parent able to hold onto and re-use the child assembler? We probably usually still want to do this, even if it's in a separate piece of heap.
146 - For unions, there's a question of if we should hold onto each child assembler, or just the most recent; that's a choice we could make and tune.
147 If the answer is "most recent only", we could even crank down the resident size by use of more interfaces instead of concrete types (at the cost of some other runtime performance debufs, most likely).
148
149 We've chosen to discard the possibility of duff's device as an assembler resetting implementation.
150 As a result, we don't have to do proactive 'w'-nil'ing in places we might otherwise have to.
151 And union assemblers hold on to all child assembler types they've ever needed.