github.com/ipld/go-ipld-prime@v0.21.0/schema/gen/go/HACKME_memorylayout.md (about) 1 about memory layout 2 =================== 3 4 Memory layout is important when designing a system for going fast. 5 It also shows up in exported types (whether or not they're pointers, etc). 6 7 For the most part, we try to hide these details; 8 or, failing that, at least make them appear consistent. 9 There's some deeper logic required to *pick* which way we do things, though. 10 11 This document was written to describe codegen and all of the tradeoffs here, 12 but much of it (particularly the details about embedding and internal pointers) 13 also strongly informed the design of the core NodeAssembler semantics, 14 and thus also may be useful reading to understand some of the forces that 15 shaped even the various un-typed node implementations. 16 17 18 19 Prerequisite understandings 20 --------------------------- 21 22 The following headings contain brief summaries of information that's important 23 to know in order to understand how we designed the IPLD data structure 24 memory layouts (and how to tune them). 25 26 Most of these concepts are common to many programming languages, so you can 27 likely skim those sections if you know them. Others are fairly golang-specific. 28 29 ### heap vs stack 30 31 The concept of heap vs stack in Golang is pretty similar to the concept 32 in most other languages with garbage collection, so we won't cover it 33 in great detail here. 34 35 The key concept to know: the *count* of allocations which are made on 36 the heap significantly affects performance. Allocations on the heap 37 consume CPU time both when made, and later, as part of GC. 38 39 The *size* of the allocations affects the total memory needed, but 40 does *not* significantly affect the speed of execution. 41 42 Allocations which are made on the stack are (familiarly) effectively free. 43 44 ### escape analysis 45 46 "Escape Analysis" refers to the efforts the compiler makes to figure out if some 47 piece of memory can be kept on the stack or if it must "escape" to the heap. 48 If escape analysis finds that some memory can be kept on the stack, 49 it will prefer to do so (and this is faster/preferable because it both means 50 allocation is simple and that no 'garbage' is generated to collect later). 51 52 Since whether things are allocated on the stack or the heap affects performance, 53 the concept of escape analysis is important. The details (fortunately) are not: 54 For the purposes of what we need to do in in our IPLD data structures, 55 our goal with our code is to A) flunk out and escape to heap 56 as soon as possible, but B) do that in one big chunk of memory at once 57 (because we'll be able to use [internal pointers](#internal-pointers) 58 thereafter). 59 60 One implication of escape analysis that's both useful and easy to note is that 61 whether or not you use a struct literal (`Foo{}`) or a pointer (`&Foo{}`) 62 *does not determine* whether that memory gets allocated on the heap or stack. 63 If you use a pointer, the escape analysis can still prove that the pointer 64 never escapes, it will still end up allocated on the stack. 65 66 Another way to thing about this is: use pointers freely! By using pointers, 67 you're in effect giving the compiler *more* freedom to decide where memory resides; 68 in contrast, avoiding the use of pointers in method signitures, etc, will 69 give the compiler *less* choice about where the memory should reside, 70 and typically forces copying. Giving the compiler more freedom generally 71 has better results. 72 73 **pro-tip**: you can compile a program with the arguments `-gcflags "-m -m"` to 74 get lots of information about the escape analysis the compiler performs. 75 76 ### embed vs pointer 77 78 Structs can be embeded -- e.g. `type Foo struct { field Otherstruct }` -- 79 or referenced by a pointer -- e.g. `type Foo struct { field *Otherstruct }`. 80 81 The difference is substantial. 82 83 When structs are embedded, the layout in memory of the larger struct is simply 84 a concatenation of the embedded structs. This means the amount of memory 85 that structure takes is the sum of the size of all of the embedded things; 86 and by the other side of the same coint, the *count* of allocations needed 87 (remember! the *count* affects performance more than the *size*, as we briefly 88 discussed in the [heap-vs-stack](#heap-vs-stack) section) is exactly *one*. 89 90 When pointers are used instead of embedding, the parent struct is typically 91 smaller (pointers are one word of memory, whereas the embedded thing can often 92 be larger), and null values can be used... but if fields are assigned to some 93 other value than null, there's a very high likelihood that heap allocations 94 will start cropping up in the process of creating values to take pointers 95 to before then assigning the pointer field! (This can be subverted by 96 either [escape analysis](#escape-analysis) (though it's fairly uncommon), 97 or by [internal pointers](#internal-pointers) (which are going to turn out 98 very important, and will be discussed later)... but it's wise to default 99 to worrying about it until you can prove that one of the two will save you.) 100 101 When setting fields, another difference appears: a pointer field takes one 102 instruction (assuming the value already exists, and we're not invoking heap 103 allocation to get the pointer!) to assign, 104 whereas an embedded field generally signifies a memcopy, which 105 may take several instructions if the embedded value is large. 106 107 You can see how the choice between use of pointers and embeds results 108 in significantly different memory usage and performance characteristics! 109 110 (Quick mention in passing: "cache lines", etc, are also potential concerns that 111 can be addressed by embedding choices. However, it's probably wise to attend 112 to GC first. While cache alignment *can* be important, it's almost always going 113 to be a winning bet that GC will be a much higher impact concern.) 114 115 It is an unfortunate truth that whether or not a field can be null in Golang 116 and whether or not it's a pointer are two properties that are conflated -- 117 you can't choose one independently of the other. (The reasoning for this is 118 based on intuitions around mechanical sympathy -- but it's worth mentioning that 119 a sufficiently smart compiler *could* address both the logical separation 120 and simultaneously have the compiler solve for the mechanical sympathy concerns 121 in order to reach good performance in many cases; Golang just doesn't do so.) 122 123 ### interfaces are two words and may cause implicit allocation 124 125 Interfaces in Golang are always two words in size. The first word is a pointer 126 to the type information for what the interface contains. The second word is 127 a pointer to the data itself. 128 129 This means if some data is assigned into an interface value, it *must* become 130 a pointer -- the compiler will do this implicitly; and this is the case even if 131 the type info in the first word retains a claim that the data is not a pointer. 132 In practice, this also almost guarantees in practice that the data in question 133 will escape to the heap. 134 135 (This applies even to primitives that are one word in size! At least, as of 136 golang version 1.13 -- keep an eye on on the `runtime.convT32` functions 137 if you want to look into this further; the `mallocgc` call is clear to see. 138 There's a special case inside `malloc` which causes zero values to get a 139 free pass (!), but in all other cases, allocation will occur.) 140 141 Knowing this, you probably can conclude a general rule of thumb: if your 142 application is going to put a value in an interface, and *especially* if it's 143 going to do that more than once, you're probably best off explicitly handling 144 it as a pointer rather than a value. Any other approach wil be very likely to 145 provoke unnecessary copy behavior and/or multiple unnecessary heap allocations 146 as the value moves in and out of pointer form. 147 148 (Fun note: if attempting to probe this by microbenchmarking experiments, be 149 careful to avoid using zero values! Zero values get special treatment and avoid 150 allocations in ways that aren't general.) 151 152 ### internal pointers 153 154 "Internal pointers" refer to any pointer taken to some position in a piece 155 of memory that was already allocated somewhere. 156 157 For example, given some `type Foo struct { a, b, c Otherstruct }`, the 158 value of `f := &Foo{}` and `b := &f.b` will be very related: they will 159 differ by the size of `Otherstruct`! 160 161 The main consequence of this is: using internal pointers can allow you to 162 construct large structure containing many pointers... *without* using a 163 correspondingly large *count of allocations*. This unlocks a lot of potential 164 choices for how to build data structures in memory while minimizing allocs! 165 166 Internal pointers are not without their tradeoffs, however: in particular, 167 internal pointers have an interesting relationship with garbage collection. 168 When there's an internal pointer to some field in a large struct, that pointer 169 will cause the *entire* containing struct to be still considered to be 170 referenced for garbage collection purposes -- that is, *it won't be collected*. 171 So, in our example above, keeping a reference to `&f.b` will in fact cause 172 memory of the size of *three* `Otherstruct`s to be uncollectable, not one. 173 174 You can find more information about internal pointers in this talk: 175 https://blog.golang.org/ismmkeynote 176 177 ### inlining functions 178 179 Function inlining is an important compiler optimization. 180 181 Inlining optimizes in two regards: one, can remove some of the overhead of 182 function calls; and two, it can enable *other* optimizations by getting the 183 relevant instruction blocks to be located together and thus rearrangable. 184 (Inlining does increase the compiled binary size, so it's not all upside.) 185 186 Calling a function has some fixed overhead -- shuffling arguments from registers 187 into calling convention order on the stack; potentially growing the stack; etc. 188 While these overheads are small in practice... if the function is called many 189 (many) times, this overhead can still add up. Inlining can remove these costs! 190 191 More interestingly, function inlining can also enable *other* optimizations. 192 For example, a function that *would* have caused escape analysis to flunk 193 something out to the heap *if* that function as called was alone... can 194 potentially be inlined in such a way that in its contextual usage, 195 the escape analysis flunking can actually disappear entirely. 196 Many other kinds of optimizations can similarly be enabled by inlining. 197 This makes designing library code to be inline-friendly a potentially 198 high-impact concern -- sometimes even more so than can be easily seen. 199 200 The exact mechanisms used by the compiler to determine what can (and should) 201 be inlined may vary significantly from version to version of the Go compiler, 202 which means one should be cautious of spending too much time in the details. 203 However, we *can* make useful choices around things that will predictably 204 obstruct inlining -- such as [virtual function calls](#virtual-function-calls). 205 Occasionally there are positive stories in teasing the inliner to do well, 206 such as https://blog.filippo.io/efficient-go-apis-with-the-inliner/ (but these 207 seem to generally require a lot of thought and probably aren't the first stop 208 on most optimization quests). 209 210 ### virtual function calls 211 212 Function calls which are intermediated by interfaces are called "virtual" 213 function calls. (You may also encounter the term "v-table" in compiler 214 and runtime design literature -- this 'v' stands for "virtual".) 215 216 Virtual function calls generally can't be inlined. This can have significant 217 effects, as described in the [inlining functions](#inlining-functions) section -- 218 it both means function call overhead can't be removed, and it can have cascading 219 consequences by making other potential optimizations unreachable. 220 221 222 223 Resultant Design Features 224 ------------------------- 225 226 ### concrete implementations 227 228 We generate a concrete type for each type in the schema. 229 230 Using a concrete type means methods on it are possible to inline. 231 This is important to us because most of the methods are "accessors" -- that is, 232 a style of function that has a small body and does little work -- and these 233 are precisely the sort of function where inlining can add up. 234 235 ### natively-typed methods in addition to the general interface 236 237 We generate two sets of methods: **both** the general interface methods to 238 comply with Node and NodeBuilder interfaces, **and** also natively-typed 239 variants of the same methods (e.g. a `Lookup` method for maps that takes 240 the concrete type key and returns the concrete type value, rather than 241 taking and returning `Node` interfaces). 242 243 While both sets of methods can accomplish the same end goals, both are needed. 244 There are two distinct advantages to natively-typed methods; 245 and at the same time, the need for the general methods is system critical. 246 247 Firstly, to programmers writing code that can use the concrete types, the 248 natively-typed methods provide more value in the form of compile-time type 249 checking, autocompletion and other tooling assist opportunities, and 250 less verbosity. 251 252 Secondly, natively-typed funtions on concrete types can be higher performance: 253 since they're not [virtual function calls](#virtual-function-calls), we 254 can expect [inlining](#inlining-functions) to work. We might expect this to 255 be particularly consequential in builders and in accessor methods, since these 256 involve numerous calls to methods with small bodies -- precisely the sort of 257 situation that often substantially benefits from inlining. 258 259 At the same time, it goes without saying that we need the general Node and 260 NodeBuilder interfaces to be satisfied, so that we can write generic library 261 code such as reusable traversals, etc. It is not possible to satisfy both 262 needs with a single set of methods with the Golang typesystem; therefore, 263 we generate both. 264 265 ### embed by default 266 267 Embedded structs amortizes the count of memory allocations. 268 This addresses what is typically our biggest concern. 269 270 The increase in size is generally not consequential. We expect most fields 271 end up filled anyway, so reserving that memory up front is reasonable. 272 (Indeed, unfilled fields are only possible for nullable or optional fields 273 which are implemented as embedded.) 274 275 If assigning whole sub-trees at once, assignment into embedded fields 276 incurs the cost of a memcopy (whereas by contrast, if fields were pointers, 277 assigning them would be cheap... it's just that we would've had to pay 278 a (possibly _extra_) allocation cost elsewhere earlier.) 279 However, this is usually a worthy trade. 280 Linear memcpy in practice can be significantly cheaper than extra allocations 281 (especially if it's one long memcpy vs many allocations); 282 and if we assume a balance of use cases such as "unmarshal happens more often 283 than sub-tree-assignment", then it's pretty clear we should prioritize getting 284 allocation minimization for unmarshal rather than fret sub-tree assignment. 285 286 ### nodebuilders point to the concrete type 287 288 We generate NodeBuilder types which contain a pointer to the type they build. 289 290 This means we can hold onto the Node pointer when its building is completed, 291 and discard the NodeBuilder. (Or, reset and reuse the NodeBuilder.) 292 Garbage collection can apply on the NodeBuilder independently of the lifespan 293 of the Node it built. 294 295 This means a single NodeBuilder and its produced Node will require 296 **two** allocations -- one for the NodeBuilder, and a separate one for the Node. 297 298 (An alternative would be to embed the concrete Node value in the NodeBuilder, 299 and return a pointer to when finalizing the creation of the Node; 300 however, because due to the garbage collection semantics around 301 [internal pointers](#internal-pointers), such a design would cause the entirety 302 of the memory needed in the NodeBuilder to remain uncollectable as long as 303 completed Node is reachable! This would be an unfortunate trade.) 304 305 While we pay two allocations for the Node and its Builder, we earn that back 306 in spades via our approach to recursion with 307 [NodeAssemblers](#nodeassemblers-accumulate-mutations), and specifically, how 308 [NodeAssemblers embed more NodeAssemblers](#nodeassemblers-embed-nodeassemblers). 309 Long story short: we pay two allocations, yes. But it's *fixed* at two, 310 no matter how large and complex the structure is. 311 312 ### nodeassemblers accumulate mutations 313 314 The NodeBuilder type is only used at the root of construction of a value. 315 After that, recursion works with an interface called NodeAssembler isntead. 316 317 A NodeAssembler is essentially the same thing as a NodeBuilder, except 318 _it doesn't return a Node_. 319 320 This means we can use the NodeAssembler interface to describe constructing 321 the data in the middle of some complex value, and we're not burdened by the 322 need to be able to return the finished product. (Sufficient state-keeping 323 and defensive checks to ensure we don't leak mutable references would not 324 come for free; reducing the number of points we might need to do this makes 325 it possible to create a more efficient system overall.) 326 327 The documentation on the datamodel.NodeAssembler type gives some general 328 description of this. 329 330 NodeBuilder types end up being just a NodeAssembler embed, plus a few methods 331 for exposing the final results and optionally resetting the whole system. 332 333 ### nodeassemblers embed nodeassemblers 334 335 In addition to each NodeAssembler containing a pointer to the value they modify 336 (the same as [NodeBuilders](#nodebuilders-point-to-the-concrete-type))... 337 for assemblers that work with recursive structures, they also embed another 338 NodeAssembler for each of their child values. 339 340 This lets us amortize the allocations for all the *assemblers* in the same way 341 as embedding in the actual value structs let us amortized allocations there. 342 343 The code for this gets a little complex, and the result also carries several 344 additional limitations to the usage, but it does keep the allocations finite, 345 and thus makes the overall performance fast. 346 347 (To be more specific, for recursive types that are infinite (namely, maps and 348 lists; whereas structs and unions are finite), the NodeAssembler embeds 349 *one* NodeAssembler for all values. (Obviously, we can't embed an infinite 350 number of them, right?) This leads to a restriction: you can't assemble 351 multiple children of an infinite recursive value simultaneously.) 352 353 ### nullable and optional struct fields embed too 354 355 TODO intro 356 357 There is some chance of over-allocation in the event of nullable or optional 358 fields. We support tuning that via adjunct configuration to the code generator 359 which allows you to opt in to using pointers for fields; choosing to do this 360 will of course cause you to lose out on alloc amortization features in exchange. 361 362 TODO also resolve the loops note, at bottom 363 364 ### unexported implementations, exported aliases 365 366 Our concrete types are unexported. For those that need to be exported, 367 we export an alias to the pointer type. 368 369 This has an interesting set of effects: 370 371 - copy-by-value from outside the package becomes impossible; 372 - creating zero values from outside the package becomes impossible; 373 - and yet refering to the type for type assertions remains possible. 374 375 This addresses one downside to using [concrete implementations](#concrete-implementations): 376 if the concrete implementation is an exported symbol, it means any code external 377 to the package can produce Golang's natural "zero" for the type. 378 This is problematic because it's true even if the Golang "zero" value for the 379 type doesn't correspond to a valid value. 380 While keeping an unexported implementation and an exported interface makes 381 external fabrication of zero values impossible, it breaks inlining. 382 Exporting an alias of the pointer type, however, strikes both goals at once: 383 external fabrication of zero values is blocked, and yet inlining works. 384 385 386 387 Amusing Details and Edge Cases 388 ------------------------------ 389 390 ### looped references 391 392 // who's job is it to detect this? 393 // the schema validator should check it... 394 // but something that breaks the cycle *there* doesn't necessarily do so for the emitted code! aggh! 395 // ... unless we go back to optional and nullable both making ptrs unconditionally. 396 397 398 399 Learning more (the hard way) 400 ---------------------------- 401 402 If this document doesn't provide enough information for you, 403 you've probably graduated to the point where doing experiments is next. :) 404 405 Prototypes and research examples can be found in the 406 `go-ipld-prime/_rsrch/` directories. 407 In particular, the "multihoisting" and "nodeassembler" packages are relevant, 408 containing research that lead to the drafting of this doc, 409 as well as some partially-worked alternative interface drafts. 410 (You may have to search back through git history to find these directories; 411 they're removed after some time, when the lessons have been applied.) 412 413 Tests there include some benchmarks (self-explanitory); 414 some tests based on runtime memory stats inspection; 415 and some tests which are simply meant to be disassembled and read thusly. 416 417 Compiler flags can provide useful insights: 418 419 - `-gcflags '-S'` -- gives you assembler dump. 420 - read this to see for sure what's inlined and not. 421 - easy to quickly skim for calls like `runtime.newObject`, etc. 422 - often critically useful to ensure a benchmark hasn't optimized out the question you meant to ask it! 423 - generally gives a ground truth which puts an end to guessing. 424 - `-gcflags '-m -m'` -- reports escape analysis and other decisions. 425 - note the two m's -- not a typo: this gives you info in stack form, 426 which is radically more informative than the single-m output. 427 - `-gcflags '-l'` -- disables inlining! 428 - useful on benchmarks to quickly detect whether inlining is a major part of performance. 429 430 These flags can apply to any command like `go install`... as well as `go test`. 431 432 Profiling information collected from live systems in use is of course always 433 intensely useful... if you have any on hand. When handling this, be aware of 434 how data-dependent performance can be when handling serialization systems: 435 different workload content can very much lead to different hot spots. 436 437 Happy hunting.