github.com/ipld/go-ipld-prime@v0.21.0/schema/gen/go/HACKME_memorylayout.md (about)

     1  about memory layout
     2  ===================
     3  
     4  Memory layout is important when designing a system for going fast.
     5  It also shows up in exported types (whether or not they're pointers, etc).
     6  
     7  For the most part, we try to hide these details;
     8  or, failing that, at least make them appear consistent.
     9  There's some deeper logic required to *pick* which way we do things, though.
    10  
    11  This document was written to describe codegen and all of the tradeoffs here,
    12  but much of it (particularly the details about embedding and internal pointers)
    13  also strongly informed the design of the core NodeAssembler semantics,
    14  and thus also may be useful reading to understand some of the forces that
    15  shaped even the various un-typed node implementations.
    16  
    17  
    18  
    19  Prerequisite understandings
    20  ---------------------------
    21  
    22  The following headings contain brief summaries of information that's important
    23  to know in order to understand how we designed the IPLD data structure
    24  memory layouts (and how to tune them).
    25  
    26  Most of these concepts are common to many programming languages, so you can
    27  likely skim those sections if you know them.  Others are fairly golang-specific.
    28  
    29  ### heap vs stack
    30  
    31  The concept of heap vs stack in Golang is pretty similar to the concept
    32  in most other languages with garbage collection, so we won't cover it
    33  in great detail here.
    34  
    35  The key concept to know: the *count* of allocations which are made on
    36  the heap significantly affects performance.  Allocations on the heap
    37  consume CPU time both when made, and later, as part of GC.
    38  
    39  The *size* of the allocations affects the total memory needed, but
    40  does *not* significantly affect the speed of execution.
    41  
    42  Allocations which are made on the stack are (familiarly) effectively free.
    43  
    44  ### escape analysis
    45  
    46  "Escape Analysis" refers to the efforts the compiler makes to figure out if some
    47  piece of memory can be kept on the stack or if it must "escape" to the heap.
    48  If escape analysis finds that some memory can be kept on the stack,
    49  it will prefer to do so (and this is faster/preferable because it both means
    50  allocation is simple and that no 'garbage' is generated to collect later).
    51  
    52  Since whether things are allocated on the stack or the heap affects performance,
    53  the concept of escape analysis is important.  The details (fortunately) are not:
    54  For the purposes of what we need to do in in our IPLD data structures,
    55  our goal with our code is to A) flunk out and escape to heap
    56  as soon as possible, but B) do that in one big chunk of memory at once
    57  (because we'll be able to use [internal pointers](#internal-pointers)
    58  thereafter).
    59  
    60  One implication of escape analysis that's both useful and easy to note is that
    61  whether or not you use a struct literal (`Foo{}`) or a pointer (`&Foo{}`)
    62  *does not determine* whether that memory gets allocated on the heap or stack.
    63  If you use a pointer, the escape analysis can still prove that the pointer
    64  never escapes, it will still end up allocated on the stack.
    65  
    66  Another way to thing about this is: use pointers freely!  By using pointers,
    67  you're in effect giving the compiler *more* freedom to decide where memory resides;
    68  in contrast, avoiding the use of pointers in method signitures, etc, will
    69  give the compiler *less* choice about where the memory should reside,
    70  and typically forces copying.  Giving the compiler more freedom generally
    71  has better results.
    72  
    73  **pro-tip**: you can compile a program with the arguments `-gcflags "-m -m"` to
    74  get lots of information about the escape analysis the compiler performs.
    75  
    76  ### embed vs pointer
    77  
    78  Structs can be embeded -- e.g. `type Foo struct { field Otherstruct }` --
    79  or referenced by a pointer -- e.g. `type Foo struct { field *Otherstruct }`.
    80  
    81  The difference is substantial.
    82  
    83  When structs are embedded, the layout in memory of the larger struct is simply
    84  a concatenation of the embedded structs.  This means the amount of memory
    85  that structure takes is the sum of the size of all of the embedded things;
    86  and by the other side of the same coint, the *count* of allocations needed
    87  (remember! the *count* affects performance more than the *size*, as we briefly
    88  discussed in the [heap-vs-stack](#heap-vs-stack) section) is exactly *one*.
    89  
    90  When pointers are used instead of embedding, the parent struct is typically
    91  smaller (pointers are one word of memory, whereas the embedded thing can often
    92  be larger), and null values can be used... but if fields are assigned to some
    93  other value than null, there's a very high likelihood that heap allocations
    94  will start cropping up in the process of creating values to take pointers
    95  to before then assigning the pointer field!  (This can be subverted by
    96  either [escape analysis](#escape-analysis) (though it's fairly uncommon),
    97  or by [internal pointers](#internal-pointers) (which are going to turn out
    98  very important, and will be discussed later)... but it's wise to default
    99  to worrying about it until you can prove that one of the two will save you.)
   100  
   101  When setting fields, another difference appears: a pointer field takes one
   102  instruction (assuming the value already exists, and we're not invoking heap
   103  allocation to get the pointer!) to assign,
   104  whereas an embedded field generally signifies a memcopy, which
   105  may take several instructions if the embedded value is large.
   106  
   107  You can see how the choice between use of pointers and embeds results
   108  in significantly different memory usage and performance characteristics!
   109  
   110  (Quick mention in passing: "cache lines", etc, are also potential concerns that
   111  can be addressed by embedding choices.  However, it's probably wise to attend
   112  to GC first.  While cache alignment *can* be important, it's almost always going
   113  to be a winning bet that GC will be a much higher impact concern.)
   114  
   115  It is an unfortunate truth that whether or not a field can be null in Golang
   116  and whether or not it's a pointer are two properties that are conflated --
   117  you can't choose one independently of the other.  (The reasoning for this is
   118  based on intuitions around mechanical sympathy -- but it's worth mentioning that
   119  a sufficiently smart compiler *could* address both the logical separation
   120  and simultaneously have the compiler solve for the mechanical sympathy concerns
   121  in order to reach good performance in many cases; Golang just doesn't do so.)
   122  
   123  ### interfaces are two words and may cause implicit allocation
   124  
   125  Interfaces in Golang are always two words in size.  The first word is a pointer
   126  to the type information for what the interface contains.  The second word is
   127  a pointer to the data itself.
   128  
   129  This means if some data is assigned into an interface value, it *must* become
   130  a pointer -- the compiler will do this implicitly; and this is the case even if
   131  the type info in the first word retains a claim that the data is not a pointer.
   132  In practice, this also almost guarantees in practice that the data in question
   133  will escape to the heap.
   134  
   135  (This applies even to primitives that are one word in size!  At least, as of
   136  golang version 1.13 -- keep an eye on on the `runtime.convT32` functions
   137  if you want to look into this further; the `mallocgc` call is clear to see.
   138  There's a special case inside `malloc` which causes zero values to get a
   139  free pass (!), but in all other cases, allocation will occur.)
   140  
   141  Knowing this, you probably can conclude a general rule of thumb: if your
   142  application is going to put a value in an interface, and *especially* if it's
   143  going to do that more than once, you're probably best off explicitly handling
   144  it as a pointer rather than a value.  Any other approach wil be very likely to
   145  provoke unnecessary copy behavior and/or multiple unnecessary heap allocations
   146  as the value moves in and out of pointer form.
   147  
   148  (Fun note: if attempting to probe this by microbenchmarking experiments, be
   149  careful to avoid using zero values!  Zero values get special treatment and avoid
   150  allocations in ways that aren't general.)
   151  
   152  ### internal pointers
   153  
   154  "Internal pointers" refer to any pointer taken to some position in a piece
   155  of memory that was already allocated somewhere.
   156  
   157  For example, given some `type Foo struct { a, b, c Otherstruct }`, the
   158  value of `f := &Foo{}` and `b := &f.b` will be very related: they will
   159  differ by the size of `Otherstruct`!
   160  
   161  The main consequence of this is: using internal pointers can allow you to
   162  construct large structure containing many pointers... *without* using a
   163  correspondingly large *count of allocations*.  This unlocks a lot of potential
   164  choices for how to build data structures in memory while minimizing allocs!
   165  
   166  Internal pointers are not without their tradeoffs, however: in particular,
   167  internal pointers have an interesting relationship with garbage collection.
   168  When there's an internal pointer to some field in a large struct, that pointer
   169  will cause the *entire* containing struct to be still considered to be
   170  referenced for garbage collection purposes -- that is, *it won't be collected*.
   171  So, in our example above, keeping a reference to `&f.b` will in fact cause
   172  memory of the size of *three* `Otherstruct`s to be uncollectable, not one.
   173  
   174  You can find more information about internal pointers in this talk:
   175  https://blog.golang.org/ismmkeynote
   176  
   177  ### inlining functions
   178  
   179  Function inlining is an important compiler optimization.
   180  
   181  Inlining optimizes in two regards: one, can remove some of the overhead of
   182  function calls; and two, it can enable *other* optimizations by getting the
   183  relevant instruction blocks to be located together and thus rearrangable.
   184  (Inlining does increase the compiled binary size, so it's not all upside.)
   185  
   186  Calling a function has some fixed overhead -- shuffling arguments from registers
   187  into calling convention order on the stack; potentially growing the stack; etc.
   188  While these overheads are small in practice... if the function is called many
   189  (many) times, this overhead can still add up.  Inlining can remove these costs!
   190  
   191  More interestingly, function inlining can also enable *other* optimizations.
   192  For example, a function that *would* have caused escape analysis to flunk
   193  something out to the heap *if* that function as called was alone... can
   194  potentially be inlined in such a way that in its contextual usage,
   195  the escape analysis flunking can actually disappear entirely.
   196  Many other kinds of optimizations can similarly be enabled by inlining.
   197  This makes designing library code to be inline-friendly a potentially
   198  high-impact concern -- sometimes even more so than can be easily seen.
   199  
   200  The exact mechanisms used by the compiler to determine what can (and should)
   201  be inlined may vary significantly from version to version of the Go compiler,
   202  which means one should be cautious of spending too much time in the details.
   203  However, we *can* make useful choices around things that will predictably
   204  obstruct inlining -- such as [virtual function calls](#virtual-function-calls).
   205  Occasionally there are positive stories in teasing the inliner to do well,
   206  such as https://blog.filippo.io/efficient-go-apis-with-the-inliner/ (but these
   207  seem to generally require a lot of thought and probably aren't the first stop
   208  on most optimization quests).
   209  
   210  ### virtual function calls
   211  
   212  Function calls which are intermediated by interfaces are called "virtual"
   213  function calls.  (You may also encounter the term "v-table" in compiler
   214  and runtime design literature -- this 'v' stands for "virtual".)
   215  
   216  Virtual function calls generally can't be inlined.  This can have significant
   217  effects, as described in the [inlining functions](#inlining-functions) section --
   218  it both means function call overhead can't be removed, and it can have cascading
   219  consequences by making other potential optimizations unreachable.
   220  
   221  
   222  
   223  Resultant Design Features
   224  -------------------------
   225  
   226  ### concrete implementations
   227  
   228  We generate a concrete type for each type in the schema.
   229  
   230  Using a concrete type means methods on it are possible to inline.
   231  This is important to us because most of the methods are "accessors" -- that is,
   232  a style of function that has a small body and does little work -- and these
   233  are precisely the sort of function where inlining can add up.
   234  
   235  ### natively-typed methods in addition to the general interface
   236  
   237  We generate two sets of methods: **both** the general interface methods to
   238  comply with Node and NodeBuilder interfaces, **and** also natively-typed
   239  variants of the same methods (e.g. a `Lookup` method for maps that takes
   240  the concrete type key and returns the concrete type value, rather than
   241  taking and returning `Node` interfaces).
   242  
   243  While both sets of methods can accomplish the same end goals, both are needed.
   244  There are two distinct advantages to natively-typed methods;
   245  and at the same time, the need for the general methods is system critical.
   246  
   247  Firstly, to programmers writing code that can use the concrete types, the
   248  natively-typed methods provide more value in the form of compile-time type
   249  checking, autocompletion and other tooling assist opportunities, and
   250  less verbosity.
   251  
   252  Secondly, natively-typed funtions on concrete types can be higher performance:
   253  since they're not [virtual function calls](#virtual-function-calls), we
   254  can expect [inlining](#inlining-functions) to work.  We might expect this to
   255  be particularly consequential in builders and in accessor methods, since these
   256  involve numerous calls to methods with small bodies -- precisely the sort of
   257  situation that often substantially benefits from inlining.
   258  
   259  At the same time, it goes without saying that we need the general Node and
   260  NodeBuilder interfaces to be satisfied, so that we can write generic library
   261  code such as reusable traversals, etc.  It is not possible to satisfy both
   262  needs with a single set of methods with the Golang typesystem; therefore,
   263  we generate both.
   264  
   265  ### embed by default
   266  
   267  Embedded structs amortizes the count of memory allocations.
   268  This addresses what is typically our biggest concern.
   269  
   270  The increase in size is generally not consequential.  We expect most fields
   271  end up filled anyway, so reserving that memory up front is reasonable.
   272  (Indeed, unfilled fields are only possible for nullable or optional fields
   273  which are implemented as embedded.)
   274  
   275  If assigning whole sub-trees at once, assignment into embedded fields
   276  incurs the cost of a memcopy (whereas by contrast, if fields were pointers,
   277  assigning them would be cheap... it's just that we would've had to pay
   278  a (possibly _extra_) allocation cost elsewhere earlier.)
   279  However, this is usually a worthy trade.
   280  Linear memcpy in practice can be significantly cheaper than extra allocations
   281  (especially if it's one long memcpy vs many allocations);
   282  and if we assume a balance of use cases such as "unmarshal happens more often
   283  than sub-tree-assignment", then it's pretty clear we should prioritize getting
   284  allocation minimization for unmarshal rather than fret sub-tree assignment.
   285  
   286  ### nodebuilders point to the concrete type
   287  
   288  We generate NodeBuilder types which contain a pointer to the type they build.
   289  
   290  This means we can hold onto the Node pointer when its building is completed,
   291  and discard the NodeBuilder.  (Or, reset and reuse the NodeBuilder.)
   292  Garbage collection can apply on the NodeBuilder independently of the lifespan
   293  of the Node it built.
   294  
   295  This means a single NodeBuilder and its produced Node will require
   296  **two** allocations -- one for the NodeBuilder, and a separate one for the Node.
   297  
   298  (An alternative would be to embed the concrete Node value in the NodeBuilder,
   299  and return a pointer to when finalizing the creation of the Node;
   300  however, because due to the garbage collection semantics around
   301  [internal pointers](#internal-pointers), such a design would cause the entirety
   302  of the memory needed in the NodeBuilder to remain uncollectable as long as
   303  completed Node is reachable!  This would be an unfortunate trade.)
   304  
   305  While we pay two allocations for the Node and its Builder, we earn that back
   306  in spades via our approach to recursion with
   307  [NodeAssemblers](#nodeassemblers-accumulate-mutations), and specifically, how
   308  [NodeAssemblers embed more NodeAssemblers](#nodeassemblers-embed-nodeassemblers).
   309  Long story short: we pay two allocations, yes.  But it's *fixed* at two,
   310  no matter how large and complex the structure is.
   311  
   312  ### nodeassemblers accumulate mutations
   313  
   314  The NodeBuilder type is only used at the root of construction of a value.
   315  After that, recursion works with an interface called NodeAssembler isntead.
   316  
   317  A NodeAssembler is essentially the same thing as a NodeBuilder, except
   318  _it doesn't return a Node_.
   319  
   320  This means we can use the NodeAssembler interface to describe constructing
   321  the data in the middle of some complex value, and we're not burdened by the
   322  need to be able to return the finished product.  (Sufficient state-keeping
   323  and defensive checks to ensure we don't leak mutable references would not
   324  come for free; reducing the number of points we might need to do this makes
   325  it possible to create a more efficient system overall.)
   326  
   327  The documentation on the datamodel.NodeAssembler type gives some general
   328  description of this.
   329  
   330  NodeBuilder types end up being just a NodeAssembler embed, plus a few methods
   331  for exposing the final results and optionally resetting the whole system.
   332  
   333  ### nodeassemblers embed nodeassemblers
   334  
   335  In addition to each NodeAssembler containing a pointer to the value they modify
   336  (the same as [NodeBuilders](#nodebuilders-point-to-the-concrete-type))...
   337  for assemblers that work with recursive structures, they also embed another
   338  NodeAssembler for each of their child values.
   339  
   340  This lets us amortize the allocations for all the *assemblers* in the same way
   341  as embedding in the actual value structs let us amortized allocations there.
   342  
   343  The code for this gets a little complex, and the result also carries several
   344  additional limitations to the usage, but it does keep the allocations finite,
   345  and thus makes the overall performance fast.
   346  
   347  (To be more specific, for recursive types that are infinite (namely, maps and
   348  lists; whereas structs and unions are finite), the NodeAssembler embeds
   349  *one* NodeAssembler for all values.  (Obviously, we can't embed an infinite
   350  number of them, right?)  This leads to a restriction: you can't assemble
   351  multiple children of an infinite recursive value simultaneously.)
   352  
   353  ### nullable and optional struct fields embed too
   354  
   355  TODO intro
   356  
   357  There is some chance of over-allocation in the event of nullable or optional
   358  fields.  We support tuning that via adjunct configuration to the code generator
   359  which allows you to opt in to using pointers for fields; choosing to do this
   360  will of course cause you to lose out on alloc amortization features in exchange.
   361  
   362  TODO also resolve the loops note, at bottom
   363  
   364  ### unexported implementations, exported aliases
   365  
   366  Our concrete types are unexported.  For those that need to be exported,
   367  we export an alias to the pointer type.
   368  
   369  This has an interesting set of effects:
   370  
   371  - copy-by-value from outside the package becomes impossible;
   372  - creating zero values from outside the package becomes impossible;
   373  - and yet refering to the type for type assertions remains possible.
   374  
   375  This addresses one downside to using [concrete implementations](#concrete-implementations):
   376  if the concrete implementation is an exported symbol, it means any code external
   377  to the package can produce Golang's natural "zero" for the type.
   378  This is problematic because it's true even if the Golang "zero" value for the
   379  type doesn't correspond to a valid value.
   380  While keeping an unexported implementation and an exported interface makes
   381  external fabrication of zero values impossible, it breaks inlining.
   382  Exporting an alias of the pointer type, however, strikes both goals at once:
   383  external fabrication of zero values is blocked, and yet inlining works.
   384  
   385  
   386  
   387  Amusing Details and Edge Cases
   388  ------------------------------
   389  
   390  ### looped references
   391  
   392  // who's job is it to detect this?
   393  // the schema validator should check it...
   394  // but something that breaks the cycle *there* doesn't necessarily do so for the emitted code!  aggh!
   395  //  ... unless we go back to optional and nullable both making ptrs unconditionally.
   396  
   397  
   398  
   399  Learning more (the hard way)
   400  ----------------------------
   401  
   402  If this document doesn't provide enough information for you,
   403  you've probably graduated to the point where doing experiments is next.  :)
   404  
   405  Prototypes and research examples can be found in the
   406  `go-ipld-prime/_rsrch/` directories.
   407  In particular, the "multihoisting" and "nodeassembler" packages are relevant,
   408  containing research that lead to the drafting of this doc,
   409  as well as some partially-worked alternative interface drafts.
   410  (You may have to search back through git history to find these directories;
   411  they're removed after some time, when the lessons have been applied.)
   412  
   413  Tests there include some benchmarks (self-explanitory);
   414  some tests based on runtime memory stats inspection;
   415  and some tests which are simply meant to be disassembled and read thusly.
   416  
   417  Compiler flags can provide useful insights:
   418  
   419  - `-gcflags '-S'` -- gives you assembler dump.
   420  	- read this to see for sure what's inlined and not.
   421  	- easy to quickly skim for calls like `runtime.newObject`, etc.
   422  	- often critically useful to ensure a benchmark hasn't optimized out the question you meant to ask it!
   423  	- generally gives a ground truth which puts an end to guessing.
   424  - `-gcflags '-m -m'` -- reports escape analysis and other decisions.
   425     - note the two m's -- not a typo: this gives you info in stack form,
   426  	  which is radically more informative than the single-m output.
   427  - `-gcflags '-l'` -- disables inlining!
   428  	- useful on benchmarks to quickly detect whether inlining is a major part of performance.
   429  
   430  These flags can apply to any command like `go install`... as well as `go test`.
   431  
   432  Profiling information collected from live systems in use is of course always
   433  intensely useful... if you have any on hand.  When handling this, be aware of
   434  how data-dependent performance can be when handling serialization systems:
   435  different workload content can very much lead to different hot spots.
   436  
   437  Happy hunting.