github.com/bir3/gocompiler@v0.9.2202/src/cmd/compile/abi-internal.md (about)

     1  # Go internal ABI specification
     2  
     3  Self-link: [go.dev/s/regabi](https://go.dev/s/regabi)
     4  
     5  This document describes Go’s internal application binary interface
     6  (ABI), known as ABIInternal.
     7  Go's ABI defines the layout of data in memory and the conventions for
     8  calling between Go functions.
     9  This ABI is *unstable* and will change between Go versions.
    10  If you’re writing assembly code, please instead refer to Go’s
    11  [assembly documentation](/doc/asm.html), which describes Go’s stable
    12  ABI, known as ABI0.
    13  
    14  All functions defined in Go source follow ABIInternal.
    15  However, ABIInternal and ABI0 functions are able to call each other
    16  through transparent *ABI wrappers*, described in the [internal calling
    17  convention proposal](https://golang.org/design/27539-internal-abi).
    18  
    19  Go uses a common ABI design across all architectures.
    20  We first describe the common ABI, and then cover per-architecture
    21  specifics.
    22  
    23  *Rationale*: For the reasoning behind using a common ABI across
    24  architectures instead of the platform ABI, see the [register-based Go
    25  calling convention proposal](https://golang.org/design/40724-register-calling).
    26  
    27  ## Memory layout
    28  
    29  Go's built-in types have the following sizes and alignments.
    30  Many, though not all, of these sizes are guaranteed by the [language
    31  specification](/doc/go_spec.html#Size_and_alignment_guarantees).
    32  Those that aren't guaranteed may change in future versions of Go (for
    33  example, we've considered changing the alignment of int64 on 32-bit).
    34  
    35  | Type                        | 64-bit |       | 32-bit |       |
    36  |-----------------------------|--------|-------|--------|-------|
    37  |                             | Size   | Align | Size   | Align |
    38  | bool, uint8, int8           | 1      | 1     | 1      | 1     |
    39  | uint16, int16               | 2      | 2     | 2      | 2     |
    40  | uint32, int32               | 4      | 4     | 4      | 4     |
    41  | uint64, int64               | 8      | 8     | 8      | 4     |
    42  | int, uint                   | 8      | 8     | 4      | 4     |
    43  | float32                     | 4      | 4     | 4      | 4     |
    44  | float64                     | 8      | 8     | 8      | 4     |
    45  | complex64                   | 8      | 4     | 8      | 4     |
    46  | complex128                  | 16     | 8     | 16     | 4     |
    47  | uintptr, *T, unsafe.Pointer | 8      | 8     | 4      | 4     |
    48  
    49  The types `byte` and `rune` are aliases for `uint8` and `int32`,
    50  respectively, and hence have the same size and alignment as these
    51  types.
    52  
    53  The layout of `map`, `chan`, and `func` types is equivalent to *T.
    54  
    55  To describe the layout of the remaining composite types, we first
    56  define the layout of a *sequence* S of N fields with types
    57  t<sub>1</sub>, t<sub>2</sub>, ..., t<sub>N</sub>.
    58  We define the byte offset at which each field begins relative to a
    59  base address of 0, as well as the size and alignment of the sequence
    60  as follows:
    61  
    62  ```
    63  offset(S, i) = 0  if i = 1
    64               = align(offset(S, i-1) + sizeof(t_(i-1)), alignof(t_i))
    65  alignof(S)   = 1  if N = 0
    66               = max(alignof(t_i) | 1 <= i <= N)
    67  sizeof(S)    = 0  if N = 0
    68               = align(offset(S, N) + sizeof(t_N), alignof(S))
    69  ```
    70  
    71  Where sizeof(T) and alignof(T) are the size and alignment of type T,
    72  respectively, and align(x, y) rounds x up to a multiple of y.
    73  
    74  The `interface{}` type is a sequence of 1. a pointer to the runtime type
    75  description for the interface's dynamic type and 2. an `unsafe.Pointer`
    76  data field.
    77  Any other interface type (besides the empty interface) is a sequence
    78  of 1. a pointer to the runtime "itab" that gives the method pointers and
    79  the type of the data field and 2. an `unsafe.Pointer` data field.
    80  An interface can be "direct" or "indirect" depending on the dynamic
    81  type: a direct interface stores the value directly in the data field,
    82  and an indirect interface stores a pointer to the value in the data
    83  field.
    84  An interface can only be direct if the value consists of a single
    85  pointer word.
    86  
    87  An array type `[N]T` is a sequence of N fields of type T.
    88  
    89  The slice type `[]T` is a sequence of a `*[cap]T` pointer to the slice
    90  backing store, an `int` giving the `len` of the slice, and an `int`
    91  giving the `cap` of the slice.
    92  
    93  The `string` type is a sequence of a `*[len]byte` pointer to the
    94  string backing store, and an `int` giving the `len` of the string.
    95  
    96  A struct type `struct { f1 t1; ...; fM tM }` is laid out as the
    97  sequence t1, ..., tM, tP, where tP is either:
    98  
    99  - Type `byte` if sizeof(tM) = 0 and any of sizeof(t*i*) ≠ 0.
   100  - Empty (size 0 and align 1) otherwise.
   101  
   102  The padding byte prevents creating a past-the-end pointer by taking
   103  the address of the final, empty fN field.
   104  
   105  Note that user-written assembly code should generally not depend on Go
   106  type layout and should instead use the constants defined in
   107  [`go_asm.h`](/doc/asm.html#data-offsets).
   108  
   109  ## Function call argument and result passing
   110  
   111  Function calls pass arguments and results using a combination of the
   112  stack and machine registers.
   113  Each argument or result is passed either entirely in registers or
   114  entirely on the stack.
   115  Because access to registers is generally faster than access to the
   116  stack, arguments and results are preferentially passed in registers.
   117  However, any argument or result that contains a non-trivial array or
   118  does not fit entirely in the remaining available registers is passed
   119  on the stack.
   120  
   121  Each architecture defines a sequence of integer registers and a
   122  sequence of floating-point registers.
   123  At a high level, arguments and results are recursively broken down
   124  into values of base types and these base values are assigned to
   125  registers from these sequences.
   126  
   127  Arguments and results can share the same registers, but do not share
   128  the same stack space.
   129  Beyond the arguments and results passed on the stack, the caller also
   130  reserves spill space on the stack for all register-based arguments
   131  (but does not populate this space).
   132  
   133  The receiver, arguments, and results of function or method F are
   134  assigned to registers or the stack using the following algorithm:
   135  
   136  1. Let NI and NFP be the length of integer and floating-point register
   137     sequences defined by the architecture.
   138     Let I and FP be 0; these are the indexes of the next integer and
   139     floating-point register.
   140     Let S, the type sequence defining the stack frame, be empty.
   141  1. If F is a method, assign F’s receiver.
   142  1. For each argument A of F, assign A.
   143  1. Add a pointer-alignment field to S. This has size 0 and the same
   144     alignment as `uintptr`.
   145  1. Reset I and FP to 0.
   146  1. For each result R of F, assign R.
   147  1. Add a pointer-alignment field to S.
   148  1. For each register-assigned receiver and argument of F, let T be its
   149     type and add T to the stack sequence S.
   150     This is the argument's (or receiver's) spill space and will be
   151     uninitialized at the call.
   152  1. Add a pointer-alignment field to S.
   153  
   154  Assigning a receiver, argument, or result V of underlying type T works
   155  as follows:
   156  
   157  1. Remember I and FP.
   158  1. If T has zero size, add T to the stack sequence S and return.
   159  1. Try to register-assign V.
   160  1. If step 3 failed, reset I and FP to the values from step 1, add T
   161     to the stack sequence S, and assign V to this field in S.
   162  
   163  Register-assignment of a value V of underlying type T works as follows:
   164  
   165  1. If T is a boolean or integral type that fits in an integer
   166     register, assign V to register I and increment I.
   167  1. If T is an integral type that fits in two integer registers, assign
   168     the least significant and most significant halves of V to registers
   169     I and I+1, respectively, and increment I by 2
   170  1. If T is a floating-point type and can be represented without loss
   171     of precision in a floating-point register, assign V to register FP
   172     and increment FP.
   173  1. If T is a complex type, recursively register-assign its real and
   174     imaginary parts.
   175  1. If T is a pointer type, map type, channel type, or function type,
   176     assign V to register I and increment I.
   177  1. If T is a string type, interface type, or slice type, recursively
   178     register-assign V’s components (2 for strings and interfaces, 3 for
   179     slices).
   180  1. If T is a struct type, recursively register-assign each field of V.
   181  1. If T is an array type of length 0, do nothing.
   182  1. If T is an array type of length 1, recursively register-assign its
   183     one element.
   184  1. If T is an array type of length > 1, fail.
   185  1. If I > NI or FP > NFP, fail.
   186  1. If any recursive assignment above fails, fail.
   187  
   188  The above algorithm produces an assignment of each receiver, argument,
   189  and result to registers or to a field in the stack sequence.
   190  The final stack sequence looks like: stack-assigned receiver,
   191  stack-assigned arguments, pointer-alignment, stack-assigned results,
   192  pointer-alignment, spill space for each register-assigned argument,
   193  pointer-alignment.
   194  The following diagram shows what this stack frame looks like on the
   195  stack, using the typical convention where address 0 is at the bottom:
   196  
   197      +------------------------------+
   198      |             . . .            |
   199      | 2nd reg argument spill space |
   200      | 1st reg argument spill space |
   201      | <pointer-sized alignment>    |
   202      |             . . .            |
   203      | 2nd stack-assigned result    |
   204      | 1st stack-assigned result    |
   205      | <pointer-sized alignment>    |
   206      |             . . .            |
   207      | 2nd stack-assigned argument  |
   208      | 1st stack-assigned argument  |
   209      | stack-assigned receiver      |
   210      +------------------------------+ ↓ lower addresses
   211  
   212  To perform a call, the caller reserves space starting at the lowest
   213  address in its stack frame for the call stack frame, stores arguments
   214  in the registers and argument stack fields determined by the above
   215  algorithm, and performs the call.
   216  At the time of a call, spill space, result stack fields, and result
   217  registers are left uninitialized.
   218  Upon return, the callee must have stored results to all result
   219  registers and result stack fields determined by the above algorithm.
   220  
   221  There are no callee-save registers, so a call may overwrite any
   222  register that doesn’t have a fixed meaning, including argument
   223  registers.
   224  
   225  ### Example
   226  
   227  Consider the function `func f(a1 uint8, a2 [2]uintptr, a3 uint8) (r1
   228  struct { x uintptr; y [2]uintptr }, r2 string)` on a 64-bit
   229  architecture with hypothetical integer registers R0–R9.
   230  
   231  On entry, `a1` is assigned to `R0`, `a3` is assigned to `R1` and the
   232  stack frame is laid out in the following sequence:
   233  
   234      a2      [2]uintptr
   235      r1.x    uintptr
   236      r1.y    [2]uintptr
   237      a1Spill uint8
   238      a3Spill uint8
   239      _       [6]uint8  // alignment padding
   240  
   241  In the stack frame, only the `a2` field is initialized on entry; the
   242  rest of the frame is left uninitialized.
   243  
   244  On exit, `r2.base` is assigned to `R0`, `r2.len` is assigned to `R1`,
   245  and `r1.x` and `r1.y` are initialized in the stack frame.
   246  
   247  There are several things to note in this example.
   248  First, `a2` and `r1` are stack-assigned because they contain arrays.
   249  The other arguments and results are register-assigned.
   250  Result `r2` is decomposed into its components, which are individually
   251  register-assigned.
   252  On the stack, the stack-assigned arguments appear at lower addresses
   253  than the stack-assigned results, which appear at lower addresses than
   254  the argument spill area.
   255  Only arguments, not results, are assigned a spill area on the stack.
   256  
   257  ### Rationale
   258  
   259  Each base value is assigned to its own register to optimize
   260  construction and access.
   261  An alternative would be to pack multiple sub-word values into
   262  registers, or to simply map an argument's in-memory layout to
   263  registers (this is common in C ABIs), but this typically adds cost to
   264  pack and unpack these values.
   265  Modern architectures have more than enough registers to pass all
   266  arguments and results this way for nearly all functions (see the
   267  appendix), so there’s little downside to spreading base values across
   268  registers.
   269  
   270  Arguments that can’t be fully assigned to registers are passed
   271  entirely on the stack in case the callee takes the address of that
   272  argument.
   273  If an argument could be split across the stack and registers and the
   274  callee took its address, it would need to be reconstructed in memory,
   275  a process that would be proportional to the size of the argument.
   276  
   277  Non-trivial arrays are always passed on the stack because indexing
   278  into an array typically requires a computed offset, which generally
   279  isn’t possible with registers.
   280  Arrays in general are rare in function signatures (only 0.7% of
   281  functions in the Go 1.15 standard library and 0.2% in kubelet).
   282  We considered allowing array fields to be passed on the stack while
   283  the rest of an argument’s fields are passed in registers, but this
   284  creates the same problems as other large structs if the callee takes
   285  the address of an argument, and would benefit <0.1% of functions in
   286  kubelet (and even these very little).
   287  
   288  We make exceptions for 0 and 1-element arrays because these don’t
   289  require computed offsets, and 1-element arrays are already decomposed
   290  in the compiler’s SSA representation.
   291  
   292  The ABI assignment algorithm above is equivalent to Go’s stack-based
   293  ABI0 calling convention if there are zero architecture registers.
   294  This is intended to ease the transition to the register-based internal
   295  ABI and make it easy for the compiler to generate either calling
   296  convention.
   297  An architecture may still define register meanings that aren’t
   298  compatible with ABI0, but these differences should be easy to account
   299  for in the compiler.
   300  
   301  The assignment algorithm assigns zero-sized values to the stack
   302  (assignment step 2) in order to support ABI0-equivalence.
   303  While these values take no space themselves, they do result in
   304  alignment padding on the stack in ABI0.
   305  Without this step, the internal ABI would register-assign zero-sized
   306  values even on architectures that provide no argument registers
   307  because they don't consume any registers, and hence not add alignment
   308  padding to the stack.
   309  
   310  The algorithm reserves spill space for arguments in the caller’s frame
   311  so that the compiler can generate a stack growth path that spills into
   312  this reserved space.
   313  If the callee has to grow the stack, it may not be able to reserve
   314  enough additional stack space in its own frame to spill these, which
   315  is why it’s important that the caller do so.
   316  These slots also act as the home location if these arguments need to
   317  be spilled for any other reason, which simplifies traceback printing.
   318  
   319  There are several options for how to lay out the argument spill space.
   320  We chose to lay out each argument according to its type's usual memory
   321  layout but to separate the spill space from the regular argument
   322  space.
   323  Using the usual memory layout simplifies the compiler because it
   324  already understands this layout.
   325  Also, if a function takes the address of a register-assigned argument,
   326  the compiler must spill that argument to memory in its usual memory
   327  layout and it's more convenient to use the argument spill space for
   328  this purpose.
   329  
   330  Alternatively, the spill space could be structured around argument
   331  registers.
   332  In this approach, the stack growth spill path would spill each
   333  argument register to a register-sized stack word.
   334  However, if the function takes the address of a register-assigned
   335  argument, the compiler would have to reconstruct it in memory layout
   336  elsewhere on the stack.
   337  
   338  The spill space could also be interleaved with the stack-assigned
   339  arguments so the arguments appear in order whether they are register-
   340  or stack-assigned.
   341  This would be close to ABI0, except that register-assigned arguments
   342  would be uninitialized on the stack and there's no need to reserve
   343  stack space for register-assigned results.
   344  We expect separating the spill space to perform better because of
   345  memory locality.
   346  Separating the space is also potentially simpler for `reflect` calls
   347  because this allows `reflect` to summarize the spill space as a single
   348  number.
   349  Finally, the long-term intent is to remove reserved spill slots
   350  entirely – allowing most functions to be called without any stack
   351  setup and easing the introduction of callee-save registers – and
   352  separating the spill space makes that transition easier.
   353  
   354  ## Closures
   355  
   356  A func value (e.g., `var x func()`) is a pointer to a closure object.
   357  A closure object begins with a pointer-sized program counter
   358  representing the entry point of the function, followed by zero or more
   359  bytes containing the closed-over environment.
   360  
   361  Closure calls follow the same conventions as static function and
   362  method calls, with one addition. Each architecture specifies a
   363  *closure context pointer* register and calls to closures store the
   364  address of the closure object in the closure context pointer register
   365  prior to the call.
   366  
   367  ## Software floating-point mode
   368  
   369  In "softfloat" mode, the ABI simply treats the hardware as having zero
   370  floating-point registers.
   371  As a result, any arguments containing floating-point values will be
   372  passed on the stack.
   373  
   374  *Rationale*: Softfloat mode is about compatibility over performance
   375  and is not commonly used.
   376  Hence, we keep the ABI as simple as possible in this case, rather than
   377  adding additional rules for passing floating-point values in integer
   378  registers.
   379  
   380  ## Architecture specifics
   381  
   382  This section describes per-architecture register mappings, as well as
   383  other per-architecture special cases.
   384  
   385  ### amd64 architecture
   386  
   387  The amd64 architecture uses the following sequence of 9 registers for
   388  integer arguments and results:
   389  
   390      RAX, RBX, RCX, RDI, RSI, R8, R9, R10, R11
   391  
   392  It uses X0 – X14 for floating-point arguments and results.
   393  
   394  *Rationale*: These sequences are chosen from the available registers
   395  to be relatively easy to remember.
   396  
   397  Registers R12 and R13 are permanent scratch registers.
   398  R15 is a scratch register except in dynamically linked binaries.
   399  
   400  *Rationale*: Some operations such as stack growth and reflection calls
   401  need dedicated scratch registers in order to manipulate call frames
   402  without corrupting arguments or results.
   403  
   404  Special-purpose registers are as follows:
   405  
   406  | Register | Call meaning | Return meaning | Body meaning |
   407  | --- | --- | --- | --- |
   408  | RSP | Stack pointer | Same | Same |
   409  | RBP | Frame pointer | Same | Same |
   410  | RDX | Closure context pointer | Scratch | Scratch |
   411  | R12 | Scratch | Scratch | Scratch |
   412  | R13 | Scratch | Scratch | Scratch |
   413  | R14 | Current goroutine | Same | Same |
   414  | R15 | GOT reference temporary if dynlink | Same | Same |
   415  | X15 | Zero value (*) | Same | Scratch |
   416  
   417  (*) Except on Plan 9, where X15 is a scratch register because SSE
   418  registers cannot be used in note handlers (so the compiler avoids
   419  using them except when absolutely necessary).
   420  
   421  *Rationale*: These register meanings are compatible with Go’s
   422  stack-based calling convention except for R14 and X15, which will have
   423  to be restored on transitions from ABI0 code to ABIInternal code.
   424  In ABI0, these are undefined, so transitions from ABIInternal to ABI0
   425  can ignore these registers.
   426  
   427  *Rationale*: For the current goroutine pointer, we chose a register
   428  that requires an additional REX byte.
   429  While this adds one byte to every function prologue, it is hardly ever
   430  accessed outside the function prologue and we expect making more
   431  single-byte registers available to be a net win.
   432  
   433  *Rationale*: We could allow R14 (the current goroutine pointer) to be
   434  a scratch register in function bodies because it can always be
   435  restored from TLS on amd64.
   436  However, we designate it as a fixed register for simplicity and for
   437  consistency with other architectures that may not have a copy of the
   438  current goroutine pointer in TLS.
   439  
   440  *Rationale*: We designate X15 as a fixed zero register because
   441  functions often have to bulk zero their stack frames, and this is more
   442  efficient with a designated zero register.
   443  
   444  *Implementation note*: Registers with fixed meaning at calls but not
   445  in function bodies must be initialized by "injected" calls such as
   446  signal-based panics.
   447  
   448  #### Stack layout
   449  
   450  The stack pointer, RSP, grows down and is always aligned to 8 bytes.
   451  
   452  The amd64 architecture does not use a link register.
   453  
   454  A function's stack frame is laid out as follows:
   455  
   456      +------------------------------+
   457      | return PC                    |
   458      | RBP on entry                 |
   459      | ... locals ...               |
   460      | ... outgoing arguments ...   |
   461      +------------------------------+ ↓ lower addresses
   462  
   463  The "return PC" is pushed as part of the standard amd64 `CALL`
   464  operation.
   465  On entry, a function subtracts from RSP to open its stack frame and
   466  saves the value of RBP directly below the return PC.
   467  A leaf function that does not require any stack space may omit the
   468  saved RBP.
   469  
   470  The Go ABI's use of RBP as a frame pointer register is compatible with
   471  amd64 platform conventions so that Go can inter-operate with platform
   472  debuggers and profilers.
   473  
   474  #### Flags
   475  
   476  The direction flag (D) is always cleared (set to the “forward”
   477  direction) at a call.
   478  The arithmetic status flags are treated like scratch registers and not
   479  preserved across calls.
   480  All other bits in RFLAGS are system flags.
   481  
   482  At function calls and returns, the CPU is in x87 mode (not MMX
   483  technology mode).
   484  
   485  *Rationale*: Go on amd64 does not use either the x87 registers or MMX
   486  registers. Hence, we follow the SysV platform conventions in order to
   487  simplify transitions to and from the C ABI.
   488  
   489  At calls, the MXCSR control bits are always set as follows:
   490  
   491  | Flag | Bit | Value | Meaning |
   492  | --- | --- | --- | --- |
   493  | FZ | 15 | 0 | Do not flush to zero |
   494  | RC | 14/13 | 0 (RN) | Round to nearest |
   495  | PM | 12 | 1 | Precision masked |
   496  | UM | 11 | 1 | Underflow masked |
   497  | OM | 10 | 1 | Overflow masked |
   498  | ZM | 9 | 1 | Divide-by-zero masked |
   499  | DM | 8 | 1 | Denormal operations masked |
   500  | IM | 7 | 1 | Invalid operations masked |
   501  | DAZ | 6 | 0 | Do not zero de-normals |
   502  
   503  The MXCSR status bits are callee-save.
   504  
   505  *Rationale*: Having a fixed MXCSR control configuration allows Go
   506  functions to use SSE operations without modifying or saving the MXCSR.
   507  Functions are allowed to modify it between calls (as long as they
   508  restore it), but as of this writing Go code never does.
   509  The above fixed configuration matches the process initialization
   510  control bits specified by the ELF AMD64 ABI.
   511  
   512  The x87 floating-point control word is not used by Go on amd64.
   513  
   514  ### arm64 architecture
   515  
   516  The arm64 architecture uses R0 – R15 for integer arguments and results.
   517  
   518  It uses F0 – F15 for floating-point arguments and results.
   519  
   520  *Rationale*: 16 integer registers and 16 floating-point registers are
   521  more than enough for passing arguments and results for practically all
   522  functions (see Appendix). While there are more registers available,
   523  using more registers provides little benefit. Additionally, it will add
   524  overhead on code paths where the number of arguments are not statically
   525  known (e.g. reflect call), and will consume more stack space when there
   526  is only limited stack space available to fit in the nosplit limit.
   527  
   528  Registers R16 and R17 are permanent scratch registers. They are also
   529  used as scratch registers by the linker (Go linker and external
   530  linker) in trampolines.
   531  
   532  Register R18 is reserved and never used. It is reserved for the OS
   533  on some platforms (e.g. macOS).
   534  
   535  Registers R19 – R25 are permanent scratch registers. In addition,
   536  R27 is a permanent scratch register used by the assembler when
   537  expanding instructions.
   538  
   539  Floating-point registers F16 – F31 are also permanent scratch
   540  registers.
   541  
   542  Special-purpose registers are as follows:
   543  
   544  | Register | Call meaning | Return meaning | Body meaning |
   545  | --- | --- | --- | --- |
   546  | RSP | Stack pointer | Same | Same |
   547  | R30 | Link register | Same | Scratch (non-leaf functions) |
   548  | R29 | Frame pointer | Same | Same |
   549  | R28 | Current goroutine | Same | Same |
   550  | R27 | Scratch | Scratch | Scratch |
   551  | R26 | Closure context pointer | Scratch | Scratch |
   552  | R18 | Reserved (not used) | Same | Same |
   553  | ZR  | Zero value | Same | Same |
   554  
   555  *Rationale*: These register meanings are compatible with Go’s
   556  stack-based calling convention.
   557  
   558  *Rationale*: The link register, R30, holds the function return
   559  address at the function entry. For functions that have frames
   560  (including most non-leaf functions), R30 is saved to stack in the
   561  function prologue and restored in the epilogue. Within the function
   562  body, R30 can be used as a scratch register.
   563  
   564  *Implementation note*: Registers with fixed meaning at calls but not
   565  in function bodies must be initialized by "injected" calls such as
   566  signal-based panics.
   567  
   568  #### Stack layout
   569  
   570  The stack pointer, RSP, grows down and is always aligned to 16 bytes.
   571  
   572  *Rationale*: The arm64 architecture requires the stack pointer to be
   573  16-byte aligned.
   574  
   575  A function's stack frame, after the frame is created, is laid out as
   576  follows:
   577  
   578      +------------------------------+
   579      | ... locals ...               |
   580      | ... outgoing arguments ...   |
   581      | return PC                    | ← RSP points to
   582      | frame pointer on entry       |
   583      +------------------------------+ ↓ lower addresses
   584  
   585  The "return PC" is loaded to the link register, R30, as part of the
   586  arm64 `CALL` operation.
   587  
   588  On entry, a function subtracts from RSP to open its stack frame, and
   589  saves the values of R30 and R29 at the bottom of the frame.
   590  Specifically, R30 is saved at 0(RSP) and R29 is saved at -8(RSP),
   591  after RSP is updated.
   592  
   593  A leaf function that does not require any stack space may omit the
   594  saved R30 and R29.
   595  
   596  The Go ABI's use of R29 as a frame pointer register is compatible with
   597  arm64 architecture requirement so that Go can inter-operate with platform
   598  debuggers and profilers.
   599  
   600  This stack layout is used by both register-based (ABIInternal) and
   601  stack-based (ABI0) calling conventions.
   602  
   603  #### Flags
   604  
   605  The arithmetic status flags (NZCV) are treated like scratch registers
   606  and not preserved across calls.
   607  All other bits in PSTATE are system flags and are not modified by Go.
   608  
   609  The floating-point status register (FPSR) is treated like scratch
   610  registers and not preserved across calls.
   611  
   612  At calls, the floating-point control register (FPCR) bits are always
   613  set as follows:
   614  
   615  | Flag | Bit | Value | Meaning |
   616  | --- | --- | --- | --- |
   617  | DN  | 25 | 0 | Propagate NaN operands |
   618  | FZ  | 24 | 0 | Do not flush to zero |
   619  | RC  | 23/22 | 0 (RN) | Round to nearest, choose even if tied |
   620  | IDE | 15 | 0 | Denormal operations trap disabled |
   621  | IXE | 12 | 0 | Inexact trap disabled |
   622  | UFE | 11 | 0 | Underflow trap disabled |
   623  | OFE | 10 | 0 | Overflow trap disabled |
   624  | DZE | 9 | 0 | Divide-by-zero trap disabled |
   625  | IOE | 8 | 0 | Invalid operations trap disabled |
   626  | NEP | 2 | 0 | Scalar operations do not affect higher elements in vector registers |
   627  | AH  | 1 | 0 | No alternate handling of de-normal inputs |
   628  | FIZ | 0 | 0 | Do not zero de-normals |
   629  
   630  *Rationale*: Having a fixed FPCR control configuration allows Go
   631  functions to use floating-point and vector (SIMD) operations without
   632  modifying or saving the FPCR.
   633  Functions are allowed to modify it between calls (as long as they
   634  restore it), but as of this writing Go code never does.
   635  
   636  ### loong64 architecture
   637  
   638  The loong64 architecture uses R4 – R19 for integer arguments and integer results.
   639  
   640  It uses F0 – F15 for floating-point arguments and results.
   641  
   642  Registers R20 - R21, R23 – R28, R30 - R31, F16 – F31 are permanent scratch registers.
   643  
   644  Register R2 is reserved and never used.
   645  
   646  Register R20, R21 is Used by runtime.duffcopy, runtime.duffzero.
   647  
   648  Special-purpose registers used within Go generated code and Go assembly code
   649  are as follows:
   650  
   651  | Register | Call meaning | Return meaning | Body meaning |
   652  | --- | --- | --- | --- |
   653  | R0 | Zero value | Same | Same |
   654  | R1 | Link register | Link register | Scratch |
   655  | R3 | Stack pointer | Same | Same |
   656  | R20,R21 | Scratch | Scratch | Used by duffcopy, duffzero |
   657  | R22 | Current goroutine | Same | Same |
   658  | R29 | Closure context pointer | Same | Same |
   659  | R30, R31 | used by the assembler | Same | Same |
   660  
   661  *Rationale*: These register meanings are compatible with Go’s stack-based
   662  calling convention.
   663  
   664  #### Stack layout
   665  
   666  The stack pointer, R3, grows down and is aligned to 8 bytes.
   667  
   668  A function's stack frame, after the frame is created, is laid out as
   669  follows:
   670  
   671      +------------------------------+
   672      | ... locals ...               |
   673      | ... outgoing arguments ...   |
   674      | return PC                    | ← R3 points to
   675      +------------------------------+ ↓ lower addresses
   676  
   677  This stack layout is used by both register-based (ABIInternal) and
   678  stack-based (ABI0) calling conventions.
   679  
   680  The "return PC" is loaded to the link register, R1, as part of the
   681  loong64 `JAL` operation.
   682  
   683  #### Flags
   684  All bits in CSR are system flags and are not modified by Go.
   685  
   686  ### ppc64 architecture
   687  
   688  The ppc64 architecture uses R3 – R10 and R14 – R17 for integer arguments
   689  and results.
   690  
   691  It uses F1 – F12 for floating-point arguments and results.
   692  
   693  Register R31 is a permanent scratch register in Go.
   694  
   695  Special-purpose registers used within Go generated code and Go
   696  assembly code are as follows:
   697  
   698  | Register | Call meaning | Return meaning | Body meaning |
   699  | --- | --- | --- | --- |
   700  | R0  | Zero value | Same | Same |
   701  | R1  | Stack pointer | Same | Same |
   702  | R2  | TOC register | Same | Same |
   703  | R11 | Closure context pointer | Scratch | Scratch |
   704  | R12 | Function address on indirect calls | Scratch | Scratch |
   705  | R13 | TLS pointer | Same | Same |
   706  | R20,R21 | Scratch | Scratch | Used by duffcopy, duffzero |
   707  | R30 | Current goroutine | Same | Same |
   708  | R31 | Scratch | Scratch | Scratch |
   709  | LR  | Link register | Link register | Scratch |
   710  *Rationale*: These register meanings are compatible with Go’s
   711  stack-based calling convention.
   712  
   713  The link register, LR, holds the function return
   714  address at the function entry and is set to the correct return
   715  address before exiting the function. It is also used
   716  in some cases as the function address when doing an indirect call.
   717  
   718  The register R2 contains the address of the TOC (table of contents) which
   719  contains data or code addresses used when generating position independent
   720  code. Non-Go code generated when using cgo contains TOC-relative addresses
   721  which depend on R2 holding a valid TOC. Go code compiled with -shared or
   722  -dynlink initializes and maintains R2 and uses it in some cases for
   723  function calls; Go code compiled without these options does not modify R2.
   724  
   725  When making a function call R12 contains the function address for use by the
   726  code to generate R2 at the beginning of the function. R12 can be used for
   727  other purposes within the body of the function, such as trampoline generation.
   728  
   729  R20 and R21 are used in duffcopy and duffzero which could be generated
   730  before arguments are saved so should not be used for register arguments.
   731  
   732  The Count register CTR can be used as the call target for some branch instructions.
   733  It holds the return address when preemption has occurred.
   734  
   735  On PPC64 when a float32 is loaded it becomes a float64 in the register, which is
   736  different from other platforms and that needs to be recognized by the internal
   737  implementation of reflection so that float32 arguments are passed correctly.
   738  
   739  Registers R18 - R29 and F13 - F31 are considered scratch registers.
   740  
   741  #### Stack layout
   742  
   743  The stack pointer, R1, grows down and is aligned to 8 bytes in Go, but changed
   744  to 16 bytes when calling cgo.
   745  
   746  A function's stack frame, after the frame is created, is laid out as
   747  follows:
   748  
   749      +------------------------------+
   750      | ... locals ...               |
   751      | ... outgoing arguments ...   |
   752      | 24  TOC register R2 save     | When compiled with -shared/-dynlink
   753      | 16  Unused in Go             | Not used in Go
   754      |  8  CR save                  | nonvolatile CR fields
   755      |  0  return PC                | ← R1 points to
   756      +------------------------------+ ↓ lower addresses
   757  
   758  The "return PC" is loaded to the link register, LR, as part of the
   759  ppc64 `BL` operations.
   760  
   761  On entry to a non-leaf function, the stack frame size is subtracted from R1 to
   762  create its stack frame, and saves the value of LR at the bottom of the frame.
   763  
   764  A leaf function that does not require any stack space does not modify R1 and
   765  does not save LR.
   766  
   767  *NOTE*: We might need to save the frame pointer on the stack as
   768  in the PPC64 ELF v2 ABI so Go can inter-operate with platform debuggers
   769  and profilers.
   770  
   771  This stack layout is used by both register-based (ABIInternal) and
   772  stack-based (ABI0) calling conventions.
   773  
   774  #### Flags
   775  
   776  The condition register consists of 8 condition code register fields
   777  CR0-CR7. Go generated code only sets and uses CR0, commonly set by
   778  compare functions and use to determine the target of a conditional
   779  branch. The generated code does not set or use CR1-CR7.
   780  
   781  The floating point status and control register (FPSCR) is initialized
   782  to 0 by the kernel at startup of the Go program and not changed by
   783  the Go generated code.
   784  
   785  ### riscv64 architecture
   786  
   787  The riscv64 architecture uses X10 – X17, X8, X9, X18 – X23 for integer arguments
   788  and results.
   789  
   790  It uses F10 – F17, F8, F9, F18 – F23 for floating-point arguments and results.
   791  
   792  Special-purpose registers used within Go generated code and Go
   793  assembly code are as follows:
   794  
   795  | Register | Call meaning | Return meaning | Body meaning |
   796  | --- | --- | --- | --- |
   797  | X0  | Zero value | Same | Same |
   798  | X1  | Link register | Link register | Scratch |
   799  | X2  | Stack pointer | Same | Same |
   800  | X3  | Global pointer | Same | Used by dynamic linker |
   801  | X4  | TLS (thread pointer) | TLS | Scratch |
   802  | X24,X25 | Scratch | Scratch | Used by duffcopy, duffzero |
   803  | X26 | Closure context pointer | Scratch | Scratch |
   804  | X27 | Current goroutine | Same | Same |
   805  | X31 | Scratch | Scratch | Scratch |
   806  
   807  *Rationale*: These register meanings are compatible with Go’s
   808  stack-based calling convention. Context register X20 will change to X26,
   809  duffcopy, duffzero register will change to X24, X25 before this register ABI been adopted.
   810  X10 – X17, X8, X9, X18 – X23, is the same order as A0 – A7, S0 – S7 in platform ABI.
   811  F10 – F17, F8, F9, F18 – F23, is the same order as FA0 – FA7, FS0 – FS7 in platform ABI.
   812  X8 – X23, F8 – F15 are used for compressed instruction (RVC) which will benefit code size in the future.
   813  
   814  #### Stack layout
   815  
   816  The stack pointer, X2, grows down and is aligned to 8 bytes.
   817  
   818  A function's stack frame, after the frame is created, is laid out as
   819  follows:
   820  
   821      +------------------------------+
   822      | ... locals ...               |
   823      | ... outgoing arguments ...   |
   824      | return PC                    | ← X2 points to
   825      +------------------------------+ ↓ lower addresses
   826  
   827  The "return PC" is loaded to the link register, X1, as part of the
   828  riscv64 `CALL` operation.
   829  
   830  #### Flags
   831  
   832  The riscv64 has Zicsr extension for control and status register (CSR) and
   833  treated as scratch register.
   834  All bits in CSR are system flags and are not modified by Go.
   835  
   836  ## Future directions
   837  
   838  ### Spill path improvements
   839  
   840  The ABI currently reserves spill space for argument registers so the
   841  compiler can statically generate an argument spill path before calling
   842  into `runtime.morestack` to grow the stack.
   843  This ensures there will be sufficient spill space even when the stack
   844  is nearly exhausted and keeps stack growth and stack scanning
   845  essentially unchanged from ABI0.
   846  
   847  However, this wastes stack space (the median wastage is 16 bytes per
   848  call), resulting in larger stacks and increased cache footprint.
   849  A better approach would be to reserve stack space only when spilling.
   850  One way to ensure enough space is available to spill would be for
   851  every function to ensure there is enough space for the function's own
   852  frame *as well as* the spill space of all functions it calls.
   853  For most functions, this would change the threshold for the prologue
   854  stack growth check.
   855  For `nosplit` functions, this would change the threshold used in the
   856  linker's static stack size check.
   857  
   858  Allocating spill space in the callee rather than the caller may also
   859  allow for faster reflection calls in the common case where a function
   860  takes only register arguments, since it would allow reflection to make
   861  these calls directly without allocating any frame.
   862  
   863  The statically-generated spill path also increases code size.
   864  It is possible to instead have a generic spill path in the runtime, as
   865  part of `morestack`.
   866  However, this complicates reserving the spill space, since spilling
   867  all possible register arguments would, in most cases, take
   868  significantly more space than spilling only those used by a particular
   869  function.
   870  Some options are to spill to a temporary space and copy back only the
   871  registers used by the function, or to grow the stack if necessary
   872  before spilling to it (using a temporary space if necessary), or to
   873  use a heap-allocated space if insufficient stack space is available.
   874  These options all add enough complexity that we will have to make this
   875  decision based on the actual code size growth caused by the static
   876  spill paths.
   877  
   878  ### Clobber sets
   879  
   880  As defined, the ABI does not use callee-save registers.
   881  This significantly simplifies the garbage collector and the compiler's
   882  register allocator, but at some performance cost.
   883  A potentially better balance for Go code would be to use *clobber
   884  sets*: for each function, the compiler records the set of registers it
   885  clobbers (including those clobbered by functions it calls) and any
   886  register not clobbered by function F can remain live across calls to
   887  F.
   888  
   889  This is generally a good fit for Go because Go's package DAG allows
   890  function metadata like the clobber set to flow up the call graph, even
   891  across package boundaries.
   892  Clobber sets would require relatively little change to the garbage
   893  collector, unlike general callee-save registers.
   894  One disadvantage of clobber sets over callee-save registers is that
   895  they don't help with indirect function calls or interface method
   896  calls, since static information isn't available in these cases.
   897  
   898  ### Large aggregates
   899  
   900  Go encourages passing composite values by value, and this simplifies
   901  reasoning about mutation and races.
   902  However, this comes at a performance cost for large composite values.
   903  It may be possible to instead transparently pass large composite
   904  values by reference and delay copying until it is actually necessary.
   905  
   906  ## Appendix: Register usage analysis
   907  
   908  In order to understand the impacts of the above design on register
   909  usage, we
   910  [analyzed](https://github.com/aclements/go-misc/tree/master/abi) the
   911  impact of the above ABI on a large code base: cmd/kubelet from
   912  [Kubernetes](https://github.com/kubernetes/kubernetes) at tag v1.18.8.
   913  
   914  The following table shows the impact of different numbers of available
   915  integer and floating-point registers on argument assignment:
   916  
   917  ```
   918  |      |        |       |      stack args |          spills |     stack total |
   919  | ints | floats | % fit | p50 | p95 | p99 | p50 | p95 | p99 | p50 | p95 | p99 |
   920  |    0 |      0 |  6.3% |  32 | 152 | 256 |   0 |   0 |   0 |  32 | 152 | 256 |
   921  |    0 |      8 |  6.4% |  32 | 152 | 256 |   0 |   0 |   0 |  32 | 152 | 256 |
   922  |    1 |      8 | 21.3% |  24 | 144 | 248 |   8 |   8 |   8 |  32 | 152 | 256 |
   923  |    2 |      8 | 38.9% |  16 | 128 | 224 |   8 |  16 |  16 |  24 | 136 | 240 |
   924  |    3 |      8 | 57.0% |   0 | 120 | 224 |  16 |  24 |  24 |  24 | 136 | 240 |
   925  |    4 |      8 | 73.0% |   0 | 120 | 216 |  16 |  32 |  32 |  24 | 136 | 232 |
   926  |    5 |      8 | 83.3% |   0 | 112 | 216 |  16 |  40 |  40 |  24 | 136 | 232 |
   927  |    6 |      8 | 87.5% |   0 | 112 | 208 |  16 |  48 |  48 |  24 | 136 | 232 |
   928  |    7 |      8 | 89.8% |   0 | 112 | 208 |  16 |  48 |  56 |  24 | 136 | 232 |
   929  |    8 |      8 | 91.3% |   0 | 112 | 200 |  16 |  56 |  64 |  24 | 136 | 232 |
   930  |    9 |      8 | 92.1% |   0 | 112 | 192 |  16 |  56 |  72 |  24 | 136 | 232 |
   931  |   10 |      8 | 92.6% |   0 | 104 | 192 |  16 |  56 |  72 |  24 | 136 | 232 |
   932  |   11 |      8 | 93.1% |   0 | 104 | 184 |  16 |  56 |  80 |  24 | 128 | 232 |
   933  |   12 |      8 | 93.4% |   0 | 104 | 176 |  16 |  56 |  88 |  24 | 128 | 232 |
   934  |   13 |      8 | 94.0% |   0 |  88 | 176 |  16 |  56 |  96 |  24 | 128 | 232 |
   935  |   14 |      8 | 94.4% |   0 |  80 | 152 |  16 |  64 | 104 |  24 | 128 | 232 |
   936  |   15 |      8 | 94.6% |   0 |  80 | 152 |  16 |  64 | 112 |  24 | 128 | 232 |
   937  |   16 |      8 | 94.9% |   0 |  16 | 152 |  16 |  64 | 112 |  24 | 128 | 232 |
   938  |    ∞ |      8 | 99.8% |   0 |   0 |   0 |  24 | 112 | 216 |  24 | 120 | 216 |
   939  ```
   940  
   941  The first two columns show the number of available integer and
   942  floating-point registers.
   943  The first row shows the results for 0 integer and 0 floating-point
   944  registers, which is equivalent to ABI0.
   945  We found that any reasonable number of floating-point registers has
   946  the same effect, so we fixed it at 8 for all other rows.
   947  
   948  The “% fit” column gives the fraction of functions where all arguments
   949  and results are register-assigned and no arguments are passed on the
   950  stack.
   951  The three “stack args” columns give the median, 95th and 99th
   952  percentile number of bytes of stack arguments.
   953  The “spills” columns likewise summarize the number of bytes in
   954  on-stack spill space.
   955  And “stack total” summarizes the sum of stack arguments and on-stack
   956  spill slots.
   957  Note that these are three different distributions; for example,
   958  there’s no single function that takes 0 stack argument bytes, 16 spill
   959  bytes, and 24 total stack bytes.
   960  
   961  From this, we can see that the fraction of functions that fit entirely
   962  in registers grows very slowly once it reaches about 90%, though
   963  curiously there is a small minority of functions that could benefit
   964  from a huge number of registers.
   965  Making 9 integer registers available on amd64 puts it in this realm.
   966  We also see that the stack space required for most functions is fairly
   967  small.
   968  While the increasing space required for spills largely balances out
   969  the decreasing space required for stack arguments as the number of
   970  available registers increases, there is a general reduction in the
   971  total stack space required with more available registers.
   972  This does, however, suggest that eliminating spill slots in the future
   973  would noticeably reduce stack requirements.