github.com/tetratelabs/wazero@v1.7.1/site/content/docs/how_the_optimizing_compiler_works/backend.md (about)

     1  +++
     2  title = "How the Optimizing Compiler Works: Back-End"
     3  layout = "single"
     4  +++
     5  
     6  In this section we will discuss the phases in the back-end of the optimizing
     7  compiler:
     8  
     9  - [Instruction Selection](#instruction-selection)
    10  - [Register Allocation](#register-allocation)
    11  - [Finalization and Encoding](#finalization-and-encoding)
    12  
    13  Each section will include a brief explanation of the phase, references to the
    14  code that implements the phase, and a description of the debug flags that can
    15  be used to inspect that phase.  Please notice that, since the implementation of
    16  the back-end is architecture-specific, the code might be different for each
    17  architecture.
    18  
    19  ### Code
    20  
    21  The higher-level entry-point to the back-end is the
    22  `backend.Compiler.Compile(context.Context)` method.  This method executes, in
    23  turn, the following methods in the same type:
    24  
    25  - `backend.Compiler.Lower()` (instruction selection)
    26  - `backend.Compiler.RegAlloc()` (register allocation)
    27  - `backend.Compiler.Finalize(context.Context)` (finalization and encoding)
    28  
    29  ## Instruction Selection
    30  
    31  The instruction selection phase is responsible for mapping the higher-level SSA
    32  instructions to arch-specific instructions. Each SSA instruction is translated
    33  to one or more machine instructions.
    34  
    35  Each target architecture comes with a different number of registers, some of
    36  them are general purpose, others might be specific to certain instructions. In
    37  general, we can expect to have a set of registers for integer computations,
    38  another set for floating point computations, a set for vector (SIMD)
    39  computations, and some specific special-purpose registers (e.g. stack pointers,
    40  program counters, status flags, etc.)
    41  
    42  In addition, some registers might be reserved by the Go runtime or the
    43  Operating System for specific purposes, so they should be handled with special
    44  care.
    45  
    46  At this point in the compilation process we do not want to deal with all that.
    47  Instead, we assume that we have a potentially infinite number of *virtual
    48  registers* of each type at our disposal. The next phase, the register
    49  allocation phase, will map these virtual registers to the actual registers of
    50  the target architecture.
    51  
    52  ### Operands and Addressing Modes
    53  
    54  As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and
    55  then use that virtual register as one of the arguments of the machine
    56  instruction that we will generate. However, usually instructions are able to
    57  address more than just registers: an *operand* might be able to represent a
    58  memory address, or an immediate value (i.e. a constant value that is encoded as
    59  part of the instruction itself).
    60  
    61  For these reasons, instead of mapping each `ssa.Value` to a virtual register
    62  (`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific
    63  `operand` type.
    64  
    65  During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as
    66  an argument of the instruction, in the simplest case, the `operand` might be
    67  mapped to a virtual register, in other cases, the `operand` might be mapped to
    68  a memory address, or an immediate value. Sometimes this makes it possible to
    69  replace several SSA instructions with a single machine instruction, by folding
    70  the addressing mode into the instruction itself.
    71  
    72  For instance, consider the following SSA instructions:
    73  
    74  ```
    75      v4:i32 = Const 0x9
    76      v6:i32 = Load v5, 0x4
    77      v7:i32 = Iadd v6, v4
    78  ```
    79  
    80  In the `amd64` architecture, the `add` instruction adds the second operand to
    81  the first operand, and assigns the result to the second operand. So assuming
    82  that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual
    83  registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd`
    84  instruction on `amd64` might look like this:
    85  
    86  ```asm
    87      ;; AT&T syntax
    88      add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?`
    89      mov %r4?, %r7?     ;; move the result from `r4?` to `r7?`
    90  ```
    91  
    92  Notice how the load from memory has been folded into an operand of the `add`
    93  instruction. This transformation is possible when the value produced by the
    94  instruction being folded is not referenced by other instructions and the
    95  instructions belong to the same `InstructionGroupID` (see [Front-End:
    96  Optimization](../frontend/#optimization)).
    97  
    98  ### Example
    99  
   100  At the end of the instruction selection phase, the basic blocks of our `abs`
   101  function will look as follows (for `arm64`):
   102  
   103  ```asm
   104  L1 (SSA Block: blk0):
   105  	mov x130?, x2
   106  	subs wzr, w130?, #0x0
   107  	b.ge L2
   108  L3 (SSA Block: blk1):
   109  	mov x136?, xzr
   110  	sub w134?, w136?, w130?
   111  	mov x135?, x134?
   112  	b L4
   113  L2 (SSA Block: blk2):
   114  	mov x135?, x130?
   115  L4 (SSA Block: blk3):
   116  	mov x0, x135?
   117  	ret
   118  ```
   119  
   120  Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`.
   121  These are labels that are used to mark the beginning of each basic block, and
   122  they are the target for branching instructions such as `b` and `b.ge`.
   123  
   124  ### Code
   125  
   126  `backend.Machine` is the interface to the backend. It has a methods to
   127  translate (lower) the IR to machine code.  Again, as seen earlier in the
   128  front-end, the term *lowering* is used to indicate translation from a
   129  higher-level representation to a lower-level representation.
   130  
   131  `backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an
   132  SSA instruction to machine code.  Machine-specific implementations of this
   133  method can be found in package `backend/isa/<arch>` where `<arch>` is either
   134  `amd64` or `arm64`.
   135  
   136  ### Debug Flags
   137  
   138  `wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the
   139  lowered arch-specific instructions.
   140  
   141  ## Register Allocation
   142  
   143  The register allocation phase is responsible for mapping the potentially
   144  infinite number of virtual registers to the real registers of the target
   145  architecture. Because the number of real registers is limited, the register
   146  allocation phase might need to "spill" some of the virtual registers to memory;
   147  that is, it might store their content, and then load them back into a register
   148  when they are needed.
   149  
   150  For a given function `f` the register allocation procedure
   151  `regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases:
   152  
   153  - `livenessAnalysis(f)` collects the "liveness" information for each virtual
   154    register. The algorithm is described in [Chapter 9.2 of The SSA
   155  Book][ssa-book].
   156  
   157  - `alloc(f)` allocates registers for the given function. The algorithm is
   158    derived from [the Go compiler's
   159  allocator][go-regalloc]
   160  
   161  At the end of the allocation procedure, we also record the set of registers
   162  that are **clobbered** by the body of the function. A register is clobbered
   163  if its value is overwritten by the function, and it is not saved by the
   164  callee. This information is used in the finalization phase to determine which
   165  registers need to be saved in the prologue and restored in the epilogue.
   166  to register allocation in a textbook meaning, but it is a necessary step
   167  for the finalization phase.
   168  
   169  ### Liveness Analysis
   170  
   171  Intuitively, a variable or name binding can be considered _live_ at a certain
   172  point in a program, if its value will be used in the future.
   173  
   174  For instance:
   175  
   176  ```
   177  1| int f(int x) {
   178  2|   int y = 2 + x;
   179  3|   int z = x + y;
   180  4|   return z;
   181  5| }
   182  ```
   183  
   184  Variable `x` and `y` are both live at line 4, because they are used in the
   185  expression `x + y` on line 3; variable `z` is live at line 4, because it is
   186  used in the return statement.  However, variables `x` and `y` can be considered
   187  _not_ live at line 4 because they are not used anywhere after line 3.
   188  
   189  Statically, _liveness_ can be approximated by following paths backwards on the
   190  control-flow graph, connecting the uses of a given variable to its definitions
   191  (or its *unique* definition, assuming SSA form).
   192  
   193  In practice, while liveness is a property of each name binding at any point in
   194  the program, it is enough to keep track of liveness at the boundaries of basic
   195  blocks:
   196  
   197  - the _live-in_ set for a given basic block is the set of all bindings that are
   198    live at the entry of that block.
   199  - the _live-out_ set for a given basic block is the set of all bindings that
   200    are live at the exit of that block. A binding is live at the exit of a block
   201  if it is live at the entry of a successor.
   202  
   203  Because the CFG is a connected graph, it is enough to keep track of either
   204  live-in or live-out sets, and then propagate the liveness information backward
   205  or forward, respectively. In our case, we keep track of live-in sets per block;
   206  live-outs are derived from live-ins of the successor blocks when a block is
   207  allocated.
   208  
   209  ### Allocation
   210  
   211  We implemented a variant of the linear scan register allocation algorithm
   212  described in [the Go compiler's allocator][go-regalloc].
   213  
   214  Each basic block is allocated registers in a linear scan order, and the
   215  allocation state is propagated from a given basic block to its successors.
   216  Then, each block continues allocation from that initial state.
   217  
   218  #### Merge States
   219  
   220  Special care has to be taken when a block has multiple predecessors. We call
   221  this *fixing merge states*: for instance, consider the following:
   222  
   223  ```goat { width="30%" }
   224   .---.     .---.
   225  | BB0 |   | BB1 |
   226   '-+-'     '-+-'
   227     +----+----+
   228          |
   229          v
   230        .---.
   231       | BB2 |
   232        '---'
   233  ```
   234  
   235  if the live-out set of a given block `BB0` is different from the live-out set
   236  of a given block `BB1` and both are predecessors of a block `BB2`, then we need
   237  to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice,
   238  abstract values in `BB0` and `BB1` might be passed to `BB2` either via registers
   239  or via stack; fixing merge states ensures that registers and stack are used
   240  consistently to pass values across the involved states.
   241  
   242  #### Spilling
   243  
   244  If the register allocator cannot find a free register for a given virtual
   245  (live) register, it needs to "spill" the value to the stack to get a free
   246  register, *i.e.,* stash it temporarily to stack.  When that virtual register is
   247  reused later, we will have to insert instructions to reload the value into a
   248  real register.
   249  
   250  While the procedure proceeds with allocation, the procedure also records all
   251  the virtual registers that transition to the "spilled" state, and inserts the
   252  reload instructions when those registers are reused later.
   253  
   254  The spill instructions are actually inserted at the end of the register
   255  allocation, after all the allocations and the merge states have been fixed. At
   256  this point, all the other potential sources of instability have been resolved,
   257  and we know where all the reloads happen.
   258  
   259  We insert the spills in the block that is the lowest common ancestor of all the
   260  blocks that reload the value.
   261  
   262  #### Clobbered Registers
   263  
   264  At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)`
   265  method iterates over the set of the allocated registers and compares them
   266  to a set of architecture-specific set `CalleeSavedRegisters`. If a register
   267  has been allocated, and it is present in this set, the register is marked as
   268  "clobbered", i.e., we now know that the register allocator will overwrite
   269  that value. Thus, these values will have to be spilled in the prologue.
   270  
   271  #### References
   272  
   273  Register allocation is a complex problem, possibly the most complicated
   274  part of the backend. The following references were used to implement the
   275  algorithm:
   276  
   277  - https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf
   278  - https://en.wikipedia.org/wiki/Chaitin%27s_algorithm
   279  - https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf
   280  - https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis.
   281  - https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go
   282  
   283  We suggest to refer to them to dive deeper in the topic.
   284  
   285  ### Example
   286  
   287  At the end of the register allocation phase, the basic blocks of our `abs`
   288  function look as follows (for `arm64`):
   289  
   290  ```asm
   291  L1 (SSA Block: blk0):
   292  	mov x2, x2
   293  	subs wzr, w2, #0x0
   294  	b.ge L2
   295  L3 (SSA Block: blk1):
   296  	mov x8, xzr
   297  	sub w8, w8, w2
   298  	mov x8, x8
   299  	b L4
   300  L2 (SSA Block: blk2):
   301  	mov x8, x2
   302  L4 (SSA Block: blk3):
   303  	mov x0, x8
   304  	ret
   305  ```
   306  
   307  Notice how the virtual registers have been all replaced by real registers, i.e.
   308  no register identifier is suffixed with `?`. This example is quite simple, and
   309  it does not require any spill.
   310  
   311  ### Code
   312  
   313  The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the
   314  interfaces in `regalloc/api.go`.
   315  
   316  Essentially:
   317  
   318  - each architecture exposes iteration over basic blocks of a function
   319    (`regalloc.Function` interface)
   320  - each arch-specific basic block exposes iteration over instructions
   321    (`regalloc.Block` interface)
   322  - each arch-specific instruction exposes the set of registers it defines and
   323    uses  (`regalloc.Instr` interface)
   324  
   325  By defining these interfaces, the register allocation algorithm can assign real
   326  registers to virtual registers without dealing specifically with the target
   327  architecture.
   328  
   329  In practice, each interface is usually implemented by instantiating a common
   330  generic struct that comes already with an implementation of all or most of the
   331  required methods.  For instance,`regalloc.Function`is implemented by
   332  `backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`.
   333  
   334  `backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`)
   335  contains the instantiation of the `regalloc.RegisterInfo` struct, which
   336  declares, among others
   337  - the set of registers that are available for allocation, excluding, for
   338    instance, those that might be reserved by the runtime or the OS
   339  (`AllocatableRegisters`)
   340  - the registers that might be saved by the callee to the stack
   341    (`CalleeSavedRegisters`)
   342  
   343  ### Debug Flags
   344  
   345  - `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register
   346    allocation procedure.
   347  - `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register
   348    allocation result.
   349  
   350  ## Finalization and Encoding
   351  
   352  At the end of the register allocation phase, we have enough information to
   353  finally generate machine code (_encoding_). We are only missing the prologue
   354  and epilogue of the function.
   355  
   356  ### Prologue and Epilogue
   357  
   358  As usual, the **prologue** is executed before the main body of the function,
   359  and the **epilogue** is executed at the return. The prologue is responsible for
   360  setting up the stack frame, and the epilogue is responsible for cleaning up the
   361  stack frame and returning control to the caller.
   362  
   363  Generally, this means, at the very least:
   364  - saving the return address
   365  - a base pointer to the stack; or, equivalently, the height of the stack at the
   366    beginning of the function
   367  
   368  For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack
   369  pointer:
   370  
   371  ```goat {width="100%" height="250"}
   372                  (high address)                     (high address)
   373      RBP ----> +-----------------+                +-----------------+
   374                |      `...`      |                |      `...`      |
   375                |      ret Y      |                |      ret Y      |
   376                |      `...`      |                |      `...`      |
   377                |      ret 0      |                |      ret 0      |
   378                |      arg X      |                |      arg X      |
   379                |      `...`      |     ====>      |      `...`      |
   380                |      arg 1      |                |      arg 1      |
   381                |      arg 0      |                |      arg 0      |
   382                |   Return Addr   |                |   Return Addr   |
   383      RSP ----> +-----------------+                |    Caller_RBP   |
   384                   (low address)                   +-----------------+ <----- RSP, RBP
   385  ```
   386  
   387  While, on `arm64`, there is only a stack pointer `SP`:
   388  
   389  
   390  ```goat {width="100%" height="300"}
   391              (high address)                    (high address)
   392    SP ---> +-----------------+               +------------------+ <----+
   393            |      `...`      |               |      `...`       |      |
   394            |      ret Y      |               |      ret Y       |      |
   395            |      `...`      |               |      `...`       |      |
   396            |      ret 0      |               |      ret 0       |      |
   397            |      arg X      |               |      arg X       |      |  size_of_arg_ret.
   398            |      `...`      |     ====>     |      `...`       |      |
   399            |      arg 1      |               |      arg 1       |      |
   400            |      arg 0      |               |      arg 0       | <----+
   401            +-----------------+               |  size_of_arg_ret |
   402                                              |  return address  |
   403                                              +------------------+ <---- SP
   404               (low address)                     (low address)
   405  ```
   406  
   407  However, the prologue and epilogue might also be responsible for saving and
   408  restoring the state of registers that might be overwritten by the function
   409  ("clobbered"); and, if spilling occurs, prologue and epilogue are also
   410  responsible for reserving and releasing the space for the spilled values.
   411  
   412  For clarity, we make a distinction between the space reserved for the clobbered
   413  registers and the space reserved for the spilled values:
   414  
   415  - Spill slots are used to temporarily store the values that needs spilling as
   416    determined by the register allocator. This section must have a fix height,
   417  but its contents will change over time, as registers are being spilled and
   418  reloaded.
   419  - Clobbered registers are, similarly, determined by the register allocator, but
   420    they are stashed in the prologue and then restored in the epilogue.
   421  
   422  The procedure happens after the register allocation phase because at
   423  this point we have collected enough information to know how much space we need
   424  to reserve, and which registers are clobbered.
   425  
   426  Regardless of the architecture, after allocating this space, the stack will
   427  look as follows:
   428  
   429  ```goat {height="350"}
   430      (high address)
   431    +-----------------+
   432    |      `...`      |
   433    |      ret Y      |
   434    |      `...`      |
   435    |      ret 0      |
   436    |      arg X      |
   437    |      `...`      |
   438    |      arg 1      |
   439    |      arg 0      |
   440    | (arch-specific) |
   441    +-----------------+
   442    |    clobbered M  |
   443    |   ............  |
   444    |    clobbered 1  |
   445    |    clobbered 0  |
   446    |   spill slot N  |
   447    |   ............  |
   448    |   spill slot 0  |
   449    +-----------------+
   450       (low address)
   451  ```
   452  
   453  Note: the prologue might also introduce a check of the stack bounds. If there
   454  is no sufficient space to allocate the stack frame, the function will exit the
   455  execution and will try to grow it from the Go runtime.
   456  
   457  The epilogue simply reverses the operations of the prologue.
   458  
   459  ### Other Post-RegAlloc Logic
   460  
   461  The `backend.Machine.PostRegAlloc` method is invoked after the register
   462  allocation procedure; while its main role is to define the prologue and
   463  epilogue of the function, it also serves as a hook to perform other,
   464  arch-specific duty, that has to happen after the register allocation phase.
   465  
   466  For instance, on `amd64`, the constraints for some instructions are hard to
   467  express in a meaningful way for the register allocation procedure (for
   468  instance, the `div` instruction implicitly use registers `rdx`, `rax`).
   469  Instead, they are lowered with ad-hoc logic as part of the implementation
   470  `backend.Machine.PostRegAlloc` method.
   471  
   472  ### Encoding
   473  
   474  The final stage of the backend encodes the machine instructions into bytes and
   475  writes them to the target buffer. Before proceeding with the encoding, relative
   476  addresses in branching instructions or addressing modes are resolved.
   477  
   478  The procedure encodes the instructions in the order they appear in the
   479  function.
   480  
   481  ### Code
   482  
   483  - The prologue and epilogue are set up as part of the
   484    `backend.Machine.PostRegAlloc` method.
   485  - The encoding is done by the `backend.Machine.Encode` method.
   486  
   487  ### Debug Flags
   488  
   489  - `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the
   490    function after the finalization phase.
   491  - `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex
   492    representation of the function generated code as it is.
   493  - `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex
   494    representation of the function generated code that can be disassembled.
   495  
   496  The reason for the distinction between the last two flags is that the generated
   497  code in some cases might not be disassemblable.
   498  `PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of
   499  the generated code that can be disassembled, but cannot be executed.
   500  
   501  <hr>
   502  
   503  * Previous Section: [Front-End](../frontend/)
   504  * Next Section: [Appendix: Trampolines](../appendix/)
   505  
   506  [ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf
   507  [go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go