github.com/tetratelabs/wazero@v1.7.1/site/content/docs/how_the_optimizing_compiler_works/backend.md (about) 1 +++ 2 title = "How the Optimizing Compiler Works: Back-End" 3 layout = "single" 4 +++ 5 6 In this section we will discuss the phases in the back-end of the optimizing 7 compiler: 8 9 - [Instruction Selection](#instruction-selection) 10 - [Register Allocation](#register-allocation) 11 - [Finalization and Encoding](#finalization-and-encoding) 12 13 Each section will include a brief explanation of the phase, references to the 14 code that implements the phase, and a description of the debug flags that can 15 be used to inspect that phase. Please notice that, since the implementation of 16 the back-end is architecture-specific, the code might be different for each 17 architecture. 18 19 ### Code 20 21 The higher-level entry-point to the back-end is the 22 `backend.Compiler.Compile(context.Context)` method. This method executes, in 23 turn, the following methods in the same type: 24 25 - `backend.Compiler.Lower()` (instruction selection) 26 - `backend.Compiler.RegAlloc()` (register allocation) 27 - `backend.Compiler.Finalize(context.Context)` (finalization and encoding) 28 29 ## Instruction Selection 30 31 The instruction selection phase is responsible for mapping the higher-level SSA 32 instructions to arch-specific instructions. Each SSA instruction is translated 33 to one or more machine instructions. 34 35 Each target architecture comes with a different number of registers, some of 36 them are general purpose, others might be specific to certain instructions. In 37 general, we can expect to have a set of registers for integer computations, 38 another set for floating point computations, a set for vector (SIMD) 39 computations, and some specific special-purpose registers (e.g. stack pointers, 40 program counters, status flags, etc.) 41 42 In addition, some registers might be reserved by the Go runtime or the 43 Operating System for specific purposes, so they should be handled with special 44 care. 45 46 At this point in the compilation process we do not want to deal with all that. 47 Instead, we assume that we have a potentially infinite number of *virtual 48 registers* of each type at our disposal. The next phase, the register 49 allocation phase, will map these virtual registers to the actual registers of 50 the target architecture. 51 52 ### Operands and Addressing Modes 53 54 As a rule of thumb, we want to map each `ssa.Value` to a virtual register, and 55 then use that virtual register as one of the arguments of the machine 56 instruction that we will generate. However, usually instructions are able to 57 address more than just registers: an *operand* might be able to represent a 58 memory address, or an immediate value (i.e. a constant value that is encoded as 59 part of the instruction itself). 60 61 For these reasons, instead of mapping each `ssa.Value` to a virtual register 62 (`regalloc.VReg`), we map each `ssa.Value` to an architecture-specific 63 `operand` type. 64 65 During lowering of an `ssa.Instruction`, for each `ssa.Value` that is used as 66 an argument of the instruction, in the simplest case, the `operand` might be 67 mapped to a virtual register, in other cases, the `operand` might be mapped to 68 a memory address, or an immediate value. Sometimes this makes it possible to 69 replace several SSA instructions with a single machine instruction, by folding 70 the addressing mode into the instruction itself. 71 72 For instance, consider the following SSA instructions: 73 74 ``` 75 v4:i32 = Const 0x9 76 v6:i32 = Load v5, 0x4 77 v7:i32 = Iadd v6, v4 78 ``` 79 80 In the `amd64` architecture, the `add` instruction adds the second operand to 81 the first operand, and assigns the result to the second operand. So assuming 82 that `r4`, `v5`, `v6`, and `v7` are mapped respectively to the virtual 83 registers `r4?`, `r5?`, `r6?`, and `r7?`, the lowering of the `Iadd` 84 instruction on `amd64` might look like this: 85 86 ```asm 87 ;; AT&T syntax 88 add $4(%r5?), %r4? ;; add the value at memory address [`r5?` + 4] to `r4?` 89 mov %r4?, %r7? ;; move the result from `r4?` to `r7?` 90 ``` 91 92 Notice how the load from memory has been folded into an operand of the `add` 93 instruction. This transformation is possible when the value produced by the 94 instruction being folded is not referenced by other instructions and the 95 instructions belong to the same `InstructionGroupID` (see [Front-End: 96 Optimization](../frontend/#optimization)). 97 98 ### Example 99 100 At the end of the instruction selection phase, the basic blocks of our `abs` 101 function will look as follows (for `arm64`): 102 103 ```asm 104 L1 (SSA Block: blk0): 105 mov x130?, x2 106 subs wzr, w130?, #0x0 107 b.ge L2 108 L3 (SSA Block: blk1): 109 mov x136?, xzr 110 sub w134?, w136?, w130? 111 mov x135?, x134? 112 b L4 113 L2 (SSA Block: blk2): 114 mov x135?, x130? 115 L4 (SSA Block: blk3): 116 mov x0, x135? 117 ret 118 ``` 119 120 Notice the introduction of the new identifiers `L1`, `L3`, `L2`, and `L4`. 121 These are labels that are used to mark the beginning of each basic block, and 122 they are the target for branching instructions such as `b` and `b.ge`. 123 124 ### Code 125 126 `backend.Machine` is the interface to the backend. It has a methods to 127 translate (lower) the IR to machine code. Again, as seen earlier in the 128 front-end, the term *lowering* is used to indicate translation from a 129 higher-level representation to a lower-level representation. 130 131 `backend.Machine.LowerInstr(*ssa.Instruction)` is the method that translates an 132 SSA instruction to machine code. Machine-specific implementations of this 133 method can be found in package `backend/isa/<arch>` where `<arch>` is either 134 `amd64` or `arm64`. 135 136 ### Debug Flags 137 138 `wazevoapi.PrintSSAToBackendIRLowering` prints the basic blocks with the 139 lowered arch-specific instructions. 140 141 ## Register Allocation 142 143 The register allocation phase is responsible for mapping the potentially 144 infinite number of virtual registers to the real registers of the target 145 architecture. Because the number of real registers is limited, the register 146 allocation phase might need to "spill" some of the virtual registers to memory; 147 that is, it might store their content, and then load them back into a register 148 when they are needed. 149 150 For a given function `f` the register allocation procedure 151 `regalloc.Allocator.DoAllocation(f)` is implemented in sub-phases: 152 153 - `livenessAnalysis(f)` collects the "liveness" information for each virtual 154 register. The algorithm is described in [Chapter 9.2 of The SSA 155 Book][ssa-book]. 156 157 - `alloc(f)` allocates registers for the given function. The algorithm is 158 derived from [the Go compiler's 159 allocator][go-regalloc] 160 161 At the end of the allocation procedure, we also record the set of registers 162 that are **clobbered** by the body of the function. A register is clobbered 163 if its value is overwritten by the function, and it is not saved by the 164 callee. This information is used in the finalization phase to determine which 165 registers need to be saved in the prologue and restored in the epilogue. 166 to register allocation in a textbook meaning, but it is a necessary step 167 for the finalization phase. 168 169 ### Liveness Analysis 170 171 Intuitively, a variable or name binding can be considered _live_ at a certain 172 point in a program, if its value will be used in the future. 173 174 For instance: 175 176 ``` 177 1| int f(int x) { 178 2| int y = 2 + x; 179 3| int z = x + y; 180 4| return z; 181 5| } 182 ``` 183 184 Variable `x` and `y` are both live at line 4, because they are used in the 185 expression `x + y` on line 3; variable `z` is live at line 4, because it is 186 used in the return statement. However, variables `x` and `y` can be considered 187 _not_ live at line 4 because they are not used anywhere after line 3. 188 189 Statically, _liveness_ can be approximated by following paths backwards on the 190 control-flow graph, connecting the uses of a given variable to its definitions 191 (or its *unique* definition, assuming SSA form). 192 193 In practice, while liveness is a property of each name binding at any point in 194 the program, it is enough to keep track of liveness at the boundaries of basic 195 blocks: 196 197 - the _live-in_ set for a given basic block is the set of all bindings that are 198 live at the entry of that block. 199 - the _live-out_ set for a given basic block is the set of all bindings that 200 are live at the exit of that block. A binding is live at the exit of a block 201 if it is live at the entry of a successor. 202 203 Because the CFG is a connected graph, it is enough to keep track of either 204 live-in or live-out sets, and then propagate the liveness information backward 205 or forward, respectively. In our case, we keep track of live-in sets per block; 206 live-outs are derived from live-ins of the successor blocks when a block is 207 allocated. 208 209 ### Allocation 210 211 We implemented a variant of the linear scan register allocation algorithm 212 described in [the Go compiler's allocator][go-regalloc]. 213 214 Each basic block is allocated registers in a linear scan order, and the 215 allocation state is propagated from a given basic block to its successors. 216 Then, each block continues allocation from that initial state. 217 218 #### Merge States 219 220 Special care has to be taken when a block has multiple predecessors. We call 221 this *fixing merge states*: for instance, consider the following: 222 223 ```goat { width="30%" } 224 .---. .---. 225 | BB0 | | BB1 | 226 '-+-' '-+-' 227 +----+----+ 228 | 229 v 230 .---. 231 | BB2 | 232 '---' 233 ``` 234 235 if the live-out set of a given block `BB0` is different from the live-out set 236 of a given block `BB1` and both are predecessors of a block `BB2`, then we need 237 to adjust `BB0` and `BB1` to ensure consistency with `BB2`. In practice, 238 abstract values in `BB0` and `BB1` might be passed to `BB2` either via registers 239 or via stack; fixing merge states ensures that registers and stack are used 240 consistently to pass values across the involved states. 241 242 #### Spilling 243 244 If the register allocator cannot find a free register for a given virtual 245 (live) register, it needs to "spill" the value to the stack to get a free 246 register, *i.e.,* stash it temporarily to stack. When that virtual register is 247 reused later, we will have to insert instructions to reload the value into a 248 real register. 249 250 While the procedure proceeds with allocation, the procedure also records all 251 the virtual registers that transition to the "spilled" state, and inserts the 252 reload instructions when those registers are reused later. 253 254 The spill instructions are actually inserted at the end of the register 255 allocation, after all the allocations and the merge states have been fixed. At 256 this point, all the other potential sources of instability have been resolved, 257 and we know where all the reloads happen. 258 259 We insert the spills in the block that is the lowest common ancestor of all the 260 blocks that reload the value. 261 262 #### Clobbered Registers 263 264 At the end of the allocation procedure, the `determineCalleeSavedRealRegs(f)` 265 method iterates over the set of the allocated registers and compares them 266 to a set of architecture-specific set `CalleeSavedRegisters`. If a register 267 has been allocated, and it is present in this set, the register is marked as 268 "clobbered", i.e., we now know that the register allocator will overwrite 269 that value. Thus, these values will have to be spilled in the prologue. 270 271 #### References 272 273 Register allocation is a complex problem, possibly the most complicated 274 part of the backend. The following references were used to implement the 275 algorithm: 276 277 - https://web.stanford.edu/class/archive/cs/cs143/cs143.1128/lectures/17/Slides17.pdf 278 - https://en.wikipedia.org/wiki/Chaitin%27s_algorithm 279 - https://llvm.org/ProjectsWithLLVM/2004-Fall-CS426-LS.pdf 280 - https://pfalcon.github.io/ssabook/latest/book-full.pdf: Chapter 9. for liveness analysis. 281 - https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go 282 283 We suggest to refer to them to dive deeper in the topic. 284 285 ### Example 286 287 At the end of the register allocation phase, the basic blocks of our `abs` 288 function look as follows (for `arm64`): 289 290 ```asm 291 L1 (SSA Block: blk0): 292 mov x2, x2 293 subs wzr, w2, #0x0 294 b.ge L2 295 L3 (SSA Block: blk1): 296 mov x8, xzr 297 sub w8, w8, w2 298 mov x8, x8 299 b L4 300 L2 (SSA Block: blk2): 301 mov x8, x2 302 L4 (SSA Block: blk3): 303 mov x0, x8 304 ret 305 ``` 306 307 Notice how the virtual registers have been all replaced by real registers, i.e. 308 no register identifier is suffixed with `?`. This example is quite simple, and 309 it does not require any spill. 310 311 ### Code 312 313 The algorithm (`regalloc/regalloc.go`) can work on any ISA by implementing the 314 interfaces in `regalloc/api.go`. 315 316 Essentially: 317 318 - each architecture exposes iteration over basic blocks of a function 319 (`regalloc.Function` interface) 320 - each arch-specific basic block exposes iteration over instructions 321 (`regalloc.Block` interface) 322 - each arch-specific instruction exposes the set of registers it defines and 323 uses (`regalloc.Instr` interface) 324 325 By defining these interfaces, the register allocation algorithm can assign real 326 registers to virtual registers without dealing specifically with the target 327 architecture. 328 329 In practice, each interface is usually implemented by instantiating a common 330 generic struct that comes already with an implementation of all or most of the 331 required methods. For instance,`regalloc.Function`is implemented by 332 `backend.RegAllocFunction[*arm64.instruction, *arm64.machine]`. 333 334 `backend/isa/<arch>/abi.go` (where `<arch>` is either `arm64` or `amd64`) 335 contains the instantiation of the `regalloc.RegisterInfo` struct, which 336 declares, among others 337 - the set of registers that are available for allocation, excluding, for 338 instance, those that might be reserved by the runtime or the OS 339 (`AllocatableRegisters`) 340 - the registers that might be saved by the callee to the stack 341 (`CalleeSavedRegisters`) 342 343 ### Debug Flags 344 345 - `wazevoapi.RegAllocLoggingEnabled` logs detailed logging of the register 346 allocation procedure. 347 - `wazevoapi.PrintRegisterAllocated` prints the basic blocks with the register 348 allocation result. 349 350 ## Finalization and Encoding 351 352 At the end of the register allocation phase, we have enough information to 353 finally generate machine code (_encoding_). We are only missing the prologue 354 and epilogue of the function. 355 356 ### Prologue and Epilogue 357 358 As usual, the **prologue** is executed before the main body of the function, 359 and the **epilogue** is executed at the return. The prologue is responsible for 360 setting up the stack frame, and the epilogue is responsible for cleaning up the 361 stack frame and returning control to the caller. 362 363 Generally, this means, at the very least: 364 - saving the return address 365 - a base pointer to the stack; or, equivalently, the height of the stack at the 366 beginning of the function 367 368 For instance, on `amd64`, `RBP` is the base pointer, `RSP` is the stack 369 pointer: 370 371 ```goat {width="100%" height="250"} 372 (high address) (high address) 373 RBP ----> +-----------------+ +-----------------+ 374 | `...` | | `...` | 375 | ret Y | | ret Y | 376 | `...` | | `...` | 377 | ret 0 | | ret 0 | 378 | arg X | | arg X | 379 | `...` | ====> | `...` | 380 | arg 1 | | arg 1 | 381 | arg 0 | | arg 0 | 382 | Return Addr | | Return Addr | 383 RSP ----> +-----------------+ | Caller_RBP | 384 (low address) +-----------------+ <----- RSP, RBP 385 ``` 386 387 While, on `arm64`, there is only a stack pointer `SP`: 388 389 390 ```goat {width="100%" height="300"} 391 (high address) (high address) 392 SP ---> +-----------------+ +------------------+ <----+ 393 | `...` | | `...` | | 394 | ret Y | | ret Y | | 395 | `...` | | `...` | | 396 | ret 0 | | ret 0 | | 397 | arg X | | arg X | | size_of_arg_ret. 398 | `...` | ====> | `...` | | 399 | arg 1 | | arg 1 | | 400 | arg 0 | | arg 0 | <----+ 401 +-----------------+ | size_of_arg_ret | 402 | return address | 403 +------------------+ <---- SP 404 (low address) (low address) 405 ``` 406 407 However, the prologue and epilogue might also be responsible for saving and 408 restoring the state of registers that might be overwritten by the function 409 ("clobbered"); and, if spilling occurs, prologue and epilogue are also 410 responsible for reserving and releasing the space for the spilled values. 411 412 For clarity, we make a distinction between the space reserved for the clobbered 413 registers and the space reserved for the spilled values: 414 415 - Spill slots are used to temporarily store the values that needs spilling as 416 determined by the register allocator. This section must have a fix height, 417 but its contents will change over time, as registers are being spilled and 418 reloaded. 419 - Clobbered registers are, similarly, determined by the register allocator, but 420 they are stashed in the prologue and then restored in the epilogue. 421 422 The procedure happens after the register allocation phase because at 423 this point we have collected enough information to know how much space we need 424 to reserve, and which registers are clobbered. 425 426 Regardless of the architecture, after allocating this space, the stack will 427 look as follows: 428 429 ```goat {height="350"} 430 (high address) 431 +-----------------+ 432 | `...` | 433 | ret Y | 434 | `...` | 435 | ret 0 | 436 | arg X | 437 | `...` | 438 | arg 1 | 439 | arg 0 | 440 | (arch-specific) | 441 +-----------------+ 442 | clobbered M | 443 | ............ | 444 | clobbered 1 | 445 | clobbered 0 | 446 | spill slot N | 447 | ............ | 448 | spill slot 0 | 449 +-----------------+ 450 (low address) 451 ``` 452 453 Note: the prologue might also introduce a check of the stack bounds. If there 454 is no sufficient space to allocate the stack frame, the function will exit the 455 execution and will try to grow it from the Go runtime. 456 457 The epilogue simply reverses the operations of the prologue. 458 459 ### Other Post-RegAlloc Logic 460 461 The `backend.Machine.PostRegAlloc` method is invoked after the register 462 allocation procedure; while its main role is to define the prologue and 463 epilogue of the function, it also serves as a hook to perform other, 464 arch-specific duty, that has to happen after the register allocation phase. 465 466 For instance, on `amd64`, the constraints for some instructions are hard to 467 express in a meaningful way for the register allocation procedure (for 468 instance, the `div` instruction implicitly use registers `rdx`, `rax`). 469 Instead, they are lowered with ad-hoc logic as part of the implementation 470 `backend.Machine.PostRegAlloc` method. 471 472 ### Encoding 473 474 The final stage of the backend encodes the machine instructions into bytes and 475 writes them to the target buffer. Before proceeding with the encoding, relative 476 addresses in branching instructions or addressing modes are resolved. 477 478 The procedure encodes the instructions in the order they appear in the 479 function. 480 481 ### Code 482 483 - The prologue and epilogue are set up as part of the 484 `backend.Machine.PostRegAlloc` method. 485 - The encoding is done by the `backend.Machine.Encode` method. 486 487 ### Debug Flags 488 489 - `wazevoapi.PrintFinalizedMachineCode` prints the assembly code of the 490 function after the finalization phase. 491 - `wazevoapi.printMachineCodeHexPerFunctionUnmodified` prints a hex 492 representation of the function generated code as it is. 493 - `wazevoapi.PrintMachineCodeHexPerFunctionDisassemblable` prints a hex 494 representation of the function generated code that can be disassembled. 495 496 The reason for the distinction between the last two flags is that the generated 497 code in some cases might not be disassemblable. 498 `PrintMachineCodeHexPerFunctionDisassemblable` flag prints a hex encoding of 499 the generated code that can be disassembled, but cannot be executed. 500 501 <hr> 502 503 * Previous Section: [Front-End](../frontend/) 504 * Next Section: [Appendix: Trampolines](../appendix/) 505 506 [ssa-book]: https://pfalcon.github.io/ssabook/latest/book-full.pdf 507 [go-regalloc]: https://github.com/golang/go/blob/release-branch.go1.21/src/cmd/compile/internal/ssa/regalloc.go