gitee.com/quant1x/num@v0.3.2/asm/c2goasm/README.md (about) 1 # c2goasm: C to Go Assembly 2 3 ## Introduction 4 5 This is a tool to convert assembly as generated by a C/C++ compiler into Golang assembly. It is meant to be used in 6 combination with [asm2plan9s](https://github.com/minio/asm2plan9s) in order to automatically generate pure Go wrappers 7 for C/C++ code (that may for instance take advantage of compiler SIMD intrinsics or `template<>` code). 8 9 Mode of operation: 10 11 ``` 12 $ c2goasm -a /path/to/some/great/c-code.s /path/to/now/great/golang-code_amd64.s 13 ``` 14 15 You can optionally nicely format the code using [asmfmt](https://github.com/klauspost/asmfmt) by passing in an `-f` 16 flag. 17 18 This project has been developed as part of developing a Go wrapper 19 around [Simd](https://github.com/fwessels/go-cv-simd). However it should also work with other projects and libraries. 20 Keep in mind though that it is not intented to 'port' a complete C/C++ project in a single action but rather do it on a 21 case-by-case basis per function/source file (and create accompanying high level Go code to call into the assembly code). 22 23 ## Command line options 24 25 ``` 26 $ c2goasm --help 27 Usage of c2goasm: 28 -a Immediately invoke asm2plan9s 29 -c Compact byte codes 30 -f Format using asmfmt 31 -s Strip comments 32 ``` 33 34 ## A simple example 35 36 Here is a simple C function doing an AVX2 intrinsics computation: 37 38 ``` 39 void MultiplyAndAdd(float* arg1, float* arg2, float* arg3, float* result) { 40 __m256 vec1 = _mm256_load_ps(arg1); 41 __m256 vec2 = _mm256_load_ps(arg2); 42 __m256 vec3 = _mm256_load_ps(arg3); 43 __m256 res = _mm256_fmadd_ps(vec1, vec2, vec3); 44 _mm256_storeu_ps(result, res); 45 } 46 ``` 47 48 Compiling into assembly gives the following 49 50 ``` 51 __ZN14MultiplyAndAddEPfS1_S1_S1_: ## @_ZN14MultiplyAndAddEPfS1_S1_S1_ 52 ## BB#0: 53 push rbp 54 mov rbp, rsp 55 vmovups ymm0, ymmword ptr [rdi] 56 vmovups ymm1, ymmword ptr [rsi] 57 vfmadd213ps ymm1, ymm0, ymmword ptr [rdx] 58 vmovups ymmword ptr [rcx], ymm1 59 pop rbp 60 vzeroupper 61 ret 62 ``` 63 64 Running `c2goasm` will generate the following Go assembly (eg. saved in `MultiplyAndAdd_amd64.s`) 65 66 ``` 67 //+build !noasm !appengine 68 // AUTO-GENERATED BY C2GOASM -- DO NOT EDIT 69 70 TEXT ยท_MultiplyAndAdd(SB), $0-32 71 72 MOVQ vec1+0(FP), DI 73 MOVQ vec2+8(FP), SI 74 MOVQ vec3+16(FP), DX 75 MOVQ result+24(FP), CX 76 77 LONG $0x0710fcc5 // vmovups ymm0, yword [rdi] 78 LONG $0x0e10fcc5 // vmovups ymm1, yword [rsi] 79 LONG $0xa87de2c4; BYTE $0x0a // vfmadd213ps ymm1, ymm0, yword [rdx] 80 LONG $0x0911fcc5 // vmovups yword [rcx], ymm1 81 82 VZEROUPPER 83 RET 84 ``` 85 86 This needs to be accompanied by the following Go code (in `MultiplyAndAdd_amd64.go`) 87 88 ``` 89 //go:noescape 90 func _MultiplyAndAdd(vec1, vec2, vec3, result unsafe.Pointer) 91 92 func MultiplyAndAdd(someObj Object) { 93 94 _MultiplyAndAdd(someObj.GetVec1(), someObj.GetVec2(), someObj.GetVec3(), someObj.GetResult())) 95 } 96 ``` 97 98 And as you may have gathered the amd64.go file needs to be in place in order for the arguments names to be derived (and 99 allow `go vet` to succeed). 100 101 ## Benchmark against cgo 102 103 We have run benchmarks of `c2goasm` versus `cgo` for both Go version 1.7.5 and 1.8.1. You can find the `c2goasm` 104 benchmark test in `test/` and the `cgo` test in `cgocmp/` respectively. Here are the results for both versions: 105 106 ``` 107 $ benchcmp ../cgocmp/cgo-1.7.5.out c2goasm.out 108 benchmark old ns/op new ns/op delta 109 BenchmarkMultiplyAndAdd-12 382 10.9 -97.15% 110 ``` 111 112 ``` 113 $ benchcmp ../cgocmp/cgo-1.8.1.out c2goasm.out 114 benchmark old ns/op new ns/op delta 115 BenchmarkMultiplyAndAdd-12 236 10.9 -95.38% 116 ``` 117 118 As you can see Golang 1.8 has made a significant improvement (38.2%) over 1.7.5, but it is still about 20x slower than 119 directly calling into assembly code as wrapped by `c2goasm`. 120 121 ## Converted projects 122 123 - [go-cv-simd (WIP)](https://github.com/fwessels/go-cv-simd) 124 125 ## Internals 126 127 The basic process is to (in the prologue) setup the stack and registers as how the C code expects this to be the case, 128 and upon exiting the subroutine (in the epilogue) to revert back to the golang world and pass a return value back if 129 required. In more details: 130 131 - Define assembly subroutine with proper golang decoration in terms of needed stack space and overall size of arguments 132 plus return value. 133 - Function arguments are loaded from the golang stack into registers and prior to starting the C code any arguments 134 beyond 6 are stored in C stack space. 135 - Stack space is reserved and setup for the C code. Depending on the C code, the stack pointer maybe aligned on a 136 certain boundary (especially needed for code that takes advantages of SIMD instructions such as AVX etc.). 137 - A constants table is generated (if needed) and any `rip`-based references are replaced with proper offsets to where Go 138 will put the table. 139 140 ## Limitations 141 142 - Arguments need (for now) to be 64-bit size, meaning either a value or a pointer (this requirement will be lifted) 143 - Maximum number of 14 arguments (hard limit -- if you hit this maybe you should rethink your api anyway...) 144 - Generally no `call` statements (thus inline your C code) with a couple of exceptions for functions such as `memset` 145 and `memcpy` (see `clib_amd64.s`) 146 147 ## Generate assembly from C/C++ 148 149 For eg. projects using cmake, here is how to see a list of assembly targets 150 151 ``` 152 $ make help | grep "\.s" 153 ``` 154 155 To see the actual command to generate the assembly 156 157 ``` 158 $ make -n SimdAvx2BgraToGray.s 159 ``` 160 161 ## Supported golang architectures 162 163 For now just the AMD64 architecture is supported. Also ARM64 should work just fine in a similar fashion but support is 164 lacking at the moment. 165 166 ## Compatible compilers 167 168 The following compilers have been tested: 169 170 - `clang` (Apple LLVM version) on OSX/darwin 171 - `clang` on linux 172 173 Compiler flags: 174 175 ``` 176 -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti 177 ``` 178 179 | Flag | Explanation | 180 |:----------------------------------|:--------------------------------------------------------------------------------------------------------| 181 | `-masm=intel` | Output Intel syntax for assembly | 182 | `-mno-red-zone` | Do not write below stack pointer (avoid [red zone](https://en.wikipedia.org/wiki/Red_zone_(computing))) | 183 | `-mstackrealign` | Use explicit stack initialization | 184 | `-mllvm -inline-threshold=1000` | Higher limit for inlining heuristic (default=255) | 185 | `-fno-asynchronous-unwind-tables` | Do not generate unwind tables (for debug purposes) | 186 | `-fno-exceptions` | Disable exception handling | 187 | `-fno-rtti` | Disable run-time type information | 188 189 The following flags are only available in `clang -cc1` frontend mode (see [below]()): 190 191 | Flag | Explanation | 192 |:-------------------|:-------------------------------------------------------------------| 193 | `-fno-jump-tables` | Do not use jump tables as may be generated for `select` statements | 194 195 #### `clang` vs `clang -cc1` 196 197 As per the clang [FAQ](https://clang.llvm.org/docs/FAQ.html#driver), `clang -cc1` is the frontend, and `clang` is a ( 198 mostly GCC compatible) driver for the frontend. To see all options that the driver passes on to the frontend, use `-###` 199 like this: 200 201 ``` 202 $ clang -### -c hello.c 203 "/usr/lib/llvm/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" etc. etc. etc. 204 ``` 205 206 #### Command line flags for clang 207 208 To see all command line flags use either `clang --help` or `clang --help-hidden` for the clang driver 209 or `clang -cc1 -help` for the frontend. 210 211 #### Further optimization and fine tuning 212 213 Using the LLVM optimizer ([opt](http://llvm.org/docs/CommandGuide/opt.html)) you can further optimize the code 214 generation. Use `opt -help` or `opt -help-hidden` for all available options. 215 216 An option can be passed in via `clang` using the `-mllvm <value>` option, such as `-mllvm -inline-threshold=1000` as 217 discussed above. 218 219 Also LLVM allows you to tune specific functions 220 via [function attributes](http://llvm.org/docs/LangRef.html#function-attributes) 221 like `define void @f() alwaysinline norecurse { ... }`. 222 223 #### What about GCC support? 224 225 For now GCC code will not work out of the box. However there is no reason why GCC should not work fundamentally (PRs are 226 welcome). 227 228 ## Resources 229 230 - [A Primer on Go Assembly](https://github.com/teh-cmc/go-internals/blob/master/chapter1_assembly_primer/README.md) 231 - [Go Function in Assembly](https://github.com/golang/go/files/447163/GoFunctionsInAssembly.pdf) 232 - [Stack frame layout on x86-64](http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64) 233 - [Compiler Explorer (interactive)](https://go.godbolt.org/) 234 235 ## License 236 237 c2goasm is released under the Apache License v2.0. You can find the complete text in the file LICENSE. 238 239 ## Contributing 240 241 Contributions are welcome, please send PRs for any enhancements.