gitee.com/quant1x/num@v0.3.2/asm/c2goasm/README.md (about)

     1  # c2goasm: C to Go Assembly
     2  
     3  ## Introduction
     4  
     5  This is a tool to convert assembly as generated by a C/C++ compiler into Golang assembly. It is meant to be used in
     6  combination with [asm2plan9s](https://github.com/minio/asm2plan9s) in order to automatically generate pure Go wrappers
     7  for C/C++ code (that may for instance take advantage of compiler SIMD intrinsics or `template<>` code).
     8  
     9  Mode of operation:
    10  
    11  ```
    12  $ c2goasm -a /path/to/some/great/c-code.s /path/to/now/great/golang-code_amd64.s
    13  ```
    14  
    15  You can optionally nicely format the code using [asmfmt](https://github.com/klauspost/asmfmt) by passing in an `-f`
    16  flag.
    17  
    18  This project has been developed as part of developing a Go wrapper
    19  around [Simd](https://github.com/fwessels/go-cv-simd). However it should also work with other projects and libraries.
    20  Keep in mind though that it is not intented to 'port' a complete C/C++ project in a single action but rather do it on a
    21  case-by-case basis per function/source file (and create accompanying high level Go code to call into the assembly code).
    22  
    23  ## Command line options
    24  
    25  ```
    26  $ c2goasm --help
    27  Usage of c2goasm:
    28    -a	Immediately invoke asm2plan9s
    29    -c	Compact byte codes
    30    -f	Format using asmfmt
    31    -s	Strip comments
    32  ```
    33  
    34  ## A simple example
    35  
    36  Here is a simple C function doing an AVX2 intrinsics computation:
    37  
    38  ```
    39  void MultiplyAndAdd(float* arg1, float* arg2, float* arg3, float* result) {
    40      __m256 vec1 = _mm256_load_ps(arg1);
    41      __m256 vec2 = _mm256_load_ps(arg2);
    42      __m256 vec3 = _mm256_load_ps(arg3);
    43      __m256 res  = _mm256_fmadd_ps(vec1, vec2, vec3);
    44      _mm256_storeu_ps(result, res);
    45  }
    46  ```
    47  
    48  Compiling into assembly gives the following
    49  
    50  ```
    51  __ZN14MultiplyAndAddEPfS1_S1_S1_: ## @_ZN14MultiplyAndAddEPfS1_S1_S1_
    52  ## BB#0:
    53          push          rbp
    54          mov           rbp, rsp
    55          vmovups       ymm0, ymmword ptr [rdi]
    56          vmovups       ymm1, ymmword ptr [rsi]
    57          vfmadd213ps   ymm1, ymm0, ymmword ptr [rdx]
    58          vmovups       ymmword ptr [rcx], ymm1
    59          pop           rbp
    60          vzeroupper
    61          ret
    62  ```
    63  
    64  Running `c2goasm` will generate the following Go assembly (eg. saved in `MultiplyAndAdd_amd64.s`)
    65  
    66  ```
    67  //+build !noasm !appengine
    68  // AUTO-GENERATED BY C2GOASM -- DO NOT EDIT
    69  
    70  TEXT ยท_MultiplyAndAdd(SB), $0-32
    71  
    72  	MOVQ vec1+0(FP), DI
    73  	MOVQ vec2+8(FP), SI
    74  	MOVQ vec3+16(FP), DX
    75  	MOVQ result+24(FP), CX
    76  
    77  	LONG $0x0710fcc5             // vmovups    ymm0, yword [rdi]
    78  	LONG $0x0e10fcc5             // vmovups    ymm1, yword [rsi]
    79  	LONG $0xa87de2c4; BYTE $0x0a // vfmadd213ps    ymm1, ymm0, yword [rdx]
    80  	LONG $0x0911fcc5             // vmovups    yword [rcx], ymm1
    81  
    82  	VZEROUPPER
    83  	RET
    84  ```
    85  
    86  This needs to be accompanied by the following Go code (in `MultiplyAndAdd_amd64.go`)
    87  
    88  ```
    89  //go:noescape
    90  func _MultiplyAndAdd(vec1, vec2, vec3, result unsafe.Pointer)
    91  
    92  func MultiplyAndAdd(someObj Object) {
    93  
    94  	_MultiplyAndAdd(someObj.GetVec1(), someObj.GetVec2(), someObj.GetVec3(), someObj.GetResult()))
    95  }
    96  ```
    97  
    98  And as you may have gathered the amd64.go file needs to be in place in order for the arguments names to be derived (and
    99  allow `go vet` to succeed).
   100  
   101  ## Benchmark against cgo
   102  
   103  We have run benchmarks of `c2goasm` versus `cgo` for both Go version 1.7.5 and 1.8.1. You can find the `c2goasm`
   104  benchmark test in `test/` and the `cgo` test in `cgocmp/` respectively. Here are the results for both versions:
   105  
   106  ```
   107  $ benchcmp ../cgocmp/cgo-1.7.5.out c2goasm.out 
   108  benchmark                      old ns/op     new ns/op     delta
   109  BenchmarkMultiplyAndAdd-12     382           10.9          -97.15%
   110  ```
   111  
   112  ```
   113  $ benchcmp ../cgocmp/cgo-1.8.1.out c2goasm.out 
   114  benchmark                      old ns/op     new ns/op     delta
   115  BenchmarkMultiplyAndAdd-12     236           10.9          -95.38%
   116  ```
   117  
   118  As you can see Golang 1.8 has made a significant improvement (38.2%) over 1.7.5, but it is still about 20x slower than
   119  directly calling into assembly code as wrapped by `c2goasm`.
   120  
   121  ## Converted projects
   122  
   123  - [go-cv-simd (WIP)](https://github.com/fwessels/go-cv-simd)
   124  
   125  ## Internals
   126  
   127  The basic process is to (in the prologue) setup the stack and registers as how the C code expects this to be the case,
   128  and upon exiting the subroutine (in the epilogue) to revert back to the golang world and pass a return value back if
   129  required. In more details:
   130  
   131  - Define assembly subroutine with proper golang decoration in terms of needed stack space and overall size of arguments
   132    plus return value.
   133  - Function arguments are loaded from the golang stack into registers and prior to starting the C code any arguments
   134    beyond 6 are stored in C stack space.
   135  - Stack space is reserved and setup for the C code. Depending on the C code, the stack pointer maybe aligned on a
   136    certain boundary (especially needed for code that takes advantages of SIMD instructions such as AVX etc.).
   137  - A constants table is generated (if needed) and any `rip`-based references are replaced with proper offsets to where Go
   138    will put the table.
   139  
   140  ## Limitations
   141  
   142  - Arguments need (for now) to be 64-bit size, meaning either a value or a pointer (this requirement will be lifted)
   143  - Maximum number of 14 arguments (hard limit -- if you hit this maybe you should rethink your api anyway...)
   144  - Generally no `call` statements (thus inline your C code) with a couple of exceptions for functions such as `memset`
   145    and `memcpy` (see `clib_amd64.s`)
   146  
   147  ## Generate assembly from C/C++
   148  
   149  For eg. projects using cmake, here is how to see a list of assembly targets
   150  
   151  ```
   152  $ make help | grep "\.s"
   153  ```
   154  
   155  To see the actual command to generate the assembly
   156  
   157  ```
   158  $ make -n SimdAvx2BgraToGray.s
   159  ```
   160  
   161  ## Supported golang architectures
   162  
   163  For now just the AMD64 architecture is supported. Also ARM64 should work just fine in a similar fashion but support is
   164  lacking at the moment.
   165  
   166  ## Compatible compilers
   167  
   168  The following compilers have been tested:
   169  
   170  - `clang` (Apple LLVM version) on OSX/darwin
   171  - `clang` on linux
   172  
   173  Compiler flags:
   174  
   175  ```
   176  -masm=intel -mno-red-zone -mstackrealign -mllvm -inline-threshold=1000 -fno-asynchronous-unwind-tables -fno-exceptions -fno-rtti
   177  ```
   178  
   179  | Flag                              | Explanation                                                                                             |
   180  |:----------------------------------|:--------------------------------------------------------------------------------------------------------|
   181  | `-masm=intel`                     | Output Intel syntax for assembly                                                                        |
   182  | `-mno-red-zone`                   | Do not write below stack pointer (avoid [red zone](https://en.wikipedia.org/wiki/Red_zone_(computing))) |
   183  | `-mstackrealign`                  | Use explicit stack initialization                                                                       |
   184  | `-mllvm -inline-threshold=1000`   | Higher limit for inlining heuristic (default=255)                                                       |
   185  | `-fno-asynchronous-unwind-tables` | Do not generate unwind tables (for debug purposes)                                                      |
   186  | `-fno-exceptions`                 | Disable exception handling                                                                              |
   187  | `-fno-rtti`                       | Disable run-time type information                                                                       |
   188  
   189  The following flags are only available in `clang -cc1` frontend mode (see [below]()):
   190  
   191  | Flag               | Explanation                                                        |
   192  |:-------------------|:-------------------------------------------------------------------|
   193  | `-fno-jump-tables` | Do not use jump tables as may be generated for `select` statements |
   194  
   195  #### `clang` vs `clang -cc1`
   196  
   197  As per the clang [FAQ](https://clang.llvm.org/docs/FAQ.html#driver), `clang -cc1` is the frontend, and `clang` is a (
   198  mostly GCC compatible) driver for the frontend. To see all options that the driver passes on to the frontend, use `-###`
   199  like this:
   200  
   201  ```
   202  $ clang -### -c hello.c
   203  "/usr/lib/llvm/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" etc. etc. etc.
   204  ```
   205  
   206  #### Command line flags for clang
   207  
   208  To see all command line flags use either `clang --help` or `clang --help-hidden` for the clang driver
   209  or `clang -cc1 -help` for the frontend.
   210  
   211  #### Further optimization and fine tuning
   212  
   213  Using the LLVM optimizer ([opt](http://llvm.org/docs/CommandGuide/opt.html)) you can further optimize the code
   214  generation. Use `opt -help` or `opt -help-hidden` for all available options.
   215  
   216  An option can be passed in via `clang` using the `-mllvm <value>` option, such as `-mllvm -inline-threshold=1000` as
   217  discussed above.
   218  
   219  Also LLVM allows you to tune specific functions
   220  via [function attributes](http://llvm.org/docs/LangRef.html#function-attributes)
   221  like `define void @f() alwaysinline norecurse { ... }`.
   222  
   223  #### What about GCC support?
   224  
   225  For now GCC code will not work out of the box. However there is no reason why GCC should not work fundamentally (PRs are
   226  welcome).
   227  
   228  ## Resources
   229  
   230  - [A Primer on Go Assembly](https://github.com/teh-cmc/go-internals/blob/master/chapter1_assembly_primer/README.md)
   231  - [Go Function in Assembly](https://github.com/golang/go/files/447163/GoFunctionsInAssembly.pdf)
   232  - [Stack frame layout on x86-64](http://eli.thegreenplace.net/2011/09/06/stack-frame-layout-on-x86-64)
   233  - [Compiler Explorer (interactive)](https://go.godbolt.org/)
   234  
   235  ## License
   236  
   237  c2goasm is released under the Apache License v2.0. You can find the complete text in the file LICENSE.
   238  
   239  ## Contributing
   240  
   241  Contributions are welcome, please send PRs for any enhancements.