github.com/apache/arrow/go/v14@v14.0.2/parquet/internal/utils/_lib/README.md (about) 1 <!--- 2 Licensed to the Apache Software Foundation (ASF) under one 3 or more contributor license agreements. See the NOTICE file 4 distributed with this work for additional information 5 regarding copyright ownership. The ASF licenses this file 6 to you under the Apache License, Version 2.0 (the 7 "License"); you may not use this file except in compliance 8 with the License. You may obtain a copy of the License at 9 10 http://www.apache.org/licenses/LICENSE-2.0 11 12 Unless required by applicable law or agreed to in writing, 13 software distributed under the License is distributed on an 14 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 15 KIND, either express or implied. See the License for the 16 specific language governing permissions and limitations 17 under the License. 18 --> 19 20 # SIMD Bit Packing Implementation 21 22 Go doesn't have any SIMD intrinsics so for some low-level optimizations we can 23 leverage auto-vectorization by C++ compilers and the fact that Go lets you specify the body of a 24 function in assembly to benefit from SIMD. 25 26 In here we have implementations using SIMD intrinsics for AVX (amd64) and NEON (arm64). 27 28 ## Generating the Go assembly 29 30 c2goasm and asm2plan9s are two projects which can be used in conjunction to generate 31 compatible Go assembly from C assembly. 32 33 First the tools need to be installed: 34 35 ```bash 36 go install github.com/klauspost/asmfmt/cmd/asmfmt@latest 37 go install github.com/minio/asm2plan9s@latest 38 go install github.com/minio/c2goasm@latest 39 ``` 40 41 ### Generating for amd64 42 43 The Makefile in the directory above will work for amd64. `make assembly` will compile 44 the c sources and then call `c2goasm` to generate the Go assembly for amd64 45 architectures. 46 47 ### Generating for arm64 48 49 Unfortunately there are some caveats for arm64. c2goasm / asm2plan9s doesn't fully 50 support arm64 correctly. However, proper assembly can be created with some slight 51 manipulation of the result. 52 53 The Makefile has the NEON flags for compiling the assembly by using 54 `make _lib/bit_packing_neon.s` and `make _lib/unpack_bool_neon.s` to generate the 55 raw assembly sources. 56 57 Before calling `c2goasm` there's a few things that need to be modified in the assembly: 58 59 * x86-64 assembly uses `#` for comments while arm64 assembly uses `//` for comments. 60 `c2goasm` assumes `#` for comments and splits lines based on them. For most lines 61 this isn't an issue, but for any constants this is important and will need to have 62 the comment character converted from `//` to `#`. 63 * A `word` for x86-64 is 16 bits, a `double` word is 32 bits, and a `quad` is 64 bits. 64 For arm64, a `word` is 32 bits. This means that constants in the assembly need to be 65 modified. `c2goasm` and `asm2plan9s` expect the x86-64 meaning for the sizes, so 66 usage of `.word ######` needs to be converted to `.long #####` before running 67 `c2goasm`. In addition, `.xword` is an 8-byte value and as such should be changed to 68 `.quad` before running `c2goasm`. 69 * Because of this change in bits, `MOVQ` instructions will also be converted to 70 `MOVD` instructions. 71 72 After running `c2goasm` there will still need to be modifications made to the 73 resulting assembly. 74 75 * Most of the ARM instructions will be converted to using the Go assembly construction 76 of `WORD $0x########` to provide an instruction directly to the processor rather than 77 going through the Go assembler. Some of the instructions, however, aren't recognized 78 by `c2goasm` and will need to added. If you look at the assembly, you'll see these 79 as assembly that is commented out without any `WORD` instruction. For example: 80 ```asm 81 // stp x29, x30, [sp, #-48]! 82 WORD $0x11007c48 // add w8, w2, #31 83 ``` 84 The `stp` instruction needs to be added. This can be done in one of two ways: 85 1. Many instructions are properly handled by the Go assembler correctly. You can 86 find the arm-specific caveats to Go's assembly [here](https://pkg.go.dev/cmd/internal/obj/arm64). In this case, the instruction would be `STP.W (R29, R30), -48(RSP)`. 87 2. Assuming that the GNU assembler is installed, you can use it to generate the 88 correct byte sequence. Create a file named `neon.asm` with a single line 89 (the instruction) and call `as -o neon.o neon.asm`. Then you can run 90 `objdump -S neon.o` to get the value to use. The output should look something 91 like: 92 ``` 93 Disassembly of section .text: 94 95 0000000000000000 <.text>: 96 0: 11 00 7c 48 add w8, w2, #31 97 ``` 98 And then update the assembly as `WORD $0x11007c48 // add w8, w2, #31` 99 * Labels used in instructions won't work when using the `WORD $0x#########` syntax. 100 They need to be the actual instructions for the labels. So all lines that have a 101 label will need to be converted. This is two-fold: 102 1. Any lines for branching such as those which end with `// b.le LBB0_10` are updated 103 to be `BLE LBB0_10`. The same is true for `b.gt`, `b.ge`, `b.ne`, and `b.eq`. `b` 104 instructions are instead converted to `JMP` calls. 105 2. References to constants need to be updated, for example `LCPI0_192`. By default, 106 these will get converted to global data instructions like 107 `DATA LCDATA1<>+0xc68(SB)/8, $0x0000000000000000`. Unfortunately, these seem to 108 have issues with being referenced by the assembler. The pattern to look for in 109 the assembly is an `adrp x9, .LCPI0_192` instruction that is later followed by 110 an instruction that looks like `str d4, [x9, 0:lo12:.LCPI0_192]`. These will 111 need to be converted to a macro and a `VMOV` instruction. 112 * In the original assembly, you'll see blocks like: 113 ```asm 114 .LCPI0_0 115 .word 1 // 0x00000001 116 .word 2 // 0x00000002 117 .LCPI0_1 118 .word 4294967265 // 0xffffffe1 119 .word 4294967266 // 0xffffffe2 120 ``` 121 which were converted to the `DATA LCDATA1`.... lines. Instead they should get 122 converted to a macro and a vector instruction: 123 ```asm 124 #define LCPI0_0 $0x0000000200000001 125 #define LCPI0_1 $0xffffffe2ffffffe1 126 ``` 127 Notice the lower/higher bits! 128 Then replace the `str`/`ldr`/`mov` instruction as `VMOVD LCPI0_0, v4`. Because 129 the original instruction storing the value in `d4`, we use `VMOVD` and `V4`. 130 Alternately we might find a prefix of `q` instead of `d`, in which case it we 131 need to use `VMOVQ` and pass the lower bytes followed by the higher bytes. 132 ```asm 133 #define LCPI0_48L $0x0000000d00000008 134 #define LCPI0_48H $0x0000001700000012 135 ... 136 VMOVQ LCPI0_48L, LCPI0_48H, V4 137 ``` 138 After replacing the instructions, both the `adrp` and the `str`/`ldr`/`mov` 139 instructions should be removed/commented out. 140 There might also be a `LEAQ LCDATA1<>(SB), BP` instruction at the top of the 141 function. That should be removed/commented out as we are replacing the constants 142 with macros. 143 * Finally, if the function has a return value, make sure that at the end of the 144 function, ends with something akin to `MOVD R0, num+32(FP)`. Where `num` is the 145 local variable name of the return value, and `32` is the byte size of the arguments. 146 147 To faciliate some automation, a `script.sed` file is provided in this directory which 148 can be run against the generated assembly from `c2goasm` as 149 `sed -f _lib/script.sed -i bit_packing_neon_arm64.s` which will perform several of 150 these steps on the generated assembly such as convering `b.le`/etc calls with labels 151 to proper `BLE LBB0_....` lines, and converting `adrp`/`ldr` pairs to `VMOVD` and 152 `VMOVQ` instructions. 153 154 This should be sufficient to ensuring the assembly is generated and works properly!