github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/x86/README (about) 1 Copyright 1999-2002 Free Software Foundation, Inc. 2 3 This file is part of the GNU MP Library. 4 5 The GNU MP Library is free software; you can redistribute it and/or modify 6 it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12 or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18 or both in parallel, as here. 19 20 The GNU MP Library is distributed in the hope that it will be useful, but 21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23 for more details. 24 25 You should have received copies of the GNU General Public License and the 26 GNU Lesser General Public License along with the GNU MP Library. If not, 27 see https://www.gnu.org/licenses/. 28 29 30 31 32 33 X86 MPN SUBROUTINES 34 35 36 This directory contains mpn functions for various 80x86 chips. 37 38 39 CODE ORGANIZATION 40 41 x86 i386, generic 42 x86/i486 i486 43 x86/pentium Intel Pentium (P5, P54) 44 x86/pentium/mmx Intel Pentium with MMX (P55) 45 x86/p6 Intel Pentium Pro 46 x86/p6/mmx Intel Pentium II, III 47 x86/p6/p3mmx Intel Pentium III 48 x86/k6 \ AMD K6 49 x86/k6/mmx / 50 x86/k6/k62mmx AMD K6-2 51 x86/k7 \ AMD Athlon 52 x86/k7/mmx / 53 x86/pentium4 \ 54 x86/pentium4/mmx | Intel Pentium 4 55 x86/pentium4/sse2 / 56 57 58 The top-level x86 directory contains blended style code, meant to be 59 reasonable on all x86s. 60 61 62 63 STATUS 64 65 The code is well-optimized for AMD and Intel chips, but there's nothing 66 specific for Cyrix chips, nor for actual 80386 and 80486 chips. 67 68 69 70 ASM FILES 71 72 The x86 .asm files are BSD style assembler code, first put through m4 for 73 macro processing. The generic mpn/asm-defs.m4 is used, together with 74 mpn/x86/x86-defs.m4. See comments in those files. 75 76 The code is meant for use with GNU "gas" or a system "as". There's no 77 support for assemblers that demand Intel style code. 78 79 80 81 STACK FRAME 82 83 m4 macros are used to define the parameters passed on the stack, and these 84 act like comments on what the stack frame looks like too. For example, 85 mpn_mul_1() has the following. 86 87 defframe(PARAM_MULTIPLIER, 16) 88 defframe(PARAM_SIZE, 12) 89 defframe(PARAM_SRC, 8) 90 defframe(PARAM_DST, 4) 91 92 PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly. The 93 return address is at offset 0, but there's not normally any need to access 94 that. 95 96 FRAME is redefined as necessary through the code so it's the number of bytes 97 pushed on the stack, and hence the offsets in the parameter macros stay 98 correct. At the start of a routine FRAME should be zero. 99 100 deflit(`FRAME',0) 101 ... 102 deflit(`FRAME',4) 103 ... 104 deflit(`FRAME',8) 105 ... 106 107 Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and 108 FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions, 109 and can be used instead of explicit definitions if preferred. 110 defframe_pushl() is a combination FRAME_pushl() and defframe(). 111 112 There's generally some slackness in redefining FRAME. If new values aren't 113 going to get used then the redefinitions are omitted to keep from cluttering 114 up the code. This happens for instance at the end of a routine, where there 115 might be just four pops and then a ret, so FRAME isn't getting used. 116 117 Local variables and saved registers can be similarly defined, with negative 118 offsets representing stack space below the initial stack pointer. For 119 example, 120 121 defframe(SAVE_ESI, -4) 122 defframe(SAVE_EDI, -8) 123 defframe(VAR_COUNTER,-12) 124 125 deflit(STACK_SPACE, 12) 126 127 Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the 128 space, and that instruction must be followed by a redefinition of FRAME 129 (setting it equal to STACK_SPACE) to reflect the change in %esp. 130 131 Definitions for pushed registers are only put in when they're going to be 132 used. If registers are just saved and restored with pushes and pops then 133 definitions aren't made. 134 135 136 137 ASSEMBLER EXPRESSIONS 138 139 Only addition and subtraction seem to be universally available, certainly 140 that's all the Solaris 8 "as" seems to accept. If expressions are wanted 141 then m4 eval() should be used. 142 143 In particular note that a "/" anywhere in a line starts a comment in Solaris 144 "as", and in some configurations of gas too. 145 146 addl $32/2, %eax <-- wrong 147 148 addl $eval(32/2), %eax <-- right 149 150 Binutils gas/config/tc-i386.c has a choice between "/" being a comment 151 anywhere in a line, or only at the start. FreeBSD patches 2.9.1 to select 152 the latter, and from 2.9.5 it's the default for GNU/Linux too. 153 154 155 156 ASSEMBLER COMMENTS 157 158 Solaris "as" doesn't support "#" commenting, using /* */ instead. For that 159 reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s" 160 files have no comments. 161 162 Any comments before include(`../config.m4') must use m4 "dnl", since it's 163 only after the include that "C" is available. By convention "dnl" is also 164 used for comments about m4 macros. 165 166 167 168 TEMPORARY LABELS 169 170 Temporary numbered labels like "1:" used as "1f" or "1b" are available in 171 "gas" and Solaris "as", but not in SCO "as". Normal L() labels should be 172 used instead, possibly with a counter to make them unique, see jadcl0() in 173 x86-defs.m4 for instance. A separate counter for each macro makes it 174 possible to nest them, for instance movl_text_address() can be used within 175 an ASSERT(). 176 177 "1:" etc must be avoided in gcc __asm__ blocks too. "%=" for generating a 178 unique number looks like a good alternative, but is that actually a 179 documented feature? In any case this problem doesn't currently arise. 180 181 182 183 ZERO DISPLACEMENTS 184 185 In a couple of places addressing modes like 0(%ebx) with a byte-sized zero 186 displacement are wanted, rather than (%ebx) with no displacement. These are 187 either for computed jumps or to get desirable code alignment. Explicit 188 .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into 189 (%ebx). The Zdisp() macro in x86-defs.m4 is used for this. 190 191 Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas 192 1.92.3 changes it. In general changing would be the sort of "optimization" 193 an assembler might perform, hence explicit ".byte"s are used where 194 necessary. 195 196 197 198 SHLD/SHRD INSTRUCTIONS 199 200 The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx" 201 must be written "shldl %eax,%ebx" for some assemblers. gas takes either, 202 Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is 203 gas), and omits %cl elsewhere. 204 205 For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether 206 %cl should be used, and the macros shldl, shrdl, shldw and shrdw in 207 mpn/x86/x86-defs.m4 pass through or omit %cl as necessary. See the comments 208 with those macros for usage. 209 210 211 212 IMUL INSTRUCTION 213 214 GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes 215 that the following two forms produce identical object code 216 217 imul $12, %eax 218 imul $12, %eax, %eax 219 220 but that the former isn't accepted by some assemblers, in particular the SCO 221 OSR5 COFF assembler. GMP follows GCC and uses only the latter form. 222 223 (This applies only to immediate operands, the three operand form is only 224 valid with an immediate.) 225 226 227 228 DIRECTION FLAG 229 230 The x86 calling conventions say that the direction flag should be clear at 231 function entry and exit. (See iBCS2 and SVR4 ABI books, references below.) 232 Although this has been so since the year dot, it's not absolutely clear 233 whether it's universally respected. Since it's better to be safe than 234 sorry, GMP follows glibc and does a "cld" if it depends on the direction 235 flag being clear. This happens only in a few places. 236 237 238 239 POSITION INDEPENDENT CODE 240 241 Coding Style 242 243 Defining the symbol PIC in m4 processing selects SVR4 / ELF style 244 position independent code. This is necessary for shared libraries 245 because they can be mapped into different processes at different virtual 246 addresses. Actually, relocations are allowed but text pages with 247 relocations aren't shared, defeating the purpose of a shared library. 248 249 The GOT is used to access global data, and the PLT is used for 250 functions. The use of the PLT adds a fixed cost to every function call, 251 and the GOT adds a cost to any function accessing global variables. 252 These are small but might be noticeable when working with small 253 operands. 254 255 Scope 256 257 It's intended, as a matter of policy, that references within libgmp are 258 resolved within libgmp. Certainly there's no need for an application to 259 replace any internals, and we take the view that there's no value in an 260 application subverting anything documented either. 261 262 Resolving references within libgmp in theory means calls can be made with a 263 plain PC-relative call instruction, which is faster and smaller than going 264 through the PLT, and data references can be similarly PC-relative, saving a 265 GOT entry and fetch from there. Unfortunately the normal linker behaviour 266 doesn't allow us to do this. 267 268 By default an R_386_PC32 PC-relative reference, either for a call or for 269 data, is left in libgmp.so by the linker so that it can be resolved at 270 runtime to a location in the application or another shared library. This 271 means a text segment relocation which we don't want. 272 273 -Bsymbolic 274 275 Under the "-Bsymbolic" option, the linker resolves references to symbols 276 within libgmp.so. This gives us the desired effect for R_386_PC32, 277 ie. it's resolved at link time. It also resolves R_386_PLT32 calls 278 directly to their target without creating a PLT entry (though if this is 279 done to normal compiler-generated code it still leaves a setup of %ebx 280 to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary). 281 282 Unfortunately -Bsymbolic does bad things to global variables defined in 283 a shared library but accessed by non-PIC code from the mainline (or a 284 static library). 285 286 The problem is that the mainline needs a fixed data address to avoid 287 text segment relocations, so space is allocated in its data segment and 288 the value from the variable is copied from the shared library's data 289 segment when the library is loaded. Under -Bsymbolic, however, 290 references in the shared library are then resolved still to the shared 291 library data area. Not surprisingly it bombs badly to have mainline 292 code and library code accessing different locations for what should be 293 one variable. 294 295 Note that this -Bsymbolic effect for the shared library is not just for 296 R_386_PC32 offsets which might have been cooked up in assembler, but is 297 done also for the contents of GOT entries. -Bsymbolic simply applies a 298 general rule that symbols are resolved first from the local module. 299 300 Visibility Attributes 301 302 GCC __attribute__ ((visibility ("protected"))), which is available in 303 recent versions, eg. 3.3, is probably what we'd like to use. It makes 304 gcc generate plain PC-relative calls to indicated functions, and directs 305 the linker to resolve references to the given function within the link 306 module. 307 308 Unfortunately, as of debian binutils 2.13.90.0.16 at least, the 309 resulting libgmp.so comes out with text segment relocations, references 310 are not resolved at link time. If the gcc description is to be believed 311 this is this not how it should work. If a symbol cannot be overridden 312 by another module then surely references within that module can be 313 resolved immediately (ie. at link time). 314 315 Present 316 317 In any case, all this means that we have no optimizations we can 318 usefully make to function or variable usages, neither for assembler nor 319 C code. Perhaps in the future the visibility attribute will work as 320 we'd like. 321 322 323 324 325 GLOBAL OFFSET TABLE 326 327 The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the 328 GOT sometimes requires an extra underscore prefix. SVR4 systems and NetBSD 329 don't need a prefix, OpenBSD does need one. Note that NetBSD and OpenBSD 330 are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_ 331 is not simply the same as the prefix for ordinary globals. 332 333 In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro 334 in x86-defs.m4 add an extra underscore if required (according to a configure 335 test). 336 337 Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when 338 asked to assemble the following, 339 340 L1: 341 addl $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx 342 343 It seems that using the label in the same instruction it refers to is the 344 problem, since a nop in between works. But the simplest workaround is to 345 follow gcc and omit the +[.-L1] since it does nothing, 346 347 addl $_GLOBAL_OFFSET_TABLE_, %ebx 348 349 Current gas 2.10 generates incorrect object code when %eax is used in such a 350 construction (with or without +[.-L1]), 351 352 addl $_GLOBAL_OFFSET_TABLE_, %eax 353 354 The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for 355 the 1 byte opcode of "addl $n,%eax". The best workaround is just to use any 356 other register, since then it's a two byte opcode+mod/rm. GCC for example 357 always uses %ebx (which is needed for calls through the PLT). 358 359 A similar problem occurs in an leal (again with or without a +[.-L1]), 360 361 leal _GLOBAL_OFFSET_TABLE_(%edi), %ebx 362 363 This time the R_386_GOTPC gets a displacement of 0 rather than the 2 364 appropriate for the opcode and mod/rm, making this form unusable. 365 366 367 368 369 SIMPLE LOOPS 370 371 The overheads in setting up for an unrolled loop can mean that at small 372 sizes a simple loop is faster. Making small sizes go fast is important, 373 even if it adds a cycle or two to bigger sizes. To this end various 374 routines choose between a simple loop and an unrolled loop according to 375 operand size. The path to the simple loop, or to special case code for 376 small sizes, is always as fast as possible. 377 378 Adding a simple loop requires a conditional jump to choose between the 379 simple and unrolled code. The size of a branch misprediction penalty 380 affects whether a simple loop is worthwhile. 381 382 The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover 383 point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >= 384 UNROLL_THRESHOLD using the unrolled loop. If position independent code adds 385 a couple of cycles to an unrolled loop setup, the threshold will vary with 386 PIC or non-PIC. Something like the following is typical. 387 388 deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8)) 389 390 There's no automated way to determine the threshold. Setting it to a small 391 value and then to a big value makes it possible to measure the simple and 392 unrolled loops each over a range of sizes, from which the crossover point 393 can be determined. Alternately, just adjust the threshold up or down until 394 there's no more speedups. 395 396 397 398 UNROLLED LOOP CODING 399 400 The x86 addressing modes allow a byte displacement of -128 to +127, making 401 it possible to access 256 bytes, which is 64 limbs, without adjusting 402 pointer registers within the loop. Dword sized displacements can be used 403 too, but they increase code size, and unrolling to 64 ought to be enough. 404 405 When unrolling to the full 64 limbs/loop, the limb at the top of the loop 406 will have a displacement of -128, so pointers have to have a corresponding 407 +128 added before entering the loop. When unrolling to 32 limbs/loop 408 displacements 0 to 127 can be used with 0 at the top of the loop and no 409 adjustment needed to the pointers. 410 411 Where 64 limbs/loop is supported, the +128 adjustment is done only when 64 412 limbs/loop is selected. Usually the gain in speed using 64 instead of 32 or 413 16 is small, so support for 64 limbs/loop is generally only for comparison. 414 415 416 417 COMPUTED JUMPS 418 419 When working from least significant limb to most significant limb (most 420 routines) the computed jump and pointer calculations in preparation for an 421 unrolled loop are as follows. 422 423 S = operand size in limbs 424 N = number of limbs per loop (UNROLL_COUNT) 425 L = log2 of unrolling (UNROLL_LOG2) 426 M = mask for unrolling (UNROLL_MASK) 427 C = code bytes per limb in the loop 428 B = bytes per limb (4 for x86) 429 430 computed jump (-S & M) * C + entrypoint 431 subtract from pointers (-S & M) * B 432 initial loop counter (S-1) >> L 433 displacements 0 to B*(N-1) 434 435 The loop counter is decremented at the end of each loop, and the looping 436 stops when the decrement takes the counter to -1. The displacements are for 437 the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax". 438 439 Usually the multiply by "C" can be handled without an imul, using instead an 440 leal, or a shift and subtract. 441 442 When working from most significant to least significant limb (eg. mpn_lshift 443 and mpn_copyd), the calculations change as follows. 444 445 add to pointers (-S & M) * B 446 displacements 0 to -B*(N-1) 447 448 449 450 OLD GAS 1.92.3 451 452 This version comes with FreeBSD 2.2.8 and has a couple of gremlins that 453 affect GMP code. 454 455 Firstly, an expression involving two forward references to labels comes out 456 as zero. For example, 457 458 addl $bar-foo, %eax 459 foo: 460 nop 461 bar: 462 463 This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax". 464 When only one forward reference is involved, it works correctly, as for 465 example, 466 467 foo: 468 addl $bar-foo, %eax 469 nop 470 bar: 471 472 Secondly, an expression involving two labels can't be used as the 473 displacement for an leal. For example, 474 475 foo: 476 nop 477 bar: 478 leal bar-foo(%eax,%ebx,8), %ecx 479 480 A slightly cryptic error is given, "Unimplemented segment type 0 in 481 parse_operand". When only one label is used it's ok, and the label can be a 482 forward reference too, as for example, 483 484 leal foo(%eax,%ebx,8), %ecx 485 nop 486 foo: 487 488 These problems only affect PIC computed jump calculations. The workarounds 489 are just to do an leal without a displacement and then an addl, and to make 490 sure the code is placed so that there's at most one forward reference in the 491 addl. 492 493 494 495 REFERENCES 496 497 "Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b, 498 2006, order numbers 253665 through 253669. Available on-line, 499 500 ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf 501 ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf 502 ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf 503 ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf 504 ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf 505 506 507 "System V Application Binary Interface", Unix System Laboratories Inc, 1992, 508 published by Prentice Hall, ISBN 0-13-880410-9. And the "Intel386 Processor 509 Supplement", AT&T, 1991, ISBN 0-13-877689-X. These have details of calling 510 conventions and ELF shared library PIC coding. Versions of both available 511 on-line, 512 513 http://www.sco.com/developer/devspecs 514 515 "Intel386 Family Binary Compatibility Specification 2", Intel Corporation, 516 published by McGraw-Hill, 1991, ISBN 0-07-031219-2. (Same as the above 386 517 ABI supplement.) 518 519 520 521 ---------------- 522 Local variables: 523 mode: text 524 fill-column: 76 525 End: