github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/x86/README (about)

     1  Copyright 1999-2002 Free Software Foundation, Inc.
     2  
     3  This file is part of the GNU MP Library.
     4  
     5  The GNU MP Library is free software; you can redistribute it and/or modify
     6  it under the terms of either:
     7  
     8    * the GNU Lesser General Public License as published by the Free
     9      Software Foundation; either version 3 of the License, or (at your
    10      option) any later version.
    11  
    12  or
    13  
    14    * the GNU General Public License as published by the Free Software
    15      Foundation; either version 2 of the License, or (at your option) any
    16      later version.
    17  
    18  or both in parallel, as here.
    19  
    20  The GNU MP Library is distributed in the hope that it will be useful, but
    21  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    22  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    23  for more details.
    24  
    25  You should have received copies of the GNU General Public License and the
    26  GNU Lesser General Public License along with the GNU MP Library.  If not,
    27  see https://www.gnu.org/licenses/.
    28  
    29  
    30  
    31  
    32  
    33                        X86 MPN SUBROUTINES
    34  
    35  
    36  This directory contains mpn functions for various 80x86 chips.
    37  
    38  
    39  CODE ORGANIZATION
    40  
    41  	x86               i386, generic
    42  	x86/i486          i486
    43  	x86/pentium       Intel Pentium (P5, P54)
    44  	x86/pentium/mmx   Intel Pentium with MMX (P55)
    45  	x86/p6            Intel Pentium Pro
    46  	x86/p6/mmx        Intel Pentium II, III
    47  	x86/p6/p3mmx      Intel Pentium III
    48  	x86/k6            \ AMD K6
    49  	x86/k6/mmx        /
    50  	x86/k6/k62mmx     AMD K6-2
    51  	x86/k7            \ AMD Athlon
    52  	x86/k7/mmx        /
    53  	x86/pentium4      \
    54  	x86/pentium4/mmx  | Intel Pentium 4
    55  	x86/pentium4/sse2 /
    56  
    57  
    58  The top-level x86 directory contains blended style code, meant to be
    59  reasonable on all x86s.
    60  
    61  
    62  
    63  STATUS
    64  
    65  The code is well-optimized for AMD and Intel chips, but there's nothing
    66  specific for Cyrix chips, nor for actual 80386 and 80486 chips.
    67  
    68  
    69  
    70  ASM FILES
    71  
    72  The x86 .asm files are BSD style assembler code, first put through m4 for
    73  macro processing.  The generic mpn/asm-defs.m4 is used, together with
    74  mpn/x86/x86-defs.m4.  See comments in those files.
    75  
    76  The code is meant for use with GNU "gas" or a system "as".  There's no
    77  support for assemblers that demand Intel style code.
    78  
    79  
    80  
    81  STACK FRAME
    82  
    83  m4 macros are used to define the parameters passed on the stack, and these
    84  act like comments on what the stack frame looks like too.  For example,
    85  mpn_mul_1() has the following.
    86  
    87          defframe(PARAM_MULTIPLIER, 16)
    88          defframe(PARAM_SIZE,       12)
    89          defframe(PARAM_SRC,         8)
    90          defframe(PARAM_DST,         4)
    91  
    92  PARAM_MULTIPLIER becomes `FRAME+16(%esp)', and the others similarly.  The
    93  return address is at offset 0, but there's not normally any need to access
    94  that.
    95  
    96  FRAME is redefined as necessary through the code so it's the number of bytes
    97  pushed on the stack, and hence the offsets in the parameter macros stay
    98  correct.  At the start of a routine FRAME should be zero.
    99  
   100          deflit(`FRAME',0)
   101  	...
   102  	deflit(`FRAME',4)
   103  	...
   104  	deflit(`FRAME',8)
   105  	...
   106  
   107  Helper macros FRAME_pushl(), FRAME_popl(), FRAME_addl_esp() and
   108  FRAME_subl_esp() exist to adjust FRAME for the effect of those instructions,
   109  and can be used instead of explicit definitions if preferred.
   110  defframe_pushl() is a combination FRAME_pushl() and defframe().
   111  
   112  There's generally some slackness in redefining FRAME.  If new values aren't
   113  going to get used then the redefinitions are omitted to keep from cluttering
   114  up the code.  This happens for instance at the end of a routine, where there
   115  might be just four pops and then a ret, so FRAME isn't getting used.
   116  
   117  Local variables and saved registers can be similarly defined, with negative
   118  offsets representing stack space below the initial stack pointer.  For
   119  example,
   120  
   121  	defframe(SAVE_ESI,   -4)
   122  	defframe(SAVE_EDI,   -8)
   123  	defframe(VAR_COUNTER,-12)
   124  
   125  	deflit(STACK_SPACE, 12)
   126  
   127  Here STACK_SPACE gets used in a "subl $STACK_SPACE, %esp" to allocate the
   128  space, and that instruction must be followed by a redefinition of FRAME
   129  (setting it equal to STACK_SPACE) to reflect the change in %esp.
   130  
   131  Definitions for pushed registers are only put in when they're going to be
   132  used.  If registers are just saved and restored with pushes and pops then
   133  definitions aren't made.
   134  
   135  
   136  
   137  ASSEMBLER EXPRESSIONS
   138  
   139  Only addition and subtraction seem to be universally available, certainly
   140  that's all the Solaris 8 "as" seems to accept.  If expressions are wanted
   141  then m4 eval() should be used.
   142  
   143  In particular note that a "/" anywhere in a line starts a comment in Solaris
   144  "as", and in some configurations of gas too.
   145  
   146  	addl	$32/2, %eax           <-- wrong
   147  
   148  	addl	$eval(32/2), %eax     <-- right
   149  
   150  Binutils gas/config/tc-i386.c has a choice between "/" being a comment
   151  anywhere in a line, or only at the start.  FreeBSD patches 2.9.1 to select
   152  the latter, and from 2.9.5 it's the default for GNU/Linux too.
   153  
   154  
   155  
   156  ASSEMBLER COMMENTS
   157  
   158  Solaris "as" doesn't support "#" commenting, using /* */ instead.  For that
   159  reason "C" commenting is used (see asm-defs.m4) and the intermediate ".s"
   160  files have no comments.
   161  
   162  Any comments before include(`../config.m4') must use m4 "dnl", since it's
   163  only after the include that "C" is available.  By convention "dnl" is also
   164  used for comments about m4 macros.
   165  
   166  
   167  
   168  TEMPORARY LABELS
   169  
   170  Temporary numbered labels like "1:" used as "1f" or "1b" are available in
   171  "gas" and Solaris "as", but not in SCO "as".  Normal L() labels should be
   172  used instead, possibly with a counter to make them unique, see jadcl0() in
   173  x86-defs.m4 for instance.  A separate counter for each macro makes it
   174  possible to nest them, for instance movl_text_address() can be used within
   175  an ASSERT().
   176  
   177  "1:" etc must be avoided in gcc __asm__ blocks too.  "%=" for generating a
   178  unique number looks like a good alternative, but is that actually a
   179  documented feature?  In any case this problem doesn't currently arise.
   180  
   181  
   182  
   183  ZERO DISPLACEMENTS
   184  
   185  In a couple of places addressing modes like 0(%ebx) with a byte-sized zero
   186  displacement are wanted, rather than (%ebx) with no displacement.  These are
   187  either for computed jumps or to get desirable code alignment.  Explicit
   188  .byte sequences are used to ensure the assembler doesn't turn 0(%ebx) into
   189  (%ebx).  The Zdisp() macro in x86-defs.m4 is used for this.
   190  
   191  Current gas 2.9.5 or recent 2.9.1 leave 0(%ebx) as written, but old gas
   192  1.92.3 changes it.  In general changing would be the sort of "optimization"
   193  an assembler might perform, hence explicit ".byte"s are used where
   194  necessary.
   195  
   196  
   197  
   198  SHLD/SHRD INSTRUCTIONS
   199  
   200  The %cl count forms of double shift instructions like "shldl %cl,%eax,%ebx"
   201  must be written "shldl %eax,%ebx" for some assemblers.  gas takes either,
   202  Solaris "as" doesn't allow %cl, gcc generates %cl for gas and NeXT (which is
   203  gas), and omits %cl elsewhere.
   204  
   205  For GMP an autoconf test GMP_ASM_X86_SHLDL_CL is used to determine whether
   206  %cl should be used, and the macros shldl, shrdl, shldw and shrdw in
   207  mpn/x86/x86-defs.m4 pass through or omit %cl as necessary.  See the comments
   208  with those macros for usage.
   209  
   210  
   211  
   212  IMUL INSTRUCTION
   213  
   214  GCC config/i386/i386.md (cvs rev 1.187, 21 Oct 00) under *mulsi3_1 notes
   215  that the following two forms produce identical object code
   216  
   217  	imul	$12, %eax
   218  	imul	$12, %eax, %eax
   219  
   220  but that the former isn't accepted by some assemblers, in particular the SCO
   221  OSR5 COFF assembler.  GMP follows GCC and uses only the latter form.
   222  
   223  (This applies only to immediate operands, the three operand form is only
   224  valid with an immediate.)
   225  
   226  
   227  
   228  DIRECTION FLAG
   229  
   230  The x86 calling conventions say that the direction flag should be clear at
   231  function entry and exit.  (See iBCS2 and SVR4 ABI books, references below.)
   232  Although this has been so since the year dot, it's not absolutely clear
   233  whether it's universally respected.  Since it's better to be safe than
   234  sorry, GMP follows glibc and does a "cld" if it depends on the direction
   235  flag being clear.  This happens only in a few places.
   236  
   237  
   238  
   239  POSITION INDEPENDENT CODE
   240  
   241    Coding Style
   242  
   243      Defining the symbol PIC in m4 processing selects SVR4 / ELF style
   244      position independent code.  This is necessary for shared libraries
   245      because they can be mapped into different processes at different virtual
   246      addresses.  Actually, relocations are allowed but text pages with
   247      relocations aren't shared, defeating the purpose of a shared library.
   248  
   249      The GOT is used to access global data, and the PLT is used for
   250      functions.  The use of the PLT adds a fixed cost to every function call,
   251      and the GOT adds a cost to any function accessing global variables.
   252      These are small but might be noticeable when working with small
   253      operands.
   254  
   255    Scope
   256  
   257      It's intended, as a matter of policy, that references within libgmp are
   258      resolved within libgmp.  Certainly there's no need for an application to
   259      replace any internals, and we take the view that there's no value in an
   260      application subverting anything documented either.
   261  
   262      Resolving references within libgmp in theory means calls can be made with a
   263      plain PC-relative call instruction, which is faster and smaller than going
   264      through the PLT, and data references can be similarly PC-relative, saving a
   265      GOT entry and fetch from there.  Unfortunately the normal linker behaviour
   266      doesn't allow us to do this.
   267  
   268      By default an R_386_PC32 PC-relative reference, either for a call or for
   269      data, is left in libgmp.so by the linker so that it can be resolved at
   270      runtime to a location in the application or another shared library.  This
   271      means a text segment relocation which we don't want.
   272  
   273    -Bsymbolic
   274  
   275      Under the "-Bsymbolic" option, the linker resolves references to symbols
   276      within libgmp.so.  This gives us the desired effect for R_386_PC32,
   277      ie. it's resolved at link time.  It also resolves R_386_PLT32 calls
   278      directly to their target without creating a PLT entry (though if this is
   279      done to normal compiler-generated code it still leaves a setup of %ebx
   280      to _GLOBAL_OFFSET_TABLE_ which may then be unnecessary).
   281  
   282      Unfortunately -Bsymbolic does bad things to global variables defined in
   283      a shared library but accessed by non-PIC code from the mainline (or a
   284      static library).
   285  
   286      The problem is that the mainline needs a fixed data address to avoid
   287      text segment relocations, so space is allocated in its data segment and
   288      the value from the variable is copied from the shared library's data
   289      segment when the library is loaded.  Under -Bsymbolic, however,
   290      references in the shared library are then resolved still to the shared
   291      library data area.  Not surprisingly it bombs badly to have mainline
   292      code and library code accessing different locations for what should be
   293      one variable.
   294  
   295      Note that this -Bsymbolic effect for the shared library is not just for
   296      R_386_PC32 offsets which might have been cooked up in assembler, but is
   297      done also for the contents of GOT entries.  -Bsymbolic simply applies a
   298      general rule that symbols are resolved first from the local module.
   299  
   300    Visibility Attributes
   301  
   302      GCC __attribute__ ((visibility ("protected"))), which is available in
   303      recent versions, eg. 3.3, is probably what we'd like to use.  It makes
   304      gcc generate plain PC-relative calls to indicated functions, and directs
   305      the linker to resolve references to the given function within the link
   306      module.
   307  
   308      Unfortunately, as of debian binutils 2.13.90.0.16 at least, the
   309      resulting libgmp.so comes out with text segment relocations, references
   310      are not resolved at link time.  If the gcc description is to be believed
   311      this is this not how it should work.  If a symbol cannot be overridden
   312      by another module then surely references within that module can be
   313      resolved immediately (ie. at link time).
   314  
   315    Present
   316  
   317      In any case, all this means that we have no optimizations we can
   318      usefully make to function or variable usages, neither for assembler nor
   319      C code.  Perhaps in the future the visibility attribute will work as
   320      we'd like.
   321  
   322  
   323  
   324  
   325  GLOBAL OFFSET TABLE
   326  
   327  The magic _GLOBAL_OFFSET_TABLE_ used by code establishing the address of the
   328  GOT sometimes requires an extra underscore prefix.  SVR4 systems and NetBSD
   329  don't need a prefix, OpenBSD does need one.  Note that NetBSD and OpenBSD
   330  are both a.out underscore systems, so the prefix for _GLOBAL_OFFSET_TABLE_
   331  is not simply the same as the prefix for ordinary globals.
   332  
   333  In any case in the asm code we write _GLOBAL_OFFSET_TABLE_ and let a macro
   334  in x86-defs.m4 add an extra underscore if required (according to a configure
   335  test).
   336  
   337  Old gas 1.92.3 which comes with FreeBSD 2.2.8 gets a segmentation fault when
   338  asked to assemble the following,
   339  
   340          L1:
   341              addl  $_GLOBAL_OFFSET_TABLE_+[.-L1], %ebx
   342  
   343  It seems that using the label in the same instruction it refers to is the
   344  problem, since a nop in between works.  But the simplest workaround is to
   345  follow gcc and omit the +[.-L1] since it does nothing,
   346  
   347              addl  $_GLOBAL_OFFSET_TABLE_, %ebx
   348  
   349  Current gas 2.10 generates incorrect object code when %eax is used in such a
   350  construction (with or without +[.-L1]),
   351  
   352              addl  $_GLOBAL_OFFSET_TABLE_, %eax
   353  
   354  The R_386_GOTPC gets a displacement of 2 rather than the 1 appropriate for
   355  the 1 byte opcode of "addl $n,%eax".  The best workaround is just to use any
   356  other register, since then it's a two byte opcode+mod/rm.  GCC for example
   357  always uses %ebx (which is needed for calls through the PLT).
   358  
   359  A similar problem occurs in an leal (again with or without a +[.-L1]),
   360  
   361              leal  _GLOBAL_OFFSET_TABLE_(%edi), %ebx
   362  
   363  This time the R_386_GOTPC gets a displacement of 0 rather than the 2
   364  appropriate for the opcode and mod/rm, making this form unusable.
   365  
   366  
   367  
   368  
   369  SIMPLE LOOPS
   370  
   371  The overheads in setting up for an unrolled loop can mean that at small
   372  sizes a simple loop is faster.  Making small sizes go fast is important,
   373  even if it adds a cycle or two to bigger sizes.  To this end various
   374  routines choose between a simple loop and an unrolled loop according to
   375  operand size.  The path to the simple loop, or to special case code for
   376  small sizes, is always as fast as possible.
   377  
   378  Adding a simple loop requires a conditional jump to choose between the
   379  simple and unrolled code.  The size of a branch misprediction penalty
   380  affects whether a simple loop is worthwhile.
   381  
   382  The convention is for an m4 definition UNROLL_THRESHOLD to set the crossover
   383  point, with sizes < UNROLL_THRESHOLD using the simple loop, sizes >=
   384  UNROLL_THRESHOLD using the unrolled loop.  If position independent code adds
   385  a couple of cycles to an unrolled loop setup, the threshold will vary with
   386  PIC or non-PIC.  Something like the following is typical.
   387  
   388  	deflit(UNROLL_THRESHOLD, ifdef(`PIC',10,8))
   389  
   390  There's no automated way to determine the threshold.  Setting it to a small
   391  value and then to a big value makes it possible to measure the simple and
   392  unrolled loops each over a range of sizes, from which the crossover point
   393  can be determined.  Alternately, just adjust the threshold up or down until
   394  there's no more speedups.
   395  
   396  
   397  
   398  UNROLLED LOOP CODING
   399  
   400  The x86 addressing modes allow a byte displacement of -128 to +127, making
   401  it possible to access 256 bytes, which is 64 limbs, without adjusting
   402  pointer registers within the loop.  Dword sized displacements can be used
   403  too, but they increase code size, and unrolling to 64 ought to be enough.
   404  
   405  When unrolling to the full 64 limbs/loop, the limb at the top of the loop
   406  will have a displacement of -128, so pointers have to have a corresponding
   407  +128 added before entering the loop.  When unrolling to 32 limbs/loop
   408  displacements 0 to 127 can be used with 0 at the top of the loop and no
   409  adjustment needed to the pointers.
   410  
   411  Where 64 limbs/loop is supported, the +128 adjustment is done only when 64
   412  limbs/loop is selected.  Usually the gain in speed using 64 instead of 32 or
   413  16 is small, so support for 64 limbs/loop is generally only for comparison.
   414  
   415  
   416  
   417  COMPUTED JUMPS
   418  
   419  When working from least significant limb to most significant limb (most
   420  routines) the computed jump and pointer calculations in preparation for an
   421  unrolled loop are as follows.
   422  
   423  	S = operand size in limbs
   424  	N = number of limbs per loop (UNROLL_COUNT)
   425  	L = log2 of unrolling (UNROLL_LOG2)
   426  	M = mask for unrolling (UNROLL_MASK)
   427  	C = code bytes per limb in the loop
   428  	B = bytes per limb (4 for x86)
   429  
   430  	computed jump            (-S & M) * C + entrypoint
   431  	subtract from pointers   (-S & M) * B
   432  	initial loop counter     (S-1) >> L
   433  	displacements            0 to B*(N-1)
   434  
   435  The loop counter is decremented at the end of each loop, and the looping
   436  stops when the decrement takes the counter to -1.  The displacements are for
   437  the addressing accessing each limb, eg. a load with "movl disp(%ebx), %eax".
   438  
   439  Usually the multiply by "C" can be handled without an imul, using instead an
   440  leal, or a shift and subtract.
   441  
   442  When working from most significant to least significant limb (eg. mpn_lshift
   443  and mpn_copyd), the calculations change as follows.
   444  
   445  	add to pointers          (-S & M) * B
   446  	displacements            0 to -B*(N-1)
   447  
   448  
   449  
   450  OLD GAS 1.92.3
   451  
   452  This version comes with FreeBSD 2.2.8 and has a couple of gremlins that
   453  affect GMP code.
   454  
   455  Firstly, an expression involving two forward references to labels comes out
   456  as zero.  For example,
   457  
   458  		addl	$bar-foo, %eax
   459  	foo:
   460  		nop
   461  	bar:
   462  
   463  This should lead to "addl $1, %eax", but it comes out as "addl $0, %eax".
   464  When only one forward reference is involved, it works correctly, as for
   465  example,
   466  
   467  	foo:
   468  		addl	$bar-foo, %eax
   469  		nop
   470  	bar:
   471  
   472  Secondly, an expression involving two labels can't be used as the
   473  displacement for an leal.  For example,
   474  
   475  	foo:
   476  		nop
   477  	bar:
   478  		leal	bar-foo(%eax,%ebx,8), %ecx
   479  
   480  A slightly cryptic error is given, "Unimplemented segment type 0 in
   481  parse_operand".  When only one label is used it's ok, and the label can be a
   482  forward reference too, as for example,
   483  
   484  		leal	foo(%eax,%ebx,8), %ecx
   485  		nop
   486  	foo:
   487  
   488  These problems only affect PIC computed jump calculations.  The workarounds
   489  are just to do an leal without a displacement and then an addl, and to make
   490  sure the code is placed so that there's at most one forward reference in the
   491  addl.
   492  
   493  
   494  
   495  REFERENCES
   496  
   497  "Intel Architecture Software Developer's Manual", volumes 1, 2a, 2b, 3a, 3b,
   498  2006, order numbers 253665 through 253669.  Available on-line,
   499  
   500  	ftp://download.intel.com/design/Pentium4/manuals/25366518.pdf
   501  	ftp://download.intel.com/design/Pentium4/manuals/25366618.pdf
   502  	ftp://download.intel.com/design/Pentium4/manuals/25366718.pdf
   503  	ftp://download.intel.com/design/Pentium4/manuals/25366818.pdf
   504  	ftp://download.intel.com/design/Pentium4/manuals/25366918.pdf
   505  
   506  
   507  "System V Application Binary Interface", Unix System Laboratories Inc, 1992,
   508  published by Prentice Hall, ISBN 0-13-880410-9.  And the "Intel386 Processor
   509  Supplement", AT&T, 1991, ISBN 0-13-877689-X.  These have details of calling
   510  conventions and ELF shared library PIC coding.  Versions of both available
   511  on-line,
   512  
   513  	http://www.sco.com/developer/devspecs
   514  
   515  "Intel386 Family Binary Compatibility Specification 2", Intel Corporation,
   516  published by McGraw-Hill, 1991, ISBN 0-07-031219-2.  (Same as the above 386
   517  ABI supplement.)
   518  
   519  
   520  
   521  ----------------
   522  Local variables:
   523  mode: text
   524  fill-column: 76
   525  End: