github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/x86/k6/README (about)

     1  Copyright 2000, 2001 Free Software Foundation, Inc.
     2  
     3  This file is part of the GNU MP Library.
     4  
     5  The GNU MP Library is free software; you can redistribute it and/or modify
     6  it under the terms of either:
     7  
     8    * the GNU Lesser General Public License as published by the Free
     9      Software Foundation; either version 3 of the License, or (at your
    10      option) any later version.
    11  
    12  or
    13  
    14    * the GNU General Public License as published by the Free Software
    15      Foundation; either version 2 of the License, or (at your option) any
    16      later version.
    17  
    18  or both in parallel, as here.
    19  
    20  The GNU MP Library is distributed in the hope that it will be useful, but
    21  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    22  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    23  for more details.
    24  
    25  You should have received copies of the GNU General Public License and the
    26  GNU Lesser General Public License along with the GNU MP Library.  If not,
    27  see https://www.gnu.org/licenses/.
    28  
    29  
    30  
    31  
    32  			AMD K6 MPN SUBROUTINES
    33  
    34  
    35  
    36  This directory contains code optimized for AMD K6 CPUs, meaning K6, K6-2 and
    37  K6-3.
    38  
    39  The mmx subdirectory has MMX code suiting plain K6, the k62mmx subdirectory
    40  has MMX code suiting K6-2 and K6-3.  All chips in the K6 family have MMX,
    41  the separate directories are just so that ./configure can omit them if the
    42  assembler doesn't support MMX.
    43  
    44  
    45  
    46  
    47  STATUS
    48  
    49  Times for the loops, with all code and data in L1 cache, are as follows.
    50  
    51                                   cycles/limb
    52  
    53  	mpn_add_n/sub_n            3.25 normal, 2.75 in-place
    54  
    55  	mpn_mul_1                  6.25
    56  	mpn_add/submul_1           7.65-8.4  (varying with data values)
    57  
    58  	mpn_mul_basecase           9.25 cycles/crossproduct (approx)
    59  	mpn_sqr_basecase           4.7  cycles/crossproduct (approx)
    60                                     or 9.2 cycles/triangleproduct (approx)
    61  
    62  	mpn_l/rshift               3.0
    63  
    64  	mpn_divrem_1              20.0
    65  	mpn_mod_1                 20.0
    66  	mpn_divexact_by3          11.0
    67  
    68  	mpn_copyi                  1.0
    69  	mpn_copyd                  1.0
    70  
    71  
    72  K6-2 and K6-3 have dual-issue MMX and get the following improvements.
    73  
    74  	mpn_l/rshift               1.75
    75  
    76  
    77  Prefetching of sources hasn't yet given any joy.  With the 3DNow "prefetch"
    78  instruction, code seems to run slower, and with just "mov" loads it doesn't
    79  seem faster.  Results so far are inconsistent.  The K6 does a hardware
    80  prefetch of the second cache line in a sector, so the penalty for not
    81  prefetching in software is reduced.
    82  
    83  
    84  
    85  
    86  NOTES
    87  
    88  All K6 family chips have MMX, but only K6-2 and K6-3 have 3DNow.
    89  
    90  Plain K6 executes MMX instructions only in the X pipe, but K6-2 and K6-3 can
    91  execute them in both X and Y (and in both together).
    92  
    93  Branch misprediction penalty is 1 to 4 cycles (Optimization Manual
    94  chapter 6 table 12).
    95  
    96  Write-allocate L1 data cache means prefetching of destinations is unnecessary.
    97  Store queue is 7 entries of 64 bits each.
    98  
    99  Floating point multiplications can be done in parallel with integer
   100  multiplications, but there doesn't seem to be any way to make use of this.
   101  
   102  
   103  
   104  OPTIMIZATIONS
   105  
   106  Unrolled loops are used to reduce looping overhead.  The unrolling is
   107  configurable up to 32 limbs/loop for most routines, up to 64 for some.
   108  
   109  Sometimes computed jumps into the unrolling are used to handle sizes not a
   110  multiple of the unrolling.  An attractive feature of this is that times
   111  smoothly increase with operand size, but an indirect jump is about 6 cycles
   112  and the setups about another 6, so it depends on how much the unrolled code
   113  is faster than a simple loop as to whether a computed jump ought to be used.
   114  
   115  Position independent code is implemented using a call to get eip for
   116  computed jumps and a ret is always done, rather than an addl $4,%esp or a
   117  popl, so the CPU return address branch prediction stack stays synchronised
   118  with the actual stack in memory.  Such a call however still costs 4 to 7
   119  cycles.
   120  
   121  Branch prediction, in absence of any history, will guess forward jumps are
   122  not taken and backward jumps are taken.  Where possible it's arranged that
   123  the less likely or less important case is under a taken forward jump.
   124  
   125  
   126  
   127  MMX
   128  
   129  Putting emms or femms as late as possible in a routine seems to be fastest.
   130  Perhaps an emms or femms stalls until all outstanding MMX instructions have
   131  completed, so putting it later gives them a chance to complete on their own,
   132  in parallel with other operations (like register popping).
   133  
   134  The Optimization Manual chapter 5 recommends using a femms on K6-2 and K6-3
   135  at the start of a routine, in case it's been preceded by x87 floating point
   136  operations.  This isn't done because in gmp programs it's expected that x87
   137  floating point won't be much used and that chances are an mpn routine won't
   138  have been preceded by any x87 code.
   139  
   140  
   141  
   142  CODING
   143  
   144  Instructions in general code are shown paired if they can decode and execute
   145  together, meaning two short decode instructions with the second not
   146  depending on the first, only the first using the shifter, no more than one
   147  load, and no more than one store.
   148  
   149  K6 does some out of order execution so the pairings aren't essential, they
   150  just show what slots might be available.  When decoding is the limiting
   151  factor things can be scheduled that might not execute until later.
   152  
   153  
   154  
   155  NOTES
   156  
   157  Code alignment
   158  
   159  - if an opcode/modrm or 0Fh/opcode/modrm crosses a cache line boundary,
   160    short decode is inhibited.  The cross.pl script detects this.
   161  
   162  - loops and branch targets should be aligned to 16 bytes, or ensure at least
   163    2 instructions before a 32 byte boundary.  This makes use of the 16 byte
   164    cache in the BTB.
   165  
   166  Addressing modes
   167  
   168  - (%esi) degrades decoding from short to vector.  0(%esi) doesn't have this
   169    problem, and can be used as an equivalent, or easier is just to use a
   170    different register, like %ebx.
   171  
   172  - K6 and pre-CXT core K6-2 have the following problem.  (K6-2 CXT and K6-3
   173    have it fixed, these being cpuid function 1 signatures 0x588 to 0x58F).
   174  
   175    If more than 3 bytes are needed to determine instruction length then
   176    decoding degrades from direct to long, or from long to vector.  This
   177    happens with forms like "0F opcode mod/rm" with mod/rm=00-xxx-100 since
   178    with mod=00 the sib determines whether there's a displacement.
   179  
   180    This affects all MMX and 3DNow instructions, and others with an 0F prefix,
   181    like movzbl.  The modes affected are anything with an index and no
   182    displacement, or an index but no base, and this includes (%esp) which is
   183    really (,%esp,1).
   184  
   185    The cross.pl script detects problem cases.  The workaround is to always
   186    use a displacement, and to do this with Zdisp if it's zero so the
   187    assembler doesn't discard it.
   188  
   189    See Optimization Manual rev D page 67 and 3DNow Porting Guide rev B pages
   190    13-14 and 36-37.
   191  
   192  Calls
   193  
   194  - indirect jumps and calls are not branch predicted, they measure about 6
   195    cycles.
   196  
   197  Various
   198  
   199  - adcl      2 cycles of decode, maybe 2 cycles executing in the X pipe
   200  - bsf       12-27 cycles
   201  - emms      5 cycles
   202  - femms     3 cycles
   203  - jecxz     2 cycles taken, 13 not taken (optimization manual says 7 not taken)
   204  - divl      20 cycles back-to-back
   205  - imull     2 decode, 3 execute
   206  - mull      2 decode, 3 execute (optimization manual decoding sample)
   207  - prefetch  2 cycles
   208  - rcll/rcrl implicit by one bit: 2 cycles
   209              immediate or %cl count: 11 + 2 per bit for dword
   210                                      13 + 4 per bit for byte
   211  - setCC	    2 cycles
   212  - xchgl	%eax,reg  1.5 cycles, back-to-back (strange)
   213          reg,reg   2 cycles, back-to-back
   214  
   215  
   216  
   217  
   218  REFERENCES
   219  
   220  "AMD-K6 Processor Code Optimization Application Note", AMD publication
   221  number 21924, revision D amendment 0, January 2000.  This describes K6-2 and
   222  K6-3.  Available on-line,
   223  
   224  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21924.pdf
   225  
   226  "AMD-K6 MMX Enhanced Processor x86 Code Optimization Application Note", AMD
   227  publication number 21828, revision A amendment 0, August 1997.  This is an
   228  older edition of the above document, describing plain K6.  Available
   229  on-line,
   230  
   231  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21828.pdf
   232  
   233  "3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
   234  This describes the femms and prefetch instructions, but nothing else from
   235  3DNow has been used.  Available on-line,
   236  
   237  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
   238  
   239  "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
   240  August 1999.  This has some notes on general K6 optimizations as well as
   241  3DNow.  Available on-line,
   242  
   243  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
   244  
   245  
   246  
   247  ----------------
   248  Local variables:
   249  mode: text
   250  fill-column: 76
   251  End: