github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/x86/k7/README (about)

     1  Copyright 2000, 2001 Free Software Foundation, Inc.
     2  
     3  This file is part of the GNU MP Library.
     4  
     5  The GNU MP Library is free software; you can redistribute it and/or modify
     6  it under the terms of either:
     7  
     8    * the GNU Lesser General Public License as published by the Free
     9      Software Foundation; either version 3 of the License, or (at your
    10      option) any later version.
    11  
    12  or
    13  
    14    * the GNU General Public License as published by the Free Software
    15      Foundation; either version 2 of the License, or (at your option) any
    16      later version.
    17  
    18  or both in parallel, as here.
    19  
    20  The GNU MP Library is distributed in the hope that it will be useful, but
    21  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    22  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    23  for more details.
    24  
    25  You should have received copies of the GNU General Public License and the
    26  GNU Lesser General Public License along with the GNU MP Library.  If not,
    27  see https://www.gnu.org/licenses/.
    28  
    29  
    30  
    31  
    32                        AMD K7 MPN SUBROUTINES
    33  
    34  
    35  This directory contains code optimized for the AMD Athlon CPU.
    36  
    37  The mmx subdirectory has routines using MMX instructions.  All Athlons have
    38  MMX, the separate directory is just so that configure can omit it if the
    39  assembler doesn't support MMX.
    40  
    41  
    42  
    43  STATUS
    44  
    45  Times for the loops, with all code and data in L1 cache.
    46  
    47                                 cycles/limb
    48  	mpn_add/sub_n             1.6
    49  
    50  	mpn_copyi                 0.75 or 1.0   \ varying with data alignment
    51  	mpn_copyd                 0.75 or 1.0   /
    52  
    53  	mpn_divrem_1             17.0 integer part, 15.0 fractional part
    54  	mpn_mod_1                17.0
    55  	mpn_divexact_by3          8.0
    56  
    57  	mpn_l/rshift              1.2
    58  
    59  	mpn_mul_1                 3.4
    60  	mpn_addmul/submul_1       3.9
    61  
    62  	mpn_mul_basecase          4.42 cycles/crossproduct (approx)
    63          mpn_sqr_basecase          2.3 cycles/crossproduct (approx)
    64  				  or 4.55 cycles/triangleproduct (approx)
    65  
    66  Prefetching of sources hasn't yet been tried.
    67  
    68  
    69  
    70  NOTES
    71  
    72  cmov, MMX, 3DNow and some extensions to MMX and 3DNow are available.
    73  
    74  Write-allocate L1 data cache means prefetching of destinations is unnecessary.
    75  
    76  Floating point multiplications can be done in parallel with integer
    77  multiplications, but there doesn't seem to be any way to make use of this.
    78  
    79  Unsigned "mul"s can be issued every 3 cycles.  This suggests 3 is a limit on
    80  the speed of the multiplication routines.  The documentation shows mul
    81  executing in IEU0 (or maybe in IEU0 and IEU1 together), so it might be that,
    82  to get near 3 cycles code has to be arranged so that nothing else is issued
    83  to IEU0.  A busy IEU0 could explain why some code takes 4 cycles and other
    84  apparently equivalent code takes 5.
    85  
    86  
    87  
    88  OPTIMIZATIONS
    89  
    90  Unrolled loops are used to reduce looping overhead.  The unrolling is
    91  configurable up to 32 limbs/loop for most routines and up to 64 for some.
    92  The K7 has 64k L1 code cache so quite big unrolling is allowable.
    93  
    94  Computed jumps into the unrolling are used to handle sizes not a multiple of
    95  the unrolling.  An attractive feature of this is that times increase
    96  smoothly with operand size, but it may be that some routines should just
    97  have simple loops to finish up, especially when PIC adds between 2 and 16
    98  cycles to get %eip.
    99  
   100  Position independent code is implemented using a call to get %eip for the
   101  computed jumps and a ret is always done, rather than an addl $4,%esp or a
   102  popl, so the CPU return address branch prediction stack stays synchronised
   103  with the actual stack in memory.
   104  
   105  Branch prediction, in absence of any history, will guess forward jumps are
   106  not taken and backward jumps are taken.  Where possible it's arranged that
   107  the less likely or less important case is under a taken forward jump.
   108  
   109  
   110  
   111  CODING
   112  
   113  Instructions in general code have been shown grouped if they can execute
   114  together, which means up to three direct-path instructions which have no
   115  successive dependencies.  K7 always decodes three and has out-of-order
   116  execution, but the groupings show what slots might be available and what
   117  dependency chains exist.
   118  
   119  When there's vector-path instructions an effort is made to get triplets of
   120  direct-path instructions in between them, even if there's dependencies,
   121  since this maximizes decoding throughput and might save a cycle or two if
   122  decoding is the limiting factor.
   123  
   124  
   125  
   126  INSTRUCTIONS
   127  
   128  adcl       direct
   129  divl       39 cycles back-to-back
   130  lodsl,etc  vector
   131  loop       1 cycle vector (decl/jnz opens up one decode slot)
   132  movd reg   vector
   133  movd mem   direct
   134  mull       issue every 3 cycles, latency 4 cycles low word, 6 cycles high word
   135  popl	   vector (use movl for more than one pop)
   136  pushl	   direct, will pair with a load
   137  shrdl %cl  vector, 3 cycles, seems to be 3 decode too
   138  xorl r,r   false read dependency recognised
   139  
   140  
   141  
   142  REFERENCES
   143  
   144  "AMD Athlon Processor X86 Code Optimization Guide", AMD publication number
   145  22007, revision K, February 2002.  Available on-line,
   146  
   147  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
   148  
   149  "3DNow Technology Manual", AMD publication number 21928G/0-March 2000.
   150  This describes the femms and prefetch instructions.  Available on-line,
   151  
   152  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/21928.pdf
   153  
   154  "AMD Extensions to the 3DNow and MMX Instruction Sets Manual", AMD
   155  publication number 22466, revision D, March 2000.  This describes
   156  instructions added in the Athlon processor, such as pswapd and the extra
   157  prefetch forms.  Available on-line,
   158  
   159  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22466.pdf
   160  
   161  "3DNow Instruction Porting Guide", AMD publication number 22621, revision B,
   162  August 1999.  This has some notes on general Athlon optimizations as well as
   163  3DNow.  Available on-line,
   164  
   165  http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22621.pdf
   166  
   167  
   168  
   169  
   170  ----------------
   171  Local variables:
   172  mode: text
   173  fill-column: 76
   174  End: