github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/x86/pentium/README (about)

     1  Copyright 1996, 1999-2001, 2003 Free Software Foundation, Inc.
     2  
     3  This file is part of the GNU MP Library.
     4  
     5  The GNU MP Library is free software; you can redistribute it and/or modify
     6  it under the terms of either:
     7  
     8    * the GNU Lesser General Public License as published by the Free
     9      Software Foundation; either version 3 of the License, or (at your
    10      option) any later version.
    11  
    12  or
    13  
    14    * the GNU General Public License as published by the Free Software
    15      Foundation; either version 2 of the License, or (at your option) any
    16      later version.
    17  
    18  or both in parallel, as here.
    19  
    20  The GNU MP Library is distributed in the hope that it will be useful, but
    21  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    22  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    23  for more details.
    24  
    25  You should have received copies of the GNU General Public License and the
    26  GNU Lesser General Public License along with the GNU MP Library.  If not,
    27  see https://www.gnu.org/licenses/.
    28  
    29  
    30  
    31  
    32  
    33                     INTEL PENTIUM P5 MPN SUBROUTINES
    34  
    35  
    36  This directory contains mpn functions optimized for Intel Pentium (P5,P54)
    37  processors.  The mmx subdirectory has additional code for Pentium with MMX
    38  (P55).
    39  
    40  
    41  STATUS
    42  
    43                                  cycles/limb
    44  
    45  	mpn_add_n/sub_n            2.375
    46  
    47  	mpn_mul_1                 12.0
    48  	mpn_add/submul_1          14.0
    49  
    50  	mpn_mul_basecase          14.2 cycles/crossproduct (approx)
    51  
    52  	mpn_sqr_basecase           8 cycles/crossproduct (approx)
    53                                     or 15.5 cycles/triangleproduct (approx)
    54  
    55  	mpn_l/rshift               5.375 normal (6.0 on P54)
    56  				   1.875 special shift by 1 bit
    57  
    58  	mpn_divrem_1              44.0
    59  	mpn_mod_1                 28.0
    60  	mpn_divexact_by3          15.0
    61  
    62  	mpn_copyi/copyd            1.0
    63  
    64  Pentium MMX gets the following improvements
    65  
    66  	mpn_l/rshift               1.75
    67  
    68  	mpn_mul_1                 12.0 normal, 7.0 for 16-bit multiplier
    69  
    70  
    71  mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb.  Due to loop
    72  overhead and other delays (cache refill?), they run at or near 2.5
    73  cycles/limb.
    74  
    75  mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they
    76  should.  Intel documentation says a mul instruction is 10 cycles, but it
    77  measures 9 and the routines using it run as 9.
    78  
    79  
    80  
    81  P55 MMX AND X87
    82  
    83  The cost of switching between MMX and x87 floating point on P55 is about 100
    84  cycles (fld1/por/emms for instance).  In order to avoid that the two aren't
    85  mixed and currently that means using MMX and not x87.
    86  
    87  MMX offers a big speedup for lshift and rshift, and a nice speedup for
    88  16-bit multipliers in mpn_mul_1.  If fast code using x87 is found then
    89  perhaps the preference for MMX will be reversed.
    90  
    91  
    92  
    93  
    94  P54 SHLDL
    95  
    96  mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the
    97  documentation indicates that they should take only 43/8 = 5.375 cycles/limb,
    98  or 5 cycles/limb asymptotically.  The P55 runs them at the expected speed.
    99  
   100  It seems that on P54 a shldl or shrdl allows pairing in one following cycle,
   101  but not two.  For example, back to back repetitions of the following
   102  
   103  	shldl(	%cl, %eax, %ebx)
   104  	xorl	%edx, %edx
   105  	xorl	%esi, %esi
   106  
   107  run at 5 cycles, as expected, but repetitions of the following run at 7
   108  cycles, whereas 6 would be expected (and is achieved on P55),
   109  
   110  	shldl(	%cl, %eax, %ebx)
   111  	xorl	%edx, %edx
   112  	xorl	%esi, %esi
   113  	xorl	%edi, %edi
   114  	xorl	%ebp, %ebp
   115  
   116  Three xorls run at 7 cycles too, so it doesn't seem to be just that pairing
   117  inhibited is only in the second following cycle (or something like that).
   118  
   119  Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a
   120  pattern of shift, 2 loads, shift, 2 stores, shift, etc.  A start has been
   121  made on something like that, but it's not yet complete.
   122  
   123  
   124  
   125  
   126  OTHER NOTES
   127  
   128  Prefetching Destinations
   129  
   130      Pentium doesn't allocate cache lines on writes, unlike most other modern
   131      processors.  Since the functions in the mpn class do array writes, we
   132      have to handle allocating the destination cache lines by reading a word
   133      from it in the loops, to achieve the best performance.
   134  
   135  Prefetching Sources
   136  
   137      Prefetching of sources is pointless since there's no out-of-order loads.
   138      Any load instruction blocks until the line is brought to L1, so it may
   139      as well be the load that wants the data which blocks.
   140  
   141  Data Cache Bank Clashes
   142  
   143      Pairing of memory operations requires that the two issued operations
   144      refer to different cache banks (ie. different addresses modulo 32
   145      bytes).  The simplest way to ensure this is to read/write two words from
   146      the same object.  If we make operations on different objects, they might
   147      or might not be to the same cache bank.
   148  
   149  PIC %eip Fetching
   150  
   151      A simple call $+5 and popl can be used to get %eip, there's no need to
   152      balance calls and returns since P5 doesn't have any return stack branch
   153      prediction.
   154  
   155  Float Multiplies
   156  
   157      fmul is pairable and can be issued every 2 cycles (with a 4 cycle
   158      latency for data ready to use).  This is a lot better than integer mull
   159      or imull at 9 cycles non-pairing.  Unfortunately the advantage is
   160      quickly eaten away by needing to throw data through memory back to the
   161      integer registers to adjust for fild and fist being signed, and to do
   162      things like propagating carry bits.
   163  
   164  
   165  
   166  
   167  
   168  REFERENCES
   169  
   170  "Intel Architecture Optimization Manual", 1997, order number 242816.  This
   171  is mostly about P5, the parts about P6 aren't relevant.  Available on-line:
   172  
   173          http://download.intel.com/design/PentiumII/manuals/242816.htm
   174  
   175  
   176  
   177  ----------------
   178  Local variables:
   179  mode: text
   180  fill-column: 76
   181  End: