github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/x86/pentium/README (about) 1 Copyright 1996, 1999-2001, 2003 Free Software Foundation, Inc. 2 3 This file is part of the GNU MP Library. 4 5 The GNU MP Library is free software; you can redistribute it and/or modify 6 it under the terms of either: 7 8 * the GNU Lesser General Public License as published by the Free 9 Software Foundation; either version 3 of the License, or (at your 10 option) any later version. 11 12 or 13 14 * the GNU General Public License as published by the Free Software 15 Foundation; either version 2 of the License, or (at your option) any 16 later version. 17 18 or both in parallel, as here. 19 20 The GNU MP Library is distributed in the hope that it will be useful, but 21 WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 22 or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License 23 for more details. 24 25 You should have received copies of the GNU General Public License and the 26 GNU Lesser General Public License along with the GNU MP Library. If not, 27 see https://www.gnu.org/licenses/. 28 29 30 31 32 33 INTEL PENTIUM P5 MPN SUBROUTINES 34 35 36 This directory contains mpn functions optimized for Intel Pentium (P5,P54) 37 processors. The mmx subdirectory has additional code for Pentium with MMX 38 (P55). 39 40 41 STATUS 42 43 cycles/limb 44 45 mpn_add_n/sub_n 2.375 46 47 mpn_mul_1 12.0 48 mpn_add/submul_1 14.0 49 50 mpn_mul_basecase 14.2 cycles/crossproduct (approx) 51 52 mpn_sqr_basecase 8 cycles/crossproduct (approx) 53 or 15.5 cycles/triangleproduct (approx) 54 55 mpn_l/rshift 5.375 normal (6.0 on P54) 56 1.875 special shift by 1 bit 57 58 mpn_divrem_1 44.0 59 mpn_mod_1 28.0 60 mpn_divexact_by3 15.0 61 62 mpn_copyi/copyd 1.0 63 64 Pentium MMX gets the following improvements 65 66 mpn_l/rshift 1.75 67 68 mpn_mul_1 12.0 normal, 7.0 for 16-bit multiplier 69 70 71 mpn_add_n and mpn_sub_n run at asymptotically 2 cycles/limb. Due to loop 72 overhead and other delays (cache refill?), they run at or near 2.5 73 cycles/limb. 74 75 mpn_mul_1, mpn_addmul_1, mpn_submul_1 all run 1 cycle faster than they 76 should. Intel documentation says a mul instruction is 10 cycles, but it 77 measures 9 and the routines using it run as 9. 78 79 80 81 P55 MMX AND X87 82 83 The cost of switching between MMX and x87 floating point on P55 is about 100 84 cycles (fld1/por/emms for instance). In order to avoid that the two aren't 85 mixed and currently that means using MMX and not x87. 86 87 MMX offers a big speedup for lshift and rshift, and a nice speedup for 88 16-bit multipliers in mpn_mul_1. If fast code using x87 is found then 89 perhaps the preference for MMX will be reversed. 90 91 92 93 94 P54 SHLDL 95 96 mpn_lshift and mpn_rshift run at about 6 cycles/limb on P5 and P54, but the 97 documentation indicates that they should take only 43/8 = 5.375 cycles/limb, 98 or 5 cycles/limb asymptotically. The P55 runs them at the expected speed. 99 100 It seems that on P54 a shldl or shrdl allows pairing in one following cycle, 101 but not two. For example, back to back repetitions of the following 102 103 shldl( %cl, %eax, %ebx) 104 xorl %edx, %edx 105 xorl %esi, %esi 106 107 run at 5 cycles, as expected, but repetitions of the following run at 7 108 cycles, whereas 6 would be expected (and is achieved on P55), 109 110 shldl( %cl, %eax, %ebx) 111 xorl %edx, %edx 112 xorl %esi, %esi 113 xorl %edi, %edi 114 xorl %ebp, %ebp 115 116 Three xorls run at 7 cycles too, so it doesn't seem to be just that pairing 117 inhibited is only in the second following cycle (or something like that). 118 119 Avoiding this problem would bring P54 shifts down from 6.0 c/l to 5.5 with a 120 pattern of shift, 2 loads, shift, 2 stores, shift, etc. A start has been 121 made on something like that, but it's not yet complete. 122 123 124 125 126 OTHER NOTES 127 128 Prefetching Destinations 129 130 Pentium doesn't allocate cache lines on writes, unlike most other modern 131 processors. Since the functions in the mpn class do array writes, we 132 have to handle allocating the destination cache lines by reading a word 133 from it in the loops, to achieve the best performance. 134 135 Prefetching Sources 136 137 Prefetching of sources is pointless since there's no out-of-order loads. 138 Any load instruction blocks until the line is brought to L1, so it may 139 as well be the load that wants the data which blocks. 140 141 Data Cache Bank Clashes 142 143 Pairing of memory operations requires that the two issued operations 144 refer to different cache banks (ie. different addresses modulo 32 145 bytes). The simplest way to ensure this is to read/write two words from 146 the same object. If we make operations on different objects, they might 147 or might not be to the same cache bank. 148 149 PIC %eip Fetching 150 151 A simple call $+5 and popl can be used to get %eip, there's no need to 152 balance calls and returns since P5 doesn't have any return stack branch 153 prediction. 154 155 Float Multiplies 156 157 fmul is pairable and can be issued every 2 cycles (with a 4 cycle 158 latency for data ready to use). This is a lot better than integer mull 159 or imull at 9 cycles non-pairing. Unfortunately the advantage is 160 quickly eaten away by needing to throw data through memory back to the 161 integer registers to adjust for fild and fist being signed, and to do 162 things like propagating carry bits. 163 164 165 166 167 168 REFERENCES 169 170 "Intel Architecture Optimization Manual", 1997, order number 242816. This 171 is mostly about P5, the parts about P6 aren't relevant. Available on-line: 172 173 http://download.intel.com/design/PentiumII/manuals/242816.htm 174 175 176 177 ---------------- 178 Local variables: 179 mode: text 180 fill-column: 76 181 End: