github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/alpha/README

github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/alpha/README (about)

     1  Copyright 1996, 1997, 1999-2005 Free Software Foundation, Inc.
     2  
     3  This file is part of the GNU MP Library.
     4  
     5  The GNU MP Library is free software; you can redistribute it and/or modify
     6  it under the terms of either:
     7  
     8    * the GNU Lesser General Public License as published by the Free
     9      Software Foundation; either version 3 of the License, or (at your
    10      option) any later version.
    11  
    12  or
    13  
    14    * the GNU General Public License as published by the Free Software
    15      Foundation; either version 2 of the License, or (at your option) any
    16      later version.
    17  
    18  or both in parallel, as here.
    19  
    20  The GNU MP Library is distributed in the hope that it will be useful, but
    21  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    22  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    23  for more details.
    24  
    25  You should have received copies of the GNU General Public License and the
    26  GNU Lesser General Public License along with the GNU MP Library.  If not,
    27  see https://www.gnu.org/licenses/.
    28  
    29  
    30  
    31  
    32  
    33  This directory contains mpn functions optimized for DEC Alpha processors.
    34  
    35  ALPHA ASSEMBLY RULES AND REGULATIONS
    36  
    37  The `.prologue N' pseudo op marks the end of instruction that needs special
    38  handling by unwinding.  It also says whether $27 is really needed for computing
    39  the gp.  The `.mask M' pseudo op says which registers are saved on the stack,
    40  and at what offset in the frame.
    41  
    42  Cray T3 code is very very different...
    43  
    44  "$6" / "$f6" etc is the usual syntax for registers, but on Unicos instead "r6"
    45  / "f6" is required.  We use the "r6" / "f6" forms, and have m4 defines expand
    46  them to "$6" or "$f6" where necessary.
    47  
    48  "0x" introduces a hex constant in gas and DEC as, but on Unicos "^X" is
    49  required.  The X() macro accommodates this difference.
    50  
    51  "cvttqc" is required by DEC as, "cvttq/c" is required by Unicos, and gas will
    52  accept either.  We use cvttqc and have an m4 define expand to cvttq/c where
    53  necessary.
    54  
    55  "not" as an alias for "ornot r31, ..." is available in gas and DEC as, but not
    56  the Unicos assembler.  The full "ornot" must be used.
    57  
    58  "unop" is not available in Unicos.  We make an m4 define to the usual "ldq_u
    59  r31,0(r30)", and in fact use that define on all systems since it comes out the
    60  same.
    61  
    62  "!literal!123" etc explicit relocations as per Tru64 4.0 are apparently not
    63  available in older alpha assemblers (including gas prior to 2.12), according to
    64  the GCC manual, so the assembler macro forms must be used (eg. ldgp).
    65  
    66  
    67  
    68  RELEVANT OPTIMIZATION ISSUES
    69  
    70  EV4
    71  
    72  1. This chip has very limited store bandwidth.  The on-chip L1 cache is write-
    73     through, and a cache line is transferred from the store buffer to the off-
    74     chip L2 in as much 15 cycles on most systems.  This delay hurts mpn_add_n,
    75     mpn_sub_n, mpn_lshift, and mpn_rshift.
    76  
    77  2. Pairing is possible between memory instructions and integer arithmetic
    78     instructions.
    79  
    80  3. mulq and umulh are documented to have a latency of 23 cycles, but 2 of these
    81     cycles are pipelined.  Thus, multiply instructions can be issued at a rate
    82     of one each 21st cycle.
    83  
    84  EV5
    85  
    86  1. The memory bandwidth of this chip is good, both for loads and stores.  The
    87     L1 cache can handle two loads or one store per cycle, but two cycles after a
    88     store, no ld can issue.
    89  
    90  2. mulq has a latency of 12 cycles and an issue rate of 1 each 8th cycle.
    91     umulh has a latency of 14 cycles and an issue rate of 1 each 10th cycle.
    92     (Note that published documentation gets these numbers slightly wrong.)
    93  
    94  3. mpn_add_n.  With 4-fold unrolling, we need 37 instructions, whereof 12
    95     are memory operations.  This will take at least
    96  	ceil(37/2) [dual issue] + 1 [taken branch] = 19 cycles
    97     We have 12 memory cycles, plus 4 after-store conflict cycles, or 16 data
    98     cache cycles, which should be completely hidden in the 19 issue cycles.
    99     The computation is inherently serial, with these dependencies:
   100  
   101  	       ldq  ldq
   102  		 \  /\
   103  	  (or)   addq |
   104  	   |\   /   \ |
   105  	   | addq  cmpult
   106  	    \  |     |
   107  	     cmpult  |
   108  		 \  /
   109  		  or
   110  
   111     I.e., 3 operations are needed between carry-in and carry-out, making 12
   112     cycles the absolute minimum for the 4 limbs.  We could replace the `or' with
   113     a cmoveq/cmovne, which could issue one cycle earlier that the `or', but that
   114     might waste a cycle on EV4.  The total depth remain unaffected, since cmov
   115     has a latency of 2 cycles.
   116  
   117       addq
   118       /   \
   119     addq  cmpult
   120       |      \
   121     cmpult -> cmovne
   122  
   123    Montgomery has a slightly different way of computing carry that requires one
   124    less instruction, but has depth 4 (instead of the current 3).  Since the code
   125    is currently instruction issue bound, Montgomery's idea should save us 1/2
   126    cycle per limb, or bring us down to a total of 17 cycles or 4.25 cycles/limb.
   127    Unfortunately, this method will not be good for the EV6.
   128  
   129  4. addmul_1 and friends: We previously had a scheme for splitting the single-
   130     limb operand in 21-bits chunks and the multi-limb operand in 32-bit chunks,
   131     and then use FP operations for every 2nd multiply, and integer operations
   132     for every 2nd multiply.
   133  
   134     But it seems much better to split the single-limb operand in 16-bit chunks,
   135     since we save many integer shifts and adds that way.  See powerpc64/README
   136     for some more details.
   137  
   138  EV6
   139  
   140  Here we have a really parallel pipeline, capable of issuing up to 4 integer
   141  instructions per cycle.  In actual practice, it is never possible to sustain
   142  more than 3.5 integer insns/cycle due to rename register shortage.  One integer
   143  multiply instruction can issue each cycle.  To get optimal speed, we need to
   144  pretend we are vectorizing the code, i.e., minimize the depth of recurrences.
   145  
   146  There are two dependencies to watch out for.  1) Address arithmetic
   147  dependencies, and 2) carry propagation dependencies.
   148  
   149  We can avoid serializing due to address arithmetic by unrolling loops, so that
   150  addresses don't depend heavily on an index variable.  Avoiding serializing
   151  because of carry propagation is trickier; the ultimate performance of the code
   152  will be determined of the number of latency cycles it takes from accepting
   153  carry-in to a vector point until we can generate carry-out.
   154  
   155  Most integer instructions can execute in either the L0, U0, L1, or U1
   156  pipelines.  Shifts only execute in U0 and U1, and multiply only in U1.
   157  
   158  CMOV instructions split into two internal instructions, CMOV1 and CMOV2.  CMOV
   159  split the mapping process (see pg 2-26 in cmpwrgd.pdf), suggesting the CMOV
   160  should always be placed as the last instruction of an aligned 4 instruction
   161  block, or perhaps simply avoided.
   162  
   163  Perhaps the most important issue is the latency between the L0/U0 and L1/U1
   164  clusters; a result obtained on either cluster has an extra cycle of latency for
   165  consumers in the opposite cluster.  Because of the dynamic nature of the
   166  implementation, it is hard to predict where an instruction will execute.
   167  
   168  
   169  
   170  REFERENCES
   171  
   172  "Alpha Architecture Handbook", version 4, Compaq, October 1998, order number
   173  EC-QD2KC-TE.
   174  
   175  "Alpha 21164 Microprocessor Hardware Reference Manual", Compaq, December 1998,
   176  order number EC-QP99C-TE.
   177  
   178  "Alpha 21264/EV67 Microprocessor Hardware Reference Manual", revision 1.4,
   179  Compaq, September 2000, order number DS-0028B-TE.
   180  
   181  "Compiler Writer's Guide for the Alpha 21264", Compaq, June 1999, order number
   182  EC-RJ66A-TE.
   183  
   184  All of the above are available online from
   185  
   186    http://ftp.digital.com/pub/Digital/info/semiconductor/literature/dsc-library.html
   187    ftp://ftp.compaq.com/pub/products/alphaCPUdocs
   188  
   189  "Tru64 Unix Assembly Language Programmer's Guide", Compaq, March 1996, part
   190  number AA-PS31D-TE.
   191  
   192  "Digital UNIX Calling Standard for Alpha Systems", Digital Equipment Corp,
   193  March 1996, part number AA-PY8AC-TE.
   194  
   195  The above are available online,
   196  
   197    http://h30097.www3.hp.com/docs/pub_page/V40F_DOCS.HTM
   198  
   199  (Dunno what h30097 means in this URL, but if it moves try searching for "tru64
   200  online documentation" from the main www.hp.com page.)
   201  
   202  
   203  
   204  ----------------
   205  Local variables:
   206  mode: text
   207  fill-column: 79
   208  End: