github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/ia64/README

github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/ia64/README (about)

     1  Copyright 2000-2005 Free Software Foundation, Inc.
     2  
     3  This file is part of the GNU MP Library.
     4  
     5  The GNU MP Library is free software; you can redistribute it and/or modify
     6  it under the terms of either:
     7  
     8    * the GNU Lesser General Public License as published by the Free
     9      Software Foundation; either version 3 of the License, or (at your
    10      option) any later version.
    11  
    12  or
    13  
    14    * the GNU General Public License as published by the Free Software
    15      Foundation; either version 2 of the License, or (at your option) any
    16      later version.
    17  
    18  or both in parallel, as here.
    19  
    20  The GNU MP Library is distributed in the hope that it will be useful, but
    21  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    22  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    23  for more details.
    24  
    25  You should have received copies of the GNU General Public License and the
    26  GNU Lesser General Public License along with the GNU MP Library.  If not,
    27  see https://www.gnu.org/licenses/.
    28  
    29  
    30  
    31                        IA-64 MPN SUBROUTINES
    32  
    33  
    34  This directory contains mpn functions for the IA-64 architecture.
    35  
    36  
    37  CODE ORGANIZATION
    38  
    39  	mpn/ia64          itanium-2, and generic ia64
    40  
    41  The code here has been optimized primarily for Itanium 2.  Very few Itanium 1
    42  chips were ever sold, and Itanium 2 is more powerful, so the latter is what
    43  we concentrate on.
    44  
    45  
    46  
    47  CHIP NOTES
    48  
    49  The IA-64 ISA keeps instructions three and three in 128 bit bundles.
    50  Programmers/compilers need to put explicit breaks `;;' when there are WAW or
    51  RAW dependencies, with some notable exceptions.  Such "breaks" are typically
    52  at the end of a bundle, but can be put between operations within some bundle
    53  types too.
    54  
    55  The Itanium 1 and Itanium 2 implementations can under ideal conditions
    56  execute two bundles per cycle.  The Itanium 1 allows 4 of these instructions
    57  to do integer operations, while the Itanium 2 allows all 6 to be integer
    58  operations.
    59  
    60  Taken cloop branches seem to insert a bubble into the pipeline most of the
    61  time on Itanium 1.
    62  
    63  Loads to the fp registers bypass the L1 cache and thus get extremely long
    64  latencies, 9 cycles on the Itanium 1 and 6 cycles on the Itanium 2.
    65  
    66  The software pipeline stuff using br.ctop instruction causes delays, since
    67  many issue slots are taken up by instructions with zero predicates, and
    68  since many extra instructions are needed to set things up.  These features
    69  are clearly designed for code density, not speed.
    70  
    71  Misc pipeline limitations (Itanium 1):
    72  * The getf.sig instruction can only execute in M0.
    73  * At most four integer instructions/cycle.
    74  * Nops take up resources like any plain instructions.
    75  
    76  Misc pipeline limitations (Itanium 2):
    77  * The getf.sig instruction can only execute in M0.
    78  * Nops take up resources like any plain instructions.
    79  
    80  
    81  ASSEMBLY SYNTAX
    82  
    83  .align pads with nops in a text segment, but gas 2.14 and earlier
    84  incorrectly byte-swaps its nop bundle in big endian mode (eg. hpux), making
    85  it come out as break instructions.  We use the ALIGN() macro in
    86  mpn/ia64/ia64-defs.m4 when it might be executed across.  That macro
    87  suppresses any .align if the problem is detected by configure.  Lack of
    88  alignment might hurt performance but will at least be correct.
    89  
    90  foo:: to create a global symbol is not accepted by gas.  Use separate
    91  ".global foo" and "foo:" instead.
    92  
    93  .global is the standard global directive.  gas accepts .globl, but hpux "as"
    94  doesn't.
    95  
    96  .proc / .endp generates the appropriate .type and .size information for ELF,
    97  so the latter directives don't need to be given explicitly.
    98  
    99  .pred.rel "mutex"... is standard for annotating predicate register
   100  relationships.  gas also accepts .pred.rel.mutex, but hpux "as" doesn't.
   101  
   102  .pred directives can't be put on a line with a label, like
   103  ".Lfoo: .pred ...", the HP assembler on HP-UX 11.23 rejects that.
   104  gas is happy with it, and past versions of HP had seemed ok.
   105  
   106  // is the standard comment sequence, but we prefer "C" since it inhibits m4
   107  macro expansion.  See comments in ia64-defs.m4.
   108  
   109  
   110  REGISTER USAGE
   111  
   112  Special:
   113     r0: constant 0
   114     r1: global pointer (gp)
   115     r8: return value
   116     r12: stack pointer (sp)
   117     r13: thread pointer (tp)
   118  Caller-saves: r8-r11 r14-r31 f6-f15 f32-f127
   119  Caller-saves but rotating: r32-
   120  
   121  
   122  ================================================================
   123  mpn_add_n, mpn_sub_n:
   124  
   125  The current code runs at 1.25 c/l on Itanium 2.
   126  
   127  ================================================================
   128  mpn_mul_1:
   129  
   130  The current code runs at 2 c/l on Itanium 2.
   131  
   132  Using a blocked approach, working off of 4 separate places in the operands,
   133  one could make use of the xma accumulation, and approach 1 c/l.
   134  
   135  	ldf8 [up]
   136  	xma.l
   137  	xma.hu
   138  	stf8  [wrp]
   139  
   140  ================================================================
   141  mpn_addmul_1:
   142  
   143  The current code runs at 2 c/l on Itanium 2.
   144  
   145  It seems possible to use a blocked approach, as with mpn_mul_1.  We should
   146  read rp[] to integer registers, allowing for just one getf.sig per cycle.
   147  
   148  	ld8  [rp]
   149  	ldf8 [up]
   150  	xma.l
   151  	xma.hu
   152  	getf.sig
   153  	add+add+cmp+cmp
   154  	st8  [wrp]
   155  
   156  These 10 instructions can be scheduled to approach 1.667 cycles, and with
   157  the 4 cycle latency of xma, this means we need at least 3 blocks.  Using
   158  ldfp8 we could approach 1.583 c/l.
   159  
   160  ================================================================
   161  mpn_submul_1:
   162  
   163  The current code runs at 2.25 c/l on Itanium 2.  Getting to 2 c/l requires
   164  ldfp8 with all alignment headache that implies.
   165  
   166  ================================================================
   167  mpn_addmul_N
   168  
   169  For best speed, we need to give up using mpn_addmul_2 as the main multiply
   170  building block, and instead take multiple v limbs per loop.  For the Itanium
   171  1, we need to take about 8 limbs at a time for full speed.  For the Itanium
   172  2, something like mpn_addmul_4 should be enough.
   173  
   174  The add+cmp+cmp+add we use on the other codes is optimal for shortening
   175  recurrencies (1 cycle) but the sequence takes up 4 execution slots.  When
   176  recurrency depth is not critical, a more standard 3-cycle add+cmp+add is
   177  better.
   178  
   179  /* First load the 8 values from v */
   180  	ldfp8		v0, v1 = [r35], 16;;
   181  	ldfp8		v2, v3 = [r35], 16;;
   182  	ldfp8		v4, v5 = [r35], 16;;
   183  	ldfp8		v6, v7 = [r35], 16;;
   184  
   185  /* In the inner loop, get a new U limb and store a result limb. */
   186  	mov		lc = un
   187  Loop:	ldf8		u0 = [r33], 8
   188  	ld8		r0 = [r32]
   189  	xma.l		lp0 = v0, u0, hp0
   190  	xma.hu		hp0 = v0, u0, hp0
   191  	xma.l		lp1 = v1, u0, hp1
   192  	xma.hu		hp1 = v1, u0, hp1
   193  	xma.l		lp2 = v2, u0, hp2
   194  	xma.hu		hp2 = v2, u0, hp2
   195  	xma.l		lp3 = v3, u0, hp3
   196  	xma.hu		hp3 = v3, u0, hp3
   197  	xma.l		lp4 = v4, u0, hp4
   198  	xma.hu		hp4 = v4, u0, hp4
   199  	xma.l		lp5 = v5, u0, hp5
   200  	xma.hu		hp5 = v5, u0, hp5
   201  	xma.l		lp6 = v6, u0, hp6
   202  	xma.hu		hp6 = v6, u0, hp6
   203  	xma.l		lp7 = v7, u0, hp7
   204  	xma.hu		hp7 = v7, u0, hp7
   205  	getf.sig	l0 = lp0
   206  	getf.sig	l1 = lp1
   207  	getf.sig	l2 = lp2
   208  	getf.sig	l3 = lp3
   209  	getf.sig	l4 = lp4
   210  	getf.sig	l5 = lp5
   211  	getf.sig	l6 = lp6
   212  	add+cmp+add	xx, l0, r0
   213  	add+cmp+add	acc0, acc1, l1
   214  	add+cmp+add	acc1, acc2, l2
   215  	add+cmp+add	acc2, acc3, l3
   216  	add+cmp+add	acc3, acc4, l4
   217  	add+cmp+add	acc4, acc5, l5
   218  	add+cmp+add	acc5, acc6, l6
   219  	getf.sig	acc6 = lp7
   220  	st8		[r32] = xx, 8
   221  	br.cloop Loop
   222  
   223  	49 insn at max 6 insn/cycle:		8.167 cycles/limb8
   224  	11 memops at max 2 memops/cycle:	5.5 cycles/limb8
   225  	16 fpops at max 2 fpops/cycle:		8 cycles/limb8
   226  	21 intops at max 4 intops/cycle:	5.25 cycles/limb8
   227  	11+21 memops+intops at max 4/cycle	8 cycles/limb8
   228  
   229  ================================================================
   230  mpn_lshift, mpn_rshift
   231  
   232  The current code runs at 1 cycle/limb on Itanium 2.
   233  
   234  Using 63 separate loops, we could use the double-word shrp instruction.
   235  That instruction has a plain single-cycle latency.  We need 63 loops since
   236  this instruction only accept immediate count.  That would lead to a somewhat
   237  silly code size, but the speed would be 0.75 c/l on Itanium 2 (by using shrp
   238  each cycle plus shl/shr going down I1 for a further limb every second
   239  cycle).
   240  
   241  ================================================================
   242  mpn_copyi, mpn_copyd
   243  
   244  The current code runs at 0.5 c/l on Itanium 2.  But that is just for L1
   245  cache hit.  The 4-way unrolled loop takes just 2 cycles, and thus load-use
   246  scheduling isn't great.  It might be best to actually use modulo scheduled
   247  loops, since that will allow us to do better load-use scheduling without too
   248  much unrolling.
   249  
   250  Depending on size or operand alignment, we get 1 c/l or 0.5 c/l on Itanium
   251  2, according to tune/speed.  Cache bank conflicts?
   252  
   253  
   254  
   255  REFERENCES
   256  
   257  Intel Itanium Architecture Software Developer's Manual, volumes 1 to 3,
   258  Intel document 245317-004, 245318-004, 245319-004 October 2002.  Volume 1
   259  includes an Itanium optimization guide.
   260  
   261  Intel Itanium Processor-specific Application Binary Interface (ABI), Intel
   262  document 245370-003, May 2001.  Describes C type sizes, dynamic linking,
   263  etc.
   264  
   265  Intel Itanium Architecture Assembly Language Reference Guide, Intel document
   266  248801-004, 2000-2002.  Describes assembly instruction syntax and other
   267  directives.
   268  
   269  Itanium Software Conventions and Runtime Architecture Guide, Intel document
   270  245358-003, May 2001.  Describes calling conventions, including stack
   271  unwinding requirements.
   272  
   273  Intel Itanium Processor Reference Manual for Software Optimization, Intel
   274  document 245473-003, November 2001.
   275  
   276  Intel Itanium-2 Processor Reference Manual for Software Development and
   277  Optimization, Intel document 251110-003, May 2004.
   278  
   279  All the above documents can be found online at
   280  
   281      http://developer.intel.com/design/itanium/manuals.htm