github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/cray/README

github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/mpn/cray/README (about)

     1  Copyright 2000-2002 Free Software Foundation, Inc.
     2  
     3  This file is part of the GNU MP Library.
     4  
     5  The GNU MP Library is free software; you can redistribute it and/or modify
     6  it under the terms of either:
     7  
     8    * the GNU Lesser General Public License as published by the Free
     9      Software Foundation; either version 3 of the License, or (at your
    10      option) any later version.
    11  
    12  or
    13  
    14    * the GNU General Public License as published by the Free Software
    15      Foundation; either version 2 of the License, or (at your option) any
    16      later version.
    17  
    18  or both in parallel, as here.
    19  
    20  The GNU MP Library is distributed in the hope that it will be useful, but
    21  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    22  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    23  for more details.
    24  
    25  You should have received copies of the GNU General Public License and the
    26  GNU Lesser General Public License along with the GNU MP Library.  If not,
    27  see https://www.gnu.org/licenses/.
    28  
    29  
    30  
    31  
    32  
    33  
    34  The code in this directory works for Cray vector systems such as C90,
    35  J90, T90 (both the CFP variant and the IEEE variant) and SV1.  (For
    36  the T3E and T3D systems, see the `alpha' subdirectory at the same
    37  level as the directory containing this file.)
    38  
    39  The cfp subdirectory is for systems utilizing the traditional Cray
    40  floating-point format, and the ieee subdirectory is for the newer
    41  systems that use the IEEE floating-point format.
    42  
    43  There are several issues that reduces speed on Cray systems.  For
    44  systems with cfp floating point, the main obstacle is the forming of
    45  128-bit products.  For IEEE systems, adding, and in particular
    46  computing carry is the main issue.  There are no vectorizing
    47  unsigned-less-than instructions, and the sequence that implement that
    48  operation is very long.
    49  
    50  Shifting is the only operation that is simple to make fast.  All Cray
    51  systems have a bitblt instructions (Vi Vj,Vj<Ak and Vi Vj,Vj>Ak) that
    52  should be really useful.
    53  
    54  For best speed for cfp systems, we need a mul_basecase, since that
    55  reduces the need for carry propagation to a minimum.  Depending on the
    56  size (vn) of the smaller of the two operands (V), we should split U and V
    57  in different chunk sizes:
    58  
    59  U split in 2 32-bit parts
    60  V split according to the table:
    61  parts			4	5	6	7	8
    62  bits/part		16	13	11	10	8
    63  max allowed vn		1	8	32	64	256
    64  number of multiplies	8	10	12	14	16
    65  peak cycles/limb	4	5	6	7	8
    66  
    67  U split in 3 22-bit parts
    68  V split according to the table:
    69  parts			3	4	5
    70  bits/part		22	16	13
    71  max allowed vn		16	1024	8192
    72  number of multiplies	9	12	15
    73  peak cycles/limb	4.5	6	7.5
    74  
    75  U split in 4 16-bit parts
    76  V split according to the table:
    77  parts			4
    78  bits/part		16
    79  max allowed vn		65536
    80  number of multiplies	16
    81  peak cycles/limb	8
    82  
    83  (A T90 CPU can accumulate two products per cycle.)
    84  
    85  IDEA:
    86  * Rewrite mpn_add_n:
    87      short cy[n + 1];
    88      #pragma _CRI ivdep
    89        for (i = 0; i < n; i++)
    90  	{ s = up[i] + vp[i];
    91  	  rp[i] = s;
    92  	  cy[i + 1] = s < up[i]; }
    93        more_carries = 0;
    94      #pragma _CRI ivdep
    95        for (i = 1; i < n; i++)
    96  	{ s = rp[i] + cy[i];
    97  	  rp[i] = s;
    98  	  more_carries += s < cy[i]; }
    99        cys = 0;
   100        if (more_carries)
   101  	{
   102  	  cys = rp[1] < cy[1];
   103  	  for (i = 2; i < n; i++)
   104  	    { rp[i] += cys;
   105  	      cys = rp[i] < cys; }
   106  	}
   107        return cys + cy[n];
   108  
   109  * Write mpn_add3_n for adding three operands.  First add operands 1
   110    and 2, and generate cy[].  Then add operand 3 to the partial result,
   111    and accumulate carry into cy[].  Finally propagate carry just like
   112    in the new mpn_add_n.
   113  
   114  IDEA:
   115  
   116  Store fewer bits, perhaps 62, per limb.  That brings mpn_add_n time
   117  down to 2.5 cycles/limb and mpn_addmul_1 times to 4 cycles/limb.  By
   118  storing even fewer bits per limb, perhaps 56, it would be possible to
   119  write a mul_mul_basecase that would run at effectively 1 cycle/limb.
   120  (Use VM here to better handle the romb-shaped multiply area, perhaps
   121  rounding operand sizes up to the next power of 2.)