github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/doc/tasks.html (about)

     1  <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
     2  <html>
     3  <head>
     4    <title>GMP Itemized Development Tasks</title>
     5    <link rel="shortcut icon" href="favicon.ico">
     6    <link rel="stylesheet" href="gmp.css">
     7    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
     8  </head>
     9  
    10  <center>
    11    <h1>
    12      GMP Itemized Development Tasks
    13    </h1>
    14  </center>
    15  
    16  <font size=-1>
    17  <pre>
    18  Copyright 2000-2004, 2006, 2008, 2009 Free Software Foundation, Inc.
    19  
    20  This file is part of the GNU MP Library.
    21  
    22  The GNU MP Library is free software; you can redistribute it and/or modify
    23  it under the terms of either:
    24  
    25    * the GNU Lesser General Public License as published by the Free
    26      Software Foundation; either version 3 of the License, or (at your
    27      option) any later version.
    28  
    29  or
    30  
    31    * the GNU General Public License as published by the Free Software
    32      Foundation; either version 2 of the License, or (at your option) any
    33      later version.
    34  
    35  or both in parallel, as here.
    36  
    37  The GNU MP Library is distributed in the hope that it will be useful, but
    38  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    39  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    40  for more details.
    41  
    42  You should have received copies of the GNU General Public License and the
    43  GNU Lesser General Public License along with the GNU MP Library.  If not,
    44  see https://www.gnu.org/licenses/.
    45  </pre>
    46  </font>
    47  
    48  <hr>
    49  <!-- NB. timestamp updated automatically by emacs -->
    50    This file current as of 29 Jan 2014.  An up-to-date version is available at
    51    <a href="https://gmplib.org/tasks.html">https://gmplib.org/tasks.html</a>.
    52    Please send comments about this page to gmp-devel<font>@</font>gmplib.org.
    53  
    54  <p> These are itemized GMP development tasks.  Not all the tasks
    55      listed here are suitable for volunteers, but many of them are.
    56      Please see the <a href="projects.html">projects file</a> for more
    57      sizeable projects.
    58  
    59  <p> CAUTION: This file needs updating.  Many of the tasks here have
    60  either already been taken care of, or have become irrelevant.
    61  
    62  <h4>Correctness and Completeness</h4>
    63  <ul>
    64  <li> <code>_LONG_LONG_LIMB</code> in gmp.h is not namespace clean.  Reported
    65       by Patrick Pelissier.
    66       <br>
    67       We sort of mentioned <code>_LONG_LONG_LIMB</code> in past releases, so
    68       need to be careful about changing it.  It used to be a define
    69       applications had to set for long long limb systems, but that in
    70       particular is no longer relevant now that it's established automatically.
    71  <li> The various reuse.c tests need to force reallocation by calling
    72       <code>_mpz_realloc</code> with a small (1 limb) size.
    73  <li> One reuse case is missing from mpX/tests/reuse.c:
    74       <code>mpz_XXX(a,a,a)</code>.
    75  <li> Make the string reading functions allow the `0x' prefix when the base is
    76       explicitly 16.  They currently only allow that prefix when the base is
    77       unspecified (zero).
    78  <li> <code>mpf_eq</code> is not always correct, when one operand is
    79       1000000000... and the other operand is 0111111111..., i.e., extremely
    80       close.  There is a special case in <code>mpf_sub</code> for this
    81       situation; put similar code in <code>mpf_eq</code>.  [In progress.]
    82  <li> <code>mpf_eq</code> doesn't implement what gmp.texi specifies.  It should
    83       not use just whole limbs, but partial limbs.  [In progress.]
    84  <li> <code>mpf_set_str</code> doesn't validate it's exponent, for instance
    85       garbage 123.456eX789X is accepted (and an exponent 0 used), and overflow
    86       of a <code>long</code> is not detected.
    87  <li> <code>mpf_add</code> doesn't check for a carry from truncated portions of
    88       the inputs, and in that respect doesn't implement the "infinite precision
    89       followed by truncate" specified in the manual.
    90  <li> Windows DLLs: tests/mpz/reuse.c and tests/mpf/reuse.c initialize global
    91       variables with pointers to <code>mpz_add</code> etc, which doesn't work
    92       when those routines are coming from a DLL (because they're effectively
    93       function pointer global variables themselves).  Need to rearrange perhaps
    94       to a set of calls to a test function rather than iterating over an array.
    95  <li> <code>mpz_pow_ui</code>: Detect when the result would be more memory than
    96       a <code>size_t</code> can represent and raise some suitable exception,
    97       probably an alloc call asking for <code>SIZE_T_MAX</code>, and if that
    98       somehow succeeds then an <code>abort</code>.  Various size overflows of
    99       this kind are not handled gracefully, probably resulting in segvs.
   100       <br>
   101       In <code>mpz_n_pow_ui</code>, detect when the count of low zero bits
   102       exceeds an <code>unsigned long</code>.  There's a (small) chance of this
   103       happening but still having enough memory to represent the value.
   104       Reported by Winfried Dreckmann in for instance <code>mpz_ui_pow_ui (x,
   105       4UL, 1431655766UL)</code>.
   106  <li> <code>mpf</code>: Detect exponent overflow and raise some exception.
   107       It'd be nice to allow the full <code>mp_exp_t</code> range since that's
   108       how it's been in the past, but maybe dropping one bit would make it
   109       easier to test if e1+e2 goes out of bounds.
   110  </ul>
   111  
   112  
   113  
   114  <h4>Machine Independent Optimization</h4>
   115  <ul>
   116  <li> <code>mpf_cmp</code>: For better cache locality, don't test for low zero
   117       limbs until the high limbs fail to give an ordering.  Reduce code size by
   118       turning the three <code>mpn_cmp</code>'s into a single loop stopping when
   119       the end of one operand is reached (and then looking for a non-zero in the
   120       rest of the other).
   121  <li> <code>mpf_mul_2exp</code>, <code>mpf_div_2exp</code>: The use of
   122       <code>mpn_lshift</code> for any size&lt;=prec means repeated
   123       <code>mul_2exp</code> and <code>div_2exp</code> calls accumulate low zero
   124       limbs until size==prec+1 is reached.  Those zeros will slow down
   125       subsequent operations, especially if the value is otherwise only small.
   126       If low bits of the low limb are zero, use <code>mpn_rshift</code> so as
   127       to not increase the size.
   128  <li> <code>mpn_dc_sqrtrem</code>, <code>mpn_sqrtrem2</code>: Don't use
   129       <code>mpn_add_1</code> and <code>mpn_sub_1</code> for 1 limb operations,
   130       instead <code>ADDC_LIMB</code> and <code>SUBC_LIMB</code>.
   131  <li> <code>mpn_sqrtrem2</code>: Use plain variables for <code>sp[0]</code> and
   132       <code>rp[0]</code> calculations, so the compiler needn't worry about
   133       aliasing between <code>sp</code> and <code>rp</code>.
   134  <li> <code>mpn_sqrtrem</code>: Some work can be saved in the last step when
   135       the remainder is not required, as noted in Paul's paper.
   136  <li> <code>mpq_add</code>, <code>mpq_sub</code>: The gcd fits a single limb
   137       with high probability and in this case <code>binvert_limb</code> could
   138       be used to calculate the inverse just once for the two exact divisions
   139       "op1.den / gcd" and "op2.den / gcd", rather than letting
   140       <code>mpn_bdiv_q_1</code> do it each time.  This would require calling
   141       <code>mpn_pi1_bdiv_q_1</code>.
   142  <li> <code>mpn_gcdext</code>: Don't test <code>count_leading_zeros</code> for
   143       zero, instead check the high bit of the operand and avoid invoking
   144       <code>count_leading_zeros</code>.  This is an optimization on all
   145       machines, and significant on machines with slow
   146       <code>count_leading_zeros</code>, though it's possible an already
   147       normalized operand might not be encountered very often.
   148  <li> Rewrite <code>umul_ppmm</code> to use floating-point for generating the
   149       most significant limb (if <code>GMP_LIMB_BITS</code> &lt= 52 bits).
   150       (Peter Montgomery has some ideas on this subject.)
   151  <li> Improve the default <code>umul_ppmm</code> code in longlong.h: Add partial
   152       products with fewer operations.
   153  <li> Consider inlining <code>mpz_set_ui</code>.  This would be both small and
   154       fast, especially for compile-time constants, but would make application
   155       binaries depend on having 1 limb allocated to an <code>mpz_t</code>,
   156       preventing the "lazy" allocation scheme below.
   157  <li> Consider inlining <code>mpz_[cft]div_ui</code> and maybe
   158       <code>mpz_[cft]div_r_ui</code>.  A <code>__gmp_divide_by_zero</code>
   159       would be needed for the divide by zero test, unless that could be left to
   160       <code>mpn_mod_1</code> (not sure currently whether all the risc chips
   161       provoke the right exception there if using mul-by-inverse).
   162  <li> Consider inlining: <code>mpz_fits_s*_p</code>.  The setups for
   163       <code>LONG_MAX</code> etc would need to go into gmp.h, and on Cray it
   164       might, unfortunately, be necessary to forcibly include &lt;limits.h&gt;
   165       since there's no apparent way to get <code>SHRT_MAX</code> with an
   166       expression (since <code>short</code> and <code>unsigned short</code> can
   167       be different sizes).
   168  <li> <code>mpz_powm</code> and <code>mpz_powm_ui</code> aren't very fast on one
   169       or two limb moduli, due to a lot of function call overheads.  These could
   170       perhaps be handled as special cases.
   171  <li> Make sure <code>mpz_powm_ui</code> is never slower than the corresponding
   172       computation using <code>mpz_powm</code>.
   173  <li> <code>mpz_powm</code> REDC should do multiplications by <code>g[]</code>
   174       using the division method when they're small, since the REDC form of a
   175       small multiplier is normally a full size product.  Probably would need a
   176       new tuned parameter to say what size multiplier is "small", as a function
   177       of the size of the modulus.
   178  <li> <code>mpn_gcd</code> might be able to be sped up on small to moderate
   179       sizes by improving <code>find_a</code>, possibly just by providing an
   180       alternate implementation for CPUs with slowish
   181       <code>count_leading_zeros</code>.
   182  <li> <code>mpf_set_str</code> produces low zero limbs when a string has a
   183       fraction but is exactly representable, eg. 0.5 in decimal.  These could be
   184       stripped to save work in later operations.
   185  <li> <code>mpz_and</code>, <code>mpz_ior</code> and <code>mpz_xor</code> should
   186       use <code>mpn_and_n</code> etc for the benefit of the small number of
   187       targets with native versions of those routines.  Need to be careful not to
   188       pass size==0.  Is some code sharing possible between the <code>mpz</code>
   189       routines?
   190  <li> <code>mpf_add</code>: Don't do a copy to avoid overlapping operands
   191       unless it's really necessary (currently only sizes are tested, not
   192       whether r really is u or v).
   193  <li> <code>mpf_add</code>: Under the check for v having no effect on the
   194       result, perhaps test for r==u and do nothing in that case, rather than
   195       currently it looks like an <code>MPN_COPY_INCR</code> will be done to
   196       reduce prec+1 limbs to prec.
   197  <li> <code>mpf_div_ui</code>: Instead of padding with low zeros, call
   198       <code>mpn_divrem_1</code> asking for fractional quotient limbs.
   199  <li> <code>mpf_div_ui</code>: Eliminate <code>TMP_ALLOC</code>.  When r!=u
   200       there's no overlap and the division can be called on those operands.
   201       When r==u and is prec+1 limbs, then it's an in-place division.  If r==u
   202       and not prec+1 limbs, then move the available limbs up to prec+1 and do
   203       an in-place there.
   204  <li> <code>mpf_div_ui</code>: Whether the high quotient limb is zero can be
   205       determined by testing the dividend for high&lt;divisor.  When non-zero, the
   206       division can be done on prec dividend limbs instead of prec+1.  The result
   207       size is also known before the division, so that can be a tail call (once
   208       the <code>TMP_ALLOC</code> is eliminated).
   209  <li> <code>mpn_divrem_2</code> could usefully accept unnormalized divisors and
   210       shift the dividend on-the-fly, since this should cost nothing on
   211       superscalar processors and avoid the need for temporary copying in
   212       <code>mpn_tdiv_qr</code>.
   213  <li> <code>mpf_sqrt</code>: If r!=u, and if u doesn't need to be padded with
   214       zeros, then there's no need for the tp temporary.
   215  <li> <code>mpq_cmp_ui</code> could form the <code>num1*den2</code> and
   216       <code>num2*den1</code> products limb-by-limb from high to low and look at
   217       each step for values differing by more than the possible carry bit from
   218       the uncalculated portion.
   219  <li> <code>mpq_cmp</code> could do the same high-to-low progressive multiply
   220       and compare.  The benefits of karatsuba and higher multiplication
   221       algorithms are lost, but if it's assumed only a few high limbs will be
   222       needed to determine an order then that's fine.
   223  <li> <code>mpn_add_1</code>, <code>mpn_sub_1</code>, <code>mpn_add</code>,
   224       <code>mpn_sub</code>: Internally use <code>__GMPN_ADD_1</code> etc
   225       instead of the functions, so they get inlined on all compilers, not just
   226       gcc and others with <code>inline</code> recognised in gmp.h.
   227       <code>__GMPN_ADD_1</code> etc are meant mostly to support application
   228       inline <code>mpn_add_1</code> etc and if they don't come out good for
   229       internal uses then special forms can be introduced, for instance many
   230       internal uses are in-place.  Sometimes a block of code is executed based
   231       on the carry-out, rather than using it arithmetically, and those places
   232       might want to do their own loops entirely.
   233  <li> <code>__gmp_extract_double</code> on 64-bit systems could use just one
   234       bitfield for the mantissa extraction, not two, when endianness permits.
   235       Might depend on the compiler allowing <code>long long</code> bit fields
   236       when that's the only actual 64-bit type.
   237  <li> tal-notreent.c could keep a block of memory permanently allocated.
   238       Currently the last nested <code>TMP_FREE</code> releases all memory, so
   239       there's an allocate and free every time a top-level function using
   240       <code>TMP</code> is called.  Would need
   241       <code>mp_set_memory_functions</code> to tell tal-notreent.c to release
   242       any cached memory when changing allocation functions though.
   243  <li> <code>__gmp_tmp_alloc</code> from tal-notreent.c could be partially
   244       inlined.  If the current chunk has enough room then a couple of pointers
   245       can be updated.  Only if more space is required then a call to some sort
   246       of <code>__gmp_tmp_increase</code> would be needed.  The requirement that
   247       <code>TMP_ALLOC</code> is an expression might make the implementation a
   248       bit ugly and/or a bit sub-optimal.
   249  <pre>
   250  #define TMP_ALLOC(n)
   251    ((ROUND_UP(n) &gt; current-&gt;end - current-&gt;point ?
   252       __gmp_tmp_increase (ROUND_UP (n)) : 0),
   253       current-&gt;point += ROUND_UP (n),
   254       current-&gt;point - ROUND_UP (n))
   255  </pre>
   256  <li> <code>__mp_bases</code> has a lot of data for bases which are pretty much
   257       never used.  Perhaps the table should just go up to base 16, and have
   258       code to generate data above that, if and when required.  Naturally this
   259       assumes the code would be smaller than the data saved.
   260  <li> <code>__mp_bases</code> field <code>big_base_inverted</code> is only used
   261       if <code>USE_PREINV_DIVREM_1</code> is true, and could be omitted
   262       otherwise, to save space.
   263  <li> <code>mpz_get_str</code>, <code>mtox</code>: For power-of-2 bases, which
   264       are of course fast, it seems a little silly to make a second pass over
   265       the <code>mpn_get_str</code> output to convert to ASCII.  Perhaps combine
   266       that with the bit extractions.
   267  <li> <code>mpz_gcdext</code>: If the caller requests only the S cofactor (of
   268       A), and A&lt;B, then the code ends up generating the cofactor T (of B) and
   269       deriving S from that.  Perhaps it'd be possible to arrange to get S in
   270       the first place by calling <code>mpn_gcdext</code> with A+B,B.  This
   271       might only be an advantage if A and B are about the same size.
   272  <li> <code>mpz_n_pow_ui</code> does a good job with small bases and stripping
   273       powers of 2, but it's perhaps a bit too complicated for what it gains.
   274       The simpler <code>mpn_pow_1</code> is a little faster on small exponents.
   275       (Note some of the ugliness in <code>mpz_n_pow_ui</code> is due to
   276       supporting <code>mpn_mul_2</code>.)
   277       <br>
   278       Perhaps the stripping of 2s in <code>mpz_n_pow_ui</code> should be
   279       confined to single limb operands for simplicity and since that's where
   280       the greatest gain would be.
   281       <br>
   282       Ideally <code>mpn_pow_1</code> and <code>mpz_n_pow_ui</code> would be
   283       merged.  The reason <code>mpz_n_pow_ui</code> writes to an
   284       <code>mpz_t</code> is that its callers leave it to make a good estimate
   285       of the result size.  Callers of <code>mpn_pow_1</code> already know the
   286       size by separate means (<code>mp_bases</code>).
   287  <li> <code>mpz_invert</code> should call <code>mpn_gcdext</code> directly.
   288  </ul>
   289  
   290  
   291  <h4>Machine Dependent Optimization</h4>
   292  <ul>
   293  <li> <code>invert_limb</code> on various processors might benefit from the
   294       little Newton iteration done for alpha and ia64.
   295  <li> Alpha 21264: <code>mpn_addlsh1_n</code> could be implemented with
   296       <code>mpn_addmul_1</code>, since that code at 3.5 is a touch faster than
   297       a separate <code>lshift</code> and <code>add_n</code> at
   298       1.75+2.125=3.875.  Or very likely some specific <code>addlsh1_n</code>
   299       code could beat both.
   300  <li> Alpha 21264: Improve feed-in code for <code>mpn_mul_1</code>,
   301       <code>mpn_addmul_1</code>, and <code>mpn_submul_1</code>.
   302  <li> Alpha 21164: Rewrite <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
   303       and <code>mpn_submul_1</code> for the 21164.  This should use both integer
   304       multiplies and floating-point multiplies.  For the floating-point
   305       operations, the single-limb multiplier should be split into three 21-bit
   306       chunks, or perhaps even better in four 16-bit chunks.  Probably possible
   307       to reach 9 cycles/limb.
   308  <li> Alpha: GCC 3.4 will introduce <code>__builtin_ctzl</code>,
   309       <code>__builtin_clzl</code> and <code>__builtin_popcountl</code> using
   310       the corresponding CIX <code>ct</code> instructions, and
   311       <code>__builtin_alpha_cmpbge</code>.  These should give GCC more
   312       information about scheduling etc than the <code>asm</code> blocks
   313       currently used in longlong.h and gmp-impl.h.
   314  <li> Alpha Unicos: Apparently there's no <code>alloca</code> on this system,
   315       making <code>configure</code> choose the slower
   316       <code>malloc-reentrant</code> allocation method.  Is there a better way?
   317       Maybe variable-length arrays per notes below.
   318  <li> Alpha Unicos 21164, 21264: <code>.align</code> is not used since it pads
   319       with garbage.  Does the code get the intended slotting required for the
   320       claimed speeds?  <code>.align</code> at the start of a function would
   321       presumably be safe no matter how it pads.
   322  <li> ARM V5: <code>count_leading_zeros</code> can use the <code>clz</code>
   323       instruction.  For GCC 3.4 and up, do this via <code>__builtin_clzl</code>
   324       since then gcc knows it's "predicable".
   325  <li> Itanium: GCC 3.4 introduces <code>__builtin_popcount</code> which can be
   326       used instead of an <code>asm</code> block.  The builtin should give gcc
   327       more opportunities for scheduling, bundling and predication.
   328       <code>__builtin_ctz</code> similarly (it just uses popcount as per
   329       current longlong.h).
   330  <li> UltraSPARC/64: Optimize <code>mpn_mul_1</code>, <code>mpn_addmul_1</code>,
   331       for s2 &lt; 2^32 (or perhaps for any zero 16-bit s2 chunk).  Not sure how
   332       much this can improve the speed, though, since the symmetry that we rely
   333       on is lost.  Perhaps we can just gain cycles when s2 &lt; 2^16, or more
   334       accurately, when two 16-bit s2 chunks which are 16 bits apart are zero.
   335  <li> UltraSPARC/64: Write native <code>mpn_submul_1</code>, analogous to
   336       <code>mpn_addmul_1</code>.
   337  <li> UltraSPARC/64: Write <code>umul_ppmm</code>.  Using four
   338       "<code>mulx</code>"s either with an asm block or via the generic C code is
   339       about 90 cycles.  Try using fp operations, and also try using karatsuba
   340       for just three "<code>mulx</code>"s.
   341  <li> UltraSPARC/32: Rewrite <code>mpn_lshift</code>, <code>mpn_rshift</code>.
   342       Will give 2 cycles/limb.  Trivial modifications of mpn/sparc64 should do.
   343  <li> UltraSPARC/32: Write special mpn_Xmul_1 loops for s2 &lt; 2^16.
   344  <li> UltraSPARC/32: Use <code>mulx</code> for <code>umul_ppmm</code> if
   345       possible (see commented out code in longlong.h).  This is unlikely to
   346       save more than a couple of cycles, so perhaps isn't worth bothering with.
   347  <li> UltraSPARC/32: On Solaris gcc doesn't give us <code>__sparc_v9__</code>
   348       or anything to indicate V9 support when -mcpu=v9 is selected.  See
   349       gcc/config/sol2-sld-64.h.  Will need to pass something through from
   350       ./configure to select the right code in longlong.h.  (Currently nothing
   351       is lost because <code>mulx</code> for multiplying is commented out.)
   352  <li> UltraSPARC/32: <code>mpn_divexact_1</code> and
   353       <code>mpn_modexact_1c_odd</code> can use a 64-bit inverse and take
   354       64-bits at a time from the dividend, as per the 32-bit divisor case in
   355       mpn/sparc64/mode1o.c.  This must be done in assembler, since the full
   356       64-bit registers (<code>%gN</code>) are not available from C.
   357  <li> UltraSPARC/32: <code>mpn_divexact_by3c</code> can work 64-bits at a time
   358       using <code>mulx</code>, in assembler.  This would be the same as for
   359       sparc64.
   360  <li> UltraSPARC: <code>binvert_limb</code> might save a few cycles from
   361       masking down to just the useful bits at each point in the calculation,
   362       since <code>mulx</code> speed depends on the highest bit set.  Either
   363       explicit masks or small types like <code>short</code> and
   364       <code>int</code> ought to work.
   365  <li> Sparc64 HAL R1 <code>popc</code>: This chip reputedly implements
   366       <code>popc</code> properly (see gcc sparc.md).  Would need to recognise
   367       it as <code>sparchalr1</code> or something in configure / config.sub /
   368       config.guess.  <code>popc_limb</code> in gmp-impl.h could use this (per
   369       commented out code).  <code>count_trailing_zeros</code> could use it too.
   370  <li> PA64: Improve <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
   371       <code>mpn_mul_1</code>.  The current code runs at 11 cycles/limb.  It
   372       should be possible to saturate the cache, which will happen at 8
   373       cycles/limb (7.5 for mpn_mul_1).  Write special loops for s2 &lt; 2^32;
   374       it should be possible to make them run at about 5 cycles/limb.
   375  <li> PPC601: See which of the power or powerpc32 code runs better.  Currently
   376       the powerpc32 is used, but only because it's the default for
   377       <code>powerpc*</code>.
   378  <li> PPC630: Rewrite <code>mpn_addmul_1</code>, <code>mpn_submul_1</code>, and
   379       <code>mpn_mul_1</code>.  Use both integer and floating-point operations,
   380       possibly two floating-point and one integer limb per loop.  Split operands
   381       into four 16-bit chunks for fast fp operations.  Should easily reach 9
   382       cycles/limb (using one int + one fp), but perhaps even 7 cycles/limb
   383       (using one int + two fp).
   384  <li> PPC630: <code>mpn_rshift</code> could do the same sort of unrolled loop
   385       as <code>mpn_lshift</code>.  Some judicious use of m4 might let the two
   386       share source code, or with a register to control the loop direction
   387       perhaps even share object code.
   388  <li> Implement <code>mpn_mul_basecase</code> and <code>mpn_sqr_basecase</code>
   389       for important machines.  Helping the generic sqr_basecase.c with an
   390       <code>mpn_sqr_diagonal</code> might be enough for some of the RISCs.
   391  <li> POWER2/POWER2SC: Schedule <code>mpn_lshift</code>/<code>mpn_rshift</code>.
   392       Will bring time from 1.75 to 1.25 cycles/limb.
   393  <li> X86: Optimize non-MMX <code>mpn_lshift</code> for shifts by 1.  (See
   394       Pentium code.)
   395  <li> X86: Good authority has it that in the past an inline <code>rep
   396       movs</code> would upset GCC register allocation for the whole function.
   397       Is this still true in GCC 3?  It uses <code>rep movs</code> itself for
   398       <code>__builtin_memcpy</code>.  Examine the code for some simple and
   399       complex functions to find out.  Inlining <code>rep movs</code> would be
   400       desirable, it'd be both smaller and faster.
   401  <li> Pentium P54: <code>mpn_lshift</code> and <code>mpn_rshift</code> can come
   402       down from 6.0 c/l to 5.5 or 5.375 by paying attention to pairing after
   403       <code>shrdl</code> and <code>shldl</code>, see mpn/x86/pentium/README.
   404  <li> Pentium P55 MMX: <code>mpn_lshift</code> and <code>mpn_rshift</code>
   405       might benefit from some destination prefetching.
   406  <li> PentiumPro: <code>mpn_divrem_1</code> might be able to use a
   407       mul-by-inverse, hoping for maybe 30 c/l.
   408  <li> K7: <code>mpn_lshift</code> and <code>mpn_rshift</code> might be able to
   409       do something branch-free for unaligned startups, and shaving one insn
   410       from the loop with alternative indexing might save a cycle.
   411  <li> PPC32: Try using fewer registers in the current <code>mpn_lshift</code>.
   412       The pipeline is now extremely deep, perhaps unnecessarily deep.
   413  <li> Fujitsu VPP: Vectorize main functions, perhaps in assembly language.
   414  <li> Fujitsu VPP: Write <code>mpn_mul_basecase</code> and
   415       <code>mpn_sqr_basecase</code>.  This should use a "vertical multiplication
   416       method", to avoid carry propagation.  splitting one of the operands in
   417       11-bit chunks.
   418  <li> Pentium: <code>mpn_lshift</code> by 31 should use the special rshift
   419       by 1 code, and vice versa <code>mpn_rshift</code> by 31 should use the
   420       special lshift by 1.  This would be best as a jump across to the other
   421       routine, could let both live in lshift.asm and omit rshift.asm on finding
   422       <code>mpn_rshift</code> already provided.
   423  <li> Cray T3E: Experiment with optimization options.  In particular,
   424       -hpipeline3 seems promising.  We should at least up -O to -O2 or -O3.
   425  <li> Cray: <code>mpn_com</code> and <code>mpn_and_n</code> etc very probably
   426       wants a pragma like <code>MPN_COPY_INCR</code>.
   427  <li> Cray vector systems: <code>mpn_lshift</code>, <code>mpn_rshift</code>,
   428       <code>mpn_popcount</code> and <code>mpn_hamdist</code> are nice and small
   429       and could be inlined to avoid function calls.
   430  <li> Cray: Variable length arrays seem to be faster than the tal-notreent.c
   431       scheme.  Not sure why, maybe they merely give the compiler more
   432       information about aliasing (or the lack thereof).  Would like to modify
   433       <code>TMP_ALLOC</code> to use them, or introduce a new scheme.  Memory
   434       blocks wanted unconditionally are easy enough, those wanted only
   435       sometimes are a problem.  Perhaps a special size calculation to ask for a
   436       dummy length 1 when unwanted, or perhaps an inlined subroutine
   437       duplicating code under each conditional.  Don't really want to turn
   438       everything into a dog's dinner just because Cray don't offer an
   439       <code>alloca</code>.
   440  <li> Cray: <code>mpn_get_str</code> on power-of-2 bases ought to vectorize.
   441       Does it?  <code>bits_per_digit</code> and the inner loop over bits in a
   442       limb might prevent it.  Perhaps special cases for binary, octal and hex
   443       would be worthwhile (very possibly for all processors too).
   444  <li> S390: <code>BSWAP_LIMB_FETCH</code> looks like it could be done with
   445       <code>lrvg</code>, as per glibc sysdeps/s390/s390-64/bits/byteswap.h.
   446       This is only for 64-bit mode or something is it, since 32-bit mode has
   447       other code?  Also, is it worth using for <code>BSWAP_LIMB</code> too, or
   448       would that mean a store and re-fetch?  Presumably that's what comes out
   449       in glibc.
   450  <li> Improve <code>count_leading_zeros</code> for 64-bit machines:
   451    <pre>
   452  	   if ((x &gt&gt 32) == 0) { x &lt&lt= 32; cnt += 32; }
   453  	   if ((x &gt&gt 48) == 0) { x &lt&lt= 16; cnt += 16; }
   454  	   ... </pre>
   455  <li> IRIX 6 MIPSpro compiler has an <code>__inline</code> which could perhaps
   456       be used in <code>__GMP_EXTERN_INLINE</code>.  What would be the right way
   457       to identify suitable versions of that compiler?
   458  <li> IRIX <code>cc</code> is rumoured to have an <code>_int_mult_upper</code>
   459       (in <code>&lt;intrinsics.h&gt;</code> like Cray), but it didn't seem to
   460       exist on some IRIX 6.5 systems tried.  If it does actually exist
   461       somewhere it would very likely be an improvement over a function call to
   462       umul.asm.
   463  <li> <code>mpn_get_str</code> final divisions by the base with
   464       <code>udiv_qrnd_unnorm</code> could use some sort of multiply-by-inverse
   465       on suitable machines.  This ends up happening for decimal by presenting
   466       the compiler with a run-time constant, but the same for other bases would
   467       be good.  Perhaps use could be made of the fact base&lt;256.
   468  <li> <code>mpn_umul_ppmm</code>, <code>mpn_udiv_qrnnd</code>: Return a
   469       structure like <code>div_t</code> to avoid going through memory, in
   470       particular helping RISCs that don't do store-to-load forwarding.  Clearly
   471       this is only possible if the ABI returns a structure of two
   472       <code>mp_limb_t</code>s in registers.
   473       <br>
   474       On PowerPC, structures are returned in memory on AIX and Darwin.  In SVR4
   475       they're returned in registers, except that draft SVR4 had said memory, so
   476       it'd be prudent to check which is done.  We can jam the compiler into the
   477       right mode if we know how, since all this is purely internal to libgmp.
   478       (gcc has an option, though of course gcc doesn't matter since we use
   479       inline asm there.)
   480  </ul>
   481  
   482  <h4>New Functionality</h4>
   483  <ul>
   484  <li> Maybe add <code>mpz_crr</code> (Chinese Remainder Reconstruction).
   485  <li> Let `0b' and `0B' mean binary input everywhere.
   486  <li> <code>mpz_init</code> and <code>mpq_init</code> could do lazy allocation.
   487       Set <code>ALLOC(var)</code> to 0 to indicate nothing allocated, and let
   488       <code>_mpz_realloc</code> do the initial alloc.  Set
   489       <code>z-&gt;_mp_d</code> to a dummy that <code>mpz_get_ui</code> and
   490       similar can unconditionally fetch from.  Niels Möller has had a go at
   491       this.
   492       <br>
   493       The advantages of the lazy scheme would be:
   494       <ul>
   495       <li> Initial allocate would be the size required for the first value
   496            stored, rather than getting 1 limb in <code>mpz_init</code> and then
   497            more or less immediately reallocating.
   498       <li> <code>mpz_init</code> would only store magic values in the
   499            <code>mpz_t</code> fields, and could be inlined.
   500       <li> A fixed initializer could even be used by applications, like
   501            <code>mpz_t z = MPZ_INITIALIZER;</code>, which might be convenient
   502            for globals.
   503       </ul>
   504       The advantages of the current scheme are:
   505       <ul>
   506       <li> <code>mpz_set_ui</code> and other similar routines needn't check the
   507            size allocated and can just store unconditionally.
   508       <li> <code>mpz_set_ui</code> and perhaps others like
   509            <code>mpz_tdiv_r_ui</code> and a prospective
   510            <code>mpz_set_ull</code> could be inlined.
   511       </ul>
   512  <li> Add <code>mpf_out_raw</code> and <code>mpf_inp_raw</code>.  Make sure
   513       format is portable between 32-bit and 64-bit machines, and between
   514       little-endian and big-endian machines.  A format which MPFR can use too
   515       would be good.
   516  <li> <code>mpn_and_n</code> ... <code>mpn_copyd</code>: Perhaps make the mpn
   517       logops and copys available in gmp.h, either as library functions or
   518       inlines, with the availability of library functions instantiated in the
   519       generated gmp.h at build time.
   520  <li> <code>mpz_set_str</code> etc variants taking string lengths rather than
   521       null-terminators.
   522  <li> <code>mpz_andn</code>, <code>mpz_iorn</code>, <code>mpz_nand</code>,
   523       <code>mpz_nior</code>, <code>mpz_xnor</code> might be useful additions,
   524       if they could share code with the current such functions (which should be
   525       possible).
   526  <li> <code>mpz_and_ui</code> etc might be of use sometimes.  Suggested by
   527       Niels Möller.
   528  <li> <code>mpf_set_str</code> and <code>mpf_inp_str</code> could usefully
   529       accept 0x, 0b etc when base==0.  Perhaps the exponent could default to
   530       decimal in this case, with a further 0x, 0b etc allowed there.
   531       Eg. 0xFFAA@0x5A.  A leading "0" for octal would match the integers, but
   532       probably something like "0.123" ought not mean octal.
   533  <li> <code>GMP_LONG_LONG_LIMB</code> or some such could become a documented
   534       feature of gmp.h, so applications could know whether to
   535       <code>printf</code> a limb using <code>%lu</code> or <code>%Lu</code>.
   536  <li> <code>GMP_PRIdMP_LIMB</code> and similar defines following C99
   537       &lt;inttypes.h&gt; might be of use to applications printing limbs.  But
   538       if <code>GMP_LONG_LONG_LIMB</code> or whatever is added then perhaps this
   539       can easily enough be left to applications.
   540  <li> <code>gmp_printf</code> could accept <code>%b</code> for binary output.
   541       It'd be nice if it worked for plain <code>int</code> etc too, not just
   542       <code>mpz_t</code> etc.
   543  <li> <code>gmp_printf</code> in fact could usefully accept an arbitrary base,
   544       for both integer and float conversions.  A base either in the format
   545       string or as a parameter with <code>*</code> should be allowed.  Maybe
   546       <code>&amp;13b</code> (b for base) or something like that.
   547  <li> <code>gmp_printf</code> could perhaps accept <code>mpq_t</code> for float
   548       conversions, eg. <code>"%.4Qf"</code>.  This would be merely for
   549       convenience, but still might be useful.  Rounding would be the same as
   550       for an <code>mpf_t</code> (ie. currently round-to-nearest, but not
   551       actually documented).  Alternately, perhaps a separate
   552       <code>mpq_get_str_point</code> or some such might be more use.  Suggested
   553       by Pedro Gimeno.
   554  <li> <code>mpz_rscan0</code> or <code>mpz_revscan0</code> or some such
   555       searching towards the low end of an integer might match
   556       <code>mpz_scan0</code> nicely.  Likewise for <code>scan1</code>.
   557       Suggested by Roberto Bagnara.
   558  <li> <code>mpz_bit_subset</code> or some such to test whether one integer is a
   559       bitwise subset of another might be of use.  Some sort of return value
   560       indicating whether it's a proper or non-proper subset would be good and
   561       wouldn't cost anything in the implementation.  Suggested by Roberto
   562       Bagnara.
   563  <li> <code>mpf_get_ld</code>, <code>mpf_set_ld</code>: Conversions between
   564       <code>mpf_t</code> and <code>long double</code>, suggested by Dan
   565       Christensen.  Other <code>long double</code> routines might be desirable
   566       too, but <code>mpf</code> would be a start.
   567       <br>
   568       <code>long double</code> is an ANSI-ism, so everything involving it would
   569       need to be suppressed on a K&amp;R compiler.
   570       <br>
   571       There'd be some work to be done by <code>configure</code> to recognise
   572       the format in use, MPFR has a start on this.  Often <code>long
   573       double</code> is the same as <code>double</code>, which is easy but
   574       pretty pointless.  A single float format detector macro could look at
   575       <code>double</code> then <code>long double</code>
   576       <br>
   577       Sometimes there's a compiler option for the size of a <code>long
   578       double</code>, eg. xlc on AIX can use either 64-bit or 128-bit.  It's
   579       probably simplest to regard this as a compiler compatibility issue, and
   580       leave it to users or sysadmins to ensure application and library code is
   581       built the same.
   582  <li> <code>mpz_sqrt_if_perfect_square</code>: When
   583       <code>mpz_perfect_square_p</code> does its tests it calculates a square
   584       root and then discards it.  For some applications it might be useful to
   585       return that root.  Suggested by Jason Moxham.
   586  <li> <code>mpz_get_ull</code>, <code>mpz_set_ull</code>,
   587       <code>mpz_get_sll</code>, <code>mpz_get_sll</code>: Conversions for
   588       <code>long long</code>.  These would aid interoperability, though a
   589       mixture of GMP and <code>long long</code> would probably not be too
   590       common.  Since <code>long long</code> is not always available (it's in
   591       C99 and GCC though), disadvantages of using <code>long long</code> in
   592       libgmp.a would be
   593       <ul>
   594       <li> Library contents vary according to the build compiler.
   595       <li> gmp.h would need an ugly <code>#ifdef</code> block to decide if the
   596            application compiler could take the <code>long long</code>
   597            prototypes.
   598       <li> Some sort of <code>LIBGMP_HAS_LONGLONG</code> might be wanted to
   599            indicate whether the functions are available.  (Applications using
   600            autoconf could probe the library too.)
   601       </ul>
   602       It'd be possible to defer the need for <code>long long</code> to
   603       application compile time, by having something like
   604       <code>mpz_set_2ui</code> called with two halves of a <code>long
   605       long</code>.  Disadvantages of this would be,
   606       <ul>
   607       <li> Bigger code in the application, though perhaps not if a <code>long
   608            long</code> is normally passed as two halves anyway.
   609       <li> <code>mpz_get_ull</code> would be a rather big inline, or would have
   610            to be two function calls.
   611       <li> <code>mpz_get_sll</code> would be a worse inline, and would put the
   612            treatment of <code>-0x10..00</code> into applications (see
   613            <code>mpz_get_si</code> correctness above).
   614       <li> Although having libgmp.a independent of the build compiler is nice,
   615            it sort of sacrifices the capabilities of a good compiler to
   616            uniformity with inferior ones.
   617       </ul>
   618       Plain use of <code>long long</code> is probably the lesser evil, if only
   619       because it makes best use of gcc.  In fact perhaps it would suffice to
   620       guarantee <code>long long</code> conversions only when using GCC for both
   621       application and library.  That would cover free software, and we can
   622       worry about selected vendor compilers later.
   623       <br>
   624       In C++ the situation is probably clearer, we demand fairly recent C++ so
   625       <code>long long</code> should be available always.  We'd probably prefer
   626       to have the C and C++ the same in respect of <code>long long</code>
   627       support, but it would be possible to have it unconditionally in gmpxx.h,
   628       by some means or another.
   629  <li> <code>mpz_strtoz</code> parsing the same as <code>strtol</code>.
   630       Suggested by Alexander Kruppa.
   631  </ul>
   632  
   633  
   634  <h4>Configuration</h4>
   635  
   636  <ul>
   637  <li> Alpha ev7, ev79: Add code to config.guess to detect these.  Believe ev7
   638       will be "3-1307" in the current switch, but need to verify that.  (On
   639       OSF, current configfsf.guess identifies ev7 using psrinfo, we need to do
   640       it ourselves for other systems.)
   641  <li> Alpha OSF: Libtool (version 1.5) doesn't seem to recognise this system is
   642       "pic always" and ends up running gcc twice with the same options.  This
   643       is wasteful, but harmless.  Perhaps a newer libtool will be better.
   644  <li> ARM: <code>umul_ppmm</code> in longlong.h always uses <code>umull</code>,
   645       but is that available only for M series chips or some such?  Perhaps it
   646       should be configured in some way.
   647  <li> HPPA: config.guess should recognize 7000, 7100, 7200, and 8x00.
   648  <li> HPPA: gcc 3.2 introduces a <code>-mschedule=7200</code> etc parameter,
   649       which could be driven by an exact hppa cpu type.
   650  <li> Mips: config.guess should say mipsr3000, mipsr4000, mipsr10000, etc.
   651       "hinv -c processor" gives lots of information on Irix.  Standard
   652       config.guess appends "el" to indicate endianness, but
   653       <code>AC_C_BIGENDIAN</code> seems the best way to handle that for GMP.
   654  <li> PowerPC: The function descriptor nonsense for AIX is currently driven by
   655       <code>*-*-aix*</code>.  It might be more reliable to do some sort of
   656       feature test, examining the compiler output perhaps.  It might also be
   657       nice to merge the aix.m4 files into powerpc-defs.m4.
   658  <li> config.m4 is generated only by the configure script, it won't be
   659       regenerated by config.status.  Creating it as an <code>AC_OUTPUT</code>
   660       would work, but it might upset "make" to have things like <code>L$</code>
   661       get into the Makefiles through <code>AC_SUBST</code>.
   662       <code>AC_CONFIG_COMMANDS</code> would be the alternative.  With some
   663       careful m4 quoting the <code>changequote</code> calls might not be
   664       needed, which might free up the order in which things had to be output.
   665  <li> Automake: Latest automake has a <code>CCAS</code>, <code>CCASFLAGS</code>
   666       scheme.  Though we probably wouldn't be using its assembler support we
   667       could try to use those variables in compatible ways.
   668  <li> <code>GMP_LDFLAGS</code> could probably be done with plain
   669       <code>LDFLAGS</code> already used by automake for all linking.  But with
   670       a bit of luck the next libtool will pass pretty much all
   671       <code>CFLAGS</code> through to the compiler when linking, making
   672       <code>GMP_LDFLAGS</code> unnecessary.
   673  <li> mpn/Makeasm.am uses <code>-c</code> and <code>-o</code> together in the
   674       .S and .asm rules, but apparently that isn't completely portable (there's
   675       an autoconf <code>AC_PROG_CC_C_O</code> test for it).  So far we've not
   676       had problems, but perhaps the rules could be rewritten to use "foo.s" as
   677       the temporary, or to do a suitable "mv" of the result.  The only danger
   678       from using foo.s would be if a compile failed and the temporary foo.s
   679       then looked like the primary source.  Hopefully if the
   680       <code>SUFFIXES</code> are ordered to have .S and .asm ahead of .s that
   681       wouldn't happen.  Might need to check.
   682  </ul>
   683  
   684  
   685  <h4>Random Numbers</h4>
   686  <ul>
   687  <li> <code>_gmp_rand</code> is not particularly fast on the linear
   688       congruential algorithm and could stand various improvements.
   689       <ul>
   690       <li> Make a second seed area within <code>gmp_randstate_t</code> (or
   691            <code>_mp_algdata</code> rather) to save some copying.
   692       <li> Make a special case for a single limb <code>2exp</code> modulus, to
   693            avoid <code>mpn_mul</code> calls.  Perhaps the same for two limbs.
   694       <li> Inline the <code>lc</code> code, to avoid a function call and
   695            <code>TMP_ALLOC</code> for every chunk.
   696       <li> Perhaps the <code>2exp</code> and general LC cases should be split,
   697            for clarity (if the general case is retained).
   698       </ul>
   699  <li> <code>gmp_randstate_t</code> used for parameters perhaps should become
   700       <code>gmp_randstate_ptr</code> the same as other types.
   701  <li> Some of the empirical randomness tests could be included in a "make
   702       check".  They ought to work everywhere, for a given seed at least.
   703  </ul>
   704  
   705  
   706  <h4>C++</h4>
   707  <ul>
   708  <li> <code>mpz_class(string)</code>, etc: Use the C++ global locale to
   709       identify whitespace.
   710       <br>
   711       <code>mpf_class(string)</code>: Use the C++ global locale decimal point,
   712       rather than the C one.
   713       <br>
   714       Consider making these variant <code>mpz_set_str</code> etc forms
   715       available for <code>mpz_t</code> too, not just <code>mpz_class</code>
   716       etc.
   717  <li> <code>mpq_class operator+=</code>: Don't emit an unnecessary
   718       <code>mpq_set(q,q)</code> before <code>mpz_addmul</code> etc.
   719  <li> Put various bits of gmpxx.h into libgmpxx, to avoid excessive inlining.
   720       Candidates for this would be,
   721       <ul>
   722       <li> <code>mpz_class(const char *)</code>, etc: since they're normally
   723            not fast anyway, and we can hide the exception <code>throw</code>.
   724       <li> <code>mpz_class(string)</code>, etc: to hide the <code>cstr</code>
   725            needed to get to the C conversion function.
   726       <li> <code>mpz_class string, char*</code> etc constructors: likewise to
   727            hide the throws and conversions.
   728       <li> <code>mpz_class::get_str</code>, etc: to hide the <code>char*</code>
   729            to <code>string</code> conversion and free.  Perhaps
   730            <code>mpz_get_str</code> can write directly into a
   731            <code>string</code>, to avoid copying.
   732            <br>
   733            Consider making such <code>string</code> returning variants
   734            available for use with plain <code>mpz_t</code> etc too.
   735       </ul>
   736  </ul>
   737  
   738  <h4>Miscellaneous</h4>
   739  <ul>
   740  <li> <code>mpz_gcdext</code> and <code>mpn_gcdext</code> ought to document
   741       what range of values the generated cofactors can take, and preferably
   742       ensure the definition uniquely specifies the cofactors for given inputs.
   743       A basic extended Euclidean algorithm or multi-step variant leads to
   744       |x|&lt;|b| and |y|&lt;|a| or something like that, but there's probably
   745       two solutions under just those restrictions.
   746  <li> demos/factorize.c: use <code>mpz_divisible_ui_p</code> rather than
   747       <code>mpz_tdiv_qr_ui</code>.  (Of course dividing multiple primes at a
   748       time would be better still.)
   749  <li> The various test programs use quite a bit of the main
   750       <code>libgmp</code>.  This establishes good cross-checks, but it might be
   751       better to use simple reference routines where possible.  Where it's not
   752       possible some attention could be paid to the order of the tests, so a
   753       <code>libgmp</code> routine is only used for tests once it seems to be
   754       good.
   755  <li> <code>MUL_FFT_THRESHOLD</code> etc: the FFT thresholds should allow a
   756       return to a previous k at certain sizes.  This arises basically due to
   757       the step effect caused by size multiples effectively used for each k.
   758       Looking at a graph makes it fairly clear.
   759  <li> <code>__gmp_doprnt_mpf</code> does a rather unattractive round-to-nearest
   760       on the string returned by <code>mpf_get_str</code>.  Perhaps some variant
   761       of <code>mpf_get_str</code> could be made which would better suit.
   762  </ul>
   763  
   764  
   765  <h4>Aids to Development</h4>
   766  <ul>
   767  <li> Add <code>ASSERT</code>s at the start of each user-visible mpz/mpq/mpf
   768       function to check the validity of each <code>mp?_t</code> parameter, in
   769       particular to check they've been <code>mp?_init</code>ed.  This might
   770       catch elementary mistakes in user programs.  Care would need to be taken
   771       over <code>MPZ_TMP_INIT</code>ed variables used internally.  If nothing
   772       else then consistency checks like size&lt;=alloc, ptr not
   773       <code>NULL</code> and ptr+size not wrapping around the address space,
   774       would be possible.  A more sophisticated scheme could track
   775       <code>_mp_d</code> pointers and ensure only a valid one is used.  Such a
   776       scheme probably wouldn't be reentrant, not without some help from the
   777       system.
   778  <li> tune/time.c could try to determine at runtime whether
   779       <code>getrusage</code> and <code>gettimeofday</code> are reliable.
   780       Currently we pretend in configure that the dodgy m68k netbsd 1.4.1
   781       <code>getrusage</code> doesn't exist.  If a test might take a long time
   782       to run then perhaps cache the result in a file somewhere.
   783  <li> tune/time.c could choose the default precision based on the
   784       <code>speed_unittime</code> determined, independent of the method in use.
   785  <li> Cray vector systems: CPU frequency could be determined from
   786       <code>sysconf(_SC_CLK_TCK)</code>, since it seems to be clock cycle
   787       based.  Is this true for all Cray systems?  Would like some documentation
   788       or something to confirm.
   789  </ul>
   790  
   791  
   792  <h4>Documentation</h4>
   793  <ul>
   794  <li> <code>mpz_inp_str</code> (etc) doesn't say when it stops reading digits.
   795  <li> <code>mpn_get_str</code> isn't terribly clear about how many digits it
   796       produces.  It'd probably be possible to say at most one leading zero,
   797       which is what both it and <code>mpz_get_str</code> currently do.  But
   798       want to be careful not to bind ourselves to something that might not suit
   799       another implementation.
   800  <li> <code>va_arg</code> doesn't do the right thing with <code>mpz_t</code>
   801       etc directly, but instead needs a pointer type like <code>MP_INT*</code>.
   802       It'd be good to show how to do this, but we'd either need to document
   803       <code>mpz_ptr</code> and friends, or perhaps fallback on something
   804       slightly nasty with <code>void*</code>.
   805  </ul>
   806  
   807  
   808  <h4>Bright Ideas</h4>
   809  
   810  <p> The following may or may not be feasible, and aren't likely to get done in the
   811  near future, but are at least worth thinking about.
   812  
   813  <ul>
   814  <li> Reorganize longlong.h so that we can inline the operations even for the
   815       system compiler.  When there is no such compiler feature, make calls to
   816       stub functions.  Write such stub functions for as many machines as
   817       possible.
   818  <li> longlong.h could declare when it's using, or would like to use,
   819       <code>mpn_umul_ppmm</code>, and the corresponding umul.asm file could be
   820       included in libgmp only in that case, the same as is effectively done for
   821       <code>__clz_tab</code>.  Likewise udiv.asm and perhaps cntlz.asm.  This
   822       would only be a very small space saving, so perhaps not worth the
   823       complexity.
   824  <li> longlong.h could be built at configure time by concatenating or
   825       #including fragments from each directory in the mpn path.  This would
   826       select CPU specific macros the same way as CPU specific assembler code.
   827       Code used would no longer depend on cpp predefines, and the current
   828       nested conditionals could be flattened out.
   829  <li> <code>mpz_get_si</code> returns 0x80000000 for -0x100000000, whereas it's
   830       sort of supposed to return the low 31 (or 63) bits.  But this is
   831       undocumented, and perhaps not too important.
   832  <li> <code>mpz_init_set*</code> and <code>mpz_realloc</code> could allocate
   833       say an extra 16 limbs over what's needed, so as to reduce the chance of
   834       having to do a reallocate if the <code>mpz_t</code> grows a bit more.
   835       This could only be an option, since it'd badly bloat memory usage in
   836       applications using many small values.
   837  <li> <code>mpq</code> functions could perhaps check for numerator or
   838       denominator equal to 1, on the assumption that integers or
   839       denominator-only values might be expected to occur reasonably often.
   840  <li> <code>count_trailing_zeros</code> is used on more or less uniformly
   841       distributed numbers in a couple of places.  For some CPUs
   842       <code>count_trailing_zeros</code> is slow and it's probably worth handling
   843       the frequently occurring 0 to 2 trailing zeros cases specially.
   844  <li> <code>mpf_t</code> might like to let the exponent be undefined when
   845       size==0, instead of requiring it 0 as now.  It should be possible to do
   846       size==0 tests before paying attention to the exponent.  The advantage is
   847       not needing to set exp in the various places a zero result can arise,
   848       which avoids some tedium but is otherwise perhaps not too important.
   849       Currently <code>mpz_set_f</code> and <code>mpf_cmp_ui</code> depend on
   850       exp==0, maybe elsewhere too.
   851  <li> <code>__gmp_allocate_func</code>: Could use GCC <code>__attribute__
   852       ((malloc))</code> on this, though don't know if it'd do much.  GCC 3.0
   853       allows that attribute on functions, but not function pointers (see info
   854       node "Attribute Syntax"), so would need a new autoconf test.  This can
   855       wait until there's a GCC that supports it.
   856  <li> <code>mpz_add_ui</code> contains two <code>__GMPN_COPY</code>s, one from
   857       <code>mpn_add_1</code> and one from <code>mpn_sub_1</code>.  If those two
   858       routines were opened up a bit maybe that code could be shared.  When a
   859       copy needs to be done there's no carry to append for the add, and if the
   860       copy is non-empty no high zero for the sub.
   861  </ul>
   862  
   863  
   864  <h4>Old and Obsolete Stuff</h4>
   865  
   866  <p> The following tasks apply to chips or systems that are old and/or obsolete.
   867  It's unlikely anything will be done about them unless anyone is actively using
   868  them.
   869  
   870  <ul>
   871  <li> Sparc32: The integer based udiv_nfp.asm used to be selected by
   872       <code>configure --nfp</code> but that option is gone now that autoconf is
   873       used.  The file could go somewhere suitable in the mpn search if any
   874       chips might benefit from it, though it's possible we don't currently
   875       differentiate enough exact cpu types to do this properly.
   876  <li> VAX D and G format <code>double</code> floats are straightforward and
   877       could perhaps be handled directly in <code>__gmp_extract_double</code>
   878       and maybe in <code>mpn_get_d</code>, rather than falling back on the
   879       generic code.  (Both formats are detected by <code>configure</code>.)
   880  </ul>
   881  
   882  
   883  <hr>
   884  
   885  </body>
   886  </html>
   887  
   888  <!--
   889  Local variables:
   890  eval: (add-hook 'write-file-hooks 'time-stamp)
   891  time-stamp-start: "This file current as of "
   892  time-stamp-format: "%:d %3b %:y"
   893  time-stamp-end: "\\."
   894  time-stamp-line-limit: 50
   895  End:
   896  -->