github.com/aergoio/aergo@v1.3.1/libtool/src/gmp-6.1.2/tune/README (about)

     1  Copyright 2000-2002, 2004 Free Software Foundation, Inc.
     2  
     3  This file is part of the GNU MP Library.
     4  
     5  The GNU MP Library is free software; you can redistribute it and/or modify
     6  it under the terms of either:
     7  
     8    * the GNU Lesser General Public License as published by the Free
     9      Software Foundation; either version 3 of the License, or (at your
    10      option) any later version.
    11  
    12  or
    13  
    14    * the GNU General Public License as published by the Free Software
    15      Foundation; either version 2 of the License, or (at your option) any
    16      later version.
    17  
    18  or both in parallel, as here.
    19  
    20  The GNU MP Library is distributed in the hope that it will be useful, but
    21  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
    22  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
    23  for more details.
    24  
    25  You should have received copies of the GNU General Public License and the
    26  GNU Lesser General Public License along with the GNU MP Library.  If not,
    27  see https://www.gnu.org/licenses/.
    28  
    29  
    30  
    31  
    32  
    33                 GMP SPEED MEASURING AND PARAMETER TUNING
    34  
    35  
    36  The programs in this directory are for knowledgeable users who want to
    37  measure GMP routines on their machine, and perhaps tweak some settings or
    38  identify things that can be improved.
    39  
    40  The programs here are tools, not ready to run solutions.  Nothing is built
    41  in a normal "make all", but various Makefile targets described below exist.
    42  
    43  Relatively few systems and CPUs have been tested, so be sure to verify that
    44  results are sensible before relying on them.
    45  
    46  
    47  
    48  
    49  MISCELLANEOUS NOTES
    50  
    51  --enable-assert
    52  
    53      Don't configure with --enable-assert, since the extra code added by
    54      assertion checking may influence measurements.
    55  
    56  Direct mapped caches
    57  
    58      Some effort has been made to accommodate CPUs with direct mapped caches,
    59      by putting data blocks more or less contiguously on the stack.  But this
    60      will depend on TMP_ALLOC using alloca, and even then it may or may not
    61      be enough.
    62  
    63  FreeBSD 4.2 i486 getrusage
    64  
    65      This getrusage seems to be a bit doubtful, it looks like it's
    66      microsecond accurate, but sometimes ru_utime remains unchanged after a
    67      time of many microseconds has elapsed.  It'd be good to detect this in
    68      the time.c initializations, but for now the suggestion is to pretend it
    69      doesn't exist.
    70  
    71          ./configure ac_cv_func_getrusage=no
    72  
    73  NetBSD 1.4.1 m68k macintosh time base
    74  
    75      On this system it's been found getrusage often goes backwards, making it
    76      unusable (time.c getrusage_backwards_p detects this).  gettimeofday
    77      sometimes doesn't update atomically when it crosses a 1 second boundary.
    78      Not sure what to do about this.  Expect possible intermittent failures.
    79  
    80  SCO OpenUNIX 8 /etc/hw
    81  
    82      /etc/hw takes about a second to return the cpu frequency, which suggests
    83      perhaps it's measuring each time it runs.  If this is annoying when
    84      running the speed program repeatedly then set a GMP_CPU_FREQUENCY
    85      environment variable (see TIME BASE section below).
    86  
    87  Timing on GNU/Linux
    88  
    89      On Linux, timing currently uses the cycle counter. This is unreliable,
    90      since the counter is not saved and restored at context switches (unlike
    91      FreeBSD and Solaris where the cycle counter is "virtualized").
    92  
    93      Using the clock_gettime method with CLOCK_PROCESS_CPUTIME_ID (posix) or
    94      CLOCK_VIRTUAL (BSD) should be more reliable. To get clock_gettime
    95      with glibc, one has to link with -lrt (which also drags in the pthreads
    96      threading library). configure.in must be hacked to detect this and
    97      arrange proper linking. Something like
    98  
    99        old_LIBS="$LIBS"
   100        AC_SEARCH_LIBS(clock_gettime, rt, [AC_DEFINE(HAVE_CLOCK_GETTIME)])
   101        TUNE_LIBS="$LIBS"
   102        LIBS="$old_LIBS"
   103  
   104        AC_SUBST(TUNE_LIBS)
   105  
   106      might work.
   107  
   108  Low resolution timebase
   109  
   110      Parameter tuning can be very time consuming if the only timebase
   111      available is a 10 millisecond clock tick, to the point of being
   112      unusable.  This is currently the case on VAX and ARM systems.
   113  
   114  
   115  
   116  
   117  PARAMETER TUNING
   118  
   119  The "tuneup" program runs some tests designed to find the best settings for
   120  various thresholds, like MUL_TOOM22_THRESHOLD.  Its output can be put
   121  into gmp-mparam.h.  The program is built and run with
   122  
   123          make tune
   124  
   125  If the thresholds indicated are grossly different from the values in the
   126  selected gmp-mparam.h then there may be a performance boost in applicable
   127  size ranges by changing gmp-mparam.h accordingly.
   128  
   129  Be sure to do a full reconfigure and rebuild to get any newly set thresholds
   130  to take effect.  A partial rebuild is enough sometimes, but a fresh
   131  configure and make is certain to be correct.
   132  
   133  If a CPU has specific tuned parameters coming from a gmp-mparam.h in one of
   134  the mpn subdirectories then the values from "make tune" should be similar.
   135  But check that the configured CPU is right and there are no machine specific
   136  effects causing a difference.
   137  
   138  It's hoped the compiler and options used won't have too much effect on
   139  thresholds, since for most CPUs they ultimately come down to comparisons
   140  between assembler subroutines.  Missing out on the longlong.h macros by not
   141  using gcc will probably have an effect.
   142  
   143  Some thresholds produced by the tune program are merely single values chosen
   144  from what's a range of sizes where two algorithms are pretty much the same
   145  speed.  When this happens the program is likely to give somewhat different
   146  values on successive runs.  This is noticeable on the toom3 thresholds for
   147  instance.
   148  
   149  
   150  
   151  
   152  SPEED PROGRAM
   153  
   154  The "speed" program can be used for measuring and comparing various
   155  routines, and producing tables of data or gnuplot graphs.  Compile it with
   156  
   157  	make speed
   158  
   159  (Or on DOS systems "make speed.exe".)
   160  
   161  Here are some examples of how to use it.  Check the code for all the
   162  options.
   163  
   164  Draw a graph of mpn_mul_n, stepping through sizes by 10 or a factor of 1.05
   165  (whichever is greater).
   166  
   167          ./speed -s 10-5000 -t 10 -f 1.05 -P foo mpn_mul_n
   168  	gnuplot foo.gnuplot
   169  
   170  Compare mpn_add_n and an mpn_lshift by 1, showing times in cycles and
   171  showing under mpn_lshift the difference between it and mpn_add_n.
   172  
   173  	./speed -s 1-40 -c -d mpn_add_n mpn_lshift.1
   174  
   175  Using option -c for times in cycles is interesting but normally only
   176  necessary when looking carefully at assembler subroutines.  You might think
   177  it would always give an integer value, but this doesn't happen in practice,
   178  probably due to overheads in the time measurements.
   179  
   180  In the free-form output the "#" symbol against a measurement means the
   181  corresponding routine is fastest at that size.  This is a convenient visual
   182  cue when comparing different routines.  The graph data files <name>.data
   183  don't get this since it would upset gnuplot or other data viewers.
   184  
   185  
   186  
   187  
   188  TIME BASE
   189  
   190  The time measuring method is determined in time.c, based on what the
   191  configured host has available.  A cycle counter is preferred, possibly
   192  supplemented by another method if the counter has a limited range.  A
   193  microsecond accurate getrusage() or gettimeofday() will work quite well too.
   194  
   195  The cycle counters (except possibly on alpha) and gettimeofday() will depend
   196  on the machine being otherwise idle, or rather on other jobs not stealing
   197  CPU time from the measuring program.  Short routines (those that complete
   198  within a timeslice) should work even on a busy machine.
   199  
   200  Some trouble is taken by speed_measure() in common.c to avoid ill effects
   201  from sporadic interrupts, or other intermittent things (like cron waking up
   202  every minute).  But generally an idle machine will be necessary to be
   203  certain of consistent results.
   204  
   205  The CPU frequency is needed to convert between cycles and seconds, or for
   206  when a cycle counter is supplemented by getrusage() etc.  The speed program
   207  will convert as necessary according to the output format requested.  The
   208  tune program will work with either cycles or seconds.
   209  
   210  freq.c knows how to get the frequency on some systems, or can measure a
   211  cycle counter against gettimeofday() or getrusage(), but when that fails, or
   212  needs to be overridden, an environment variable GMP_CPU_FREQUENCY can be
   213  used (in Hertz).  For example in "bash" on a 650 MHz machine,
   214  
   215  	export GMP_CPU_FREQUENCY=650e6
   216  
   217  A high precision time base makes it possible to get accurate measurements in
   218  a shorter time.
   219  
   220  
   221  
   222  
   223  EXAMPLE COMPARISONS - VARIOUS
   224  
   225  Here are some ideas for things that can be done with the speed program.
   226  
   227  There's always going to be a certain amount of overhead in the time
   228  measurements, due to reading the time base, and in the loop that runs a
   229  routine enough times to get a reading of the desired precision.  Noop
   230  functions taking various arguments are available to measure this.  The
   231  "overhead" printed by the speed program each time in its intro is the "noop"
   232  routine, but note that this is just for information, it isn't deducted from
   233  the times printed or anything.
   234  
   235  	./speed -s 1 noop noop_wxs noop_wxys
   236  
   237  To see how many cycles per limb a routine is taking, look at the time
   238  increase when the size increments, using option -D.  This avoids fixed
   239  overheads in the measuring.  Also, remember many of the assembler routines
   240  have unrolled loops, so it might be necessary to compare times at, say, 16,
   241  32, 48, 64 etc to see what the unrolled part is taking, as opposed to any
   242  finishing off.
   243  
   244          ./speed -s 16-64 -t 16 -C -D mpn_add_n
   245  
   246  The -C option on its own gives cycles per limb, but is really only useful at
   247  big sizes where fixed overheads are small compared to the code doing the
   248  real work.  Remember of course memory caching and/or page swapping will
   249  affect results at large sizes.
   250  
   251          ./speed -s 500000 -C mpn_add_n
   252  
   253  Once a calculation stops fitting in the CPU data cache, it's going to start
   254  taking longer.  Exactly where this happens depends on the cache priming in
   255  the measuring routines, and on what sort of "least recently used" the
   256  hardware does.  Here's an example for a CPU with a 16kbyte L1 data cache and
   257  32-bit limb, showing a suddenly steeper curve for mpn_add_n at about 2000
   258  limbs.
   259  
   260          ./speed -s 1-4000 -t 5 -f 1.02 -P foo mpn_add_n
   261  	gnuplot foo.gnuplot
   262  
   263  When a routine has an unrolled loop for, say, multiples of 8 limbs and then
   264  an ordinary loop for the remainder, it can happen that it's actually faster
   265  to do an operation on, say, 8 limbs than it is on 7 limbs.  The following
   266  draws a graph of mpn_sub_n, to see whether times smoothly increase with
   267  size.
   268  
   269          ./speed -s 1-100 -c -P foo mpn_sub_n
   270  	gnuplot foo.gnuplot
   271  
   272  If mpn_lshift and mpn_rshift have special case code for shifts by 1, it
   273  ought to be faster (or at least not slower) than shifting by, say, 2 bits.
   274  
   275          ./speed -s 1-200 -c mpn_rshift.1 mpn_rshift.2
   276  
   277  An mpn_lshift by 1 can be done by mpn_add_n adding a number to itself, and
   278  if the lshift isn't faster there's an obvious improvement that's possible.
   279  
   280          ./speed -s 1-200 -c mpn_lshift.1 mpn_add_n_self
   281  
   282  On some CPUs (AMD K6 for example) an "in-place" mpn_add_n where the
   283  destination is one of the sources is faster than a separate destination.
   284  Here's an example to see this.  ".1" selects dst==src1 for mpn_add_n (and
   285  mpn_sub_n), for other values see speed.h SPEED_ROUTINE_MPN_BINARY_N_CALL.
   286  
   287          ./speed -s 1-200 -c mpn_add_n mpn_add_n.1
   288  
   289  The gmp manual points out that divisions by powers of two should be done
   290  using a right shift because it'll be significantly faster than an actual
   291  division.  The following shows by what factor mpn_rshift is faster than
   292  mpn_divrem_1, using division by 32 as an example.
   293  
   294          ./speed -s 10-20 -r mpn_rshift.5 mpn_divrem_1.32
   295  
   296  
   297  
   298  
   299  EXAMPLE COMPARISONS - MULTIPLICATION
   300  
   301  mul_basecase takes a ".<r>" parameter. If positive, it gives the second
   302  (smaller) operand size.  For example to show speeds for 3x3 up to 20x3 in
   303  cycles,
   304  
   305          ./speed -s 3-20 -c mpn_mul_basecase.3
   306  
   307  A negative ".<-r>" parameter fixes the size of the product to the absolute
   308  value r.  For example to show speeds for 10x10 up to 19x1 in cycles,
   309  
   310          ./speed -s 10-19 -c mpn_mul_basecase.-20
   311  
   312  mul_basecase with no parameter does an NxN multiply, so for example to show
   313  speeds in cycles for 1x1, 2x2, 3x3, etc, up to 20x20, in cycles,
   314  
   315          ./speed -s 1-20 -c mpn_mul_basecase
   316  
   317  sqr_basecase is implemented by a "triangular" method on most CPUs, making it
   318  up to twice as fast as mul_basecase.  In practice loop overheads and the
   319  products on the diagonal mean it falls short of this.  Here's an example
   320  running the two and showing by what factor an NxN mul_basecase is slower
   321  than an NxN sqr_basecase.  (Some versions of sqr_basecase only allow sizes
   322  below SQR_TOOM2_THRESHOLD, so if it crashes at that point don't worry.)
   323  
   324          ./speed -s 1-20 -r mpn_sqr_basecase mpn_mul_basecase
   325  
   326  The technique described above with -CD for showing the time difference in
   327  cycles per limb between two size operations can be done on an NxN
   328  mul_basecase using -E to change the basis for the size increment to N*N.
   329  For instance a 20x20 operation is taken to be doing 400 limbs, and a 16x16
   330  doing 256 limbs.  The following therefore shows the per crossproduct speed
   331  of mul_basecase and sqr_basecase at around 20x20 limbs.
   332  
   333          ./speed -s 16-20 -t 4 -CDE mpn_mul_basecase mpn_sqr_basecase
   334  
   335  Of course sqr_basecase isn't really doing NxN crossproducts, but it can be
   336  interesting to compare it to mul_basecase as if it was.  For sqr_basecase
   337  the -F option can be used to base the deltas on N*(N+1)/2 operations, which
   338  is the triangular products sqr_basecase does.  For example,
   339  
   340          ./speed -s 16-20 -t 4 -CDF mpn_sqr_basecase
   341  
   342  Both -E and -F are preliminary and might change.  A consistent approach to
   343  using them when claiming certain per crossproduct or per triangularproduct
   344  speeds hasn't really been established, but the increment between speeds in
   345  the range karatsuba will call seems sensible, that being k to k/2.  For
   346  instance, if the karatsuba threshold was 20 for the multiply and 30 for the
   347  square,
   348  
   349          ./speed -s 10-20 -t 10 -CDE mpn_mul_basecase
   350          ./speed -s 15-30 -t 15 -CDF mpn_sqr_basecase
   351  
   352  
   353  
   354  EXAMPLE COMPARISONS - MALLOC
   355  
   356  The gmp manual recommends application programs avoid excessive initializing
   357  and clearing of mpz_t variables (and mpq_t and mpf_t too).  Every new
   358  variable will at a minimum go through an init, a realloc for its first
   359  store, and finally a clear.  Quite how long that takes depends on the C
   360  library.  The following compares an mpz_init/realloc/clear to a 10 limb
   361  mpz_add.  Don't be surprised if the mallocing is quite slow.
   362  
   363          ./speed -s 10 -c mpz_init_realloc_clear mpz_add
   364  
   365  On some systems malloc and free are much slower when dynamic linked.  The
   366  speed-dynamic program can be used to see this.  For example the following
   367  measures malloc/free, first static then dynamic.
   368  
   369          ./speed -s 10 -c malloc_free
   370          ./speed-dynamic -s 10 -c malloc_free
   371  
   372  Of course a real world program has big problems if it's doing so many
   373  mallocs and frees that it gets slowed down by a dynamic linked malloc.
   374  
   375  
   376  
   377  
   378  
   379  EXAMPLE COMPARISONS - STRING CONVERSIONS
   380  
   381  mpn_get_str does a binary to string conversion.  The base is specified with
   382  a ".<r>" parameter, or decimal by default.  Power of 2 bases are much faster
   383  than general bases.  The following compares decimal and hex for instance.
   384  
   385          ./speed -s 1-20 -c mpn_get_str mpn_get_str.16
   386  
   387  Smaller bases need more divisions to split a given size number, and so are
   388  slower.  The following compares base 3 and base 9.  On small operands 9 will
   389  be nearly twice as fast, though at bigger sizes this reduces since in the
   390  current implementation both divide repeatedly by 3^20 (or 3^40 for 64 bit
   391  limbs) and those divisions come to dominate.
   392  
   393          ./speed -s 1-20 -cr mpn_get_str.3 mpn_get_str.9
   394  
   395  mpn_set_str does a string to binary conversion.  The base is specified with
   396  a ".<r>" parameter, or decimal by default.  Power of 2 bases are faster than
   397  general bases on large conversions.
   398  
   399  	./speed -s 1-512 -f 2 -c mpn_set_str.8 mpn_set_str.10
   400  
   401  mpn_set_str also has some special case code for decimal which is a bit
   402  faster than the general case, basically by giving the compiler a chance to
   403  optimize some multiplications by 10.
   404  
   405  	./speed -s 20-40 -c mpn_set_str.9 mpn_set_str.10 mpn_set_str.11
   406  
   407  
   408  
   409  
   410  EXAMPLE COMPARISONS - GCDs
   411  
   412  mpn_gcd_1 has a threshold for when to reduce using an initial x%y when both
   413  x and y are single limbs.  This isn't tuned currently, but a value can be
   414  established by a measurement like
   415  
   416  	./speed -s 10-32 mpn_gcd_1.10
   417  
   418  This runs src[0] from 10 to 32 bits, and y fixed at 10 bits.  If the div
   419  threshold is high, say 31 so it's effectively disabled then a 32x10 bit gcd
   420  is done by nibbling away at the 32-bit operands bit-by-bit.  When the
   421  threshold is small, say 1 bit, then an initial x%y is done to reduce it to a
   422  10x10 bit operation.
   423  
   424  The threshold in mpn/generic/gcd_1.c or the various assembler
   425  implementations can be tweaked up or down until there's no more speedups on
   426  interesting combinations of sizes.  Note that this affects only a 1x1 limb
   427  operation and so isn't very important.  (An Nx1 limb operation always does
   428  an initial modular reduction, using mpn_mod_1 or mpn_modexact_1_odd.)
   429  
   430  
   431  
   432  
   433  SPEED PROGRAM EXTENSIONS
   434  
   435  Potentially lots of things could be made available in the program, but it's
   436  been left at only the things that have actually been wanted and are likely
   437  to be reasonably useful in the future.
   438  
   439  Extensions should be fairly easy to make though.  speed-ext.c is an example,
   440  in a style that should suit one-off tests, or new code fragments under
   441  development.
   442  
   443  many.pl is a script for generating a new speed program supplemented with
   444  alternate versions of the standard routines.  It can be used for measuring
   445  experimental code, or for comparing different implementations that exist
   446  within a CPU family.
   447  
   448  
   449  
   450  
   451  THRESHOLD EXAMINING
   452  
   453  The speed program can be used to examine the speeds of different algorithms
   454  to check the tune program has done the right thing.  For example to examine
   455  the karatsuba multiply threshold,
   456  
   457  	./speed -s 5-40 mpn_mul_basecase mpn_kara_mul_n
   458  
   459  When examining the toom3 threshold, remember it depends on the karatsuba
   460  threshold, so the right karatsuba threshold needs to be compiled into the
   461  library first.  The tune program uses specially recompiled versions of
   462  mpn/mul_n.c etc for this reason, but the speed program simply uses the
   463  normal libgmp.la.
   464  
   465  Note further that the various routines may recurse into themselves on sizes
   466  far enough above applicable thresholds.  For example, mpn_kara_mul_n will
   467  recurse into itself on sizes greater than twice the compiled-in
   468  MUL_TOOM22_THRESHOLD.
   469  
   470  When doing the above comparison between mul_basecase and kara_mul_n what's
   471  probably of interest is mul_basecase versus a kara_mul_n that does one level
   472  of Karatsuba then calls to mul_basecase, but this only happens on sizes less
   473  than twice the compiled MUL_TOOM22_THRESHOLD.  A larger value for that
   474  setting can be compiled-in to avoid the problem if necessary.  The same
   475  applies to toom3 and DC, though in a trickier fashion.
   476  
   477  There are some upper limits on some of the thresholds, arising from arrays
   478  dimensioned according to a threshold (mpn_mul_n), or asm code with certain
   479  sized displacements (some x86 versions of sqr_basecase).  So putting huge
   480  values for the thresholds, even just for testing, may fail.
   481  
   482  
   483  
   484  
   485  FUTURE
   486  
   487  Make a program to check the time base is working properly, for small and
   488  large measurements.  Make it able to test each available method, including
   489  perhaps the apparent resolution of each.
   490  
   491  Make a general mechanism for specifying operand overlap, and a syntax like
   492  maybe "mpn_add_n.dst=src2" to select it.  Some measuring routines do this
   493  sort of thing with the "r" parameter currently.
   494  
   495  
   496  
   497  ----------------
   498  Local variables:
   499  mode: text
   500  fill-column: 76
   501  End: