github.com/kaydxh/golang@v0.0.131/pkg/gocv/cgo/third_path/graphics-magick/share/doc/GraphicsMagick/www/OpenMP.html (about)

     1  <?xml version="1.0" encoding="utf-8" ?>
     2  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
     3  <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
     4  <head>
     5  <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
     6  <meta name="generator" content="Docutils 0.15.2: http://docutils.sourceforge.net/" />
     7  <title>OpenMP in GraphicsMagick</title>
     8  <link rel="stylesheet" href="docutils-articles.css" type="text/css" />
     9  </head>
    10  <body>
    11  
    12  <div class="banner">
    13  <img src="images/gm-107x76.png" alt="GraphicMagick logo" width="107" height="76" />
    14  <span class="title">GraphicsMagick</span>
    15  <form action="http://www.google.com/search">
    16  	<input type="hidden" name="domains" value="www.graphicsmagick.org" />
    17  	<input type="hidden" name="sitesearch" value="www.graphicsmagick.org" />
    18      <span class="nowrap"><input type="text" name="q" size="25" maxlength="255" />&nbsp;<input type="submit" name="sa" value="Search" /></span>
    19  </form>
    20  </div>
    21  
    22  <div class="navmenu">
    23  <ul>
    24  <li><a href="index.html">Home</a></li>
    25  <li><a href="project.html">Project</a></li>
    26  <li><a href="download.html">Download</a></li>
    27  <li><a href="README.html">Install</a></li>
    28  <li><a href="Hg.html">Source</a></li>
    29  <li><a href="NEWS.html">News</a> </li>
    30  <li><a href="utilities.html">Utilities</a></li>
    31  <li><a href="programming.html">Programming</a></li>
    32  <li><a href="reference.html">Reference</a></li>
    33  </ul>
    34  </div>
    35  <div class="document" id="openmp-in-graphicsmagick">
    36  <h1 class="title">OpenMP in GraphicsMagick</h1>
    37  
    38  <!-- -*- mode: rst -*- -->
    39  <!-- This text is in reStucturedText format, so it may look a bit odd. -->
    40  <!-- See http://docutils.sourceforge.net/rst.html for details. -->
    41  <div class="contents local topic" id="contents">
    42  <ul class="simple">
    43  <li><a class="reference internal" href="#overview" id="id1">Overview</a></li>
    44  <li><a class="reference internal" href="#limitations" id="id2">Limitations</a></li>
    45  <li><a class="reference internal" href="#openmp-variables" id="id3">OpenMP Variables</a></li>
    46  </ul>
    47  </div>
    48  <div class="section" id="overview">
    49  <h1><a class="toc-backref" href="#id1">Overview</a></h1>
    50  <p>GraphicsMagick has been transformed to use <a class="reference external" href="http://openmp.org/">OpenMP</a> for the 1.3 release
    51  series. OpenMP is a portable framework for accelerating CPU-bound and
    52  memory-bound operations using multiple threads. OpenMP originates in
    53  the super-computing world and has been available in one form or
    54  another since the late '90s.</p>
    55  <p>Since GCC 4.2 has introduced excellent OpenMP support via <a class="reference external" href="http://gcc.gnu.org/onlinedocs/libgomp/">GOMP</a>,
    56  OpenMP has become available to the masses.  Recently, <a class="reference external" href="https://clang.llvm.org/">Clang</a> has
    57  also implemented good OpenMP support. Microsoft Visual Studio
    58  Professional 2005 and later support OpenMP so Windows users can
    59  benefit as well. Any multi-CPU and/or multi-core system is potentially
    60  a good candidate for use with OpenMP.  Modern multi-core chipsets from
    61  AMD, Intel, IBM, Oracle, and ARM perform very well with OpenMP.</p>
    62  <p>Most image processing routines are comprised of loops which iterate
    63  through the image pixels, image rows, or image regions. These loops
    64  are accelerated using OpenMP by executing portions of the total loops
    65  in different threads, and therefore on a different processor
    66  core. CPU-bound algorithms benefit most from OpenMP, but memory-bound
    67  algorithms may also benefit as well since the memory is accessed by
    68  different CPU cores, and sometimes the CPUs have their own path to
    69  memory. For example, the AMD Opteron is a <a class="reference external" href="https://en.wikipedia.org/wiki/Non-uniform_memory_access">NUMA</a> (Non-Uniform Memory
    70  Architecture) design such that multi-CPU systems split the system
    71  memory across CPUs so each CPU adds more memory bandwidth as well.
    72  Server-class CPUs offer more independent memory channels than desktop
    73  CPUs do.</p>
    74  <p>For severely CPU-bound algorithms, it is not uncommon to see a linear
    75  speed-up (within the constraints of <a class="reference external" href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl's law</a>) due to the number
    76  of cores. For example, a two core system executes the algorithm twice
    77  as fast, and a four core system executes the algorithm four times as
    78  fast. Memory-bound algorithms scale based on the memory bandwith
    79  available to the cores. For example, memory-bound algorithms scale up
    80  to almost 1.5X on my four core Opteron system due to its <a class="reference external" href="https://en.wikipedia.org/wiki/Non-uniform_memory_access">NUMA</a>
    81  architecture. Some systems/CPUs are able to immediately context switch
    82  to another thread if the core would be blocked waiting for memory,
    83  allowing multiple memory accesses to be pending at once, and thereby
    84  improving throughput.  For example, typical speedup of 20-32X (average
    85  24X) has been observed on the Sun SPARC T2 CPU, which provides 8
    86  cores, with 8 virtual CPUs per core (64 threads).</p>
    87  <p>An approach used in GraphicsMagick is to recognize the various access
    88  patterns in the existing code, and re-write the algorithms (sometimes
    89  from scratch) to be based on a framework that we call &quot;pixel iterators&quot;.
    90  With this approach, the computation is restricted to a small unit (a
    91  callback function) with very well defined properties, and no knowledge as
    92  to how it is executed or where the data comes from. This approach removes
    93  the loops from the code and puts the loops in the framework, which may be
    94  adjusted based on experience. The continuing strategy will be to
    95  recognize design patterns and build frameworks which support those
    96  patterns. Sometimes algorithms are special/exotic enough that it is much
    97  easier to instrument the code for OpenMP rather than to attempt to fit
    98  the algorithm into a framework.</p>
    99  <p>Since OpenMP is based on multi-threading, multiple threads access the
   100  underlying pixel storage at once. The interface to this underlying
   101  storage is called the &quot;pixel cache&quot;. The original pixel cache code
   102  (derived from ImageMagick) was thread safe only to the extent that it
   103  allowed one thread per image. This code has now been re-written so that
   104  multiple threads may safely and efficiently work on the pixels in one
   105  image. The re-write also makes the pixel cache thread safe if a
   106  multi-threaded application uses an OpenMP-fortified library.</p>
   107  <p>GraphicsMagick provides its own built-in 'benchmark' driver utility
   108  which may be used to execute a multi-threaded benchmark of any other
   109  utility command.</p>
   110  <p>Using the built-in 'benchmark' driver utility, the following is an
   111  example of per-core speed-up due to OpenMP on a four-core AMD Opteron
   112  system (with Firefox and other desktop software still running).  The
   113  image is generated dynamically based on the 'granite' pattern and all
   114  the pixel quantum values have 30% gaussian noise added:</p>
   115  <pre class="literal-block">
   116  % gm benchmark -stepthreads 1 -duration 10 convert \
   117    -size 2048x1080 pattern:granite -operator all Noise-Gaussian 30% null:
   118  Results: 1 threads 5 iter 11.34s user 11.340000s total 0.441 iter/s 0.441 iter/cpu 1.00 speedup 1.000 karp-flatt
   119  Results: 2 threads 9 iter 20.34s user 10.190000s total 0.883 iter/s 0.442 iter/cpu 2.00 speedup 0.000 karp-flatt
   120  Results: 3 threads 14 iter 31.72s user 10.600000s total 1.321 iter/s 0.441 iter/cpu 3.00 speedup 0.001 karp-flatt
   121  Results: 4 threads 18 iter 40.84s user 10.460000s total 1.721 iter/s 0.441 iter/cpu 3.90 speedup 0.008 karp-flatt
   122  </pre>
   123  <p>Note that the &quot;iter/s cpu&quot; value is a measure of the number of
   124  iterations given the amount of reported CPU time consumed. It is an
   125  effective measure of relative efficacy since its value should ideally
   126  not drop as iterations are added.  The <a class="reference external" href="https://en.wikipedia.org/wiki/Karp%E2%80%93Flatt_metric">karp-flatt metric</a> is another
   127  useful metric for evaluating thread-speedup efficiency. In the above
   128  example, the total speedup was about 3.9X with only a slight loss of
   129  CPU efficiency as threads are added.</p>
   130  </div>
   131  <div class="section" id="limitations">
   132  <h1><a class="toc-backref" href="#id2">Limitations</a></h1>
   133  <p>Often it is noticed that the memory allocation functions (e.g. from
   134  the standard C library such as GNU libc) significantly hinder
   135  performance since they are designed or optimized for single-threaded
   136  programs, or prioritize returning memory to the system over speed.
   137  Memory allocators are usually designed and optimized for programs
   138  which perform thousands of small allocations, and if they make a large
   139  memory allocation, they retain that memory for a long time.
   140  GraphicsMagick performs large memory allocations for raster image
   141  storage interspersed with a limited number of smaller allocations for
   142  supportive data structures.  This memory is released very quickly
   143  since GraphicsMagick is highly optimized and thus the time between
   144  allocation and deallocation can be very short.  It has been observed
   145  that some memory allocators are much slower to allocate and deallocate
   146  large amounts of memory (e.g. a hundred megabytes) than alternative
   147  allocators, even in single-threaded programs.  Under these conditions,
   148  the program can spend considerable time mysteriously &quot;sleeping&quot;.</p>
   149  <p>In order to help surmount problems with the default memory allocators,
   150  the configure script offers support for use of Google <a class="reference external" href="https://github.com/gperftools/gperftools">gperftools</a> <a class="reference external" href="https://github.com/gperftools/gperftools/wiki">'tcmalloc'</a>, Solaris mtmalloc,
   151  and Solaris umem libraries via the --with-tcmalloc, --with-mtmalloc,
   152  and --with-umem options, respectively.  When the allocation functions
   153  are behaving badly, the memory allocation/deallocation performance
   154  does not scale as threads are added and thus additional threads spend
   155  more time sleeping (e.g. on a lock, or in munmap()) rather than doing
   156  more work.  Performance improvements of a factor of two are not
   157  uncommon even before contending with the hugh CPU core/thread counts
   158  available on modern CPUs.  Using more threads which are slowed by
   159  poorly-matched memory allocation functions is wasteful of memory,
   160  system resources, human patience, and electrical power.</p>
   161  <p>Many modern CPUs support &quot;Turbo&quot; modes where the CPU clock rate is
   162  boosted if only a few cores are active.  When a CPU provides a &quot;Turbo&quot;
   163  mode, this decreases the apparent speed-up compared to using one
   164  thread because the one thread was executed at a much higher clock
   165  rate.  Likewise, when a CPU becomes very hot (due to being heavily
   166  used), it may decrease its clock rates overall to avoid burning up,
   167  and this may also decreases the actual speed-up when using many
   168  threads compared to using one thread.  Many CPUs support
   169  &quot;hyperthreads&quot; or other mechanisms in which one physical core will
   170  support multiple light-weight threads, and if the core is efficiently
   171  used by one thread, then this will decrease the apparent per-thread
   172  speed-up but the peak speed-up will hopefully still be bounded by the
   173  number of physical cores.</p>
   174  <p>In most cases, OpenMP does not speed-up loading an image from a file,
   175  or writing an image to a file.  It is common for file decode and
   176  encode to take longer than processing the image.  Using uncompressed
   177  formats is recommended with a fast I/O subsystem (or in-memory 'blobs'
   178  in order to obtain the greated speed-up from OpenMP.</p>
   179  <p>It has been observed that sometimes it takes much longer to start and
   180  stop GraphicsMagick than it takes for it to run the requested
   181  algorithm.  The slowness is due to inefficiencies of the libraries
   182  that GraphicsMagick is linked with (especially the ICU library that
   183  libxml2 is often linked with).  If GraphicsMagick takes too long to
   184  perform trivial operations, then consider using the 'modules' build,
   185  and investigate the 'batch' utility which allows running many
   186  GraphicsMagick commands as a 'batch' script.  If a 'modules' build is
   187  not feasible, then configuring GraphicsMagick to only support the
   188  specific formats actually needed can help with its execution time and
   189  improve opportunity for OpenMP speed-up.</p>
   190  </div>
   191  <div class="section" id="openmp-variables">
   192  <h1><a class="toc-backref" href="#id3">OpenMP Variables</a></h1>
   193  <p>According to the OpenMP specification, the OMP_NUM_THREADS evironment
   194  variable may be used to specify the number of threads available to the
   195  application. Typically this is set to the number of processor cores on
   196  the system but may be set lower to limit resource consumption or (in
   197  some cases) to improve execution efficiency.  The GraphicsMagick
   198  commands also accept a <tt class="docutils literal"><span class="pre">-limit</span> threads limit</tt> type option for
   199  specifying the maximum number of threads to use.</p>
   200  <hr class="docutils" />
   201  <div class="line-block">
   202  <div class="line">Copyright (C) 2008 - 2020 GraphicsMagick Group</div>
   203  </div>
   204  <p>This program is covered by multiple licenses, which are described in
   205  Copyright.txt. You should have received a copy of Copyright.txt with this
   206  package; otherwise see <a class="reference external" href="http://www.graphicsmagick.org/Copyright.html">http://www.graphicsmagick.org/Copyright.html</a>.</p>
   207  </div>
   208  </div>
   209  </body>
   210  </html>