github.com/kaydxh/golang@v0.0.131/pkg/gocv/cgo/third_path/graphics-magick/share/doc/GraphicsMagick/www/OpenMP.html (about) 1 <?xml version="1.0" encoding="utf-8" ?> 2 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 3 <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> 4 <head> 5 <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> 6 <meta name="generator" content="Docutils 0.15.2: http://docutils.sourceforge.net/" /> 7 <title>OpenMP in GraphicsMagick</title> 8 <link rel="stylesheet" href="docutils-articles.css" type="text/css" /> 9 </head> 10 <body> 11 12 <div class="banner"> 13 <img src="images/gm-107x76.png" alt="GraphicMagick logo" width="107" height="76" /> 14 <span class="title">GraphicsMagick</span> 15 <form action="http://www.google.com/search"> 16 <input type="hidden" name="domains" value="www.graphicsmagick.org" /> 17 <input type="hidden" name="sitesearch" value="www.graphicsmagick.org" /> 18 <span class="nowrap"><input type="text" name="q" size="25" maxlength="255" /> <input type="submit" name="sa" value="Search" /></span> 19 </form> 20 </div> 21 22 <div class="navmenu"> 23 <ul> 24 <li><a href="index.html">Home</a></li> 25 <li><a href="project.html">Project</a></li> 26 <li><a href="download.html">Download</a></li> 27 <li><a href="README.html">Install</a></li> 28 <li><a href="Hg.html">Source</a></li> 29 <li><a href="NEWS.html">News</a> </li> 30 <li><a href="utilities.html">Utilities</a></li> 31 <li><a href="programming.html">Programming</a></li> 32 <li><a href="reference.html">Reference</a></li> 33 </ul> 34 </div> 35 <div class="document" id="openmp-in-graphicsmagick"> 36 <h1 class="title">OpenMP in GraphicsMagick</h1> 37 38 <!-- -*- mode: rst -*- --> 39 <!-- This text is in reStucturedText format, so it may look a bit odd. --> 40 <!-- See http://docutils.sourceforge.net/rst.html for details. --> 41 <div class="contents local topic" id="contents"> 42 <ul class="simple"> 43 <li><a class="reference internal" href="#overview" id="id1">Overview</a></li> 44 <li><a class="reference internal" href="#limitations" id="id2">Limitations</a></li> 45 <li><a class="reference internal" href="#openmp-variables" id="id3">OpenMP Variables</a></li> 46 </ul> 47 </div> 48 <div class="section" id="overview"> 49 <h1><a class="toc-backref" href="#id1">Overview</a></h1> 50 <p>GraphicsMagick has been transformed to use <a class="reference external" href="http://openmp.org/">OpenMP</a> for the 1.3 release 51 series. OpenMP is a portable framework for accelerating CPU-bound and 52 memory-bound operations using multiple threads. OpenMP originates in 53 the super-computing world and has been available in one form or 54 another since the late '90s.</p> 55 <p>Since GCC 4.2 has introduced excellent OpenMP support via <a class="reference external" href="http://gcc.gnu.org/onlinedocs/libgomp/">GOMP</a>, 56 OpenMP has become available to the masses. Recently, <a class="reference external" href="https://clang.llvm.org/">Clang</a> has 57 also implemented good OpenMP support. Microsoft Visual Studio 58 Professional 2005 and later support OpenMP so Windows users can 59 benefit as well. Any multi-CPU and/or multi-core system is potentially 60 a good candidate for use with OpenMP. Modern multi-core chipsets from 61 AMD, Intel, IBM, Oracle, and ARM perform very well with OpenMP.</p> 62 <p>Most image processing routines are comprised of loops which iterate 63 through the image pixels, image rows, or image regions. These loops 64 are accelerated using OpenMP by executing portions of the total loops 65 in different threads, and therefore on a different processor 66 core. CPU-bound algorithms benefit most from OpenMP, but memory-bound 67 algorithms may also benefit as well since the memory is accessed by 68 different CPU cores, and sometimes the CPUs have their own path to 69 memory. For example, the AMD Opteron is a <a class="reference external" href="https://en.wikipedia.org/wiki/Non-uniform_memory_access">NUMA</a> (Non-Uniform Memory 70 Architecture) design such that multi-CPU systems split the system 71 memory across CPUs so each CPU adds more memory bandwidth as well. 72 Server-class CPUs offer more independent memory channels than desktop 73 CPUs do.</p> 74 <p>For severely CPU-bound algorithms, it is not uncommon to see a linear 75 speed-up (within the constraints of <a class="reference external" href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl's law</a>) due to the number 76 of cores. For example, a two core system executes the algorithm twice 77 as fast, and a four core system executes the algorithm four times as 78 fast. Memory-bound algorithms scale based on the memory bandwith 79 available to the cores. For example, memory-bound algorithms scale up 80 to almost 1.5X on my four core Opteron system due to its <a class="reference external" href="https://en.wikipedia.org/wiki/Non-uniform_memory_access">NUMA</a> 81 architecture. Some systems/CPUs are able to immediately context switch 82 to another thread if the core would be blocked waiting for memory, 83 allowing multiple memory accesses to be pending at once, and thereby 84 improving throughput. For example, typical speedup of 20-32X (average 85 24X) has been observed on the Sun SPARC T2 CPU, which provides 8 86 cores, with 8 virtual CPUs per core (64 threads).</p> 87 <p>An approach used in GraphicsMagick is to recognize the various access 88 patterns in the existing code, and re-write the algorithms (sometimes 89 from scratch) to be based on a framework that we call "pixel iterators". 90 With this approach, the computation is restricted to a small unit (a 91 callback function) with very well defined properties, and no knowledge as 92 to how it is executed or where the data comes from. This approach removes 93 the loops from the code and puts the loops in the framework, which may be 94 adjusted based on experience. The continuing strategy will be to 95 recognize design patterns and build frameworks which support those 96 patterns. Sometimes algorithms are special/exotic enough that it is much 97 easier to instrument the code for OpenMP rather than to attempt to fit 98 the algorithm into a framework.</p> 99 <p>Since OpenMP is based on multi-threading, multiple threads access the 100 underlying pixel storage at once. The interface to this underlying 101 storage is called the "pixel cache". The original pixel cache code 102 (derived from ImageMagick) was thread safe only to the extent that it 103 allowed one thread per image. This code has now been re-written so that 104 multiple threads may safely and efficiently work on the pixels in one 105 image. The re-write also makes the pixel cache thread safe if a 106 multi-threaded application uses an OpenMP-fortified library.</p> 107 <p>GraphicsMagick provides its own built-in 'benchmark' driver utility 108 which may be used to execute a multi-threaded benchmark of any other 109 utility command.</p> 110 <p>Using the built-in 'benchmark' driver utility, the following is an 111 example of per-core speed-up due to OpenMP on a four-core AMD Opteron 112 system (with Firefox and other desktop software still running). The 113 image is generated dynamically based on the 'granite' pattern and all 114 the pixel quantum values have 30% gaussian noise added:</p> 115 <pre class="literal-block"> 116 % gm benchmark -stepthreads 1 -duration 10 convert \ 117 -size 2048x1080 pattern:granite -operator all Noise-Gaussian 30% null: 118 Results: 1 threads 5 iter 11.34s user 11.340000s total 0.441 iter/s 0.441 iter/cpu 1.00 speedup 1.000 karp-flatt 119 Results: 2 threads 9 iter 20.34s user 10.190000s total 0.883 iter/s 0.442 iter/cpu 2.00 speedup 0.000 karp-flatt 120 Results: 3 threads 14 iter 31.72s user 10.600000s total 1.321 iter/s 0.441 iter/cpu 3.00 speedup 0.001 karp-flatt 121 Results: 4 threads 18 iter 40.84s user 10.460000s total 1.721 iter/s 0.441 iter/cpu 3.90 speedup 0.008 karp-flatt 122 </pre> 123 <p>Note that the "iter/s cpu" value is a measure of the number of 124 iterations given the amount of reported CPU time consumed. It is an 125 effective measure of relative efficacy since its value should ideally 126 not drop as iterations are added. The <a class="reference external" href="https://en.wikipedia.org/wiki/Karp%E2%80%93Flatt_metric">karp-flatt metric</a> is another 127 useful metric for evaluating thread-speedup efficiency. In the above 128 example, the total speedup was about 3.9X with only a slight loss of 129 CPU efficiency as threads are added.</p> 130 </div> 131 <div class="section" id="limitations"> 132 <h1><a class="toc-backref" href="#id2">Limitations</a></h1> 133 <p>Often it is noticed that the memory allocation functions (e.g. from 134 the standard C library such as GNU libc) significantly hinder 135 performance since they are designed or optimized for single-threaded 136 programs, or prioritize returning memory to the system over speed. 137 Memory allocators are usually designed and optimized for programs 138 which perform thousands of small allocations, and if they make a large 139 memory allocation, they retain that memory for a long time. 140 GraphicsMagick performs large memory allocations for raster image 141 storage interspersed with a limited number of smaller allocations for 142 supportive data structures. This memory is released very quickly 143 since GraphicsMagick is highly optimized and thus the time between 144 allocation and deallocation can be very short. It has been observed 145 that some memory allocators are much slower to allocate and deallocate 146 large amounts of memory (e.g. a hundred megabytes) than alternative 147 allocators, even in single-threaded programs. Under these conditions, 148 the program can spend considerable time mysteriously "sleeping".</p> 149 <p>In order to help surmount problems with the default memory allocators, 150 the configure script offers support for use of Google <a class="reference external" href="https://github.com/gperftools/gperftools">gperftools</a> <a class="reference external" href="https://github.com/gperftools/gperftools/wiki">'tcmalloc'</a>, Solaris mtmalloc, 151 and Solaris umem libraries via the --with-tcmalloc, --with-mtmalloc, 152 and --with-umem options, respectively. When the allocation functions 153 are behaving badly, the memory allocation/deallocation performance 154 does not scale as threads are added and thus additional threads spend 155 more time sleeping (e.g. on a lock, or in munmap()) rather than doing 156 more work. Performance improvements of a factor of two are not 157 uncommon even before contending with the hugh CPU core/thread counts 158 available on modern CPUs. Using more threads which are slowed by 159 poorly-matched memory allocation functions is wasteful of memory, 160 system resources, human patience, and electrical power.</p> 161 <p>Many modern CPUs support "Turbo" modes where the CPU clock rate is 162 boosted if only a few cores are active. When a CPU provides a "Turbo" 163 mode, this decreases the apparent speed-up compared to using one 164 thread because the one thread was executed at a much higher clock 165 rate. Likewise, when a CPU becomes very hot (due to being heavily 166 used), it may decrease its clock rates overall to avoid burning up, 167 and this may also decreases the actual speed-up when using many 168 threads compared to using one thread. Many CPUs support 169 "hyperthreads" or other mechanisms in which one physical core will 170 support multiple light-weight threads, and if the core is efficiently 171 used by one thread, then this will decrease the apparent per-thread 172 speed-up but the peak speed-up will hopefully still be bounded by the 173 number of physical cores.</p> 174 <p>In most cases, OpenMP does not speed-up loading an image from a file, 175 or writing an image to a file. It is common for file decode and 176 encode to take longer than processing the image. Using uncompressed 177 formats is recommended with a fast I/O subsystem (or in-memory 'blobs' 178 in order to obtain the greated speed-up from OpenMP.</p> 179 <p>It has been observed that sometimes it takes much longer to start and 180 stop GraphicsMagick than it takes for it to run the requested 181 algorithm. The slowness is due to inefficiencies of the libraries 182 that GraphicsMagick is linked with (especially the ICU library that 183 libxml2 is often linked with). If GraphicsMagick takes too long to 184 perform trivial operations, then consider using the 'modules' build, 185 and investigate the 'batch' utility which allows running many 186 GraphicsMagick commands as a 'batch' script. If a 'modules' build is 187 not feasible, then configuring GraphicsMagick to only support the 188 specific formats actually needed can help with its execution time and 189 improve opportunity for OpenMP speed-up.</p> 190 </div> 191 <div class="section" id="openmp-variables"> 192 <h1><a class="toc-backref" href="#id3">OpenMP Variables</a></h1> 193 <p>According to the OpenMP specification, the OMP_NUM_THREADS evironment 194 variable may be used to specify the number of threads available to the 195 application. Typically this is set to the number of processor cores on 196 the system but may be set lower to limit resource consumption or (in 197 some cases) to improve execution efficiency. The GraphicsMagick 198 commands also accept a <tt class="docutils literal"><span class="pre">-limit</span> threads limit</tt> type option for 199 specifying the maximum number of threads to use.</p> 200 <hr class="docutils" /> 201 <div class="line-block"> 202 <div class="line">Copyright (C) 2008 - 2020 GraphicsMagick Group</div> 203 </div> 204 <p>This program is covered by multiple licenses, which are described in 205 Copyright.txt. You should have received a copy of Copyright.txt with this 206 package; otherwise see <a class="reference external" href="http://www.graphicsmagick.org/Copyright.html">http://www.graphicsmagick.org/Copyright.html</a>.</p> 207 </div> 208 </div> 209 </body> 210 </html>