github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/profiling-go-programs.article

github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/profiling-go-programs.article (about)

     1  Profiling Go Programs
     2  24 Jun 2011
     3  Tags: benchmark, pprof, profiling, technical
     4  
     5  Russ Cox, July 2011; updated by Shenghou Ma, May 2013
     6  
     7  * Introduction
     8  
     9  At Scala Days 2011, Robert Hundt presented a paper titled
    10  [[http://research.google.com/pubs/pub37122.html][Loop Recognition in C++/Java/Go/Scala.]]
    11  The paper implemented a specific loop finding algorithm, such as you might use
    12  in a flow analysis pass of a compiler, in C++, Go, Java, Scala, and then used
    13  those programs to draw conclusions about typical performance concerns in these
    14  languages.
    15  The Go program presented in that paper runs quite slowly, making it
    16  an excellent opportunity to demonstrate how to use Go's profiling tools to take
    17  a slow program and make it faster.
    18  
    19  _By_using_Go's_profiling_tools_to_identify_and_correct_specific_bottlenecks,_we_can_make_the_Go_loop_finding_program_run_an_order_of_magnitude_faster_and_use_6x_less_memory._
    20  (Update: Due to recent optimizations of `libstdc++` in `gcc`, the memory reduction is now 3.7x.)
    21  
    22  Hundt's paper does not specify which versions of the C++, Go, Java, and Scala
    23  tools he used.
    24  In this blog post, we will be using the most recent weekly snapshot of the `6g`
    25  Go compiler and the version of `g++` that ships with the Ubuntu Natty
    26  distribution.
    27  (We will not be using Java or Scala, because we are not skilled at writing efficient
    28  programs in either of those languages, so the comparison would be unfair.
    29  Since C++ was the fastest language in the paper, the comparisons here with C++ should
    30  suffice.)
    31  (Update: In this updated post, we will be using the most recent development snapshot
    32  of the Go compiler on amd64 and the most recent version of `g++` -- 4.8.0, which was
    33  released in March 2013.)
    34  
    35  	$ go version
    36  	go version devel +08d20469cc20 Tue Mar 26 08:27:18 2013 +0100 linux/amd64
    37  	$ g++ --version
    38  	g++ (GCC) 4.8.0
    39  	Copyright (C) 2013 Free Software Foundation, Inc.
    40  	...
    41  	$
    42  
    43  The programs are run on a computer with a 3.4GHz Core i7-2600 CPU and 16 GB of
    44  RAM running Gentoo Linux's 3.8.4-gentoo kernel.
    45  The machine is running with CPU frequency scaling disabled via
    46  
    47  	$ sudo bash
    48  	# for i in /sys/devices/system/cpu/cpu[0-7]
    49  	do
    50  	    echo performance > $i/cpufreq/scaling_governor
    51  	done
    52  	#
    53  
    54  We've taken [[http://code.google.com/p/multi-language-bench/][Hundt's benchmark programs]]
    55  in C++ and Go, combined each into a single source file, and removed all but one
    56  line of output.
    57  We'll time the program using Linux's `time` utility with a format that shows user time,
    58  system time, real time, and maximum memory usage:
    59  
    60  	$ cat xtime
    61  	#!/bin/sh
    62  	/usr/bin/time -f '%Uu %Ss %er %MkB %C' "$@"
    63  	$
    64  
    65  	$ make havlak1cc
    66  	g++ -O3 -o havlak1cc havlak1.cc
    67  	$ ./xtime ./havlak1cc
    68  	# of loops: 76002 (total 3800100)
    69  	loop-0, nest: 0, depth: 0
    70  	17.70u 0.05s 17.80r 715472kB ./havlak1cc
    71  	$
    72  
    73  	$ make havlak1
    74  	go build havlak1.go
    75  	$ ./xtime ./havlak1
    76  	# of loops: 76000 (including 1 artificial root node)
    77  	25.05u 0.11s 25.20r 1334032kB ./havlak1
    78  	$
    79  
    80  The C++ program runs in 17.80 seconds and uses 700 MB of memory.
    81  The Go program runs in 25.20 seconds and uses 1302 MB of memory.
    82  (These measurements are difficult to reconcile with the ones in the paper, but the
    83  point of this post is to explore how to use `go tool pprof`, not to reproduce the
    84  results from the paper.)
    85  
    86  To start tuning the Go program, we have to enable profiling.
    87  If the code used the [[http://golang.org/pkg/testing/][Go testing package]]'s
    88  benchmarking support, we could use gotest's standard `-cpuprofile` and `-memprofile`
    89  flags.
    90  In a standalone program like this one, we have to import `runtime/pprof` and add a few
    91  lines of code:
    92  
    93  	var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file")
    94  
    95  	func main() {
    96  	    flag.Parse()
    97  	    if *cpuprofile != "" {
    98  	        f, err := os.Create(*cpuprofile)
    99  	        if err != nil {
   100  	            log.Fatal(err)
   101  	        }
   102  	        pprof.StartCPUProfile(f)
   103  	        defer pprof.StopCPUProfile()
   104  	    }
   105  	    ...
   106  
   107  The new code defines a flag named `cpuprofile`, calls the
   108  [[http://golang.org/pkg/flag/][Go flag library]] to parse the command line flags,
   109  and then, if the `cpuprofile` flag has been set on the command line,
   110  [[http://golang.org/pkg/runtime/pprof/#StartCPUProfile][starts CPU profiling]]
   111  redirected to that file.
   112  The profiler requires a final call to
   113  [[http://golang.org/pkg/runtime/pprof/#StopCPUProfile][`StopCPUProfile`]] to
   114  flush any pending writes to the file before the program exits; we use `defer`
   115  to make sure this happens as `main` returns.
   116  
   117  After adding that code, we can run the program with the new `-cpuprofile` flag
   118  and then run `go tool pprof` to interpret the profile.
   119  
   120  	$ make havlak1.prof
   121  	./havlak1 -cpuprofile=havlak1.prof
   122  	# of loops: 76000 (including 1 artificial root node)
   123  	$ go tool pprof havlak1 havlak1.prof
   124  	Welcome to pprof!  For help, type 'help'.
   125  	(pprof)
   126  
   127  The `go tool pprof` program is a slight variant of
   128  [[https://code.google.com/p/gperftools/wiki/GooglePerformanceTools][Google's `pprof` C++ profiler]].
   129  The most important command is `topN`, which shows the top `N` samples in the profile:
   130  
   131  	(pprof) top10
   132  	Total: 2525 samples
   133  	     298  11.8%  11.8%      345  13.7% runtime.mapaccess1_fast64
   134  	     268  10.6%  22.4%     2124  84.1% main.FindLoops
   135  	     251   9.9%  32.4%      451  17.9% scanblock
   136  	     178   7.0%  39.4%      351  13.9% hash_insert
   137  	     131   5.2%  44.6%      158   6.3% sweepspan
   138  	     119   4.7%  49.3%      350  13.9% main.DFS
   139  	      96   3.8%  53.1%       98   3.9% flushptrbuf
   140  	      95   3.8%  56.9%       95   3.8% runtime.aeshash64
   141  	      95   3.8%  60.6%      101   4.0% runtime.settype_flush
   142  	      88   3.5%  64.1%      988  39.1% runtime.mallocgc
   143  
   144  When CPU profiling is enabled, the Go program stops about 100 times per second
   145  and records a sample consisting of the program counters on the currently executing
   146  goroutine's stack.
   147  The profile has 2525 samples, so it was running for a bit over 25 seconds.
   148  In the `go tool pprof` output, there is a row for each function that appeared in
   149  a sample.
   150  The first two columns show the number of samples in which the function was running
   151  (as opposed to waiting for a called function to return), as a raw count and as a
   152  percentage of total samples.
   153  The `runtime.mapaccess1_fast64` function was running during 298 samples, or 11.8%.
   154  The `top10` output is sorted by this sample count.
   155  The third column shows the running total during the listing:
   156  the first three rows account for 32.4% of the samples.
   157  The fourth and fifth columns show the number of samples in which the function appeared
   158  (either running or waiting for a called function to return).
   159  The `main.FindLoops` function was running in 10.6% of the samples, but it was on the
   160  call stack (it or functions it called were running) in 84.1% of the samples.
   161  
   162  To sort by the fourth and fifth columns, use the `-cum` (for cumulative) flag:
   163  
   164  	(pprof) top5 -cum
   165  	Total: 2525 samples
   166  	       0   0.0%   0.0%     2144  84.9% gosched0
   167  	       0   0.0%   0.0%     2144  84.9% main.main
   168  	       0   0.0%   0.0%     2144  84.9% runtime.main
   169  	       0   0.0%   0.0%     2124  84.1% main.FindHavlakLoops
   170  	     268  10.6%  10.6%     2124  84.1% main.FindLoops
   171  	(pprof) top5 -cum
   172  
   173  In fact the total for `main.FindLoops` and `main.main` should have been 100%, but
   174  each stack sample only includes the bottom 100 stack frames; during about a quarter
   175  of the samples, the recursive `main.DFS` function was more than 100 frames deeper
   176  than `main.main` so the complete trace was truncated.
   177  
   178  The stack trace samples contain more interesting data about function call relationships
   179  than the text listings can show.
   180  The `web` command writes a graph of the profile data in SVG format and opens it in a web
   181  browser.
   182  (There is also a `gv` command that writes PostScript and opens it in Ghostview.
   183  For either command, you need [[http://www.graphviz.org/][graphviz]] installed.)
   184  
   185  	(pprof) web
   186  
   187  A small fragment of
   188  [[http://benchgraffiti.googlecode.com/hg/havlak/havlak1.svg][the full graph]] looks like:
   189  
   190  .image profiling-go-programs_havlak1a-75.png
   191  
   192  Each box in the graph corresponds to a single function, and the boxes are sized
   193  according to the number of samples in which the function was running.
   194  An edge from box X to box Y indicates that X calls Y; the number along the edge is
   195  the number of times that call appears in a sample.
   196  If a call appears multiple times in a single sample, such as during recursive function
   197  calls, each appearance counts toward the edge weight.
   198  That explains the 21342 on the self-edge from `main.DFS` to itself.
   199  
   200  Just at a glance, we can see that the program spends much of its time in hash
   201  operations, which correspond to use of Go's `map` values.
   202  We can tell `web` to use only samples that include a specific function, such as
   203  `runtime.mapaccess1_fast64`, which clears some of the noise from the graph:
   204  
   205  	(pprof) web mapaccess1
   206  
   207  .image profiling-go-programs_havlak1-hash_lookup-75.png
   208  
   209  If we squint, we can see that the calls to `runtime.mapaccess1_fast64` are being
   210  made by `main.FindLoops` and `main.DFS`.
   211  
   212  Now that we have a rough idea of the big picture, it's time to zoom in on a particular
   213  function.
   214  Let's look at `main.DFS` first, just because it is a shorter function:
   215  
   216  	(pprof) list DFS
   217  	Total: 2525 samples
   218  	ROUTINE ====================== main.DFS in /home/rsc/g/benchgraffiti/havlak/havlak1.go
   219  	   119    697 Total samples (flat / cumulative)
   220  	     3      3  240: func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number map[*BasicBlock]int, last []int, current int) int {
   221  	     1      1  241:     nodes[current].Init(currentNode, current)
   222  	     1     37  242:     number[currentNode] = current
   223  	     .      .  243:
   224  	     1      1  244:     lastid := current
   225  	    89     89  245:     for _, target := range currentNode.OutEdges {
   226  	     9    152  246:             if number[target] == unvisited {
   227  	     7    354  247:                     lastid = DFS(target, nodes, number, last, lastid+1)
   228  	     .      .  248:             }
   229  	     .      .  249:     }
   230  	     7     59  250:     last[number[currentNode]] = lastid
   231  	     1      1  251:     return lastid
   232  	(pprof)
   233  
   234  The listing shows the source code for the `DFS` function (really, for every function
   235  matching the regular expression `DFS`).
   236  The first three columns are the number of samples taken while running that line, the
   237  number of samples taken while running that line or in code called from that line, and
   238  the line number in the file.
   239  The related command `disasm` shows a disassembly of the function instead of a source
   240  listing; when there are enough samples this can help you see which instructions are
   241  expensive.
   242  The `weblist` command mixes the two modes: it shows
   243  [[http://benchgraffiti.googlecode.com/hg/havlak/havlak1.html][a source listing in which clicking a line shows the disassembly]].
   244  
   245  Since we already know that the time is going into map lookups implemented by the
   246  hash runtime functions, we care most about the second column.
   247  A large fraction of time is spent in recursive calls to `DFS` (line 247), as would be
   248  expected from a recursive traversal.
   249  Excluding the recursion, it looks like the time is going into the accesses to the
   250  `number` map on lines 242, 246, and 250.
   251  For that particular lookup, a map is not the most efficient choice.
   252  Just as they would be in a compiler, the basic block structures have unique sequence
   253  numbers assigned to them.
   254  Instead of using a `map[*BasicBlock]int` we can use a `[]int`, a slice indexed by the
   255  block number.
   256  There's no reason to use a map when an array or slice will do.
   257  
   258  Changing `number` from a map to a slice requires editing seven lines in the program
   259  and cut its run time by nearly a factor of two:
   260  
   261  	$ make havlak2
   262  	go build havlak2.go
   263  	$ ./xtime ./havlak2
   264  	# of loops: 76000 (including 1 artificial root node)
   265  	16.55u 0.11s 16.69r 1321008kB ./havlak2
   266  	$ 
   267  
   268  (See the [[http://code.google.com/p/benchgraffiti/source/diff?name=34f7624bb2e2&r=240c155236f9&format=unidiff&path=/havlak/havlak.go][diff between `havlak1` and `havlak2`]])
   269  
   270  We can run the profiler again to confirm that `main.DFS` is no longer a significant
   271  part of the run time:
   272  
   273  	$ make havlak2.prof
   274  	./havlak2 -cpuprofile=havlak2.prof
   275  	# of loops: 76000 (including 1 artificial root node)
   276  	$ go tool pprof havlak2 havlak2.prof
   277  	Welcome to pprof!  For help, type 'help'.
   278  	(pprof)
   279  	(pprof) top5
   280  	Total: 1652 samples
   281  	     197  11.9%  11.9%      382  23.1% scanblock
   282  	     189  11.4%  23.4%     1549  93.8% main.FindLoops
   283  	     130   7.9%  31.2%      152   9.2% sweepspan
   284  	     104   6.3%  37.5%      896  54.2% runtime.mallocgc
   285  	      98   5.9%  43.5%      100   6.1% flushptrbuf
   286  	(pprof)
   287  
   288  The entry `main.DFS` no longer appears in the profile, and the rest of the program
   289  runtime has dropped too.
   290  Now the program is spending most of its time allocating memory and garbage collecting
   291  (`runtime.mallocgc`, which both allocates and runs periodic garbage collections,
   292  accounts for 54.2% of the time).
   293  To find out why the garbage collector is running so much, we have to find out what is
   294  allocating memory.
   295  One way is to add memory profiling to the program.
   296  We'll arrange that if the `-memprofile` flag is supplied, the program stops after one
   297  iteration of the loop finding, writes a memory profile, and exits:
   298  
   299  	var memprofile = flag.String("memprofile", "", "write memory profile to this file")
   300  	...
   301  
   302  		FindHavlakLoops(cfgraph, lsgraph)
   303  		if *memprofile != "" {
   304  			f, err := os.Create(*memprofile)
   305  			if err != nil {
   306  			    log.Fatal(err)
   307  			}
   308  			pprof.WriteHeapProfile(f)
   309  			f.Close()
   310  			return
   311  		}
   312  
   313  We invoke the program with `-memprofile` flag to write a profile:
   314  
   315  	$ make havlak3.mprof
   316  	go build havlak3.go
   317  	./havlak3 -memprofile=havlak3.mprof
   318  	$ 
   319  (See the [[http://code.google.com/p/benchgraffiti/source/diff?name=240c155236f9&r=796913012f93&format=unidiff&path=/havlak/havlak.go][diff from havlak2]])
   320  
   321  We use `go tool pprof` exactly the same way. Now the samples we are examining are
   322  memory allocations, not clock ticks.
   323  
   324  	$ go tool pprof havlak3 havlak3.mprof
   325  	Adjusting heap profiles for 1-in-524288 sampling rate
   326  	Welcome to pprof!  For help, type 'help'.
   327  	(pprof) top5
   328  	Total: 82.4 MB
   329  	    56.3  68.4%  68.4%     56.3  68.4% main.FindLoops
   330  	    17.6  21.3%  89.7%     17.6  21.3% main.(*CFG).CreateNode
   331  	     8.0   9.7%  99.4%     25.6  31.0% main.NewBasicBlockEdge
   332  	     0.5   0.6% 100.0%      0.5   0.6% itab
   333  	     0.0   0.0% 100.0%      0.5   0.6% fmt.init
   334  	(pprof)
   335  
   336  The command `go tool pprof` reports that `FindLoops` has allocated approximately
   337  56.3 of the 82.4 MB in use; `CreateNode` accounts for another 17.6 MB.
   338  To reduce overhead, the memory profiler only records information for approximately
   339  one block per half megabyte allocated (the “1-in-524288 sampling rate”), so these
   340  are approximations to the actual counts.
   341  
   342  To find the memory allocations, we can list those functions.
   343  
   344  	(pprof) list FindLoops
   345  	Total: 82.4 MB
   346  	ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
   347  	  56.3   56.3 Total MB (flat / cumulative)
   348  	...
   349  	   1.9    1.9  268:     nonBackPreds := make([]map[int]bool, size)
   350  	   5.8    5.8  269:     backPreds := make([][]int, size)
   351  	     .      .  270:
   352  	   1.9    1.9  271:     number := make([]int, size)
   353  	   1.9    1.9  272:     header := make([]int, size, size)
   354  	   1.9    1.9  273:     types := make([]int, size, size)
   355  	   1.9    1.9  274:     last := make([]int, size, size)
   356  	   1.9    1.9  275:     nodes := make([]*UnionFindNode, size, size)
   357  	     .      .  276:
   358  	     .      .  277:     for i := 0; i < size; i++ {
   359  	   9.5    9.5  278:             nodes[i] = new(UnionFindNode)
   360  	     .      .  279:     }
   361  	...
   362  	     .      .  286:     for i, bb := range cfgraph.Blocks {
   363  	     .      .  287:             number[bb.Name] = unvisited
   364  	  29.5   29.5  288:             nonBackPreds[i] = make(map[int]bool)
   365  	     .      .  289:     }
   366  	...
   367  
   368  It looks like the current bottleneck is the same as the last one: using maps where
   369  simpler data structures suffice.
   370  `FindLoops` is allocating about 29.5 MB of maps.
   371  
   372  As an aside, if we run `go tool pprof` with the `--inuse_objects` flag, it will
   373  report allocation counts instead of sizes:
   374  
   375  
   376  	$ go tool pprof --inuse_objects havlak3 havlak3.mprof
   377  	Adjusting heap profiles for 1-in-524288 sampling rate
   378  	Welcome to pprof!  For help, type 'help'.
   379  	(pprof) list FindLoops
   380  	Total: 1763108 objects
   381  	ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go
   382  	720903 720903 Total objects (flat / cumulative)
   383  	...
   384  	     .      .  277:     for i := 0; i < size; i++ {
   385  	311296 311296  278:             nodes[i] = new(UnionFindNode)
   386  	     .      .  279:     }
   387  	     .      .  280:
   388  	     .      .  281:     // Step a:
   389  	     .      .  282:     //   - initialize all nodes as unvisited.
   390  	     .      .  283:     //   - depth-first traversal and numbering.
   391  	     .      .  284:     //   - unreached BB's are marked as dead.
   392  	     .      .  285:     //
   393  	     .      .  286:     for i, bb := range cfgraph.Blocks {
   394  	     .      .  287:             number[bb.Name] = unvisited
   395  	409600 409600  288:             nonBackPreds[i] = make(map[int]bool)
   396  	     .      .  289:     }
   397  	...
   398  	(pprof)
   399  
   400  Since the ~200,000 maps account for 29.5 MB, it looks like the initial map allocation
   401  takes about 150 bytes.
   402  That's reasonable when a map is being used to hold key-value pairs, but not when a map
   403  is being used as a stand-in for a simple set, as it is here.
   404  
   405  Instead of using a map, we can use a simple slice to list the elements.
   406  In all but one of the cases where maps are being used, it is impossible for the algorithm
   407  to insert a duplicate element.
   408  In the one remaining case, we can write a simple variant of the `append` built-in function:
   409  
   410  	func appendUnique(a []int, x int) []int {
   411  	    for _, y := range a {
   412  	        if x == y {
   413  	            return a
   414  	        }
   415  	    }
   416  	    return append(a, x)
   417  	}
   418  
   419  In addition to writing that function, changing the Go program to use slices instead
   420  of maps requires changing just a few lines of code.
   421  
   422  	$ make havlak4
   423  	go build havlak4.go
   424  	$ ./xtime ./havlak4
   425  	# of loops: 76000 (including 1 artificial root node)
   426  	11.84u 0.08s 11.94r 810416kB ./havlak4
   427  	$ 
   428  (See the [[http://code.google.com/p/benchgraffiti/source/diff?name=796913012f93&r=d856c2f698c1&format=unidiff&path=/havlak/havlak.go][diff from havlak3]])
   429  
   430  We're now at 2.11x faster than when we started. Let's look at a CPU profile again.
   431  
   432  	$ make havlak4.prof
   433  	./havlak4 -cpuprofile=havlak4.prof
   434  	# of loops: 76000 (including 1 artificial root node)
   435  	$ go tool pprof havlak4 havlak4.prof
   436  	Welcome to pprof!  For help, type 'help'.
   437  	(pprof) top10
   438  	Total: 1173 samples
   439  	     205  17.5%  17.5%     1083  92.3% main.FindLoops
   440  	     138  11.8%  29.2%      215  18.3% scanblock
   441  	      88   7.5%  36.7%       96   8.2% sweepspan
   442  	      76   6.5%  43.2%      597  50.9% runtime.mallocgc
   443  	      75   6.4%  49.6%       78   6.6% runtime.settype_flush
   444  	      74   6.3%  55.9%       75   6.4% flushptrbuf
   445  	      64   5.5%  61.4%       64   5.5% runtime.memmove
   446  	      63   5.4%  66.8%      524  44.7% runtime.growslice
   447  	      51   4.3%  71.1%       51   4.3% main.DFS
   448  	      50   4.3%  75.4%      146  12.4% runtime.MCache_Alloc
   449  	(pprof)
   450  
   451  Now memory allocation and the consequent garbage collection (`runtime.mallocgc`)
   452  accounts for 50.9% of our run time.
   453  Another way to look at why the system is garbage collecting is to look at the
   454  allocations that are causing the collections, the ones that spend most of the time
   455  in `mallocgc`:
   456  
   457  	(pprof) web mallocgc
   458  
   459  .image profiling-go-programs_havlak4a-mallocgc.png
   460  
   461  It's hard to tell what's going on in that graph, because there are many nodes with
   462  small sample numbers obscuring the big ones.
   463  We can tell `go tool pprof` to ignore nodes that don't account for at least 10% of
   464  the samples:
   465  
   466  	$ go tool pprof --nodefraction=0.1 havlak4 havlak4.prof
   467  	Welcome to pprof!  For help, type 'help'.
   468  	(pprof) web mallocgc
   469  
   470  .image profiling-go-programs_havlak4a-mallocgc-trim.png
   471  
   472  We can follow the thick arrows easily now, to see that `FindLoops` is triggering
   473  most of the garbage collection.
   474  If we list `FindLoops` we can see that much of it is right at the beginning:
   475  
   476  	(pprof) list FindLoops
   477  	...
   478  	     .      .  270: func FindLoops(cfgraph *CFG, lsgraph *LSG) {
   479  	     .      .  271:     if cfgraph.Start == nil {
   480  	     .      .  272:             return
   481  	     .      .  273:     }
   482  	     .      .  274:
   483  	     .      .  275:     size := cfgraph.NumNodes()
   484  	     .      .  276:
   485  	     .    145  277:     nonBackPreds := make([][]int, size)
   486  	     .      9  278:     backPreds := make([][]int, size)
   487  	     .      .  279:
   488  	     .      1  280:     number := make([]int, size)
   489  	     .     17  281:     header := make([]int, size, size)
   490  	     .      .  282:     types := make([]int, size, size)
   491  	     .      .  283:     last := make([]int, size, size)
   492  	     .      .  284:     nodes := make([]*UnionFindNode, size, size)
   493  	     .      .  285:
   494  	     .      .  286:     for i := 0; i < size; i++ {
   495  	     2     79  287:             nodes[i] = new(UnionFindNode)
   496  	     .      .  288:     }
   497  	...
   498  	(pprof)
   499  
   500  Every time `FindLoops` is called, it allocates some sizable bookkeeping structures.
   501  Since the benchmark calls `FindLoops` 50 times, these add up to a significant amount
   502  of garbage, so a significant amount of work for the garbage collector.
   503  
   504  Having a garbage-collected language doesn't mean you can ignore memory allocation
   505  issues.
   506  In this case, a simple solution is to introduce a cache so that each call to `FindLoops`
   507  reuses the previous call's storage when possible.
   508  (In fact, in Hundt's paper, he explains that the Java program needed just this change to
   509  get anything like reasonable performance, but he did not make the same change in the
   510  other garbage-collected implementations.)
   511  
   512  We'll add a global `cache` structure:
   513  
   514  	var cache struct {
   515  	    size int
   516  	    nonBackPreds [][]int
   517  	    backPreds [][]int
   518  	    number []int
   519  	    header []int
   520  	    types []int
   521  	    last []int
   522  	    nodes []*UnionFindNode
   523  	}
   524  
   525  and then have `FindLoops` consult it as a replacement for allocation:
   526  
   527  	    if cache.size < size {
   528  	        cache.size = size
   529  	        cache.nonBackPreds = make([][]int, size)
   530  	        cache.backPreds = make([][]int, size)
   531  	        cache.number = make([]int, size)
   532  	        cache.header = make([]int, size)
   533  	        cache.types = make([]int, size)
   534  	        cache.last = make([]int, size)
   535  	        cache.nodes = make([]*UnionFindNode, size)
   536  	        for i := range cache.nodes {
   537  	            cache.nodes[i] = new(UnionFindNode)
   538  	        }
   539  	    }
   540  
   541  	    nonBackPreds := cache.nonBackPreds[:size]
   542  	    for i := range nonBackPreds {
   543  	        nonBackPreds[i] = nonBackPreds[i][:0]
   544  	    }
   545  	    backPreds := cache.backPreds[:size]
   546  	    for i := range nonBackPreds {
   547  	        backPreds[i] = backPreds[i][:0]
   548  	    }
   549  	    number := cache.number[:size]
   550  	    header := cache.header[:size]
   551  	    types := cache.types[:size]
   552  	    last := cache.last[:size]
   553  	    nodes := cache.nodes[:size]
   554  
   555  Such a global variable is bad engineering practice, of course: it means that
   556  concurrent calls to `FindLoops` are now unsafe.
   557  For now, we are making the minimal possible changes in order to understand what
   558  is important for the performance of our program; this change is simple and mirrors
   559  the code in the Java implementation.
   560  The final version of the Go program will use a separate `LoopFinder` instance to
   561  track this memory, restoring the possibility of concurrent use.
   562  
   563  	$ make havlak5
   564  	go build havlak5.go
   565  	$ ./xtime ./havlak5
   566  	# of loops: 76000 (including 1 artificial root node)
   567  	8.03u 0.06s 8.11r 770352kB ./havlak5
   568  	$
   569  (See the [[http://code.google.com/p/benchgraffiti/source/diff?name=d856c2f698c1&r=5ce46b0ee1db&format=unidiff&path=/havlak/havlak.go][diff from havlak4]])
   570  
   571  There's more we can do to clean up the program and make it faster, but none of
   572  it requires profiling techniques that we haven't already shown.
   573  The work list used in the inner loop can be reused across iterations and across
   574  calls to `FindLoops`, and it can be combined with the separate “node pool” generated
   575  during that pass.
   576  Similarly, the loop graph storage can be reused on each iteration instead of reallocated.
   577  In addition to these performance changes, the
   578  [[http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.go][final version]]
   579  is written using idiomatic Go style, using data structures and methods.
   580  The stylistic changes have only a minor effect on the run time: the algorithm and
   581  constraints are unchanged.
   582  
   583  The final version runs in 2.29 seconds and uses 351 MB of memory:
   584  
   585  	$ make havlak6
   586  	go build havlak6.go
   587  	$ ./xtime ./havlak6
   588  	# of loops: 76000 (including 1 artificial root node)
   589  	2.26u 0.02s 2.29r 360224kB ./havlak6
   590  	$ 
   591  
   592  That's 11 times faster than the program we started with.
   593  Even if we disable reuse of the generated loop graph, so that the only cached memory
   594  is the loop finding bookeeping, the program still runs 6.7x faster than the original
   595  and uses 1.5x less memory.
   596  
   597  	$ ./xtime ./havlak6 -reuseloopgraph=false
   598  	# of loops: 76000 (including 1 artificial root node)
   599  	3.69u 0.06s 3.76r 797120kB ./havlak6 -reuseloopgraph=false
   600  	$
   601  
   602  Of course, it's no longer fair to compare this Go program to the original C++
   603  program, which used inefficient data structures like `set`s where `vector`s would
   604  be more appropriate.
   605  As a sanity check, we translated the final Go program into
   606  [[http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.cc][equivalent C++ code]].
   607  Its execution time is similar to the Go program's:
   608  
   609  	$ make havlak6cc
   610  	g++ -O3 -o havlak6cc havlak6.cc
   611  	$ ./xtime ./havlak6cc
   612  	# of loops: 76000 (including 1 artificial root node)
   613  	1.99u 0.19s 2.19r 387936kB ./havlak6cc
   614  
   615  The Go program runs almost as fast as the C++ program.
   616  As the C++ program is using automatic deletes and allocation instead of an explicit
   617  cache, the C++ program a bit shorter and easier to write, but not dramatically so:
   618  
   619  	$ wc havlak6.cc; wc havlak6.go
   620  	 401 1220 9040 havlak6.cc
   621  	 461 1441 9467 havlak6.go
   622  	$
   623  (See [[http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.cc][havlak6.cc]]
   624  and [[http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.go][havlak6.go]])
   625  
   626  Benchmarks are only as good as the programs they measure.
   627  We used `go tool pprof` to study an inefficient Go program and then to improve its
   628  performance by an order of magnitude and to reduce its memory usage by a factor of 3.7.
   629  A subsequent comparison with an equivalently optimized C++ program shows that Go can be
   630  competitive with C++ when programmers are careful about how much garbage is generated
   631  by inner loops.
   632  
   633  The program sources, Linux x86-64 binaries, and profiles used to write this post
   634  are available in the [[http://code.google.com/p/benchgraffiti/][benchgraffiti project on Google Code]].
   635  
   636  As mentioned above, [[http://golang.org/cmd/go/#Test_packages][`go test`]] includes
   637  these profiling flags already: define a
   638  [[http://golang.org/pkg/testing/][benchmark function]] and you're all set.
   639  There is also a standard HTTP interface to profiling data. In an HTTP server, adding
   640  
   641  	import _ "net/http/pprof"
   642  
   643  will install handlers for a few URLs under `/debug/pprof/`.
   644  Then you can run `go tool pprof` with a single argument—the URL to your server's
   645  profiling data and it will download and examine a live profile.
   646  
   647  	go tool pprof http://localhost:6060/debug/pprof/profile   # 30-second CPU profile
   648  	go tool pprof http://localhost:6060/debug/pprof/heap      # heap profile
   649  	go tool pprof http://localhost:6060/debug/pprof/block     # goroutine blocking profile
   650  
   651  The goroutine blocking profile will be explained in a future post. Stay tuned.