github.com/graybobo/golang.org-package-offline-cache@v0.0.0-20200626051047-6608995c132f/x/blog/content/profiling-go-programs.article (about) 1 Profiling Go Programs 2 24 Jun 2011 3 Tags: benchmark, pprof, profiling, technical 4 5 Russ Cox, July 2011; updated by Shenghou Ma, May 2013 6 7 * Introduction 8 9 At Scala Days 2011, Robert Hundt presented a paper titled 10 [[http://research.google.com/pubs/pub37122.html][Loop Recognition in C++/Java/Go/Scala.]] 11 The paper implemented a specific loop finding algorithm, such as you might use 12 in a flow analysis pass of a compiler, in C++, Go, Java, Scala, and then used 13 those programs to draw conclusions about typical performance concerns in these 14 languages. 15 The Go program presented in that paper runs quite slowly, making it 16 an excellent opportunity to demonstrate how to use Go's profiling tools to take 17 a slow program and make it faster. 18 19 _By_using_Go's_profiling_tools_to_identify_and_correct_specific_bottlenecks,_we_can_make_the_Go_loop_finding_program_run_an_order_of_magnitude_faster_and_use_6x_less_memory._ 20 (Update: Due to recent optimizations of `libstdc++` in `gcc`, the memory reduction is now 3.7x.) 21 22 Hundt's paper does not specify which versions of the C++, Go, Java, and Scala 23 tools he used. 24 In this blog post, we will be using the most recent weekly snapshot of the `6g` 25 Go compiler and the version of `g++` that ships with the Ubuntu Natty 26 distribution. 27 (We will not be using Java or Scala, because we are not skilled at writing efficient 28 programs in either of those languages, so the comparison would be unfair. 29 Since C++ was the fastest language in the paper, the comparisons here with C++ should 30 suffice.) 31 (Update: In this updated post, we will be using the most recent development snapshot 32 of the Go compiler on amd64 and the most recent version of `g++` -- 4.8.0, which was 33 released in March 2013.) 34 35 $ go version 36 go version devel +08d20469cc20 Tue Mar 26 08:27:18 2013 +0100 linux/amd64 37 $ g++ --version 38 g++ (GCC) 4.8.0 39 Copyright (C) 2013 Free Software Foundation, Inc. 40 ... 41 $ 42 43 The programs are run on a computer with a 3.4GHz Core i7-2600 CPU and 16 GB of 44 RAM running Gentoo Linux's 3.8.4-gentoo kernel. 45 The machine is running with CPU frequency scaling disabled via 46 47 $ sudo bash 48 # for i in /sys/devices/system/cpu/cpu[0-7] 49 do 50 echo performance > $i/cpufreq/scaling_governor 51 done 52 # 53 54 We've taken [[http://code.google.com/p/multi-language-bench/][Hundt's benchmark programs]] 55 in C++ and Go, combined each into a single source file, and removed all but one 56 line of output. 57 We'll time the program using Linux's `time` utility with a format that shows user time, 58 system time, real time, and maximum memory usage: 59 60 $ cat xtime 61 #!/bin/sh 62 /usr/bin/time -f '%Uu %Ss %er %MkB %C' "$@" 63 $ 64 65 $ make havlak1cc 66 g++ -O3 -o havlak1cc havlak1.cc 67 $ ./xtime ./havlak1cc 68 # of loops: 76002 (total 3800100) 69 loop-0, nest: 0, depth: 0 70 17.70u 0.05s 17.80r 715472kB ./havlak1cc 71 $ 72 73 $ make havlak1 74 go build havlak1.go 75 $ ./xtime ./havlak1 76 # of loops: 76000 (including 1 artificial root node) 77 25.05u 0.11s 25.20r 1334032kB ./havlak1 78 $ 79 80 The C++ program runs in 17.80 seconds and uses 700 MB of memory. 81 The Go program runs in 25.20 seconds and uses 1302 MB of memory. 82 (These measurements are difficult to reconcile with the ones in the paper, but the 83 point of this post is to explore how to use `go tool pprof`, not to reproduce the 84 results from the paper.) 85 86 To start tuning the Go program, we have to enable profiling. 87 If the code used the [[http://golang.org/pkg/testing/][Go testing package]]'s 88 benchmarking support, we could use gotest's standard `-cpuprofile` and `-memprofile` 89 flags. 90 In a standalone program like this one, we have to import `runtime/pprof` and add a few 91 lines of code: 92 93 var cpuprofile = flag.String("cpuprofile", "", "write cpu profile to file") 94 95 func main() { 96 flag.Parse() 97 if *cpuprofile != "" { 98 f, err := os.Create(*cpuprofile) 99 if err != nil { 100 log.Fatal(err) 101 } 102 pprof.StartCPUProfile(f) 103 defer pprof.StopCPUProfile() 104 } 105 ... 106 107 The new code defines a flag named `cpuprofile`, calls the 108 [[http://golang.org/pkg/flag/][Go flag library]] to parse the command line flags, 109 and then, if the `cpuprofile` flag has been set on the command line, 110 [[http://golang.org/pkg/runtime/pprof/#StartCPUProfile][starts CPU profiling]] 111 redirected to that file. 112 The profiler requires a final call to 113 [[http://golang.org/pkg/runtime/pprof/#StopCPUProfile][`StopCPUProfile`]] to 114 flush any pending writes to the file before the program exits; we use `defer` 115 to make sure this happens as `main` returns. 116 117 After adding that code, we can run the program with the new `-cpuprofile` flag 118 and then run `go tool pprof` to interpret the profile. 119 120 $ make havlak1.prof 121 ./havlak1 -cpuprofile=havlak1.prof 122 # of loops: 76000 (including 1 artificial root node) 123 $ go tool pprof havlak1 havlak1.prof 124 Welcome to pprof! For help, type 'help'. 125 (pprof) 126 127 The `go tool pprof` program is a slight variant of 128 [[https://code.google.com/p/gperftools/wiki/GooglePerformanceTools][Google's `pprof` C++ profiler]]. 129 The most important command is `topN`, which shows the top `N` samples in the profile: 130 131 (pprof) top10 132 Total: 2525 samples 133 298 11.8% 11.8% 345 13.7% runtime.mapaccess1_fast64 134 268 10.6% 22.4% 2124 84.1% main.FindLoops 135 251 9.9% 32.4% 451 17.9% scanblock 136 178 7.0% 39.4% 351 13.9% hash_insert 137 131 5.2% 44.6% 158 6.3% sweepspan 138 119 4.7% 49.3% 350 13.9% main.DFS 139 96 3.8% 53.1% 98 3.9% flushptrbuf 140 95 3.8% 56.9% 95 3.8% runtime.aeshash64 141 95 3.8% 60.6% 101 4.0% runtime.settype_flush 142 88 3.5% 64.1% 988 39.1% runtime.mallocgc 143 144 When CPU profiling is enabled, the Go program stops about 100 times per second 145 and records a sample consisting of the program counters on the currently executing 146 goroutine's stack. 147 The profile has 2525 samples, so it was running for a bit over 25 seconds. 148 In the `go tool pprof` output, there is a row for each function that appeared in 149 a sample. 150 The first two columns show the number of samples in which the function was running 151 (as opposed to waiting for a called function to return), as a raw count and as a 152 percentage of total samples. 153 The `runtime.mapaccess1_fast64` function was running during 298 samples, or 11.8%. 154 The `top10` output is sorted by this sample count. 155 The third column shows the running total during the listing: 156 the first three rows account for 32.4% of the samples. 157 The fourth and fifth columns show the number of samples in which the function appeared 158 (either running or waiting for a called function to return). 159 The `main.FindLoops` function was running in 10.6% of the samples, but it was on the 160 call stack (it or functions it called were running) in 84.1% of the samples. 161 162 To sort by the fourth and fifth columns, use the `-cum` (for cumulative) flag: 163 164 (pprof) top5 -cum 165 Total: 2525 samples 166 0 0.0% 0.0% 2144 84.9% gosched0 167 0 0.0% 0.0% 2144 84.9% main.main 168 0 0.0% 0.0% 2144 84.9% runtime.main 169 0 0.0% 0.0% 2124 84.1% main.FindHavlakLoops 170 268 10.6% 10.6% 2124 84.1% main.FindLoops 171 (pprof) top5 -cum 172 173 In fact the total for `main.FindLoops` and `main.main` should have been 100%, but 174 each stack sample only includes the bottom 100 stack frames; during about a quarter 175 of the samples, the recursive `main.DFS` function was more than 100 frames deeper 176 than `main.main` so the complete trace was truncated. 177 178 The stack trace samples contain more interesting data about function call relationships 179 than the text listings can show. 180 The `web` command writes a graph of the profile data in SVG format and opens it in a web 181 browser. 182 (There is also a `gv` command that writes PostScript and opens it in Ghostview. 183 For either command, you need [[http://www.graphviz.org/][graphviz]] installed.) 184 185 (pprof) web 186 187 A small fragment of 188 [[http://benchgraffiti.googlecode.com/hg/havlak/havlak1.svg][the full graph]] looks like: 189 190 .image profiling-go-programs_havlak1a-75.png 191 192 Each box in the graph corresponds to a single function, and the boxes are sized 193 according to the number of samples in which the function was running. 194 An edge from box X to box Y indicates that X calls Y; the number along the edge is 195 the number of times that call appears in a sample. 196 If a call appears multiple times in a single sample, such as during recursive function 197 calls, each appearance counts toward the edge weight. 198 That explains the 21342 on the self-edge from `main.DFS` to itself. 199 200 Just at a glance, we can see that the program spends much of its time in hash 201 operations, which correspond to use of Go's `map` values. 202 We can tell `web` to use only samples that include a specific function, such as 203 `runtime.mapaccess1_fast64`, which clears some of the noise from the graph: 204 205 (pprof) web mapaccess1 206 207 .image profiling-go-programs_havlak1-hash_lookup-75.png 208 209 If we squint, we can see that the calls to `runtime.mapaccess1_fast64` are being 210 made by `main.FindLoops` and `main.DFS`. 211 212 Now that we have a rough idea of the big picture, it's time to zoom in on a particular 213 function. 214 Let's look at `main.DFS` first, just because it is a shorter function: 215 216 (pprof) list DFS 217 Total: 2525 samples 218 ROUTINE ====================== main.DFS in /home/rsc/g/benchgraffiti/havlak/havlak1.go 219 119 697 Total samples (flat / cumulative) 220 3 3 240: func DFS(currentNode *BasicBlock, nodes []*UnionFindNode, number map[*BasicBlock]int, last []int, current int) int { 221 1 1 241: nodes[current].Init(currentNode, current) 222 1 37 242: number[currentNode] = current 223 . . 243: 224 1 1 244: lastid := current 225 89 89 245: for _, target := range currentNode.OutEdges { 226 9 152 246: if number[target] == unvisited { 227 7 354 247: lastid = DFS(target, nodes, number, last, lastid+1) 228 . . 248: } 229 . . 249: } 230 7 59 250: last[number[currentNode]] = lastid 231 1 1 251: return lastid 232 (pprof) 233 234 The listing shows the source code for the `DFS` function (really, for every function 235 matching the regular expression `DFS`). 236 The first three columns are the number of samples taken while running that line, the 237 number of samples taken while running that line or in code called from that line, and 238 the line number in the file. 239 The related command `disasm` shows a disassembly of the function instead of a source 240 listing; when there are enough samples this can help you see which instructions are 241 expensive. 242 The `weblist` command mixes the two modes: it shows 243 [[http://benchgraffiti.googlecode.com/hg/havlak/havlak1.html][a source listing in which clicking a line shows the disassembly]]. 244 245 Since we already know that the time is going into map lookups implemented by the 246 hash runtime functions, we care most about the second column. 247 A large fraction of time is spent in recursive calls to `DFS` (line 247), as would be 248 expected from a recursive traversal. 249 Excluding the recursion, it looks like the time is going into the accesses to the 250 `number` map on lines 242, 246, and 250. 251 For that particular lookup, a map is not the most efficient choice. 252 Just as they would be in a compiler, the basic block structures have unique sequence 253 numbers assigned to them. 254 Instead of using a `map[*BasicBlock]int` we can use a `[]int`, a slice indexed by the 255 block number. 256 There's no reason to use a map when an array or slice will do. 257 258 Changing `number` from a map to a slice requires editing seven lines in the program 259 and cut its run time by nearly a factor of two: 260 261 $ make havlak2 262 go build havlak2.go 263 $ ./xtime ./havlak2 264 # of loops: 76000 (including 1 artificial root node) 265 16.55u 0.11s 16.69r 1321008kB ./havlak2 266 $ 267 268 (See the [[http://code.google.com/p/benchgraffiti/source/diff?name=34f7624bb2e2&r=240c155236f9&format=unidiff&path=/havlak/havlak.go][diff between `havlak1` and `havlak2`]]) 269 270 We can run the profiler again to confirm that `main.DFS` is no longer a significant 271 part of the run time: 272 273 $ make havlak2.prof 274 ./havlak2 -cpuprofile=havlak2.prof 275 # of loops: 76000 (including 1 artificial root node) 276 $ go tool pprof havlak2 havlak2.prof 277 Welcome to pprof! For help, type 'help'. 278 (pprof) 279 (pprof) top5 280 Total: 1652 samples 281 197 11.9% 11.9% 382 23.1% scanblock 282 189 11.4% 23.4% 1549 93.8% main.FindLoops 283 130 7.9% 31.2% 152 9.2% sweepspan 284 104 6.3% 37.5% 896 54.2% runtime.mallocgc 285 98 5.9% 43.5% 100 6.1% flushptrbuf 286 (pprof) 287 288 The entry `main.DFS` no longer appears in the profile, and the rest of the program 289 runtime has dropped too. 290 Now the program is spending most of its time allocating memory and garbage collecting 291 (`runtime.mallocgc`, which both allocates and runs periodic garbage collections, 292 accounts for 54.2% of the time). 293 To find out why the garbage collector is running so much, we have to find out what is 294 allocating memory. 295 One way is to add memory profiling to the program. 296 We'll arrange that if the `-memprofile` flag is supplied, the program stops after one 297 iteration of the loop finding, writes a memory profile, and exits: 298 299 var memprofile = flag.String("memprofile", "", "write memory profile to this file") 300 ... 301 302 FindHavlakLoops(cfgraph, lsgraph) 303 if *memprofile != "" { 304 f, err := os.Create(*memprofile) 305 if err != nil { 306 log.Fatal(err) 307 } 308 pprof.WriteHeapProfile(f) 309 f.Close() 310 return 311 } 312 313 We invoke the program with `-memprofile` flag to write a profile: 314 315 $ make havlak3.mprof 316 go build havlak3.go 317 ./havlak3 -memprofile=havlak3.mprof 318 $ 319 (See the [[http://code.google.com/p/benchgraffiti/source/diff?name=240c155236f9&r=796913012f93&format=unidiff&path=/havlak/havlak.go][diff from havlak2]]) 320 321 We use `go tool pprof` exactly the same way. Now the samples we are examining are 322 memory allocations, not clock ticks. 323 324 $ go tool pprof havlak3 havlak3.mprof 325 Adjusting heap profiles for 1-in-524288 sampling rate 326 Welcome to pprof! For help, type 'help'. 327 (pprof) top5 328 Total: 82.4 MB 329 56.3 68.4% 68.4% 56.3 68.4% main.FindLoops 330 17.6 21.3% 89.7% 17.6 21.3% main.(*CFG).CreateNode 331 8.0 9.7% 99.4% 25.6 31.0% main.NewBasicBlockEdge 332 0.5 0.6% 100.0% 0.5 0.6% itab 333 0.0 0.0% 100.0% 0.5 0.6% fmt.init 334 (pprof) 335 336 The command `go tool pprof` reports that `FindLoops` has allocated approximately 337 56.3 of the 82.4 MB in use; `CreateNode` accounts for another 17.6 MB. 338 To reduce overhead, the memory profiler only records information for approximately 339 one block per half megabyte allocated (the “1-in-524288 sampling rate”), so these 340 are approximations to the actual counts. 341 342 To find the memory allocations, we can list those functions. 343 344 (pprof) list FindLoops 345 Total: 82.4 MB 346 ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go 347 56.3 56.3 Total MB (flat / cumulative) 348 ... 349 1.9 1.9 268: nonBackPreds := make([]map[int]bool, size) 350 5.8 5.8 269: backPreds := make([][]int, size) 351 . . 270: 352 1.9 1.9 271: number := make([]int, size) 353 1.9 1.9 272: header := make([]int, size, size) 354 1.9 1.9 273: types := make([]int, size, size) 355 1.9 1.9 274: last := make([]int, size, size) 356 1.9 1.9 275: nodes := make([]*UnionFindNode, size, size) 357 . . 276: 358 . . 277: for i := 0; i < size; i++ { 359 9.5 9.5 278: nodes[i] = new(UnionFindNode) 360 . . 279: } 361 ... 362 . . 286: for i, bb := range cfgraph.Blocks { 363 . . 287: number[bb.Name] = unvisited 364 29.5 29.5 288: nonBackPreds[i] = make(map[int]bool) 365 . . 289: } 366 ... 367 368 It looks like the current bottleneck is the same as the last one: using maps where 369 simpler data structures suffice. 370 `FindLoops` is allocating about 29.5 MB of maps. 371 372 As an aside, if we run `go tool pprof` with the `--inuse_objects` flag, it will 373 report allocation counts instead of sizes: 374 375 376 $ go tool pprof --inuse_objects havlak3 havlak3.mprof 377 Adjusting heap profiles for 1-in-524288 sampling rate 378 Welcome to pprof! For help, type 'help'. 379 (pprof) list FindLoops 380 Total: 1763108 objects 381 ROUTINE ====================== main.FindLoops in /home/rsc/g/benchgraffiti/havlak/havlak3.go 382 720903 720903 Total objects (flat / cumulative) 383 ... 384 . . 277: for i := 0; i < size; i++ { 385 311296 311296 278: nodes[i] = new(UnionFindNode) 386 . . 279: } 387 . . 280: 388 . . 281: // Step a: 389 . . 282: // - initialize all nodes as unvisited. 390 . . 283: // - depth-first traversal and numbering. 391 . . 284: // - unreached BB's are marked as dead. 392 . . 285: // 393 . . 286: for i, bb := range cfgraph.Blocks { 394 . . 287: number[bb.Name] = unvisited 395 409600 409600 288: nonBackPreds[i] = make(map[int]bool) 396 . . 289: } 397 ... 398 (pprof) 399 400 Since the ~200,000 maps account for 29.5 MB, it looks like the initial map allocation 401 takes about 150 bytes. 402 That's reasonable when a map is being used to hold key-value pairs, but not when a map 403 is being used as a stand-in for a simple set, as it is here. 404 405 Instead of using a map, we can use a simple slice to list the elements. 406 In all but one of the cases where maps are being used, it is impossible for the algorithm 407 to insert a duplicate element. 408 In the one remaining case, we can write a simple variant of the `append` built-in function: 409 410 func appendUnique(a []int, x int) []int { 411 for _, y := range a { 412 if x == y { 413 return a 414 } 415 } 416 return append(a, x) 417 } 418 419 In addition to writing that function, changing the Go program to use slices instead 420 of maps requires changing just a few lines of code. 421 422 $ make havlak4 423 go build havlak4.go 424 $ ./xtime ./havlak4 425 # of loops: 76000 (including 1 artificial root node) 426 11.84u 0.08s 11.94r 810416kB ./havlak4 427 $ 428 (See the [[http://code.google.com/p/benchgraffiti/source/diff?name=796913012f93&r=d856c2f698c1&format=unidiff&path=/havlak/havlak.go][diff from havlak3]]) 429 430 We're now at 2.11x faster than when we started. Let's look at a CPU profile again. 431 432 $ make havlak4.prof 433 ./havlak4 -cpuprofile=havlak4.prof 434 # of loops: 76000 (including 1 artificial root node) 435 $ go tool pprof havlak4 havlak4.prof 436 Welcome to pprof! For help, type 'help'. 437 (pprof) top10 438 Total: 1173 samples 439 205 17.5% 17.5% 1083 92.3% main.FindLoops 440 138 11.8% 29.2% 215 18.3% scanblock 441 88 7.5% 36.7% 96 8.2% sweepspan 442 76 6.5% 43.2% 597 50.9% runtime.mallocgc 443 75 6.4% 49.6% 78 6.6% runtime.settype_flush 444 74 6.3% 55.9% 75 6.4% flushptrbuf 445 64 5.5% 61.4% 64 5.5% runtime.memmove 446 63 5.4% 66.8% 524 44.7% runtime.growslice 447 51 4.3% 71.1% 51 4.3% main.DFS 448 50 4.3% 75.4% 146 12.4% runtime.MCache_Alloc 449 (pprof) 450 451 Now memory allocation and the consequent garbage collection (`runtime.mallocgc`) 452 accounts for 50.9% of our run time. 453 Another way to look at why the system is garbage collecting is to look at the 454 allocations that are causing the collections, the ones that spend most of the time 455 in `mallocgc`: 456 457 (pprof) web mallocgc 458 459 .image profiling-go-programs_havlak4a-mallocgc.png 460 461 It's hard to tell what's going on in that graph, because there are many nodes with 462 small sample numbers obscuring the big ones. 463 We can tell `go tool pprof` to ignore nodes that don't account for at least 10% of 464 the samples: 465 466 $ go tool pprof --nodefraction=0.1 havlak4 havlak4.prof 467 Welcome to pprof! For help, type 'help'. 468 (pprof) web mallocgc 469 470 .image profiling-go-programs_havlak4a-mallocgc-trim.png 471 472 We can follow the thick arrows easily now, to see that `FindLoops` is triggering 473 most of the garbage collection. 474 If we list `FindLoops` we can see that much of it is right at the beginning: 475 476 (pprof) list FindLoops 477 ... 478 . . 270: func FindLoops(cfgraph *CFG, lsgraph *LSG) { 479 . . 271: if cfgraph.Start == nil { 480 . . 272: return 481 . . 273: } 482 . . 274: 483 . . 275: size := cfgraph.NumNodes() 484 . . 276: 485 . 145 277: nonBackPreds := make([][]int, size) 486 . 9 278: backPreds := make([][]int, size) 487 . . 279: 488 . 1 280: number := make([]int, size) 489 . 17 281: header := make([]int, size, size) 490 . . 282: types := make([]int, size, size) 491 . . 283: last := make([]int, size, size) 492 . . 284: nodes := make([]*UnionFindNode, size, size) 493 . . 285: 494 . . 286: for i := 0; i < size; i++ { 495 2 79 287: nodes[i] = new(UnionFindNode) 496 . . 288: } 497 ... 498 (pprof) 499 500 Every time `FindLoops` is called, it allocates some sizable bookkeeping structures. 501 Since the benchmark calls `FindLoops` 50 times, these add up to a significant amount 502 of garbage, so a significant amount of work for the garbage collector. 503 504 Having a garbage-collected language doesn't mean you can ignore memory allocation 505 issues. 506 In this case, a simple solution is to introduce a cache so that each call to `FindLoops` 507 reuses the previous call's storage when possible. 508 (In fact, in Hundt's paper, he explains that the Java program needed just this change to 509 get anything like reasonable performance, but he did not make the same change in the 510 other garbage-collected implementations.) 511 512 We'll add a global `cache` structure: 513 514 var cache struct { 515 size int 516 nonBackPreds [][]int 517 backPreds [][]int 518 number []int 519 header []int 520 types []int 521 last []int 522 nodes []*UnionFindNode 523 } 524 525 and then have `FindLoops` consult it as a replacement for allocation: 526 527 if cache.size < size { 528 cache.size = size 529 cache.nonBackPreds = make([][]int, size) 530 cache.backPreds = make([][]int, size) 531 cache.number = make([]int, size) 532 cache.header = make([]int, size) 533 cache.types = make([]int, size) 534 cache.last = make([]int, size) 535 cache.nodes = make([]*UnionFindNode, size) 536 for i := range cache.nodes { 537 cache.nodes[i] = new(UnionFindNode) 538 } 539 } 540 541 nonBackPreds := cache.nonBackPreds[:size] 542 for i := range nonBackPreds { 543 nonBackPreds[i] = nonBackPreds[i][:0] 544 } 545 backPreds := cache.backPreds[:size] 546 for i := range nonBackPreds { 547 backPreds[i] = backPreds[i][:0] 548 } 549 number := cache.number[:size] 550 header := cache.header[:size] 551 types := cache.types[:size] 552 last := cache.last[:size] 553 nodes := cache.nodes[:size] 554 555 Such a global variable is bad engineering practice, of course: it means that 556 concurrent calls to `FindLoops` are now unsafe. 557 For now, we are making the minimal possible changes in order to understand what 558 is important for the performance of our program; this change is simple and mirrors 559 the code in the Java implementation. 560 The final version of the Go program will use a separate `LoopFinder` instance to 561 track this memory, restoring the possibility of concurrent use. 562 563 $ make havlak5 564 go build havlak5.go 565 $ ./xtime ./havlak5 566 # of loops: 76000 (including 1 artificial root node) 567 8.03u 0.06s 8.11r 770352kB ./havlak5 568 $ 569 (See the [[http://code.google.com/p/benchgraffiti/source/diff?name=d856c2f698c1&r=5ce46b0ee1db&format=unidiff&path=/havlak/havlak.go][diff from havlak4]]) 570 571 There's more we can do to clean up the program and make it faster, but none of 572 it requires profiling techniques that we haven't already shown. 573 The work list used in the inner loop can be reused across iterations and across 574 calls to `FindLoops`, and it can be combined with the separate “node pool” generated 575 during that pass. 576 Similarly, the loop graph storage can be reused on each iteration instead of reallocated. 577 In addition to these performance changes, the 578 [[http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.go][final version]] 579 is written using idiomatic Go style, using data structures and methods. 580 The stylistic changes have only a minor effect on the run time: the algorithm and 581 constraints are unchanged. 582 583 The final version runs in 2.29 seconds and uses 351 MB of memory: 584 585 $ make havlak6 586 go build havlak6.go 587 $ ./xtime ./havlak6 588 # of loops: 76000 (including 1 artificial root node) 589 2.26u 0.02s 2.29r 360224kB ./havlak6 590 $ 591 592 That's 11 times faster than the program we started with. 593 Even if we disable reuse of the generated loop graph, so that the only cached memory 594 is the loop finding bookeeping, the program still runs 6.7x faster than the original 595 and uses 1.5x less memory. 596 597 $ ./xtime ./havlak6 -reuseloopgraph=false 598 # of loops: 76000 (including 1 artificial root node) 599 3.69u 0.06s 3.76r 797120kB ./havlak6 -reuseloopgraph=false 600 $ 601 602 Of course, it's no longer fair to compare this Go program to the original C++ 603 program, which used inefficient data structures like `set`s where `vector`s would 604 be more appropriate. 605 As a sanity check, we translated the final Go program into 606 [[http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.cc][equivalent C++ code]]. 607 Its execution time is similar to the Go program's: 608 609 $ make havlak6cc 610 g++ -O3 -o havlak6cc havlak6.cc 611 $ ./xtime ./havlak6cc 612 # of loops: 76000 (including 1 artificial root node) 613 1.99u 0.19s 2.19r 387936kB ./havlak6cc 614 615 The Go program runs almost as fast as the C++ program. 616 As the C++ program is using automatic deletes and allocation instead of an explicit 617 cache, the C++ program a bit shorter and easier to write, but not dramatically so: 618 619 $ wc havlak6.cc; wc havlak6.go 620 401 1220 9040 havlak6.cc 621 461 1441 9467 havlak6.go 622 $ 623 (See [[http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.cc][havlak6.cc]] 624 and [[http://code.google.com/p/benchgraffiti/source/browse/havlak/havlak6.go][havlak6.go]]) 625 626 Benchmarks are only as good as the programs they measure. 627 We used `go tool pprof` to study an inefficient Go program and then to improve its 628 performance by an order of magnitude and to reduce its memory usage by a factor of 3.7. 629 A subsequent comparison with an equivalently optimized C++ program shows that Go can be 630 competitive with C++ when programmers are careful about how much garbage is generated 631 by inner loops. 632 633 The program sources, Linux x86-64 binaries, and profiles used to write this post 634 are available in the [[http://code.google.com/p/benchgraffiti/][benchgraffiti project on Google Code]]. 635 636 As mentioned above, [[http://golang.org/cmd/go/#Test_packages][`go test`]] includes 637 these profiling flags already: define a 638 [[http://golang.org/pkg/testing/][benchmark function]] and you're all set. 639 There is also a standard HTTP interface to profiling data. In an HTTP server, adding 640 641 import _ "net/http/pprof" 642 643 will install handlers for a few URLs under `/debug/pprof/`. 644 Then you can run `go tool pprof` with a single argument—the URL to your server's 645 profiling data and it will download and examine a live profile. 646 647 go tool pprof http://localhost:6060/debug/pprof/profile # 30-second CPU profile 648 go tool pprof http://localhost:6060/debug/pprof/heap # heap profile 649 go tool pprof http://localhost:6060/debug/pprof/block # goroutine blocking profile 650 651 The goroutine blocking profile will be explained in a future post. Stay tuned.