github.com/cockroachdb/pebble@v1.1.2/docs/io_profiling.md (about) 1 # I/O Profiling 2 3 Linux provide extensive kernel profiling capabilities, including the 4 ability to trace operations at the block I/O layer. These tools are 5 incredibly powerful, though sometimes overwhelming in their 6 flexibility. This document captures some common recipes for profiling 7 Linux I/O. 8 9 * [Perf](#perf) 10 * [Blktrace](#blktrace) 11 12 ## Perf 13 14 The Linux `perf` command can instrument CPU performance counters, and 15 the extensive set of kernel trace points. A great place to get started 16 understanding `perf` are Brendan Gregg's [perf 17 examples](http://www.brendangregg.com/perf.html). 18 19 The two modes of operation are "live" reporting via `perf top`, and 20 record and report via `perf record` and `perf 21 {report,script}`. 22 23 Recording the stack traces for `block:block_rq_insert` event allows 24 determination of what Pebble level code is generating block requests. 25 26 ### Installation 27 28 Ubuntu AWS installation: 29 30 ``` 31 sudo apt-get install linux-tools-common linux-tools-4.4.0-1049-aws linux-cloud-tools-4.4.0-1049-aws 32 ``` 33 34 ### Recording 35 36 `perf record` (and `perf top`) requires read and write access to 37 `/sys/kernel/debug/tracing`. Running as root as an easiest way to get 38 the right permissions. 39 40 ``` 41 # Trace all block device (disk I/O) requests with stack traces, until Ctrl-C. 42 sudo perf record -e block:block_rq_insert -ag 43 44 # Trace all block device (disk I/O) issues and completions with stack traces, until Ctrl-C. 45 sudo perf record -e block:block_rq_issue -e block:block_rq_complete -ag 46 ``` 47 48 The `-a` flag records events on all CPUs (almost always desirable). 49 50 The `-g` flag records call graphs (a.k.a stack traces). Capturing the 51 stack trace makes the recording somewhat more expensive, but it 52 enables determining the originator of the event. Note the stack traces 53 include both the kernel and application code, allowing pinpointing the 54 source of I/O as due to flush, compaction, WAL writes, etc. 55 56 The `-e` flag controls which events are instrumented. The list of 57 `perf` events is enormous. See `sudo perf list`. 58 59 The `-o` flag controls where output is recorded. The default is 60 `perf.data`. 61 62 In order to record events for a specific duration, you can append `-- 63 sleep <duration>` to the command line. 64 65 ``` 66 # Trace all block device (disk I/O) requests with stack traces for 10s. 67 sudo perf record -e block:block_rq_insert -ag -- sleep 10 68 ``` 69 70 ### Reporting 71 72 The recorded perf data (`perf.data`) can be explored using `perf 73 report` and `perf script`. 74 75 ``` 76 # Show perf.data in an ncurses browser. 77 sudo perf report 78 79 # Show perf.data as a text report. 80 sudo perf report --stdio 81 ``` 82 83 As an example, `perf report --stdio` from perf data gathered using 84 `perf record -e block:block_rq_insert -ag` will show something like: 85 86 ``` 87 96.76% 0.00% pebble pebble [.] runtime.goexit 88 | 89 ---runtime.goexit 90 | 91 |--85.58%-- github.com/cockroachdb/pebble/internal/record.NewLogWriter.func2 92 | runtime/pprof.Do 93 | github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushLoop-fm 94 | github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushLoop 95 | github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushPending 96 | github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync 97 | github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync-fm 98 | github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync 99 | syscall.Syscall 100 | entry_SYSCALL_64_fastpath 101 | sys_fdatasync 102 | do_fsync 103 | vfs_fsync_range 104 | ext4_sync_file 105 | filemap_write_and_wait_range 106 | __filemap_fdatawrite_range 107 | do_writepages 108 | ext4_writepages 109 | blk_finish_plug 110 | blk_flush_plug_list 111 | blk_mq_flush_plug_list 112 | blk_mq_insert_requests 113 ``` 114 115 This is showing that `96.76%` of block device requests on the entire 116 system were generated by the `pebble` process, and `85.58%` of the 117 block device requests on the entire system were generated from WAL 118 syncing within this `pebble` process. 119 120 The `perf script` command provides access to the raw request 121 data. While there are various pre-recorded scripts that can be 122 executed, it is primarily useful for seeing call stacks along with the 123 "trace" data. For block requests, the trace data shows the device, the 124 operation type, the offset, and the size. 125 126 ``` 127 # List all events from perf.data with recommended header and fields. 128 sudo perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso,trace 129 ... 130 pebble 6019/6019 [008] 16492.555957: block:block_rq_insert: 259,0 WS 0 () 3970952 + 256 [pebble] 131 7fff813d791a blk_mq_insert_requests 132 7fff813d8878 blk_mq_flush_plug_list 133 7fff813ccc96 blk_flush_plug_list 134 7fff813cd20c blk_finish_plug 135 7fff812a143d ext4_writepages 136 7fff8119ea1e do_writepages 137 7fff81191746 __filemap_fdatawrite_range 138 7fff8119188a filemap_write_and_wait_range 139 7fff81297c41 ext4_sync_file 140 7fff81244ecb vfs_fsync_range 141 7fff81244f8d do_fsync 142 7fff81245243 sys_fdatasync 143 7fff8181ae6d entry_SYSCALL_64_fastpath 144 3145e0 syscall.Syscall 145 6eddf3 github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync 146 6f069a github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync-fm 147 6ed8d2 github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync 148 72542f github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushPending 149 724f5c github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushLoop 150 72855e github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushLoop-fm 151 7231d8 runtime/pprof.Do 152 727b09 github.com/cockroachdb/pebble/internal/record.NewLogWriter.func2 153 2c0281 runtime.goexit 154 ``` 155 156 Let's break down the trace data: 157 158 ``` 159 259,0 WS 0 () 3970952 + 256 160 | | | | 161 | | | + size (sectors) 162 | | | 163 | | + offset (sectors) 164 | | 165 | +- flags: R(ead), W(rite), B(arrier), S(ync), D(iscard), N(one) 166 | 167 +- device: <major>, <minor> 168 ``` 169 170 The above is indicating that a synchronous write of `256` sectors was 171 performed starting at sector `3970952`. The sector size is device 172 dependent and can be determined with `blockdev --report <device>`, 173 though it is almost always `512` bytes. In this case, the sector size 174 is `512` bytes indicating that this is a write of 128 KB. 175 176 ## Blktrace 177 178 The `blktrace` tool records similar info to `perf`, but is targeted to 179 the block layer instead of being general purpose. The `blktrace` 180 command records data, while the `blkparse` command parses and displays 181 data. The `btrace` command is a shortcut for piping the output from 182 `blktrace` directly into `blkparse. 183 184 ### Installation 185 186 Ubuntu AWS installation: 187 188 ``` 189 sudo apt-get install blktrace 190 ``` 191 192 ## Usage 193 194 ``` 195 # Pipe the output of blktrace directly into blkparse. 196 sudo blktrace -d /dev/nvme1n1 -o - | blkparse -i - 197 198 # Equivalently. 199 sudo btrace /dev/nvme1n1 200 ``` 201 202 The information captured by `blktrace` is similar to what `perf` captures: 203 204 ``` 205 sudo btrace /dev/nvme1n1 206 ... 207 259,0 4 186 0.016411295 11538 Q WS 129341760 + 296 [pebble] 208 259,0 4 187 0.016412100 11538 Q WS 129342016 + 40 [pebble] 209 259,0 4 188 0.016412200 11538 G WS 129341760 + 256 [pebble] 210 259,0 4 189 0.016412714 11538 G WS 129342016 + 40 [pebble] 211 259,0 4 190 0.016413148 11538 U N [pebble] 2 212 259,0 4 191 0.016413255 11538 I WS 129341760 + 256 [pebble] 213 259,0 4 192 0.016413321 11538 I WS 129342016 + 40 [pebble] 214 259,0 4 193 0.016414271 11538 D WS 129341760 + 256 [pebble] 215 259,0 4 194 0.016414860 11538 D WS 129342016 + 40 [pebble] 216 259,0 12 217 0.016687595 0 C WS 129341760 + 256 [0] 217 259,0 12 218 0.016700021 0 C WS 129342016 + 40 [0] 218 ``` 219 220 The standard format is: 221 222 ``` 223 <device> <cpu> <seqnum> <timestamp> <pid> <action> <RWBS> <start-sector> + <size> [<command>] 224 ``` 225 226 See `man blkparse` for an explanation of the actions. 227 228 The `blktrace` output can be used to highlight problematic I/O 229 patterns. For example, it can be used to determine there are an 230 excessive number of small sequential read I/Os indicating that dynamic 231 readahead is not working correctly.