github.com/cockroachdb/pebble@v1.1.1-0.20240513155919-3622ade60459/docs/io_profiling.md (about)

     1  # I/O Profiling
     2  
     3  Linux provide extensive kernel profiling capabilities, including the
     4  ability to trace operations at the block I/O layer. These tools are
     5  incredibly powerful, though sometimes overwhelming in their
     6  flexibility. This document captures some common recipes for profiling
     7  Linux I/O.
     8  
     9  * [Perf](#perf)
    10  * [Blktrace](#blktrace)
    11  
    12  ## Perf
    13  
    14  The Linux `perf` command can instrument CPU performance counters, and
    15  the extensive set of kernel trace points. A great place to get started
    16  understanding `perf` are Brendan Gregg's [perf
    17  examples](http://www.brendangregg.com/perf.html).
    18  
    19  The two modes of operation are "live" reporting via `perf top`, and
    20  record and report via `perf record` and `perf
    21  {report,script}`. 
    22  
    23  Recording the stack traces for `block:block_rq_insert` event allows
    24  determination of what Pebble level code is generating block requests.
    25  
    26  ### Installation
    27  
    28  Ubuntu AWS installation:
    29  
    30  ```
    31  sudo apt-get install linux-tools-common linux-tools-4.4.0-1049-aws linux-cloud-tools-4.4.0-1049-aws
    32  ```
    33  
    34  ### Recording
    35  
    36  `perf record` (and `perf top`) requires read and write access to
    37  `/sys/kernel/debug/tracing`. Running as root as an easiest way to get
    38  the right permissions.
    39  
    40  ```
    41  # Trace all block device (disk I/O) requests with stack traces, until Ctrl-C.
    42  sudo perf record -e block:block_rq_insert -ag
    43  
    44  # Trace all block device (disk I/O) issues and completions with stack traces, until Ctrl-C.
    45  sudo perf record -e block:block_rq_issue -e block:block_rq_complete -ag
    46  ```
    47  
    48  The `-a` flag records events on all CPUs (almost always desirable).
    49  
    50  The `-g` flag records call graphs (a.k.a stack traces). Capturing the
    51  stack trace makes the recording somewhat more expensive, but it
    52  enables determining the originator of the event. Note the stack traces
    53  include both the kernel and application code, allowing pinpointing the
    54  source of I/O as due to flush, compaction, WAL writes, etc.
    55  
    56  The `-e` flag controls which events are instrumented. The list of
    57  `perf` events is enormous. See `sudo perf list`.
    58  
    59  The `-o` flag controls where output is recorded. The default is
    60  `perf.data`.
    61  
    62  In order to record events for a specific duration, you can append `--
    63  sleep <duration>` to the command line.
    64  
    65  ```
    66  # Trace all block device (disk I/O) requests with stack traces for 10s.
    67  sudo perf record -e block:block_rq_insert -ag -- sleep 10
    68  ```
    69  
    70  ### Reporting
    71  
    72  The recorded perf data (`perf.data`) can be explored using `perf
    73  report` and `perf script`.
    74  
    75  ```
    76  # Show perf.data in an ncurses browser.
    77  sudo perf report
    78  
    79  # Show perf.data as a text report.
    80  sudo perf report --stdio
    81  ```
    82  
    83  As an example, `perf report --stdio` from perf data gathered using
    84  `perf record -e block:block_rq_insert -ag` will show something like:
    85  
    86  ```
    87      96.76%     0.00%  pebble          pebble             [.] runtime.goexit
    88                      |
    89                      ---runtime.goexit
    90                         |
    91                         |--85.58%-- github.com/cockroachdb/pebble/internal/record.NewLogWriter.func2
    92                         |          runtime/pprof.Do
    93                         |          github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushLoop-fm
    94                         |          github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushLoop
    95                         |          github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushPending
    96                         |          github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync
    97                         |          github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync-fm
    98                         |          github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync
    99                         |          syscall.Syscall
   100                         |          entry_SYSCALL_64_fastpath
   101                         |          sys_fdatasync
   102                         |          do_fsync
   103                         |          vfs_fsync_range
   104                         |          ext4_sync_file
   105                         |          filemap_write_and_wait_range
   106                         |          __filemap_fdatawrite_range
   107                         |          do_writepages
   108                         |          ext4_writepages
   109                         |          blk_finish_plug
   110                         |          blk_flush_plug_list
   111                         |          blk_mq_flush_plug_list
   112                         |          blk_mq_insert_requests
   113  ```
   114  
   115  This is showing that `96.76%` of block device requests on the entire
   116  system were generated by the `pebble` process, and `85.58%` of the
   117  block device requests on the entire system were generated from WAL
   118  syncing within this `pebble` process.
   119  
   120  The `perf script` command provides access to the raw request
   121  data. While there are various pre-recorded scripts that can be
   122  executed, it is primarily useful for seeing call stacks along with the
   123  "trace" data. For block requests, the trace data shows the device, the
   124  operation type, the offset, and the size.
   125  
   126  ```
   127  # List all events from perf.data with recommended header and fields.
   128  sudo perf script --header -F comm,pid,tid,cpu,time,event,ip,sym,dso,trace
   129  ...
   130  pebble  6019/6019  [008] 16492.555957: block:block_rq_insert: 259,0 WS 0 () 3970952 + 256 [pebble]
   131              7fff813d791a blk_mq_insert_requests
   132              7fff813d8878 blk_mq_flush_plug_list
   133              7fff813ccc96 blk_flush_plug_list
   134              7fff813cd20c blk_finish_plug
   135              7fff812a143d ext4_writepages
   136              7fff8119ea1e do_writepages
   137              7fff81191746 __filemap_fdatawrite_range
   138              7fff8119188a filemap_write_and_wait_range
   139              7fff81297c41 ext4_sync_file
   140              7fff81244ecb vfs_fsync_range
   141              7fff81244f8d do_fsync
   142              7fff81245243 sys_fdatasync
   143              7fff8181ae6d entry_SYSCALL_64_fastpath
   144                    3145e0 syscall.Syscall
   145                    6eddf3 github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync
   146                    6f069a github.com/cockroachdb/pebble/vfs.(*syncingFile).syncFdatasync-fm
   147                    6ed8d2 github.com/cockroachdb/pebble/vfs.(*syncingFile).Sync
   148                    72542f github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushPending
   149                    724f5c github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushLoop
   150                    72855e github.com/cockroachdb/pebble/internal/record.(*LogWriter).flushLoop-fm
   151                    7231d8 runtime/pprof.Do
   152                    727b09 github.com/cockroachdb/pebble/internal/record.NewLogWriter.func2
   153                    2c0281 runtime.goexit
   154  ```
   155  
   156  Let's break down the trace data:
   157  
   158  ```
   159  259,0 WS 0 () 3970952 + 256
   160   |     |         |       |
   161   |     |         |       + size (sectors)
   162   |     |         |
   163   |     |         + offset (sectors)
   164   |     |
   165   |     +- flags: R(ead), W(rite), B(arrier), S(ync), D(iscard), N(one)
   166   |
   167   +- device: <major>, <minor>
   168  ```
   169  
   170  The above is indicating that a synchronous write of `256` sectors was
   171  performed starting at sector `3970952`. The sector size is device
   172  dependent and can be determined with `blockdev --report <device>`,
   173  though it is almost always `512` bytes. In this case, the sector size
   174  is `512` bytes indicating that this is a write of 128 KB.
   175  
   176  ## Blktrace
   177  
   178  The `blktrace` tool records similar info to `perf`, but is targeted to
   179  the block layer instead of being general purpose. The `blktrace`
   180  command records data, while the `blkparse` command parses and displays
   181  data. The `btrace` command is a shortcut for piping the output from
   182  `blktrace` directly into `blkparse.
   183  
   184  ### Installation
   185  
   186  Ubuntu AWS installation:
   187  
   188  ```
   189  sudo apt-get install blktrace
   190  ```
   191  
   192  ## Usage
   193  
   194  ```
   195  # Pipe the output of blktrace directly into blkparse.
   196  sudo blktrace -d /dev/nvme1n1 -o - | blkparse -i -
   197  
   198  # Equivalently.
   199  sudo btrace /dev/nvme1n1
   200  ```
   201  
   202  The information captured by `blktrace` is similar to what `perf` captures:
   203  
   204  ```
   205  sudo btrace /dev/nvme1n1
   206  ...
   207  259,0    4      186     0.016411295 11538  Q  WS 129341760 + 296 [pebble]
   208  259,0    4      187     0.016412100 11538  Q  WS 129342016 + 40 [pebble]
   209  259,0    4      188     0.016412200 11538  G  WS 129341760 + 256 [pebble]
   210  259,0    4      189     0.016412714 11538  G  WS 129342016 + 40 [pebble]
   211  259,0    4      190     0.016413148 11538  U   N [pebble] 2
   212  259,0    4      191     0.016413255 11538  I  WS 129341760 + 256 [pebble]
   213  259,0    4      192     0.016413321 11538  I  WS 129342016 + 40 [pebble]
   214  259,0    4      193     0.016414271 11538  D  WS 129341760 + 256 [pebble]
   215  259,0    4      194     0.016414860 11538  D  WS 129342016 + 40 [pebble]
   216  259,0   12      217     0.016687595     0  C  WS 129341760 + 256 [0]
   217  259,0   12      218     0.016700021     0  C  WS 129342016 + 40 [0]
   218  ```
   219  
   220  The standard format is:
   221  
   222  ```
   223  <device> <cpu> <seqnum> <timestamp> <pid> <action> <RWBS> <start-sector> + <size> [<command>]
   224  ```
   225  
   226  See `man blkparse` for an explanation of the actions.
   227  
   228  The `blktrace` output can be used to highlight problematic I/O
   229  patterns. For example, it can be used to determine there are an
   230  excessive number of small sequential read I/Os indicating that dynamic
   231  readahead is not working correctly.