github.com/google/syzkaller@v0.0.0-20240517125934-c0f1611a36d6/docs/linux/coverage.md

github.com/google/syzkaller@v0.0.0-20240517125934-c0f1611a36d6/docs/linux/coverage.md (about)

     1  # Coverage
     2  
     3  Syzkaller uses [kcov](https://www.kernel.org/doc/html/latest/dev-tools/kcov.html) to collect coverage from the kernel. kcov exports the address of each executed basic block, and syzkaller runtime uses tools from `binutils` (`objdump`, `nm`, `addr2line` and `readelf`) to map these addresses to lines and functions in the source code.
     4  
     5  ## Binutils
     6  
     7  Note that if you are fuzzing in cross-arch environment you need to provide correct `binutils` cross-tools to syzkaller before starting `syz-manager`:
     8  
     9  ``` bash
    10  mkdir -p ~/bin/mips64le
    11  ln -s `which mips64el-linux-gnuabi64-addr2line` ~/bin/mips64le/addr2line
    12  ln -s `which mips64el-linux-gnuabi64-nm` ~/bin/mips64le/nm
    13  ln -s `which mips64el-linux-gnuabi64-objdump` ~/bin/mips64le/objdump
    14  ln -s `which mips64el-linux-gnuabi64-readelf` ~/bin/mips64le/readelf
    15  export PATH=~/bin/mips64le:$PATH
    16  ```
    17  
    18  The target-triple prefix is determined based on the `target` config option.
    19  
    20  ### readelf
    21  
    22  `readelf` is used to detect virtual memory offset.
    23  
    24  ```
    25  readelf -SW kernel_image
    26  ```
    27  
    28  The meaning of the flags is as follows:
    29  
    30  * `-S' - list section headers in the kernel image file
    31  * `-W' - output each section header entry in a single line
    32  
    33  Example output of the command:
    34  
    35  ```
    36  There are 59 section headers, starting at offset 0x3825258:
    37  
    38  Section Headers:
    39    [Nr] Name              Type            Address          Off    Size   ES Flg Lk Inf Al
    40    [ 0]                   NULL            0000000000000000 000000 000000 00      0   0  0
    41    [ 1] .text             PROGBITS        ffffffff81000000 200000 e010f7 00  AX  0   0 4096
    42    [ 2] .rela.text        RELA            0000000000000000 23ff488 684720 18   I 56   1  8
    43    [ 3] .rodata           PROGBITS        ffffffff82000000 1200000 2df790 00  WA  0   0 4096
    44    [ 4] .rela.rodata      RELA            0000000000000000 2a83ba8 0d8e28 18   I 56   3  8
    45    [ 5] .pci_fixup        PROGBITS        ffffffff822df790 14df790 003180 00   A  0   0 16
    46    [ 6] .rela.pci_fixup   RELA            0000000000000000 2b5c9d0 004a40 18   I 56   5  8
    47    [ 7] .tracedata        PROGBITS        ffffffff822e2910 14e2910 000078 00   A  0   0  1
    48    [ 8] .rela.tracedata   RELA            0000000000000000 2b61410 000120 18   I 56   7  8
    49    [ 9] __ksymtab         PROGBITS        ffffffff822e2988 14e2988 011b68 00   A  0   0  4
    50    [10] ...
    51  ```
    52  
    53  Executor truncates PC values into `uint32` before sending them to `syz-manager` and `syz-manager` uses section header information to recover the offset. Only the section headers of type `PROGBITS` are considered. The `Address` field represents the virtual address of a section in memory (for the sections that are loaded). It is required that all `PROGBITS` sections have same upper 32 bits in the `Address` field. These 32 bits are used as recovery offset.
    54  
    55  
    56  ## Reporting coverage data
    57  
    58  `MakeReportGenerator` factory creates an object database for the report. It requires target data, as well as information on the location of the source files and build directory. The first step in building this database is
    59  extracting the function data from the target binary.
    60  ### nm
    61  
    62  `nm` is used to parse address and size of each function in the kernel image
    63  
    64  ```
    65  nm -Ptx kernel_image
    66  ```
    67  
    68  The meaning of the flags is as follows:
    69  
    70  * `-P` - use the portable output format (Standard Output)
    71  * `-tx` - write the numeric values in the hex format
    72  
    73  Output is of the following form:
    74  
    75  ```
    76  tracepoint_module_nb d ffffffff84509580 0000000000000018
    77  ...
    78  udp_lib_hash t ffffffff831a4660 0000000000000007
    79  ```
    80  
    81  The first column is a symbol name and the second column is its type (e.g. text section, data section, debugging symbol, undefined, zero-init section, etc.). The third column is the symbol value in hex format while the forth column contains its size. The size is always rounded to up to 16 in syzkaller. For the report, we are only interested in the code sections so the `nm` output is filtered for the symbols with type `t` or `T`.
    82  The final result is a map with symbol names as keys, values being starting and ending address of a symbol. This information is used to map coverage data to symbols (functions). This step is needed to find out whether certain functions are called at all.
    83  
    84  ## Object Dump and Symbolize
    85  
    86  In order to provide the necessary information for tracking the coverage information with syzkaller, the compiler is instrumented to insert the `__sanitizer_cov_trace_pc` call into every basic block generated during the build process. These instructions are then used as anchor points to backtrack the covered code lines.
    87  
    88  ### objdump
    89  
    90  `objdump` is used to parse PC value of each call to `__sanitizer_cov_trace_pc` in the kernel image. These PC values are representing all code that is built into kernel image. PC values exported by kcov are compared against these to determine coverage.
    91  
    92  The kernel image is disassembled using the following command:
    93  
    94  ```
    95  objdump -d --no-show-raw-insn kernel_image
    96  ```
    97  
    98  The meaning of the flags is as follows:
    99  
   100  * `-d` - disassemble executable code blocks
   101  * `-no-show-raw-insn` - prevent printing hex alongside symbolic disassembly
   102  
   103  Excerpt output looks like this:
   104  
   105  ```
   106  ...
   107  ffffffff81000f41:	callq  ffffffff81382a00 <__sanitizer_cov_trace_pc>
   108  ffffffff81000f46:	lea    -0x80(%r13),%rdx
   109  ffffffff81000f4a:	lea    -0x40(%r13),%rsi
   110  ffffffff81000f4e:	mov    $0x1c,%edi
   111  ffffffff81000f53:	callq  ffffffff813ed680 <perf_trace_buf_alloc>
   112  ffffffff81000f58:	test   %rax,%rax
   113  ffffffff81000f5b:	je     ffffffff8100110e <perf_trace_initcall_finish+0x2ae>
   114  ffffffff81000f61:	mov    %rax,-0xd8(%rbp)
   115  ffffffff81000f68:	callq  ffffffff81382a00 <__sanitizer_cov_trace_pc>
   116  ffffffff81000f6d:	mov    -0x40(%r13),%rdx
   117  ffffffff81000f71:	mov    0x8(%rbp),%rsi
   118  ...
   119  ```
   120  
   121  From this output coverage trace calls are identified to determine the start of the executable block addresses:
   122  
   123  ```
   124  ffffffff81000f41:	callq  ffffffff81382a00 <__sanitizer_cov_trace_pc>
   125  ffffffff81000f68:	callq  ffffffff81382a00 <__sanitizer_cov_trace_pc>
   126  ```
   127  
   128  ### addr2line
   129  
   130  `addr2line` is used for mapping PC values exported by kcov and parsed by `objdump` to source code files and lines.
   131  
   132  ```
   133  addr2line -afi -e kernel_image
   134  ```
   135  
   136  The meaning of the flags is as follows:
   137  
   138  * `-afi` - means show addresses, function names and unwind inlined functions
   139  * `-e` - is switch for specifying executable instead of using default
   140  
   141  `addr2line` reads hexadecimal addresses from standard input and prints the filename
   142  function and line number for each address on standard output. Example usage:
   143  
   144  ```
   145  >> ffffffff8148ba08
   146  << 0xffffffff8148ba08
   147  << generic_file_read_iter
   148  << /home/user/linux/mm/filemap.c:2363
   149  ```
   150  
   151  where `>>` represents the query and `<<` is the response from the `addr2line`.
   152  
   153  The final goal is to have a hash table of frames where key is a program counter
   154  and value is a frame array consisting of a following information:
   155  
   156  * `PC` - 64bit program counter value (same as key)
   157  * `Func` - function name to which the frame belongs
   158  * `File` - file where function/frame code is located
   159  * `Line` - Line in a file to which program counter maps
   160  * `Inline` - boolean inlining information
   161  
   162  Multiple frames can be linked to a single program counter value due to inlining.
   163  
   164  ## Creating report
   165  
   166  Once the database of the frames and function address ranges is created the next step is to determine the program coverage. Each program is represented here as a series of program counter values. As the function address ranges are known at this point it is easy to determine which functions were called by simply comparing the program counters against these address intervals. In addition, the coverage information is aggregated over the source files based on the program counters that are keys in the frame hash map. These are marked as `coveredPCs`. The resulting coverage is not line based but the basic block based. The end result is stored in the `file` struct containing the following information:
   167  
   168  * `lines` - lines covered in the file
   169  * `totalPCs` - total program counters identified for this file
   170  * `coveredPCs` - the program counters that were executed in the program run
   171  * `totalInline` - total number of program counters mapped to inlined frames
   172  * `coveredInline` - the program counters mapped to inlined frames that were executed in the program run