github.com/google/osv-scalibr@v0.4.1/docs/new_extractor.md (about)

     1  # Add a new Extractor
     2  
     3  Extractors are plugins that extract inventory information, represented by the
     4  Inventory struct. They are either called on every file on the host (filesystem
     5  extractor) or query files on their own (standalone extractor).
     6  
     7  There should be one Extractor per parsing logic. In python for example there are
     8  multiple files to represent installed packages. `PKG-INFO`, `egg-info` and
     9  `METADATA` have the same format (MIME type) and therefore same parsing logic.
    10  Therefore there is one extractor
    11  ([wheelegg](/extractor/filesystem/language/python/wheelegg/wheelegg.go))
    12  for all of them. `.egg` files are zip files which contain one of the previously
    13  mentioned files, thus `.egg` is also handled by this extractor. On the other
    14  side, there are files which have a different format, e.g. `requirements.txt`,
    15  which is just a list of packages. Thus `requirements.txt` gets a separate
    16  extractor.
    17  
    18  ```
    19  wheel_egg/ <- extractor
    20    **/*egg-info/PKG-INFO
    21    */.egg-info
    22    **/*dist-info/METADATA
    23    **/EGG-INFO/PKG-INFO
    24    .egg
    25  requirements/ <- extractor
    26    requirements.txt
    27  ...
    28  ```
    29  
    30  ## Extractor interfaces
    31  
    32  Extractors use the [filesystem.Extractor](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L30)
    33  or [standalone.Extractor](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L30) interface.
    34  
    35  ### Filesystem Extractors
    36  
    37  <!--  See extractor/filesystem/filesystem.go symbol \bExtractor\b -->
    38  
    39  <!--  See plugin/plugin.go symbol \bPlugin\b -->
    40  
    41  `FileRequired` should pre filter the files by their filename and fileMode.
    42  
    43  `Extract` will be called on each file `FileRequired` returned true for. You
    44  don't have to care about opening files, permissions or closing the file. SCALIBR
    45  will take care of this.
    46  
    47  Here is a simplified version of how SCALIBR will call the filesystem extractor
    48  like this
    49  ([actual code](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L49)):
    50  
    51  ```py
    52  for f in walk.files:
    53    for e in filesystemExtractors:
    54      if e.FileRequired(f):
    55        fh = open(f)
    56        inventory.add(e.Extract(fh))
    57        fh.close()
    58  for e in standaloneExtractors:
    59    inventory.add(e.Extract(fs))
    60  ```
    61  
    62  SCALIBR will call `Extract` with
    63  [ScanInput](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L43),
    64  which contains the path, `fs.FileInfo` and `io.Reader` for the file. It also
    65  contains a FS interface and the scan root in case the extractor needs to access
    66  other files on the host.
    67  
    68  <!--  See extractor/filesystem/filesystem.go symbol \bScanInput\b -->
    69  
    70  ### Standalone Extractors
    71  
    72  <!--  See extractor/standalone/standalone.go symbol \bExtractor\b -->
    73  
    74  `Extract` receives a [ScanInput](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L43)
    75  that gives it access to the root of the scanned host. Use this to read the files
    76  you're interested in.
    77  
    78  ### Output
    79  
    80  For both extractors, the `Extract` method should return an [Inventory](https://github.com/google/osv-scalibr/tree/main/inventory/inventory.go) struct.
    81  
    82  <!--  See inventory/inventory.go symbol \bInventory\b -->
    83  
    84  The Inventory struct should have its appropriate fields set (e.g. `Packages`
    85  for software packages, `Secrets` for leaked credentials):
    86  
    87  <!--  See extractor/extractor.go symbol \bPackage\b -->
    88  
    89  You can return an empty Inventory struct in case you don't find software
    90  packages or other inventory in the file. You can also add multiple Package/etc.
    91  entries in case there are multiple in one file.
    92  
    93  ## Code location
    94  
    95  Extractors should be in a sub-folder of
    96  [/extractor/filesystem](/extractor/) or
    97  [/extractor/standalone](/standalone/)
    98  depending on the Extractor type. Take a look at existing folders and pick
    99  whichever is the most appropriate location for your Extractor, or create a new
   100  folder if none of the existing ones apply. Feel free to ask SCALIBR devs for
   101  location suggestions during code review.
   102  
   103  ## Step by step
   104  
   105  You can take the [package.json](/extractor/filesystem/language/javascript/packagejson/packagejson.go)
   106  extractor as an example for Filesystem Extractors.
   107  
   108  1.  Add a `New()` function that returns an Extractor from the specified plugin config.
   109    1.  If you'd like to add new plugin-specific config settings for your Extractor,
   110      1. Add them as a new message to [config.proto](third_party/scalibr/binary/proto/config.proto).
   111      1. Re-generate the go_proto:
   112  
   113          ```
   114          $ `make protos`
   115          ```
   116  
   117      1. You'll be able to specify these config settings from the CLI with the
   118         --plugin-config flag.
   119  1.  Implement `Name()` to return a unique name. Best practice is to use the path
   120      such as `python/requirements`, `javascript/packagejson`, `debian/dpkg`,
   121      `sbom/spdx`.
   122  1.  Implement `Version()` to return 0. This should be increased later on
   123      whenever substantial changes are added the code. Version is used to track
   124      when bugs are introduced and fixed for a given Extractor.
   125  1.  Implement `Requirements()` to return any required [Capabilities](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/plugin/plugin.go#L63)
   126      for the system that runs the scanner. For example, if your code needs
   127      network access, return `&plugin.Capabilities{Online: true}`.
   128      Ideally your Extractor is able to run in any scanning environment
   129      and will return an empty struct.
   130  1.  (For Filesystem Extractors) Implement `FileRequired` to return true in case
   131      the filename and fileMode matches a file you need to parse. For example,
   132      the JavaScript `package.json` extractor returns true for any file
   133      named `package.json`.
   134  1.  Implement `Extract` to extract inventory inside the current file
   135      (or from elsewhere on the filesystem).
   136  1.  If you introduced any new metadata type, be sure to:
   137      1. Add them to the [scan_result.proto](third_party/scalibr/binary/proto/scan_result.proto).
   138      1. Re-generate the go_proto:
   139  
   140          ```
   141          $ `make protos`
   142          ```
   143  
   144      1. Implement `func (m *Metadata) SetProto(p *pb.Package)` and `ToStruct(m *pb.MyMetadata) *Metadata`.
   145      1. Add the `ToStruct` function to the metadata map in `binary/proto/package_metadata.go`.
   146  
   147  1.  If you added new dependencies, regenerate the go.mod file by running:
   148  
   149      ```sh
   150      $ `go mod tidy`
   151      ```
   152  
   153  1.  If your Inventory is Package which can have a corresponding Package URL,
   154      check that [extractor.ToPURL](/extractor/convert.go)
   155      generates a valid PURL for your package's PURL type. Implement your custom
   156      PURL generation logic here if necessary.
   157  1.  Write tests (you can separate tests for FileRequired and Extract, to avoid
   158      having to give test data specific file names).
   159  1.  Register your extractor in
   160      [list.go](/extractor/filesystem/list/list.go)
   161  1.  Update `docs/supported_inventory_types.md` to include your new extractor.
   162  1.  Optional: Test locally: Use the name of the extractor given by `Name()` to
   163      select your extractor. For the `packagejson` extractor it would look like
   164      this:
   165  
   166      ```sh
   167      $ scalibr --extractors=javascript/packagejson ...
   168      ```
   169  
   170      You can find more details on how to run scalibr in
   171      [README.md](/README.md#as-a-standalone-binary)
   172  
   173  1.  Submit your code for review. Once merged, the extractor is ready to use, but
   174      not activated in any defaults yet.
   175  
   176  To add your extractor to the list of default extractors, add it in
   177  [extractor/list/list.go](/extractor/filesystem/list/list.go).
   178  Please submit this code separately from the main extractor logic.
   179  
   180  In case you have any questions or feedback, feel free to open an issue.