github.com/google/osv-scalibr@v0.4.1/docs/new_extractor.md (about) 1 # Add a new Extractor 2 3 Extractors are plugins that extract inventory information, represented by the 4 Inventory struct. They are either called on every file on the host (filesystem 5 extractor) or query files on their own (standalone extractor). 6 7 There should be one Extractor per parsing logic. In python for example there are 8 multiple files to represent installed packages. `PKG-INFO`, `egg-info` and 9 `METADATA` have the same format (MIME type) and therefore same parsing logic. 10 Therefore there is one extractor 11 ([wheelegg](/extractor/filesystem/language/python/wheelegg/wheelegg.go)) 12 for all of them. `.egg` files are zip files which contain one of the previously 13 mentioned files, thus `.egg` is also handled by this extractor. On the other 14 side, there are files which have a different format, e.g. `requirements.txt`, 15 which is just a list of packages. Thus `requirements.txt` gets a separate 16 extractor. 17 18 ``` 19 wheel_egg/ <- extractor 20 **/*egg-info/PKG-INFO 21 */.egg-info 22 **/*dist-info/METADATA 23 **/EGG-INFO/PKG-INFO 24 .egg 25 requirements/ <- extractor 26 requirements.txt 27 ... 28 ``` 29 30 ## Extractor interfaces 31 32 Extractors use the [filesystem.Extractor](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L30) 33 or [standalone.Extractor](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L30) interface. 34 35 ### Filesystem Extractors 36 37 <!-- See extractor/filesystem/filesystem.go symbol \bExtractor\b --> 38 39 <!-- See plugin/plugin.go symbol \bPlugin\b --> 40 41 `FileRequired` should pre filter the files by their filename and fileMode. 42 43 `Extract` will be called on each file `FileRequired` returned true for. You 44 don't have to care about opening files, permissions or closing the file. SCALIBR 45 will take care of this. 46 47 Here is a simplified version of how SCALIBR will call the filesystem extractor 48 like this 49 ([actual code](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L49)): 50 51 ```py 52 for f in walk.files: 53 for e in filesystemExtractors: 54 if e.FileRequired(f): 55 fh = open(f) 56 inventory.add(e.Extract(fh)) 57 fh.close() 58 for e in standaloneExtractors: 59 inventory.add(e.Extract(fs)) 60 ``` 61 62 SCALIBR will call `Extract` with 63 [ScanInput](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L43), 64 which contains the path, `fs.FileInfo` and `io.Reader` for the file. It also 65 contains a FS interface and the scan root in case the extractor needs to access 66 other files on the host. 67 68 <!-- See extractor/filesystem/filesystem.go symbol \bScanInput\b --> 69 70 ### Standalone Extractors 71 72 <!-- See extractor/standalone/standalone.go symbol \bExtractor\b --> 73 74 `Extract` receives a [ScanInput](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/extractor/standalone/standalone.go#L43) 75 that gives it access to the root of the scanned host. Use this to read the files 76 you're interested in. 77 78 ### Output 79 80 For both extractors, the `Extract` method should return an [Inventory](https://github.com/google/osv-scalibr/tree/main/inventory/inventory.go) struct. 81 82 <!-- See inventory/inventory.go symbol \bInventory\b --> 83 84 The Inventory struct should have its appropriate fields set (e.g. `Packages` 85 for software packages, `Secrets` for leaked credentials): 86 87 <!-- See extractor/extractor.go symbol \bPackage\b --> 88 89 You can return an empty Inventory struct in case you don't find software 90 packages or other inventory in the file. You can also add multiple Package/etc. 91 entries in case there are multiple in one file. 92 93 ## Code location 94 95 Extractors should be in a sub-folder of 96 [/extractor/filesystem](/extractor/) or 97 [/extractor/standalone](/standalone/) 98 depending on the Extractor type. Take a look at existing folders and pick 99 whichever is the most appropriate location for your Extractor, or create a new 100 folder if none of the existing ones apply. Feel free to ask SCALIBR devs for 101 location suggestions during code review. 102 103 ## Step by step 104 105 You can take the [package.json](/extractor/filesystem/language/javascript/packagejson/packagejson.go) 106 extractor as an example for Filesystem Extractors. 107 108 1. Add a `New()` function that returns an Extractor from the specified plugin config. 109 1. If you'd like to add new plugin-specific config settings for your Extractor, 110 1. Add them as a new message to [config.proto](third_party/scalibr/binary/proto/config.proto). 111 1. Re-generate the go_proto: 112 113 ``` 114 $ `make protos` 115 ``` 116 117 1. You'll be able to specify these config settings from the CLI with the 118 --plugin-config flag. 119 1. Implement `Name()` to return a unique name. Best practice is to use the path 120 such as `python/requirements`, `javascript/packagejson`, `debian/dpkg`, 121 `sbom/spdx`. 122 1. Implement `Version()` to return 0. This should be increased later on 123 whenever substantial changes are added the code. Version is used to track 124 when bugs are introduced and fixed for a given Extractor. 125 1. Implement `Requirements()` to return any required [Capabilities](https://github.com/google/osv-scalibr/blob/f37275e81582aee9/plugin/plugin.go#L63) 126 for the system that runs the scanner. For example, if your code needs 127 network access, return `&plugin.Capabilities{Online: true}`. 128 Ideally your Extractor is able to run in any scanning environment 129 and will return an empty struct. 130 1. (For Filesystem Extractors) Implement `FileRequired` to return true in case 131 the filename and fileMode matches a file you need to parse. For example, 132 the JavaScript `package.json` extractor returns true for any file 133 named `package.json`. 134 1. Implement `Extract` to extract inventory inside the current file 135 (or from elsewhere on the filesystem). 136 1. If you introduced any new metadata type, be sure to: 137 1. Add them to the [scan_result.proto](third_party/scalibr/binary/proto/scan_result.proto). 138 1. Re-generate the go_proto: 139 140 ``` 141 $ `make protos` 142 ``` 143 144 1. Implement `func (m *Metadata) SetProto(p *pb.Package)` and `ToStruct(m *pb.MyMetadata) *Metadata`. 145 1. Add the `ToStruct` function to the metadata map in `binary/proto/package_metadata.go`. 146 147 1. If you added new dependencies, regenerate the go.mod file by running: 148 149 ```sh 150 $ `go mod tidy` 151 ``` 152 153 1. If your Inventory is Package which can have a corresponding Package URL, 154 check that [extractor.ToPURL](/extractor/convert.go) 155 generates a valid PURL for your package's PURL type. Implement your custom 156 PURL generation logic here if necessary. 157 1. Write tests (you can separate tests for FileRequired and Extract, to avoid 158 having to give test data specific file names). 159 1. Register your extractor in 160 [list.go](/extractor/filesystem/list/list.go) 161 1. Update `docs/supported_inventory_types.md` to include your new extractor. 162 1. Optional: Test locally: Use the name of the extractor given by `Name()` to 163 select your extractor. For the `packagejson` extractor it would look like 164 this: 165 166 ```sh 167 $ scalibr --extractors=javascript/packagejson ... 168 ``` 169 170 You can find more details on how to run scalibr in 171 [README.md](/README.md#as-a-standalone-binary) 172 173 1. Submit your code for review. Once merged, the extractor is ready to use, but 174 not activated in any defaults yet. 175 176 To add your extractor to the list of default extractors, add it in 177 [extractor/list/list.go](/extractor/filesystem/list/list.go). 178 Please submit this code separately from the main extractor logic. 179 180 In case you have any questions or feedback, feel free to open an issue.