github.com/anchore/syft@v1.38.2/syft/pkg/cataloger/internal/cpegenerate/README.md (about) 1 # CPE Generation 2 3 This package generates Common Platform Enumeration (CPE) identifiers for software packages discovered by Syft. 4 CPEs are standardized identifiers that enable vulnerability matching by linking packages to known vulnerabilities in databases like the National Vulnerability Database (NVD). 5 6 ## Overview 7 8 CPE generation in Syft uses a **two-tier approach** to balance accuracy and coverage: 9 10 1. **Dictionary Lookups** (Authoritative): Pre-validated CPEs from the official NIST CPE dictionary 11 2. **Heuristic Generation** (Fallback): Intelligent generation based on package metadata and ecosystem-specific patterns 12 13 This dual approach ensures: 14 - **High accuracy** for packages in the NIST dictionary (no false positives) 15 - **Broad coverage** for packages not yet in the dictionary (maximizes vulnerability detection) 16 - **Fast performance** with an embedded, indexed CPE dictionary (~814KB) 17 18 ## Why It Matters 19 20 CPEs link discovered packages to security vulnerabilities (CVEs) in tools like Grype. Without accurate CPE generation, vulnerability scanning misses security issues. 21 22 ## How It Works 23 24 ### Architecture 25 26 ``` 27 ┌─────────────────────────────────────────────────────────┐ 28 │ Syft Package Discovery │ 29 └──────────────────┬──────────────────────────────────────┘ 30 │ 31 ▼ 32 ┌─────────────────────┐ 33 │ CPE Generation │ 34 │ (this package) │ 35 └──────────┬──────────┘ 36 │ 37 ┌───────────┴────────────┐ 38 │ │ 39 ▼ ▼ 40 ┌──────────────────┐ ┌─────────────────────┐ 41 │ Dictionary │ │ Heuristic │ 42 │ Lookup │ │ Generation │ 43 │ │ │ │ 44 │ • Embedded index │ │ • Ecosystem rules │ 45 │ • ~22K entries │ │ • Vendor/product │ 46 │ • 11 ecosystems │ │ candidates │ 47 └──────────────────┘ │ • Curated mappings │ 48 │ • Smart filters │ 49 └─────────────────────┘ 50 ``` 51 52 ### Dictionary Generation Process 53 54 The dictionary is generated offline and embedded into the Syft binary for fast, offline lookups. 55 56 **Location**: `dictionary/index-generator/` 57 58 **Process**: 59 1. **Fetch**: Retrieves CPE data from NVD Products API using incremental updates 60 2. **Cache**: Stores raw API responses in ORAS registry for reuse (`.cpe-cache/`) 61 3. **Filter**: 62 - Removes CPEs without reference URLs 63 - Excludes hardware (`h`) and OS (`o`) CPEs (keeps only applications `a`) 64 4. **Index by Ecosystem**: 65 - Extracts package names from reference URLs (npm, pypi, rubygems, etc.) 66 - Creates index: `ecosystem → package_name → [CPE strings]` 67 5. **Embed**: Generates `data/cpe-index.json` embedded via `go:embed` directive 68 69 ### Runtime CPE Lookup/Generation 70 71 **Entry Point**: `generate.go` 72 73 When Syft discovers a package: 74 75 1. **Check for Declared CPEs**: If package metadata already contains CPEs (from SBOM imports), skip generation 76 2. **Try Dictionary Lookup** (`FromDictionaryFind`): 77 - Loads embedded CPE index (singleton, loaded once) 78 - Looks up by ecosystem + package name 79 - Returns pre-validated CPEs if found 80 - Marks source as `NVDDictionaryLookupSource` 81 3. **Fallback to Heuristic Generation** (`FromPackageAttributes`): 82 - Generates vendor/product/targetSW candidates using ecosystem-specific logic 83 - Creates CPE permutations from candidates 84 - Applies filters to remove known false positives 85 - Marks source as `GeneratedSource` 86 87 ### Supported Ecosystems 88 89 **Dictionary Lookups** (11 ecosystems): 90 npm, RubyGems, PyPI, Jenkins Plugins, crates.io, PHP, Go Modules, WordPress Plugins/Themes 91 92 **Heuristic Generation** (all package types): 93 All dictionary ecosystems plus Java, .NET/NuGet, Alpine APK, Debian/RPM, and any other package type Syft discovers 94 95 ### Ecosystem-Specific Intelligence 96 97 The heuristic generator uses per-ecosystem strategies: 98 99 - **Java**: Extracts vendor from groupId, product from artifactId 100 - **Python**: Parses author fields, adds `_project` suffix variants 101 - **Go**: Extracts org/repo from module paths (`github.com/org/repo`) 102 - **JavaScript**: Handles npm scope patterns (`@scope/package`) 103 104 ### Curated Mappings & Filters 105 106 - **500+ curated mappings**: `curl` → `haxx`, `spring-boot` → `pivotal`, etc. 107 - **Filters**: Prevent false positives (Jenkins plugins vs. core, Jira client vs. server) 108 - **Validation**: Ensures CPE syntax correctness before returning 109 110 ## Implementation Details 111 112 ### Embedded Index Format 113 114 ```json 115 { 116 "ecosystems": { 117 "npm": { 118 "lodash": ["cpe:2.3:a:lodash:lodash:*:*:*:*:*:node.js:*:*"] 119 }, 120 "pypi": { 121 "Django": ["cpe:2.3:a:djangoproject:django:*:*:*:*:*:python:*:*"] 122 } 123 } 124 } 125 ``` 126 127 The dictionary generator maps packages to ecosystems using reference URL patterns (npmjs.com, pypi.org, rubygems.org, etc.). 128 129 ## Maintenance 130 131 ### Updating the CPE Dictionary 132 133 The CPE dictionary should be updated periodically to include new packages: 134 135 ```bash 136 # Full workflow: pull cache → update from NVD → build index 137 make generate:cpe-index 138 139 # Or run individual steps: 140 make generate:cpe-index:cache:pull # Pull cached CPE data from ORAS 141 make generate:cpe-index:cache:update # Fetch updates from NVD Products API 142 make generate:cpe-index:build # Generate cpe-index.json from cache 143 ``` 144 145 **Optional**: Set `NVD_API_KEY` for faster updates (50 req/30s vs 5 req/30s) 146 147 This workflow: 148 1. Pulls existing cache from ORAS registry (avoids re-fetching all ~1.5M CPEs) 149 2. Fetches only products modified since last update from NVD Products API 150 3. Builds indexed dictionary (~814KB, ~22K entries) 151 4. Pushes updated cache for team reuse 152 153 ### Extending CPE Generation 154 155 **Add dictionary support for a new ecosystem:** 156 1. Add URL pattern in `index-generator/generate.go` 157 2. Regenerate index with `make generate:cpe-index` 158 159 **Improve heuristic generation:** 160 1. Modify ecosystem-specific file (e.g., `java.go`, `python.go`) 161 2. Add curated mappings to `candidate_by_package_type.go` 162 163 **Key files:** 164 - `generate.go` - Main generation logic 165 - `dictionary/` - Dictionary generator and embedded index 166 - `candidate_by_package_type.go` - Ecosystem-specific candidates 167 - `filter.go` - Filtering rules