github.com/anchore/syft@v1.38.2/syft/pkg/cataloger/internal/cpegenerate/README.md (about)

     1  # CPE Generation
     2  
     3  This package generates Common Platform Enumeration (CPE) identifiers for software packages discovered by Syft.
     4  CPEs are standardized identifiers that enable vulnerability matching by linking packages to known vulnerabilities in databases like the National Vulnerability Database (NVD).
     5  
     6  ## Overview
     7  
     8  CPE generation in Syft uses a **two-tier approach** to balance accuracy and coverage:
     9  
    10  1. **Dictionary Lookups** (Authoritative): Pre-validated CPEs from the official NIST CPE dictionary
    11  2. **Heuristic Generation** (Fallback): Intelligent generation based on package metadata and ecosystem-specific patterns
    12  
    13  This dual approach ensures:
    14  - **High accuracy** for packages in the NIST dictionary (no false positives)
    15  - **Broad coverage** for packages not yet in the dictionary (maximizes vulnerability detection)
    16  - **Fast performance** with an embedded, indexed CPE dictionary (~814KB)
    17  
    18  ## Why It Matters
    19  
    20  CPEs link discovered packages to security vulnerabilities (CVEs) in tools like Grype. Without accurate CPE generation, vulnerability scanning misses security issues.
    21  
    22  ## How It Works
    23  
    24  ### Architecture
    25  
    26  ```
    27  ┌─────────────────────────────────────────────────────────┐
    28  │  Syft Package Discovery                                 │
    29  └──────────────────┬──────────────────────────────────────┘
    30                     │
    31                     ▼
    32           ┌─────────────────────┐
    33           │  CPE Generation     │
    34           │  (this package)     │
    35           └──────────┬──────────┘
    36                      │
    37          ┌───────────┴────────────┐
    38          │                        │
    39          ▼                        ▼
    40  ┌──────────────────┐    ┌─────────────────────┐
    41  │ Dictionary       │    │ Heuristic           │
    42  │ Lookup           │    │ Generation          │
    43  │                  │    │                     │
    44  │ • Embedded index │    │ • Ecosystem rules   │
    45  │ • ~22K entries   │    │ • Vendor/product    │
    46  │ • 11 ecosystems  │    │   candidates        │
    47  └──────────────────┘    │ • Curated mappings  │
    48                          │ • Smart filters     │
    49                          └─────────────────────┘
    50  ```
    51  
    52  ### Dictionary Generation Process
    53  
    54  The dictionary is generated offline and embedded into the Syft binary for fast, offline lookups.
    55  
    56  **Location**: `dictionary/index-generator/`
    57  
    58  **Process**:
    59  1. **Fetch**: Retrieves CPE data from NVD Products API using incremental updates
    60  2. **Cache**: Stores raw API responses in ORAS registry for reuse (`.cpe-cache/`)
    61  3. **Filter**:
    62     - Removes CPEs without reference URLs
    63     - Excludes hardware (`h`) and OS (`o`) CPEs (keeps only applications `a`)
    64  4. **Index by Ecosystem**:
    65     - Extracts package names from reference URLs (npm, pypi, rubygems, etc.)
    66     - Creates index: `ecosystem → package_name → [CPE strings]`
    67  5. **Embed**: Generates `data/cpe-index.json` embedded via `go:embed` directive
    68  
    69  ### Runtime CPE Lookup/Generation
    70  
    71  **Entry Point**: `generate.go`
    72  
    73  When Syft discovers a package:
    74  
    75  1. **Check for Declared CPEs**: If package metadata already contains CPEs (from SBOM imports), skip generation
    76  2. **Try Dictionary Lookup** (`FromDictionaryFind`):
    77     - Loads embedded CPE index (singleton, loaded once)
    78     - Looks up by ecosystem + package name
    79     - Returns pre-validated CPEs if found
    80     - Marks source as `NVDDictionaryLookupSource`
    81  3. **Fallback to Heuristic Generation** (`FromPackageAttributes`):
    82     - Generates vendor/product/targetSW candidates using ecosystem-specific logic
    83     - Creates CPE permutations from candidates
    84     - Applies filters to remove known false positives
    85     - Marks source as `GeneratedSource`
    86  
    87  ### Supported Ecosystems
    88  
    89  **Dictionary Lookups** (11 ecosystems):
    90  npm, RubyGems, PyPI, Jenkins Plugins, crates.io, PHP, Go Modules, WordPress Plugins/Themes
    91  
    92  **Heuristic Generation** (all package types):
    93  All dictionary ecosystems plus Java, .NET/NuGet, Alpine APK, Debian/RPM, and any other package type Syft discovers
    94  
    95  ### Ecosystem-Specific Intelligence
    96  
    97  The heuristic generator uses per-ecosystem strategies:
    98  
    99  - **Java**: Extracts vendor from groupId, product from artifactId
   100  - **Python**: Parses author fields, adds `_project` suffix variants
   101  - **Go**: Extracts org/repo from module paths (`github.com/org/repo`)
   102  - **JavaScript**: Handles npm scope patterns (`@scope/package`)
   103  
   104  ### Curated Mappings & Filters
   105  
   106  - **500+ curated mappings**: `curl` → `haxx`, `spring-boot` → `pivotal`, etc.
   107  - **Filters**: Prevent false positives (Jenkins plugins vs. core, Jira client vs. server)
   108  - **Validation**: Ensures CPE syntax correctness before returning
   109  
   110  ## Implementation Details
   111  
   112  ### Embedded Index Format
   113  
   114  ```json
   115  {
   116    "ecosystems": {
   117      "npm": {
   118        "lodash": ["cpe:2.3:a:lodash:lodash:*:*:*:*:*:node.js:*:*"]
   119      },
   120      "pypi": {
   121        "Django": ["cpe:2.3:a:djangoproject:django:*:*:*:*:*:python:*:*"]
   122      }
   123    }
   124  }
   125  ```
   126  
   127  The dictionary generator maps packages to ecosystems using reference URL patterns (npmjs.com, pypi.org, rubygems.org, etc.).
   128  
   129  ## Maintenance
   130  
   131  ### Updating the CPE Dictionary
   132  
   133  The CPE dictionary should be updated periodically to include new packages:
   134  
   135  ```bash
   136  # Full workflow: pull cache → update from NVD → build index
   137  make generate:cpe-index
   138  
   139  # Or run individual steps:
   140  make generate:cpe-index:cache:pull     # Pull cached CPE data from ORAS
   141  make generate:cpe-index:cache:update   # Fetch updates from NVD Products API
   142  make generate:cpe-index:build          # Generate cpe-index.json from cache
   143  ```
   144  
   145  **Optional**: Set `NVD_API_KEY` for faster updates (50 req/30s vs 5 req/30s)
   146  
   147  This workflow:
   148  1. Pulls existing cache from ORAS registry (avoids re-fetching all ~1.5M CPEs)
   149  2. Fetches only products modified since last update from NVD Products API
   150  3. Builds indexed dictionary (~814KB, ~22K entries)
   151  4. Pushes updated cache for team reuse
   152  
   153  ### Extending CPE Generation
   154  
   155  **Add dictionary support for a new ecosystem:**
   156  1. Add URL pattern in `index-generator/generate.go`
   157  2. Regenerate index with `make generate:cpe-index`
   158  
   159  **Improve heuristic generation:**
   160  1. Modify ecosystem-specific file (e.g., `java.go`, `python.go`)
   161  2. Add curated mappings to `candidate_by_package_type.go`
   162  
   163  **Key files:**
   164  - `generate.go` - Main generation logic
   165  - `dictionary/` - Dictionary generator and embedded index
   166  - `candidate_by_package_type.go` - Ecosystem-specific candidates
   167  - `filter.go` - Filtering rules