github.com/replit/upm@v0.0.0-20240423230255-9ce4fc3ea24c/internal/backends/python/gen_pypi_map/README.md (about)

     1  # gen_pypi_map
     2  
     3  The gen_pypi_map module which generates the `pypi_map.gen.go` file, which
     4  contains the
     5      1. module -> package mapping for guessing
     6      2. package -> modules mapping for detecting if a module is already installed
     7      3. package -> download count mapping for stats
     8  
     9  It does this in separate steps, and gives a CLI interface for the admin
    10  to walk through them. To run this program, it is recommended that you
    11  have your working directory (CWD) set to `internal/backends/python`.
    12  
    13  ## Step 1: download / update pypi download stats
    14  
    15  The package download counts are needed for heuristics in the guess algorithm during the generate step, also for upm to sort search results with. The file
    16  `download_stats.json` file contains these stats and are checked in to git.
    17  The stats are downloaded from a public BigQuery table made available
    18  by Pypi. To download the stats and update the `download_stats.json` file:
    19  
    20  ```bash
    21  go run ./gen_pypi_map bq -gcp <gcp-project-name>
    22  ```
    23  
    24  The gcp-project-name can be any replit gcp project, because the table we are accessing `bigquery-public-data.pypi.file_downloads` is public. More info here: <https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/>
    25  
    26  ## Step 2: test modules
    27  
    28  Next we test the packages in pypi we want to be able to guess. The default test
    29  is:
    30  
    31  0. use pkgutil to see what modules exists before installing the package
    32  1. install the package
    33  2. use pkgutil to see what new modules were added compared to before
    34  
    35  We used to run this test on all modules on pypi. Now we have the option to
    36  run it only on a subset of modules. The default is to collect the top
    37  10000 packages. You can change this by passing in a different number to the
    38  optional `-threshold` flag.
    39  
    40  For example, to test the top 50000 packages:
    41  
    42  ```bash
    43  go run ./gen_pypi_map/ test -threshold 50000
    44  ```
    45  
    46  The test results for all tested packages will be stored in `pkgs.json` along with
    47  the versions of the packages tested. `pkgs.json` is checked in to git. The command
    48  will only test a packages if it's not already in the `pkgs.json` file or if
    49  its latest version was not the one previously tested. If you
    50  want to force a retest of the packages, you can use the `-force` flag.
    51  
    52  ## Step 3: generate sqlite db
    53  
    54  Finally, we use the collected data to generate the lookup database. This is done with:
    55  
    56  ```bash
    57  go run ./gen_pypi_map/ gen
    58  ```