github.com/replit/upm@v0.0.0-20240423230255-9ce4fc3ea24c/internal/backends/python/gen_pypi_map/README.md (about) 1 # gen_pypi_map 2 3 The gen_pypi_map module which generates the `pypi_map.gen.go` file, which 4 contains the 5 1. module -> package mapping for guessing 6 2. package -> modules mapping for detecting if a module is already installed 7 3. package -> download count mapping for stats 8 9 It does this in separate steps, and gives a CLI interface for the admin 10 to walk through them. To run this program, it is recommended that you 11 have your working directory (CWD) set to `internal/backends/python`. 12 13 ## Step 1: download / update pypi download stats 14 15 The package download counts are needed for heuristics in the guess algorithm during the generate step, also for upm to sort search results with. The file 16 `download_stats.json` file contains these stats and are checked in to git. 17 The stats are downloaded from a public BigQuery table made available 18 by Pypi. To download the stats and update the `download_stats.json` file: 19 20 ```bash 21 go run ./gen_pypi_map bq -gcp <gcp-project-name> 22 ``` 23 24 The gcp-project-name can be any replit gcp project, because the table we are accessing `bigquery-public-data.pypi.file_downloads` is public. More info here: <https://packaging.python.org/en/latest/guides/analyzing-pypi-package-downloads/> 25 26 ## Step 2: test modules 27 28 Next we test the packages in pypi we want to be able to guess. The default test 29 is: 30 31 0. use pkgutil to see what modules exists before installing the package 32 1. install the package 33 2. use pkgutil to see what new modules were added compared to before 34 35 We used to run this test on all modules on pypi. Now we have the option to 36 run it only on a subset of modules. The default is to collect the top 37 10000 packages. You can change this by passing in a different number to the 38 optional `-threshold` flag. 39 40 For example, to test the top 50000 packages: 41 42 ```bash 43 go run ./gen_pypi_map/ test -threshold 50000 44 ``` 45 46 The test results for all tested packages will be stored in `pkgs.json` along with 47 the versions of the packages tested. `pkgs.json` is checked in to git. The command 48 will only test a packages if it's not already in the `pkgs.json` file or if 49 its latest version was not the one previously tested. If you 50 want to force a retest of the packages, you can use the `-force` flag. 51 52 ## Step 3: generate sqlite db 53 54 Finally, we use the collected data to generate the lookup database. This is done with: 55 56 ```bash 57 go run ./gen_pypi_map/ gen 58 ```