github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/scrapeomat.md (about) 1 #Scrapeomat usage 2 3 ## Intro 4 5 The scrapeomat tool grabs news articles from web sites, extracts the content 6 and meta data, and loads the results into a postgresql database. 7 8 9 Usage: 10 scrapeomat [options] publication(s) 11 12 Options: 13 -a string 14 archive dir to dump .warc files into (default "archive") 15 -db string 16 database connection string (eg postgres://scrapeomat:password@localhost/scrapeomat) 17 -discover 18 run discovery for target sites, output article links to stdout, then exit 19 -i string 20 input file of URLs (runs scrapers then exit) 21 -l List target sites and exit 22 -s string 23 path for scraper configs (default "scrapers") 24 -v int 25 verbosity of output (0=errors only 1=info 2=debug) (default 1) 26 27 28 ## Database connection 29 30 You can specify a postgresql connection string via the `-db` flag, but it's 31 more advisable to use the `SCRAPEOMAT_DB` environment variable instead: 32 33 export SCRAPEOMAT_DB="user=scrape dbname=ukarts host=/var/run/postgresql port=5434 sslmode=disable" 34 35 See the [postgresql docs](https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-CONNSTRING) 36 for details. 37 38 ## Running from a list of URLs 39 40 Using the `-i` flag, you can skip the discovery phase and instead pass in a 41 file which lists article URLs to scrape. The file should have one URL per 42 line. 43 44 In this mode, only one publication (scraper config) can be specified. Each 45 URL in the list is subjected to the URL rules for that publication. URLs 46 which fail this test are skipped (eg URLs from the wrong domain or which 47 don't conform to the defined URL patterns). 48 49 This mode is useful when backfilling using a list of URLs obtained by other 50 means, such as the sitemap.xml or via a search engine. 51 52 53 54