github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/scrapeomat.md

github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/scrapeomat.md (about)

     1  #Scrapeomat usage
     2  
     3  ## Intro
     4  
     5  The scrapeomat tool grabs news articles from web sites, extracts the content
     6  and meta data, and loads the results into a postgresql database.
     7  
     8  
     9      Usage:
    10      scrapeomat [options] publication(s)
    11  
    12      Options:
    13        -a string
    14              archive dir to dump .warc files into (default "archive")
    15        -db string
    16              database connection string (eg postgres://scrapeomat:password@localhost/scrapeomat)
    17        -discover
    18              run discovery for target sites, output article links to stdout, then exit
    19        -i string
    20              input file of URLs (runs scrapers then exit)
    21        -l	List target sites and exit
    22        -s string
    23              path for scraper configs (default "scrapers")
    24        -v int
    25              verbosity of output (0=errors only 1=info 2=debug) (default 1)
    26  
    27  
    28  ## Database connection
    29  
    30  You can specify a postgresql connection string via the `-db` flag, but it's
    31  more advisable to use the `SCRAPEOMAT_DB` environment variable instead:
    32  
    33      export SCRAPEOMAT_DB="user=scrape dbname=ukarts host=/var/run/postgresql port=5434 sslmode=disable"
    34  
    35  See the [postgresql docs](https://www.postgresql.org/docs/current/static/libpq-connect.html#LIBPQ-CONNSTRING)
    36  for details.
    37  
    38  ## Running from a list of URLs
    39  
    40  Using the `-i` flag, you can skip the discovery phase and instead pass in a
    41  file which lists article URLs to scrape. The file should have one URL per
    42  line.
    43  
    44  In this mode, only one publication (scraper config) can be specified. Each
    45  URL in the list is subjected to the URL rules for that publication. URLs
    46  which fail this test are skipped (eg URLs from the wrong domain or which
    47  don't conform to the defined URL patterns).
    48  
    49  This mode is useful when backfilling using a list of URLs obtained by other
    50  means, such as the sitemap.xml or via a search engine.
    51  
    52  
    53  
    54