github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/server_setup.md

github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/server_setup.md (about)

     1  # Server Setup
     2  
     3  This covers setting up a server to scrape articles, store them and provide an API
     4  endpoint for people to slurp them down for analysis.
     5  It's been used on a Linux server, but should apply equally to most other OSes
     6  - Windows, the BSDs, OSX.
     7  
     8  ## Quick Overview
     9  
    10  There are two components to the system:
    11  
    12  1. scrapeomat - responsible for finding and fetching articles, extracting them in a structured form (title, content, date etc), and storing them in the database.
    13  2. slurpserver - provides an HTTP-based slurp API for accessing the article database.
    14  
    15  The server setup generally follows these steps:
    16  
    17  1. build `scrapeomat` and `slurpserver` from source (they are written in golang).
    18  1. Setup a PostgreSQL database to hold your collected articles.
    19  2. Write config files describing how to scrape the sites you want.
    20  3. Run scrapeomat to perform scraping.
    21  4. Run slurpsurver to provide an API to access the data.
    22  
    23  
    24  ## Building from source
    25  
    26  You'll need a working [golang](https://golang.org/) install to build the tools from source.
    27  
    28  Grab `github.com/bcampbell/scrapeomat` (via `git` or `go get ...`), then:
    29  
    30      $ cd scrapeomat
    31      $ go build
    32      $ cd cmd/slurpserver
    33      $ go build
    34  
    35  And, optionally, `go install` for each one.
    36  
    37  (TODO: cover package dependencies. Personally I just keep running `go build` then`go get` anything missing, repeating until done :- )
    38  
    39  
    40  ## Database setup
    41  
    42  [PostgreSQL](https://postgresql.org/) is used for storing scraped article data.
    43  It's assumed that you've got a PostgreSQL server installed.
    44  
    45  You'll need to create a user and database. For example:
    46  
    47      $ sudo -u postgres createuser --no-createdb {DBUSER}
    48      $ sudo -u postgres createdb -O {DBUSER} -E utf8 {DBNAME}
    49  
    50  PostgreSQL has a complex permissions system which is a little outside the scope of this guide, but there are some notes at the end on setting it up for local development (but probably not suitable for production use).
    51  
    52  Once your database is set up, you need to load the base schema:
    53  
    54      $ psql -U {DBUSER} {DBNAME} <store/pg/schema.sql
    55  
    56  Your database should now be ready to have articles stored in it.
    57  
    58  
    59  ## Scrapeomat
    60  
    61  Scrapeomat is designed to be a long-running process. It will scrape the sites it is configured for, sleep for a couple of hours and repeat.
    62  
    63  Scraping a site takes two steps:
    64  
    65  1. Article discovery - looking for article URLs, usually by scanning all the "front" pages for sections on the site.
    66  2. Article scraping -  this breaks down further:
    67      1. Check the database, and ignore article URLs we've already got.
    68      2. Fetch each URL in turn, extract the content & metadata, store it in the database.
    69  
    70  
    71  ### Configuring Target Sites
    72  
    73  Each site you want to scrape requires a configuration entry.
    74  These are read from `.cfg` files in the `scrapers` directory.
    75  
    76  A simple config file example, `scrapers/notarealnewspaper.cfg`:
    77  ```
    78  [scraper "notarealnewspaper"]
    79  
    80  # where we start looking for article links
    81  url="https://notarealnewspaper.com"
    82  
    83  # Pattern to identify article URLs
    84  # eg: https://notarealnewspaper.com/articles/moon-made-of-cheese
    85  artform="/articles/SLUG"
    86  
    87  # css selector to find other pages to scan for article links
    88  # (we want to match links in the site's menu system)
    89  navsel=".site-header nav a"
    90  
    91  ```
    92  Notes:
    93  
    94  - comments start with `#`.
    95  - the `[scraper ...]` line denotes the start of a scraper definition and assigns it a name ("notarealnewspaper", in this case).
    96  - you can define as many site configs as you need. Usually you'd put them each in their own file, but it's the `[scraper ...]` line which defines them, so you could put them all in a single file, or group them in multiple files, or whatever.
    97   - The configuration syntax has a lot of options. Look at the [reference doc](scraper_config.md) for more details.
    98  
    99  
   100  Here's the config for the scrapeomat at http://slurp.stenoproject.org/govblog,
   101  for scraping the UK government blogs. Note that this one uses the sites pagination links rather than menu links to discover articles:
   102  ```
   103  [scraper "blog.gov.uk"]
   104  
   105  # start crawling on page containing recent posts...
   106  url="https://www.blog.gov.uk/all-posts/"
   107  
   108  # ...and follow the pagination links when looking for articles...
   109  navsel="nav.pagination-container a"
   110  
   111  # ...but exclude any pages with 2 or more digits (so first 10 pages only)
   112  xnavpat="all-posts/page/[0-9]{2,}"
   113  
   114  # we're looking for links with this url form:
   115  # (eg https://ssac.blog.gov.uk/2019/02/20/a-learning-journey-social-security-in-scotland/)
   116  artform="/YYYY/MM/DD/SLUG$"
   117  
   118  # allow posts on any subdomains
   119  hostpat=".*[.]blog.gov.uk"
   120  ```
   121  
   122  Crafting these configurations is usually a case of going to the site and
   123  using your web browsers 'inspect element' feature to examine the structure of
   124  the HTML.
   125  
   126  
   127  You can run the discovery phase on it's own like this:
   128  
   129      $ scrapeomat -discover -v=2 govblog
   130  
   131  If all goes well, this will output a list of article links discovered.
   132  
   133  (The `-v=2` turns up the verbosity, and will output of each nav page fetched during discovery).
   134  
   135  
   136  
   137  ### Running Scrapeomat
   138  
   139  Once you have config files set up in the `scrapers` dir, you can run the scrapeomat eg:
   140  
   141      $ scrapeomat -v=2 ALL
   142  
   143  "ALL" is required to run all the scrapers.
   144  The scrapers will be executed in parallel.
   145  
   146  You need to specify which database to store articles in, by passing in a database connection string.
   147  This can be passed in as a commandline flag (`-db`) or via the `SCRAPEOMAT_DB` environment variable, eg:
   148  
   149      $ export SCRAPEOMAT_DB="user=scrape dbname=govblog host=/var/run/postgresql sslmode=disable"
   150      $ scrapeomat ALL
   151  
   152  ### Installing Scrapeomat as a Service
   153  
   154  For a proper server setup, you'd want to set scrapeomat to be automatically run
   155  when the machine starts up and to direct it's `stderr` output to a logfile.
   156  
   157  Typically, I use `systemd` and `rsyslog` to handle these.
   158  
   159  TODO: add in the govblog example systemd unit file and rsyslog config here.
   160  
   161  
   162  ## SlurpServer
   163  
   164  Slurpserver provides an HTTP server and can serve up articles
   165  The read the [API reference](../cmd/slurpserver/api.txt) for details on the endpoints.
   166  
   167  
   168  ### Running
   169  
   170  As with scrapeomat, slurpserver needs to be told which database to connect to.
   171  Use the `-db` commandline flag or the `SCRAPEOMAT_DB` environment variable.
   172  
   173  To run the slurp api:
   174  
   175      $ slurpserver -port 12345 -prefix /slurptest -v=1
   176  
   177  This would accept API requests at http://localhost:12345/slurptest.
   178  
   179  
   180  ### Behaving as a Server
   181  
   182  On a production server you'd want SlurpServer running 
   183  TODO: systemd and rsyslog config examples
   184  
   185  ### Setting up SlurpServer Behind a "real" Web Server
   186  
   187  Typically, I run SlurpServers behind an nginx web server.
   188  Nginx handles https and sets the public-facing URLs, and passes requests on to the slurp server running on whatever random local port number... 
   189  
   190  The slurpserver "-prefix" flag can be used to run multiple slurp servers behind a single public-facing site. This makes it simpler if you have multiple scrapeomats and multiple databases running.
   191  
   192  TODO: add some nginx config examples, and maybe some for other web servers (eg Apache)
   193  
   194  ## Miscellaneous Notes
   195  
   196  ### PostgreSQL Permissions
   197  
   198  For development, I usually do something like this:
   199  
   200      $ sudoedit /etc/postgresql/<VERSION>/main/pg_hba.conf
   201  
   202  
   203  add a line _AFTER_(!) the first entry:
   204  ```
   205  local   {DBNAME}        scrape                    peer map=scrapedev
   206  ```
   207  
   208  add to `/etc/postgresql/<VERSION>/main/pg_ident.conf`:
   209  ```
   210  # MAPNAME       SYSTEM-USERNAME         PG-USERNAME
   211  scrapedev       ben                     scrape
   212  ```
   213  
   214  Force PostgreSQL to reread the config files:
   215  
   216      $ sudo systemctl reload postgresql
   217  
   218  Now, the unix user `ben` should be able to access the database using the postgresql user `scrape`.
   219  
   220  
   221