github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/server_setup.md (about) 1 # Server Setup 2 3 This covers setting up a server to scrape articles, store them and provide an API 4 endpoint for people to slurp them down for analysis. 5 It's been used on a Linux server, but should apply equally to most other OSes 6 - Windows, the BSDs, OSX. 7 8 ## Quick Overview 9 10 There are two components to the system: 11 12 1. scrapeomat - responsible for finding and fetching articles, extracting them in a structured form (title, content, date etc), and storing them in the database. 13 2. slurpserver - provides an HTTP-based slurp API for accessing the article database. 14 15 The server setup generally follows these steps: 16 17 1. build `scrapeomat` and `slurpserver` from source (they are written in golang). 18 1. Setup a PostgreSQL database to hold your collected articles. 19 2. Write config files describing how to scrape the sites you want. 20 3. Run scrapeomat to perform scraping. 21 4. Run slurpsurver to provide an API to access the data. 22 23 24 ## Building from source 25 26 You'll need a working [golang](https://golang.org/) install to build the tools from source. 27 28 Grab `github.com/bcampbell/scrapeomat` (via `git` or `go get ...`), then: 29 30 $ cd scrapeomat 31 $ go build 32 $ cd cmd/slurpserver 33 $ go build 34 35 And, optionally, `go install` for each one. 36 37 (TODO: cover package dependencies. Personally I just keep running `go build` then`go get` anything missing, repeating until done :- ) 38 39 40 ## Database setup 41 42 [PostgreSQL](https://postgresql.org/) is used for storing scraped article data. 43 It's assumed that you've got a PostgreSQL server installed. 44 45 You'll need to create a user and database. For example: 46 47 $ sudo -u postgres createuser --no-createdb {DBUSER} 48 $ sudo -u postgres createdb -O {DBUSER} -E utf8 {DBNAME} 49 50 PostgreSQL has a complex permissions system which is a little outside the scope of this guide, but there are some notes at the end on setting it up for local development (but probably not suitable for production use). 51 52 Once your database is set up, you need to load the base schema: 53 54 $ psql -U {DBUSER} {DBNAME} <store/pg/schema.sql 55 56 Your database should now be ready to have articles stored in it. 57 58 59 ## Scrapeomat 60 61 Scrapeomat is designed to be a long-running process. It will scrape the sites it is configured for, sleep for a couple of hours and repeat. 62 63 Scraping a site takes two steps: 64 65 1. Article discovery - looking for article URLs, usually by scanning all the "front" pages for sections on the site. 66 2. Article scraping - this breaks down further: 67 1. Check the database, and ignore article URLs we've already got. 68 2. Fetch each URL in turn, extract the content & metadata, store it in the database. 69 70 71 ### Configuring Target Sites 72 73 Each site you want to scrape requires a configuration entry. 74 These are read from `.cfg` files in the `scrapers` directory. 75 76 A simple config file example, `scrapers/notarealnewspaper.cfg`: 77 ``` 78 [scraper "notarealnewspaper"] 79 80 # where we start looking for article links 81 url="https://notarealnewspaper.com" 82 83 # Pattern to identify article URLs 84 # eg: https://notarealnewspaper.com/articles/moon-made-of-cheese 85 artform="/articles/SLUG" 86 87 # css selector to find other pages to scan for article links 88 # (we want to match links in the site's menu system) 89 navsel=".site-header nav a" 90 91 ``` 92 Notes: 93 94 - comments start with `#`. 95 - the `[scraper ...]` line denotes the start of a scraper definition and assigns it a name ("notarealnewspaper", in this case). 96 - you can define as many site configs as you need. Usually you'd put them each in their own file, but it's the `[scraper ...]` line which defines them, so you could put them all in a single file, or group them in multiple files, or whatever. 97 - The configuration syntax has a lot of options. Look at the [reference doc](scraper_config.md) for more details. 98 99 100 Here's the config for the scrapeomat at http://slurp.stenoproject.org/govblog, 101 for scraping the UK government blogs. Note that this one uses the sites pagination links rather than menu links to discover articles: 102 ``` 103 [scraper "blog.gov.uk"] 104 105 # start crawling on page containing recent posts... 106 url="https://www.blog.gov.uk/all-posts/" 107 108 # ...and follow the pagination links when looking for articles... 109 navsel="nav.pagination-container a" 110 111 # ...but exclude any pages with 2 or more digits (so first 10 pages only) 112 xnavpat="all-posts/page/[0-9]{2,}" 113 114 # we're looking for links with this url form: 115 # (eg https://ssac.blog.gov.uk/2019/02/20/a-learning-journey-social-security-in-scotland/) 116 artform="/YYYY/MM/DD/SLUG$" 117 118 # allow posts on any subdomains 119 hostpat=".*[.]blog.gov.uk" 120 ``` 121 122 Crafting these configurations is usually a case of going to the site and 123 using your web browsers 'inspect element' feature to examine the structure of 124 the HTML. 125 126 127 You can run the discovery phase on it's own like this: 128 129 $ scrapeomat -discover -v=2 govblog 130 131 If all goes well, this will output a list of article links discovered. 132 133 (The `-v=2` turns up the verbosity, and will output of each nav page fetched during discovery). 134 135 136 137 ### Running Scrapeomat 138 139 Once you have config files set up in the `scrapers` dir, you can run the scrapeomat eg: 140 141 $ scrapeomat -v=2 ALL 142 143 "ALL" is required to run all the scrapers. 144 The scrapers will be executed in parallel. 145 146 You need to specify which database to store articles in, by passing in a database connection string. 147 This can be passed in as a commandline flag (`-db`) or via the `SCRAPEOMAT_DB` environment variable, eg: 148 149 $ export SCRAPEOMAT_DB="user=scrape dbname=govblog host=/var/run/postgresql sslmode=disable" 150 $ scrapeomat ALL 151 152 ### Installing Scrapeomat as a Service 153 154 For a proper server setup, you'd want to set scrapeomat to be automatically run 155 when the machine starts up and to direct it's `stderr` output to a logfile. 156 157 Typically, I use `systemd` and `rsyslog` to handle these. 158 159 TODO: add in the govblog example systemd unit file and rsyslog config here. 160 161 162 ## SlurpServer 163 164 Slurpserver provides an HTTP server and can serve up articles 165 The read the [API reference](../cmd/slurpserver/api.txt) for details on the endpoints. 166 167 168 ### Running 169 170 As with scrapeomat, slurpserver needs to be told which database to connect to. 171 Use the `-db` commandline flag or the `SCRAPEOMAT_DB` environment variable. 172 173 To run the slurp api: 174 175 $ slurpserver -port 12345 -prefix /slurptest -v=1 176 177 This would accept API requests at http://localhost:12345/slurptest. 178 179 180 ### Behaving as a Server 181 182 On a production server you'd want SlurpServer running 183 TODO: systemd and rsyslog config examples 184 185 ### Setting up SlurpServer Behind a "real" Web Server 186 187 Typically, I run SlurpServers behind an nginx web server. 188 Nginx handles https and sets the public-facing URLs, and passes requests on to the slurp server running on whatever random local port number... 189 190 The slurpserver "-prefix" flag can be used to run multiple slurp servers behind a single public-facing site. This makes it simpler if you have multiple scrapeomats and multiple databases running. 191 192 TODO: add some nginx config examples, and maybe some for other web servers (eg Apache) 193 194 ## Miscellaneous Notes 195 196 ### PostgreSQL Permissions 197 198 For development, I usually do something like this: 199 200 $ sudoedit /etc/postgresql/<VERSION>/main/pg_hba.conf 201 202 203 add a line _AFTER_(!) the first entry: 204 ``` 205 local {DBNAME} scrape peer map=scrapedev 206 ``` 207 208 add to `/etc/postgresql/<VERSION>/main/pg_ident.conf`: 209 ``` 210 # MAPNAME SYSTEM-USERNAME PG-USERNAME 211 scrapedev ben scrape 212 ``` 213 214 Force PostgreSQL to reread the config files: 215 216 $ sudo systemctl reload postgresql 217 218 Now, the unix user `ben` should be able to access the database using the postgresql user `scrape`. 219 220 221