github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/scraper_config.md (about) 1 # Scraper configuration 2 3 4 name 5 : the name of the scraper 6 7 url 8 : the root url for crawling (eg http://example.com/news") 9 10 navsel 11 : css selector to identify section links 12 eg ".navigation-container a" 13 14 xnavpat 15 : regex, urls to ignore when looking for section links. 16 Multiple xnavpat lines can be used. 17 Handy for excluding overly-numerous navigation pages 18 eg: "/tag/", "/category/" 19 20 21 artpat 22 : treat urls matching this regex as articles. 23 24 Multiple artpat (and artform) lines can be used, to 25 show multiple URL forms (a lot of sites use multiple 26 URL schemes) 27 28 The URLs filtered by artpat (and artform) already have their 29 query and fragment parts stripped, unless nostripquery or 30 nostripfragment are also set. 31 32 xartpat 33 : exclude any article urls matching this regex 34 35 artform 36 : Simplified (non-regexp) pattern matching for URLs. 37 ID number with 4+ digits 38 YYYY 39 MM 40 DD 41 SLUG - anything with a hyphen in it, excluding slashes (/) 42 eg moon-made-of-cheese 43 moon-made-of-cheese.html 44 1234-moon-made-of-cheese^1434 45 $ - match end of line 46 47 eg: artform="/SLUG.html$" 48 49 50 xartform 51 : exclude any article urls matching this 52 Multiple xartform lines can be used. 53 54 hostpat 55 : regex. urls from non matching hosts will be rejected 56 applies to both discovery and article url filtering 57 default: only accept same host as starting url 58 59 baseerrorthreshold 60 : default 5 61 62 nostripquery 63 : by default, the query part of article urls is stripped off. 64 eg "www.example.com/news?article=1234" becomes "www.example.com/news" 65 66 Most of the time, the query part is cruft and/or tracking rubbish, but 67 some sites will require it. 68 Add `nostripquery` to turn this behaviour off. 69 70 71 cookies 72 : Retain cookies when making http requests 73 Used mainly for paywalled sites 74 75 pubcode 76 : short publication code for this site 77 TODO: add details. 78 79 useragent 80 : User-Agent string to use when sending HTTP requests for this scraper. 81 eg: useragent="https://udger.com/resources/online-parser?Fuas=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240" 82 83