github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/scraper_config.md

github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/scraper_config.md (about)

     1  # Scraper configuration
     2  
     3  
     4  name
     5  :   the name of the scraper
     6  
     7  url
     8  :   the root url for crawling (eg http://example.com/news")
     9  
    10  navsel
    11  :   css selector to identify section links
    12      eg ".navigation-container a"
    13  
    14  xnavpat
    15  :   regex, urls to ignore when looking for section links.
    16      Multiple xnavpat lines can be used.
    17      Handy for excluding overly-numerous navigation pages
    18      eg: "/tag/", "/category/"
    19      
    20  
    21  artpat
    22  :   treat urls matching this regex as articles.
    23  
    24      Multiple artpat (and artform) lines can be used, to
    25      show multiple URL forms (a lot of sites use multiple
    26      URL schemes)
    27  
    28      The URLs filtered by artpat (and artform) already have their
    29      query and fragment parts stripped, unless nostripquery or
    30      nostripfragment are also set.
    31  
    32  xartpat
    33  :   exclude any article urls matching this regex
    34  
    35  artform
    36  :   Simplified (non-regexp) pattern matching for URLs.
    37      ID    number with 4+ digits
    38      YYYY
    39      MM
    40      DD
    41      SLUG   - anything with a hyphen in it, excluding slashes (/)
    42               eg moon-made-of-cheese
    43               moon-made-of-cheese.html
    44               1234-moon-made-of-cheese^1434
    45      $      - match end of line
    46  
    47      eg: artform="/SLUG.html$"
    48  
    49  
    50  xartform
    51  :   exclude any article urls matching this
    52      Multiple xartform lines can be used.
    53  
    54  hostpat
    55  :   regex. urls from non matching hosts will be rejected
    56      applies to both discovery and article url filtering
    57      default: only accept same host as starting url
    58  
    59  baseerrorthreshold
    60  :   default 5
    61  
    62  nostripquery
    63  :   by default, the query part of article urls is stripped off.
    64      eg "www.example.com/news?article=1234" becomes "www.example.com/news"
    65  
    66      Most of the time, the query part is cruft and/or tracking rubbish, but
    67      some sites will require it.
    68      Add `nostripquery` to turn this behaviour off.
    69  
    70  
    71  cookies
    72  :   Retain cookies when making http requests
    73      Used mainly for paywalled sites
    74  
    75  pubcode
    76  :   short publication code for this site
    77      TODO: add details.
    78  
    79  useragent
    80  :   User-Agent string to use when sending HTTP requests for this scraper.
    81      eg: useragent="https://udger.com/resources/online-parser?Fuas=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10240"
    82  
    83