github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/backfilling.md (about)

     1  # Backfilling techniques
     2  
     3  Backfilling is the process of collecting URLs of older articles to fill
     4  in gaps in coverage.
     5  
     6  
     7  
     8  ## Using `sitemaps.xml`
     9  
    10  A lot of sites (most sites?) have a sitemap file, which lists pages
    11  they'd like search engines to index. On a news site, this will usually
    12  include a lot of article links.
    13  
    14  Use `sitemapwalker` to grab all the URLs in a site map.
    15  
    16  This is a nice generic method for backfilling - if the articles you
    17  want are in the sitemap, you don't need to do anything site-specific.
    18  
    19  Some sites only have recent articles in their sitemap. This is great
    20  for filling in recent gaps (eg within the last week).
    21  But if the gap is further back, you'll have to resort to other
    22  methods - most likely some site-specific hackery.
    23  
    24  Some sites have _really_ comprehensive sitemaps. For example, the
    25  Independent seems to list it's entire archive of articles back to 2012
    26  or so. In these cases, the list of URLs can be overwhelmingly large.
    27  
    28  Look at a sites `robots.txt` file to see what sitemaps it has. There
    29  will often be multiple starting points there.
    30  
    31  The usefulness of the `LastMod` timestamps vary by site. Some sites
    32  set it to the time the sitemap was generated, which might be very
    33  recent, even for archival material.
    34  For other sites, it's a useful way to filter just the articles
    35  you want.
    36  
    37  TODO: document any progress in filtering sitemaps by rough date ranges
    38  
    39  ## wp-json
    40  
    41  TODO
    42  
    43  
    44  ## Site-specific Hackery
    45  
    46  If the generic sitemap.xml scanning doesn't cover the articles you're
    47  looking for, you'll probably have to write some custom coding to cover
    48  the site you want.
    49  
    50  Such site-specifc hacks are being collected in the `backfill` tool.
    51  
    52  Most sites have a search facility which can be used to generate a list
    53  of older articles.
    54  
    55  Other sites have good archive sections which can be iterated through.
    56  
    57  Either way... coding.
    58  
    59  ## Scraping articles from backfill lists
    60  
    61  Once you've got a list of article URLs, you can scrape them using
    62  the `scrapeomat` tool with the `-i` flag. This flag skips the usual
    63  article-discovery phase for the scraper and instead reads a list of
    64  URLs from a file.
    65  
    66  In this mode, only a single scraper can be invoked
    67  - it's assumed that all the URLs in the list file are from the same
    68  publication.
    69  
    70  The usual scraper-specific article URL patterns are still applied
    71  to the URLs before scraping, so non-article links will be filtered
    72  out of the list.
    73  
    74  
    75