github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/doc/backfilling.md (about) 1 # Backfilling techniques 2 3 Backfilling is the process of collecting URLs of older articles to fill 4 in gaps in coverage. 5 6 7 8 ## Using `sitemaps.xml` 9 10 A lot of sites (most sites?) have a sitemap file, which lists pages 11 they'd like search engines to index. On a news site, this will usually 12 include a lot of article links. 13 14 Use `sitemapwalker` to grab all the URLs in a site map. 15 16 This is a nice generic method for backfilling - if the articles you 17 want are in the sitemap, you don't need to do anything site-specific. 18 19 Some sites only have recent articles in their sitemap. This is great 20 for filling in recent gaps (eg within the last week). 21 But if the gap is further back, you'll have to resort to other 22 methods - most likely some site-specific hackery. 23 24 Some sites have _really_ comprehensive sitemaps. For example, the 25 Independent seems to list it's entire archive of articles back to 2012 26 or so. In these cases, the list of URLs can be overwhelmingly large. 27 28 Look at a sites `robots.txt` file to see what sitemaps it has. There 29 will often be multiple starting points there. 30 31 The usefulness of the `LastMod` timestamps vary by site. Some sites 32 set it to the time the sitemap was generated, which might be very 33 recent, even for archival material. 34 For other sites, it's a useful way to filter just the articles 35 you want. 36 37 TODO: document any progress in filtering sitemaps by rough date ranges 38 39 ## wp-json 40 41 TODO 42 43 44 ## Site-specific Hackery 45 46 If the generic sitemap.xml scanning doesn't cover the articles you're 47 looking for, you'll probably have to write some custom coding to cover 48 the site you want. 49 50 Such site-specifc hacks are being collected in the `backfill` tool. 51 52 Most sites have a search facility which can be used to generate a list 53 of older articles. 54 55 Other sites have good archive sections which can be iterated through. 56 57 Either way... coding. 58 59 ## Scraping articles from backfill lists 60 61 Once you've got a list of article URLs, you can scrape them using 62 the `scrapeomat` tool with the `-i` flag. This flag skips the usual 63 article-discovery phase for the scraper and instead reads a list of 64 URLs from a file. 65 66 In this mode, only a single scraper can be invoked 67 - it's assumed that all the URLs in the list file are from the same 68 publication. 69 70 The usual scraper-specific article URL patterns are still applied 71 to the URLs before scraping, so non-article links will be filtered 72 out of the list. 73 74 75