github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/cmd/slurpserver/api.txt (about) 1 Scrapeomat API docs 2 =================== 3 4 GET http://<host>/<path>/api/slurp 5 6 Fetches full articles from the scrapeomat store. 7 8 9 PARAMETERS: 10 11 pubfrom 12 pubto 13 14 Only articles with publication dates within this range will be returned. 15 More specifically, the range is: pubfrom >= published < pubto 16 17 These can be days like "2006-03-23", or full RFC3339 dates, with the 18 timezone offset and all (eg: "2006-01-02T15:04:05+07:00"). 19 20 For the day-only form, the day is taken as UTC. So, because London is 21 currently using BST, the articles returned will be skewed by one hour - 22 you'll be missing an hour from one day, but it'll include an hour from 23 another day instead. 24 25 Don't forget to url-escape the params (the plus sign in the timezone 26 caused me a little head-scratching ;-) 27 28 pub 29 filter by publication. 30 31 By default, all publications are included in the results, but if one or 32 more "pub" params are included, the results will be narrowed down. 33 The values for "pub" are the publication codes "bbc", "dailymail", 34 "guardian" etc etc... 35 (I can get you a list if you need them, or you can just pick them out 36 of the results yourself :-) 37 38 xpub 39 Exclude publications. Any publications specified with xpub will 40 be filtered out. 41 42 43 since_id 44 Only return articles with an internal ID larger than this. 45 46 count 47 limit the returned set of articles to this many at most. 48 There'll be some internal limit, which will probably end 49 up at about 2000 or so. 50 51 52 EXAMPLE: 53 to fetch all the articles published on May 3rd, London time (+01:00 54 currently): 55 56 http://foo.scumways.com/ukarts/api/slurp?pubfrom=2015-05-03T00%3A00%3A00%2B01%3A00&pubto=2015-05-04T00%3A00%3A00%2B01%3A00 57 58 59 60 61 RETURNS: 62 63 Upon error, a non-200 HTTP code will be returned (eg "400 Bad Request" 64 if the parameters are bad). 65 66 Upon success, the articles are returned as a stream of json 67 objects: 68 69 {"article": { ... article 1 data ... }} 70 {"article": { ... article 2 data ... }} 71 ... 72 {"article": { ... article N data ... }} 73 74 If an error occurs after the data starts flowing, an error object will be 75 returned with some description, eg: 76 77 {"error": "too many fish"} 78 79 I plan to define some other objects in addtion to "article" and "error" 80 (eg progress updates), so if you just ignore anything unknown you should 81 be fine. 82 83 The article data should be reasonably self-explanatory. 84 The "content" field is the article text, in somewhat-sanitised HTML. 85 The "urls" field contains a list of known URLs (including canonical URL, 86 if known). 87 88 If the results were clipped, the last object returned will be: 89 {"next": {"since_id": N}} 90 where N is the ID of the highest received article, which can be used 91 as a parameter in the next request. 92 93 94 FUTURE PLANS: 95 96 - Some sort of simple token-based auth. 97 98 - Other API endpoints for interogating publication 99 codes, article counts and whatever other stats or diagnostic stuff would be 100 useful. 101 102 103 104 METHOD: 105 GET /api/pubs 106 107 PARAMETERS: 108 none 109 110 RETURNS 111 json object with one member, "publications". 112 "publications" is a list of the publications in the DB, each with the fields: 113 code - short code (lowercase) for publication (eg "dailyblah") 114 name - human-readable name of publication (eg "The Daily Blah") 115 domain - main domain for publication (eg "www.dailyblah.com") 116