github.com/bcampbell/scrapeomat@v0.0.0-20220820232205-23e64141c89e/cmd/slurpserver/api.txt (about)

     1  Scrapeomat API docs
     2  ===================
     3  
     4  GET http://<host>/<path>/api/slurp
     5  
     6  Fetches full articles from the scrapeomat store.
     7  
     8  
     9  PARAMETERS:
    10  
    11  pubfrom
    12  pubto
    13  
    14    Only articles with publication dates within this range will be returned.
    15    More specifically, the range is:   pubfrom >= published < pubto
    16  
    17    These can be days like "2006-03-23", or full RFC3339 dates, with the
    18    timezone offset and all (eg: "2006-01-02T15:04:05+07:00"). 
    19  
    20    For the day-only form, the day is taken as UTC. So, because London is
    21    currently using BST, the articles returned will be skewed by one hour -
    22    you'll be missing an hour from one day, but it'll include an hour from
    23    another day instead.
    24  
    25    Don't forget to url-escape the params (the plus sign in the timezone
    26    caused me a little head-scratching ;-)
    27  
    28  pub
    29    filter by publication.
    30  
    31    By default, all publications are included in the results, but if one or
    32    more "pub" params are included, the results will be narrowed down.
    33    The values for "pub" are the publication codes "bbc", "dailymail",
    34    "guardian" etc etc...
    35    (I can get you a list if you need them, or you can just pick them out
    36    of the results yourself :-)
    37  
    38  xpub
    39    Exclude publications. Any publications specified with xpub will
    40    be filtered out.
    41  
    42  
    43  since_id
    44    Only return articles with an internal ID larger than this.
    45  
    46  count
    47    limit the returned set of articles to this many at most.
    48    There'll be some internal limit, which will probably end
    49    up at about 2000 or so.
    50  
    51  
    52  EXAMPLE:
    53    to fetch all the articles published on May 3rd, London time (+01:00
    54    currently):
    55  
    56    http://foo.scumways.com/ukarts/api/slurp?pubfrom=2015-05-03T00%3A00%3A00%2B01%3A00&pubto=2015-05-04T00%3A00%3A00%2B01%3A00
    57  
    58  
    59  
    60  
    61  RETURNS:
    62  
    63  Upon error, a non-200 HTTP code will be returned (eg "400 Bad Request"
    64  if the parameters are bad).
    65  
    66  Upon success, the articles are returned as a stream of json
    67  objects:
    68  
    69    {"article": { ... article 1 data ... }}
    70    {"article": { ... article 2 data ... }}
    71       ...
    72    {"article": { ... article N data ... }}
    73  
    74  If an error occurs after the data starts flowing, an error object will be
    75  returned with some description, eg:
    76  
    77    {"error": "too many fish"}
    78  
    79  I plan to define some other objects in addtion to "article" and "error"
    80  (eg progress updates), so if you just ignore anything unknown you should
    81  be fine.
    82  
    83  The article data should be reasonably self-explanatory.
    84  The "content" field is the article text, in somewhat-sanitised HTML.
    85  The "urls" field contains a list of known URLs (including canonical URL,
    86  if known).
    87  
    88  If the results were clipped, the last object returned will be:
    89    {"next": {"since_id": N}}
    90  where N is the ID of the highest received article, which can be used
    91  as a parameter in the next request.
    92  
    93  
    94  FUTURE PLANS:
    95  
    96  - Some sort of simple token-based auth.
    97  
    98  - Other API endpoints for interogating publication
    99  codes, article counts and whatever other stats or diagnostic stuff would be
   100  useful.
   101  
   102  
   103  
   104  METHOD:
   105  GET /api/pubs
   106  
   107  PARAMETERS:
   108  none
   109  
   110  RETURNS
   111  json object with one member, "publications".
   112  "publications" is a list of the publications in the DB, each with the fields:
   113  code    - short code (lowercase) for publication (eg "dailyblah")
   114  name    - human-readable name of publication (eg "The Daily Blah")
   115  domain  - main domain for publication  (eg "www.dailyblah.com")
   116