legeorges

legeorges / redditimagegrabber / 0.3.1

README.md

Grab images from a subreddit's front page.

The following JSON input parameters are mandatory:
  • subreddit: The name of the subreddit.
  • category: Can be 'hot', 'new', 'rising', 'controversial', or 'top'.

The following JSON input parameters are optional:
  • limit: The maximum number of items desired (default: 25, maximum: 100).
  • after: Fetch entries that are listed after the one with the name assigned to this property.
  • before: Fetch entries that are listed before the one with the name assigned to this property.
  • count: The number of items already seen. A positive integer (default: 0).
  • user_agent: Many default User-Agents are drastically limited to encourage unique and descriptive user-agent strings. Use this parameter to define your own.
  • domains: The list of image host domains you want to grab images from. Defaults to two possible imgur domains: ['i.imgur.com', 'imgur.com'].


For "pagination":

"Many endpoints on reddit use the same protocol for controlling pagination and filtering. These endpoints are called Listings and share five common parameters: after /beforelimitcount, and show.

Listings do not use page numbers because their content changes so frequently. Instead, they allow you to view slices of the underlying data. Listing JSON responses containafter and before fields which are equivalent to the "next" and "prev" buttons on the site and in combination with count can be used to page through the listing.

The common parameters are as follows:

  • after / before - only one should be specified. these indicate the fullname of an item in the listing to use as the anchor point of the slice.
  • limit - the maximum number of items to return in this slice of the listing.
  • count - the number of items already seen in this listing. on the html site, the builder uses this to determine when to give values for before and after in the response.

To page through a listing, start by fetching the first page without specifying values forafter and count. The response will contain an after value which you can pass in the next request. It is a good idea, but not required, to send an updated value forcount which should be the number of items already fetched." (source)

This algorithm makes requests to the Reddit API, hence careful attention needs to be put into Reddit API restrictions. In particular with the frequency of requests:

"Reddit provides an API, and unlike some websites, it’s actually quite easy to use. It’s based on REST and JSON, so in theory doesn’t require any fancy setup.

http://www.reddit.com/dev/api

The important thing is to follow the rules they set. Two of the most important ones are:

  • You can’t make more than 1 request every 2 seconds (or 30 a minute).
  • You must not lie about your user agent.

Read the rest here.

The user agent is what identifies your browser. Libraries like Python’s urllib are severely restricted by Reddit to prevent abuse. Reddit recommends you use your own special user agent." (source)