Skip to content

Retrieves the links and titles of recent posts from blog feeds.

Notifications You must be signed in to change notification settings

brannerchinese/feedergrabber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

feedergrabber

Retrieves the links and titles of recent posts from blog feeds.

Version

0.3, 20130421

Set-up

  1. Note the existence of requirements.txt files.
  2. There are versions for Python v. 2.7 (contain "27" in the file names) and for Python 3 (no special notations). They are in separate PYTHON27 and PYTHON3 directories.
  3. The main program is feedergrabber.py and its main function is feedergrabber(), which takes a single URL as an argument. The URL should point to an RSS or Atom feed; it normally returns an error if it encounters ordinary HTML or malformed XML.

Output

  1. File feedergrabber.py returns a 3-tuple containing two lists and a datetime.datetime object:
  • a first list of 2-tuples, each containing the URL and title of a single post; this tuple may be None if something went wrong with the look-up or parsing.
  • a second list of 2-tuples, each containing the URL and error message associated with an error encountered; if this tuple is None, no errors were observed.
  • a datetime.datetime object containing the date of either publication or updating, preferring the latter if possible, of the post.
  1. A supplementary program is supply_feedergrabber.py, which runs through a list of known feeds and non-feed blogs, calling feedergrabber for each, and reporting a period (.) if the look-up and parsing proceeded smoothly. Since non-feed sources are no longer supported, they will return an error, "Parsing methods not successful." This supplementary program is used only for internal testing.

New in this version

  1. Now checking for empty titles and reporting as an error if found; parallel to empty links.
  2. Doc-strings complete.
  3. Obsolete function removed.
  4. More commenting.

Past versions

  • 0.2, 20130420 (initial commit; previous version was as bloggergrabber v. 0.1). The initial prototype of this module used Beautiful Soup 4 to scrape both feeds and ordinary HTML. Here, however, support for HTML blogs is discontinued, in order to eliminate the need for manual configuration of the scraping process for each new blog and to speed the parsing process.

Future work

  1. Unit testing.
  2. Error-logging.
  3. Systematize error codes.
  4. Is it possible to subscribe to a feed using a socket, so that there is no need to process anything more than once or wait for HTTP requests to be answered?

[end]

About

Retrieves the links and titles of recent posts from blog feeds.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages