web2feed: turn webpages into feeds

This is a script to turn any webpage into a feed. The program relies first on site-specific rules (for popular sites and detectable software packages), then on heuristics (TODO).

The premise is that RSS/Atom feeds often don't tell the whole story. We should be able to scrape any webpage for content regardless of what the author wants to make easily available.

Output is a list of dictionaries, but serializations (JSON, XML, RDF) will be supported. Additional options to be developed include advertisement/javascript removal, link/image/media isolation, etc.

Ultimately, this was written as a support package for Sylph, which aims to completely decentralize the web and take bootstrapped content with it. (Please read more about that and consider contributing.)

The following libraries are used:

BeautifulSoup
html5lib (for beautiful soup parser fixes)
simplejson

(A complete client would have no non-standard library dependencies.)

Sites/Blogs

Software

None yet

(Also, heuristic scraping hasn't even been started yet.)

Output format:

Output is a list of dictionaries, with the following keys:

uri
title
date, a python datetime object, but may not include a time component if the website didn't list the time
author, name of the author
contents and/or summary, which probably contain minor HTML such as <p> and <img>
contents_format and/or summary_format, which are either 'text/plain' or 'text/html'
contents_markdown and/or summary_markdown, which contains markdown-formatted version of the respective text

More will be added as I write code to support comments, etc. Also to be output is the type of page, the heuristics used, etc.

License

My code is licensed under the MIT and BSD licenses, however I have included Aaron Swartz' GPL-licensed html2text, which generates structured Markdown from HTML. Thus, to redistribute the code as-is, it must be under the GPL. (Removal of this feature should be very simple and straightforward, however.)

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
libs		libs
sites		sites
.gitignore		.gitignore
README.mkd		README.mkd
mapper.py		mapper.py
web2feed.py		web2feed.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libs

libs

sites

sites

.gitignore

.gitignore

README.mkd

README.mkd

mapper.py

mapper.py

web2feed.py

web2feed.py

Repository files navigation

web2feed: turn webpages into feeds

Sites/Blogs

Software

Output format:

License

About

Releases

Packages

Languages

spsu/web2feed

Folders and files

Latest commit

History

Repository files navigation

web2feed: turn webpages into feeds

Sites/Blogs

Software

Output format:

License

About

Resources

Stars

Watchers

Forks

Languages