Installation

This repo is in Python 2.7.8. You will need to

Install Python 2.7.8 and add Python to your path (if installing with apt-get or brew or an equivelant package manager on linux and mac systems, this should happen automatically
Install pip the python package manager.
Optionally install virtualenv by running pip install virtualenv

Installation

To parse xml articles, you'll need two system packages, libxml2 and libxsl. On ubuntu, install with sudo apt-get install libxml2 libxslt1-dev

Then, install the required python libraries with: pip install -r requirements.txt

Using Mongo

Install mongo, then
Install genghisapp with gem install genghisapp, which is like PHPMyAdmin for MongoDB. genghisapp requires ruby / rubygems. You can install ruby by following this guide and install gem by downloading and installing from here

Running everything

Once everything is installed, you should be able to run the crawler with

python poc.py

This will run the crawler in "SimpleMode," which will crawl articles from the CNN RSS feed, and write them to a directory as JSON files.

To configure different behavior, you can pass the path to a configuration file (in json format) as the first command line argument. For example, to run on several RSS feeds and storing results to mongo db running on localhost, you can run the following command

python poc.py configs/local-mongo-several-rss-conf.json

Format of configs

See this as an example.

You can make new configs and runners in the following way. First, in the file CrawlerRunners.py add a class with a run method. run will run in a while (true) loop, and any errors thrown will be caught by external code and logged.

Example: Suppose we made the following runner in CrawlerRunners.py, which will simply print new links from an RSS feed:

class RSSLinkPrinter(object):
    def __init__(self, args):
        self._rssRunner = RssLinkParser(args['rss_feed')
    
    def run(self):
        print self._rssRunner.get_new_links()

Which RSS feed will this pull from? Let's say we want to test out the New York Times RSS feed. Then we should make the following configuration file, print-nytimes-rss-links.json

{
  "runner" : "RSSLinkPrinter",
  "args" : {
    "rss_feed" : "http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml"
  }
}

The "runner" field in the JSON file is the name of the class you defined in CrawlerRunners.py, and the args field is a dict that will be passed directly to the constructor of your runner.

So we can run the crawler with our RSSLinkPrinter with the following command:

python poc.py print-nytimes-rss-links.json

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
configs		configs
.gitignore		.gitignore
ArticleInserter.py		ArticleInserter.py
README.md		README.md
RssFeedParser.py		RssFeedParser.py
article.py		article.py
article_downloader.py		article_downloader.py
cnnparser.py		cnnparser.py
crawlers.py		crawlers.py
downloaders.py		downloaders.py
main.py		main.py
poc.py		poc.py
requirements.txt		requirements.txt
writers.py		writers.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

.gitignore

.gitignore

ArticleInserter.py

ArticleInserter.py

README.md

README.md

RssFeedParser.py

RssFeedParser.py

article.py

article.py

article_downloader.py

article_downloader.py

cnnparser.py

cnnparser.py

crawlers.py

crawlers.py

downloaders.py

downloaders.py

main.py

main.py

poc.py

poc.py

requirements.txt

requirements.txt

writers.py

writers.py

Repository files navigation

Installation

Using Mongo

Running everything

Format of configs

About

Releases

Packages

Languages

yavila/retina-crawler

Folders and files

Latest commit

History

Repository files navigation

Installation

Using Mongo

Running everything

Format of configs

About

Resources

Stars

Watchers

Forks

Languages