Skip to content

nmilford/crawlr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Crawlr worker pulls a URL off of the Celery/Redis queue.

[x] It first normalizes the url, then MurmurHashe128s the url.

The Hash of the url will be the primary ID for the url

It checks Cssandra if the URL exists

if it does find it, it checks.

the last crawl date and does not crawl it if now is less than X hours.


the failure count, if the failure count is greater Y it skips it and exponentially backs off.

[x] If it doesn't it crawls it

[x] if it fails it adds it to the failure count in Cassandra and logs a reason.

[x] if it succeeds it grabs the page and returns a dict with the fruit of the crawl.

Setup Environment

echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | sudo /usr/bin/debconf-set-selections
sudo add-apt-repository ppa:webupd8team/java
echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
sudo apt-get update
sudo apt-get install oracle-java8-installer dsc30 cassandra-tools redis-server -y
sudo apt-get install build-essential python-pip python-bs4 python-requests python-celery python-dev libev4 libev-dev -y
sudo apt-get install nodejs-legacy npm -y
sudo pip install tld
sudo pip install mmh3
sudo pip install urlnorm
sudo pip install cassandra-driver
sudo npm -g install phantomjs
sudo pip install selenium

sudo /etc/init.d/cassandra start
cqlsh < crawlr.cql

from crawlr import Crawlr a = Crawlr('http://www.fsdfd.om') b = a.crawl()

About

A toy web crawler that reads from celery and write to Cassandra. A small project to re-learn Python from a long layoff.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages