GitHub - nmilford/crawlr: A toy web crawler that reads from celery and write to Cassandra. A small project to re-learn Python from a long layoff.

A Crawlr worker pulls a URL off of the Celery/Redis queue.

[x] It first normalizes the url, then MurmurHashe128s the url.

The Hash of the url will be the primary ID for the url

It checks Cssandra if the URL exists

if it does find it, it checks.

the last crawl date and does not crawl it if now is less than X hours.


the failure count, if the failure count is greater Y it skips it and exponentially backs off.

[x] If it doesn't it crawls it

[x] if it fails it adds it to the failure count in Cassandra and logs a reason.

[x] if it succeeds it grabs the page and returns a dict with the fruit of the crawl.

Setup Environment

echo oracle-java8-installer shared/accepted-oracle-license-v1-1 select true | sudo /usr/bin/debconf-set-selections
sudo add-apt-repository ppa:webupd8team/java
echo "deb http://debian.datastax.com/community stable main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
sudo apt-get update
sudo apt-get install oracle-java8-installer dsc30 cassandra-tools redis-server -y
sudo apt-get install build-essential python-pip python-bs4 python-requests python-celery python-dev libev4 libev-dev -y
sudo apt-get install nodejs-legacy npm -y
sudo pip install tld
sudo pip install mmh3
sudo pip install urlnorm
sudo pip install cassandra-driver
sudo npm -g install phantomjs
sudo pip install selenium

sudo /etc/init.d/cassandra start
cqlsh < crawlr.cql

from crawlr import Crawlr a = Crawlr('http://www.fsdfd.om') b = a.crawl()

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
contrib		contrib
crawlr		crawlr
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
crawlr-cli		crawlr-cli
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

contrib

contrib

crawlr

crawlr

.gitignore

.gitignore

Dockerfile

Dockerfile

README.md

README.md

crawlr-cli

crawlr-cli

requirements.txt

requirements.txt

setup.py

setup.py

Repository files navigation

About

Releases

Packages

Languages

nmilford/crawlr

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages