Skip to content

Experimental Distributed Web Crawling with Python + Gearman

Notifications You must be signed in to change notification settings

lihuimail/gearnado

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gearnado

Experimental Distributed Web Crawling with Python + Gearman

Setup Instructions for Ubuntu:

$ sudo apt-get install git gearman libgearman-dev python-setuptools build-essential libxml2-dev libxslt-dev python-dev

$ sudo easy_install pyquery gearman tornado progressbar

If you are looking to do more than 1024 simultaneous connections on a single machine make sure you edit /etc/security/limits.conf and increase the soft/hard nofile limits.

Clone the Git Repo:

$ git clone git://github.com/iAcquire/gearnado.git
$ cd gearnado

Launch 30 TweetScout workers in one terminal:

$ for i in `seq 1 30`; do ./TweetScout.py & done

And run the TweetHandler in another:

$ time ./TweetHandler.py --url_file=python_crawler_urls.txt

About

Experimental Distributed Web Crawling with Python + Gearman

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published