pip install -r requirements.txt
- Before installing requirements make sure your system has libffi-dev and libssl-dev libraries installed , required for the https support .
2.Phantomjs 1.9.7 is installed .
3.Python 2.7.6 .
The current version of the code is using python 2.7 . To scrape the site using the twisted version of the library execute :
python tornado_spider.py --url='http://www.example.com'
for javascript errors detection ( takes too much time )
python tornado_spider.py --jserrors --url='http://www.example.com'
The twisted version spawns quite large number of connections on the server resulting in conditions similar to DOS and might lead to pages returning 503 errors. In such scenarios modify the max concurrent connections settings in the config.py file .
Certain configurations for the scrapper can be done via the scrapper config.py file the various configurations available are as follows
Starting url
START_URL = http://www.example.com/
Max concurrent requests done to the server, too high value and server is choked
MAX_CONCURRENT_REQUESTS_PER_SERVER = 10
Idle ping used for determining the termination of the process
IDLE_PING_COUNT = 10
comma separated sub domains that need to be skipped
DOMAINS_TO_BE_SKIPPED=sub1.example.com,sub2.example.com
1.Currently the utility doesn't scrape the pages obtained after loggging in .
2.Handling localhost based urls might require some tweaking .
3.Currently only python 2.7.6 is supported .