Install the dependencies using

pip install -r requirements.txt

Before installing requirements make sure your system has libffi-dev and libssl-dev libraries installed , required for the https support .

2.Phantomjs 1.9.7 is installed .

3.Python 2.7.6 .

Start scrapping by executing

The current version of the code is using python 2.7 . To scrape the site using the twisted version of the library execute :

python tornado_spider.py --url='http://www.example.com'

for javascript errors detection ( takes too much time )

python tornado_spider.py --jserrors --url='http://www.example.com'

The twisted version spawns quite large number of connections on the server resulting in conditions similar to DOS and might lead to pages returning 503 errors. In such scenarios modify the max concurrent connections settings in the config.py file .

Configurations

Certain configurations for the scrapper can be done via the scrapper config.py file the various configurations available are as follows

Starting url

START_URL = http://www.example.com/

Max concurrent requests done to the server, too high value and server is choked

MAX_CONCURRENT_REQUESTS_PER_SERVER = 10

Idle ping used for determining the termination of the process

IDLE_PING_COUNT = 10

comma separated sub domains that need to be skipped

DOMAINS_TO_BE_SKIPPED=sub1.example.com,sub2.example.com

Limitations

1.Currently the utility doesn't scrape the pages obtained after loggging in .

2.Handling localhost based urls might require some tweaking .

3.Currently only python 2.7.6 is supported .

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
JSErrorCollector.xpi		JSErrorCollector.xpi
README.md		README.md
config.py		config.py
link_diff.py		link_diff.py
page_invoker.js		page_invoker.js
requirements.txt		requirements.txt
resource_issue_detector.py		resource_issue_detector.py
single_url_invoker.js		single_url_invoker.js
site_spider.py		site_spider.py
tornado_client_page.py		tornado_client_page.py
tornado_spider.py		tornado_spider.py
util.py		util.py
visitor.js		visitor.js
web_page.py		web_page.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSErrorCollector.xpi

JSErrorCollector.xpi

README.md

README.md

config.py

config.py

link_diff.py

link_diff.py

page_invoker.js

page_invoker.js

requirements.txt

requirements.txt

resource_issue_detector.py

resource_issue_detector.py

single_url_invoker.js

single_url_invoker.js

site_spider.py

site_spider.py

tornado_client_page.py

tornado_client_page.py

tornado_spider.py

tornado_spider.py

util.py

util.py

visitor.js

visitor.js

web_page.py

web_page.py

Repository files navigation

Install the dependencies using

Start scrapping by executing

Configurations

Limitations

About

Releases

Packages

Languages

jayeshpowar/SiteScrapper

Folders and files

Latest commit

History

Repository files navigation

Install the dependencies using

Start scrapping by executing

Configurations

Limitations

About

Resources

Stars

Watchers

Forks

Languages