Domain Crawler

A command line utility to crawl a single domain. When invoked from the console, it prints the site map as a digraph in the form of a sequence of vertices and edges.

###Usage: To use the crawler:

Clone this repository:
Create a virtualenv if need be
Change to the directory. Then change to the src directory.
Either in the global python environment or in the virtual env, run the following command : pip install -r requirements.txt
set the library path : export PYTHONPATH=$PWD:$PYTHONPATH
Run the following command to start crawling a domain : python domaincrawl/crawler_init.py --log_conf ../conf/logging.conf --url http://localhost:8000/
Alternatively, invoke the script with the -h option to see the following help message:

usage: crawler_init.py [-h] [--alt_conf_path ALT_CONF_PATH] --log_conf LOG_CONF --url URL

Crawl a single domain

optional arguments:

-h, --help show this help message and exit

--alt_conf_path ALT_CONF_PATH Path to a python config (without sections) that may be used to override config defaults

--log_conf LOG_CONF The location of the logging configuration file in the python logging config format

--url URL The base url with the domain name of the site to be crawled. (e.g. http://acme.com)

###Limitations and Enhancements: Refer to this page for details.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
conf		conf
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

src

src

test

test

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Domain Crawler

About

Releases

Packages

Languages

License

planBrk/domaincrawler

Folders and files

Latest commit

History

Repository files navigation

Domain Crawler

About

Resources

License

Stars

Watchers

Forks

Languages