Skip to content

planBrk/domaincrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Domain Crawler

A command line utility to crawl a single domain. When invoked from the console, it prints the site map as a digraph in the form of a sequence of vertices and edges.

###Usage: To use the crawler:

  • Clone this repository:
  • Create a virtualenv if need be
  • Change to the directory. Then change to the src directory.
  • Either in the global python environment or in the virtual env, run the following command : pip install -r requirements.txt
  • set the library path : export PYTHONPATH=$PWD:$PYTHONPATH
  • Run the following command to start crawling a domain : python domaincrawl/crawler_init.py --log_conf ../conf/logging.conf --url http://localhost:8000/
  • Alternatively, invoke the script with the -h option to see the following help message:

usage: crawler_init.py [-h] [--alt_conf_path ALT_CONF_PATH] --log_conf LOG_CONF --url URL

Crawl a single domain

optional arguments:

-h, --help show this help message and exit

--alt_conf_path ALT_CONF_PATH Path to a python config (without sections) that may be used to override config defaults

--log_conf LOG_CONF The location of the logging configuration file in the python logging config format

--url URL The base url with the domain name of the site to be crawled. (e.g. http://acme.com)

###Limitations and Enhancements: Refer to this page for details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages