Creepy is a simple search engine. It will be developed in four stages: (i) web crawler; (ii) data cleanser; (iii) indexer; and (iv) query processor. Each of the stages are defined by components, some of which are accessible through the Creepy interface, some of which are standalone components.
$ creepy.py [options] <arguments>
Options:
--version show program's version number and exit -h, --help show this help message and exit -v, --verbose Verbose Output [default: False] -d, --debug Debug Mode [default: False] -S STORAGE, --storage=STORAGE Data Storage location [default: ./storage/] -C, --clean_start Start clean by deleting existing storage location [default: False] Web Crawler: Crawling the Web: Takes set of seed urls and start crawling the web and stores the pages in the storage location. -c <seeds | seedfile>, --crawl=<seeds | seedfile> Crawl the web -T THRESHOLD, --page_threshold=THRESHOLD Max. number of pages to crawl [default: 2500] -N NUM_THREADS, --num_threads=NUM_THREADS Number of threads to use [default: 1] Data Cleanser: Data Preparation/Cleansing: Reads pages stored by the crawler and perform requested actions. -s, --strip_tags HTML Tag Stripping -t, --tokenize Tokenize the document
$ rucksack.rb [options] <pid_map file>
Options:
-o, --output FILE Output resultant XML to FILE -i, --indent N Indent N spaces Default: 2
-a, --append DIR Append DIR to the pid_map path --delimiter DEL pid_map uses a delimiter DEL to separate URL and ID Default: "=>"
$ validate.rb <dtd_file> <xml_file>
-
Python 2.6.1
-
Ruby 1.8.7 or higher
-
Gem: hpricot
-
Gem: builder
-
Gem: libxml-ruby
-
libxml2
Travis Hall, trvs.hll@gmail.com
Brittany Thompson, miller317@gmail.com
Bhadresh Patel, bhadresh@wsu.edu