Skip to content

tvs/creepy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Creepy

DESCRIPTION

Creepy is a simple search engine. It will be developed in four stages: (i) web crawler; (ii) data cleanser; (iii) indexer; and (iv) query processor. Each of the stages are defined by components, some of which are accessible through the Creepy interface, some of which are standalone components.

RUNNING

$ creepy.py [options] <arguments>

Creepy Search Engine

Options:

--version             show program's version number and exit
-h, --help            show this help message and exit
-v, --verbose         Verbose Output [default: False]
-d, --debug           Debug Mode [default: False]
-S STORAGE, --storage=STORAGE
                      Data Storage location [default: ./storage/]
-C, --clean_start     Start clean by deleting existing storage location
                      [default: False]

Web Crawler:
  Crawling the Web: Takes set of seed urls and start crawling the web
  and stores the pages in the storage location.

  -c <seeds | seedfile>, --crawl=<seeds | seedfile>
                      Crawl the web
  -T THRESHOLD, --page_threshold=THRESHOLD
                      Max. number of pages to crawl [default: 2500]
  -N NUM_THREADS, --num_threads=NUM_THREADS
                      Number of threads to use [default: 1]

Data Cleanser:
  Data Preparation/Cleansing: Reads pages stored by the crawler and
  perform requested actions.

  -s, --strip_tags    HTML Tag Stripping
  -t, --tokenize      Tokenize the document

XML Document Graph

$ rucksack.rb [options] <pid_map file>

Options:

-o, --output FILE                Output resultant XML to FILE
-i, --indent N                   Indent N spaces
                                  Default: 2

-a, --append DIR                 Append DIR to the pid_map path
    --delimiter DEL              pid_map uses a delimiter DEL to separate URL and ID
                                  Default: "=>"

XML Validation

$ validate.rb <dtd_file> <xml_file>

REQUIRES

  • Python 2.6.1

  • Ruby 1.8.7 or higher

  • Gem: hpricot

  • Gem: builder

  • Gem: libxml-ruby

  • libxml2

LOCATION

github.com/doggles/creepy

AUTHORS

Travis Hall, trvs.hll@gmail.com

Brittany Thompson, miller317@gmail.com

Bhadresh Patel, bhadresh@wsu.edu

About

Python Web Crawler for CS453

Resources

Stars

Watchers

Forks

Packages

No packages published