GitHub - zhangcshcn/wse-basic-search-engine

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cache		cache
lucene		lucene
mysite		mysite
query		query
source		source
db.sqlite3		db.sqlite3
manage.py		manage.py
readme		readme
requirements.txt		requirements.txt
stopwords.txt		stopwords.txt

Repository files navigation

Author: Chen Zhang, with CIMS NYU
Email: chen.zhang@nyu.edu

This is basic search engine with query enhancement in response for
Programming Assignment 3 of CSCI-GA.2580-001 Web Search Engine,
taught by Prof. Ernest Davis with New York University.

### Requirement

- Python 2.7 ( Preferably. Some python standard libaraies are imported
in 2.7 fashion. If you insist on 3.x, feel free to mordify the
import part in ./source/crawler.py)
- numpy ( Computing pseudo-relevance score. )
- Django ( The website is depolyed using Django. )
- Pylucene 3.x ( Preferably. The website is developed using 3.5 and 3.6.
In 6.2, the function and classes are imported in a very different
fashion. If you insist on 6.2, feel free to modify .py files in
./lucene. )
- BeautifulSoup4 ( The html parser used in this project. )
- lxml ( The parser BeatifulSoup4 depends on. )

### Instructions
Simply run
> $ python manage.py runserver
You can then access the website from
> localhost:8000
The instructions on the website should be straight forward.

### Potential Issues
Due to the lack of familiarity with Pylucene, I tokenized the documents with
my simple regular expression.
> [0-9a-zA-Z]+
This may cause some inconsistency with the Pylucene indexer.
More over, the HTML parser used here succeeded in removing all the tags in the
<body></body>, but failed to remove content between <script></script>. So there
the javascript code were included in the index. This may course some issues.
For example, the token of "rlq" may appear in the high ranking tokens from time to time.

### Directory

For your interest, the files different from those in Programming Assignment 1 are

- ./source/stopwords.py
- ./lucene/SearchFiles.py

- ./source
- stopwords.py
According to may investigations, human behavior does not perfectly follow Ziph distribution.
The frequently used words and the less frequently used ones can be easily seperated using tuncation
of the word frequency distribution. This script reads all cache documents and generated a set of
stopwords based on the idea mentioned above. There will be a web page illustrating the result.
- Build.py
The BuildSearchEngine(start,number,domain). It uses ./crawler.py
and ../lucene/IndexFiles.py to download pages as instructed and
index the text ( and javascript code occasionally ) in the body
of downloaded pages.
- crawler.py
It is built based on the crawler in
http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/
A few features are added to the basic crawler.
Namely,
1. Robot Exclusive Standard.
It is implemented using the robotparser from the Python
Standard Library.
2. URL normalization.
There are two kinds of URL normalization in the crawler.
One is for visiting the website with the right protocal.
The other is for truncating URL ( removing the "http://"
or "https://" in the front, and "/" or "/index.html" in
the end. ) so that websites with acctually the same URL
will not be visited again.
3. Cache and index.
The text in the body of websites ( including javascript code )
is cached with the help of BeatifulSoup4, and indexed using
Pylucene.
- ./lucene
- ./index
This is the index file generated by Pylucene.
- IndexFiles.py
It is basically the sample file with the Pylucene package.
- SearchFiles.py
It is basically the sample file with the Pylucene package.
In script, the method of queryenhancement is also implemented.
- cache
Generated by crawler. The text of the cached pages are stored here.
- static
Static files for illustration purpose.
- *
Django related files.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cache

cache

lucene

lucene

mysite

mysite

query

query

source

source

db.sqlite3

db.sqlite3

manage.py

manage.py

readme

readme

requirements.txt

requirements.txt

stopwords.txt

stopwords.txt

Repository files navigation

About

Releases

Packages

Languages

zhangcshcn/wse-basic-search-engine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages