Skip to content

zhangcshcn/wse-basic-search-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Author: Chen Zhang, with CIMS NYU
Email:  chen.zhang@nyu.edu

This is basic search engine with query enhancement in response for 
Programming Assignment 3 of CSCI-GA.2580-001 Web Search Engine, 
taught by Prof. Ernest Davis with New York University. 

### Requirement  
    
- Python 2.7     ( Preferably. Some python standard libaraies are imported 
                    in 2.7 fashion. If you insist on 3.x, feel free to mordify the 
                    import part in ./source/crawler.py)
- numpy          ( Computing pseudo-relevance score. )
- Django         ( The website is depolyed using Django. )
- Pylucene 3.x   ( Preferably. The website is developed using 3.5 and 3.6. 
                    In 6.2, the function and classes are imported in a very different 
                    fashion. If you insist on 6.2, feel free to modify .py files in 
                    ./lucene. )
- BeautifulSoup4 ( The html parser used in this project. )
- lxml           ( The parser BeatifulSoup4 depends on. )

### Instructions  
Simply run 
> $ python manage.py runserver
You can then access the website from 
> localhost:8000
The instructions on the website should be straight forward. 

### Potential Issues  
Due to the lack of familiarity with Pylucene, I tokenized the documents with 
my simple regular expression. 
> [0-9a-zA-Z]+  
This may cause some inconsistency with the Pylucene indexer. 
More over, the HTML parser used here succeeded in removing all the tags in the 
<body></body>, but failed to remove content between <script></script>. So there 
the javascript code were included in the index. This may course some issues. 
For example, the token of "rlq" may appear in the high ranking tokens from time to time. 


### Directory  

For your interest, the files different from those in Programming Assignment 1 are 

- ./source/stopwords.py 
- ./lucene/SearchFiles.py

    - ./source  
        - stopwords.py 
            According to may investigations, human behavior does not perfectly follow Ziph distribution. 
            The frequently used words and the less frequently used ones can be easily seperated using tuncation 
            of the word frequency distribution. This script reads all cache documents and generated a set of 
            stopwords based on the idea mentioned above. There will be a web page illustrating the result. 
        - Build.py  
            The BuildSearchEngine(start,number,domain). It uses ./crawler.py 
            and ../lucene/IndexFiles.py to download pages as instructed and 
            index the text ( and javascript code occasionally ) in the body 
            of downloaded pages. 
        - crawler.py  
            It is built based on the crawler in  
            http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/
            A few features are added to the basic crawler. 
            Namely, 
                1. Robot Exclusive Standard. 
                    It is implemented using the robotparser from the Python 
                    Standard Library. 
                2. URL normalization. 
                    There are two kinds of URL normalization in the crawler. 
                    One is for visiting the website with the right protocal. 
                    The other is for truncating URL ( removing the "http://" 
                    or "https://" in the front, and "/" or "/index.html" in 
                    the end. ) so that websites with acctually the same URL 
                    will not be visited again. 
                3. Cache and index.
                    The text in the body of websites ( including javascript code )
                    is cached with the help of BeatifulSoup4, and indexed using 
                    Pylucene. 
    - ./lucene
        - ./index
            This is the index file generated by Pylucene. 
        - IndexFiles.py
            It is basically the sample file with the Pylucene package. 
        - SearchFiles.py
            It is basically the sample file with the Pylucene package. 
            In script, the method of queryenhancement is also implemented. 
    - cache
        Generated by crawler. The text of the cached pages are stored here. 
    - static
        Static files for illustration purpose. 
    - *
        Django related files. 

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published