zhangcshcn/wse-basic-search-engine
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Author: Chen Zhang, with CIMS NYU Email: chen.zhang@nyu.edu This is basic search engine with query enhancement in response for Programming Assignment 3 of CSCI-GA.2580-001 Web Search Engine, taught by Prof. Ernest Davis with New York University. ### Requirement - Python 2.7 ( Preferably. Some python standard libaraies are imported in 2.7 fashion. If you insist on 3.x, feel free to mordify the import part in ./source/crawler.py) - numpy ( Computing pseudo-relevance score. ) - Django ( The website is depolyed using Django. ) - Pylucene 3.x ( Preferably. The website is developed using 3.5 and 3.6. In 6.2, the function and classes are imported in a very different fashion. If you insist on 6.2, feel free to modify .py files in ./lucene. ) - BeautifulSoup4 ( The html parser used in this project. ) - lxml ( The parser BeatifulSoup4 depends on. ) ### Instructions Simply run > $ python manage.py runserver You can then access the website from > localhost:8000 The instructions on the website should be straight forward. ### Potential Issues Due to the lack of familiarity with Pylucene, I tokenized the documents with my simple regular expression. > [0-9a-zA-Z]+ This may cause some inconsistency with the Pylucene indexer. More over, the HTML parser used here succeeded in removing all the tags in the <body></body>, but failed to remove content between <script></script>. So there the javascript code were included in the index. This may course some issues. For example, the token of "rlq" may appear in the high ranking tokens from time to time. ### Directory For your interest, the files different from those in Programming Assignment 1 are - ./source/stopwords.py - ./lucene/SearchFiles.py - ./source - stopwords.py According to may investigations, human behavior does not perfectly follow Ziph distribution. The frequently used words and the less frequently used ones can be easily seperated using tuncation of the word frequency distribution. This script reads all cache documents and generated a set of stopwords based on the idea mentioned above. There will be a web page illustrating the result. - Build.py The BuildSearchEngine(start,number,domain). It uses ./crawler.py and ../lucene/IndexFiles.py to download pages as instructed and index the text ( and javascript code occasionally ) in the body of downloaded pages. - crawler.py It is built based on the crawler in http://www.netinstructions.com/how-to-make-a-web-crawler-in-under-50-lines-of-python-code/ A few features are added to the basic crawler. Namely, 1. Robot Exclusive Standard. It is implemented using the robotparser from the Python Standard Library. 2. URL normalization. There are two kinds of URL normalization in the crawler. One is for visiting the website with the right protocal. The other is for truncating URL ( removing the "http://" or "https://" in the front, and "/" or "/index.html" in the end. ) so that websites with acctually the same URL will not be visited again. 3. Cache and index. The text in the body of websites ( including javascript code ) is cached with the help of BeatifulSoup4, and indexed using Pylucene. - ./lucene - ./index This is the index file generated by Pylucene. - IndexFiles.py It is basically the sample file with the Pylucene package. - SearchFiles.py It is basically the sample file with the Pylucene package. In script, the method of queryenhancement is also implemented. - cache Generated by crawler. The text of the cached pages are stored here. - static Static files for illustration purpose. - * Django related files.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published