GitHub

Pre-computation createEngine.y -crawler.py -indexTest.py

crawler.py Crawler crawls websites within wikipedia.org starting at root https://en.wikipedia.org/ and stops when there are 200 files that have been crawled and stored. Crawler does not crawl websites that are forbidden to crawl by robot exclusion or are outside the domain. It does not store any links that have already been stored.

indexTest.py Index test first opens all html files stored at html_files/ and removes all html markup, parses the documents, removes stop words, and returns a list of words for each document. Indexer indexes all 200 pages by indexing each word in the cleaned html files (stripped html markup and stop words, parsed, broken into words).

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
__pycache__		__pycache__
html_files		html_files
index		index
nltk-3.0.0		nltk-3.0.0
test_folder		test_folder
README.md		README.md
calcDocDistance.py		calcDocDistance.py
calcDocDistance.pyc		calcDocDistance.pyc
calculateCoverageErrors.py		calculateCoverageErrors.py
calculateCoverageErrors.pyc		calculateCoverageErrors.pyc
calculateMaxCoverageScore.py		calculateMaxCoverageScore.py
calculateMaxCoverageScore.pyc		calculateMaxCoverageScore.pyc
cleanUpHTML.py		cleanUpHTML.py
crawler.py		crawler.py
createEngine.py		createEngine.py
distances.p		distances.p
doc_urls.p		doc_urls.p
findMaxDistance.py		findMaxDistance.py
findMaxDistance.pyc		findMaxDistance.pyc
findMaxScore.py		findMaxScore.py
findMaxScore.pyc		findMaxScore.pyc
findMinDistance.py		findMinDistance.py
findMinDistance.pyc		findMinDistance.pyc
findMinPrimeDistance.py		findMinPrimeDistance.py
findMinPrimeDistance.pyc		findMinPrimeDistance.pyc
frontend.py		frontend.py
functionScore.py		functionScore.py
functionScore.pyc		functionScore.pyc
getWordsForScoring.py		getWordsForScoring.py
getWordsForScoring.pyc		getWordsForScoring.pyc
getWordsFromText.py		getWordsFromText.py
getWordsFromText.pyc		getWordsFromText.pyc
htmlparser.py		htmlparser.py
htmlparser.pyc		htmlparser.pyc
index.html		index.html
indexTest.py		indexTest.py
indexfiles.py		indexfiles.py
infexfiles.py		infexfiles.py
innerProduct.py		innerProduct.py
innerProduct.pyc		innerProduct.pyc
insertionSort.py		insertionSort.py
insertionSort.pyc		insertionSort.pyc
luceneRetriever.py		luceneRetriever.py
luceneRetriever.pyc		luceneRetriever.pyc
maxCoverage.py		maxCoverage.py
maxCoverage.pyc		maxCoverage.pyc
maxMinDispersion.py		maxMinDispersion.py
maxMinDispersion.pyc		maxMinDispersion.pyc
mergeSort.py		mergeSort.py
mergeSort.pyc		mergeSort.pyc
new_urls.p		new_urls.p
output.file		output.file
readFile.py		readFile.py
readFile.pyc		readFile.pyc
removeHTML.py		removeHTML.py
removeStopWords.py		removeStopWords.py
removeStopWords.pyc		removeStopWords.pyc
retrieveDocs.py		retrieveDocs.py
retriever.py		retriever.py
run.py		run.py
stop_words.txt		stop_words.txt
stripHTML.py		stripHTML.py
stripHTML.pyc		stripHTML.pyc
testRetriever.py		testRetriever.py

mefagan/artsearch

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages