GitHub - youngsydney/inforetrieval: Search Engine, COSC 488

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commits
BigSample		BigSample
README.txt		README.txt
constant.py		constant.py
index.py		index.py
myHTMLParser.py		myHTMLParser.py
preprocess.py		preprocess.py
query.py		query.py
queryfile.txt		queryfile.txt
run.py		run.py
stops.txt		stops.txt

Repository files navigation

READ_ME

Project Part 1: Pre-Processing Documents & Building Inverted Index
Sydney Young
October 9, 2015
COSC 488: Introduction to Information Retrieval

How to Run: 
	Compile and run the run.py file. In the run.py file there is a main method which controls the processing and indexing of all the documents. 
	Note: Set the memory constraint at the top of the indexer.py and the file path for the folder where the date files are kept at the top of the run.py

The output to the screen will be the execution time for the various indexes. The lexicon/posting list is stored in the indexType_out_path and the term list with document frequencies is stored in the indexType_terms. 


term index = {term:docFreq, term:docFreq}
lexicon = {termID: {docID:tf, docID:tf}, termID: {docID:tf}} 
lexicon_positional = {termID: {docID:[pos, pos, pos]}, termID: {docID:[pos]}} 

Notes about processing:
	For the phrase index, special cases get replaced with " STOP " so during the phrase identification stage the phrases can't cross over special cases.
	For the cases of hyphenated words, I merged the word (black-tie became blacktime) and only stored the combined in the positional index. 

	To Fix -- .., .,.0, 1$.4, 1$.6