CS6200Project

Information Retrieval Project

The input files are assumed to be in relative folder ../data. It can be easily changed using the constants at the beggining of the python files.

There are 7 retrieval systems implemented in this project, as follows: SYSTEM NUMBERS

BM25
TFIDF
Lucene (not implemented in this script
Query expansion(pseudo relevance) + BM25
Query expansion(synonyms) + BM25
BM25 + stopping
BM25 + stemming
BM25+QueryExpansion+stopping

Model 3 uses Lucene and is implemented in Java. Models 1,2,4,5,6,7 are implemented in Python, using a sequence of commands for the steps

Lucene Dependencies: lucene-queryparser-4.7.2.jar lucene-core-4.7.2.jar lucene-analyzers-common-4.7.2.jar

1)FileReadWrite.java - Would be used to read CACM_Query.txt present in the same folder

2)Lucene.java - following 4 system arguments are accepted Note: Actual paths will be needed for these four arguments:

indexLocation - Location of Lucene index files to be generated. Example: E:\Lucene\Luceneindex
dataLocation - Location of data files, they are placed in ../data/cacm folder : E:\NEUSUBJECTS\IR_Project\data\cacm
queryFile- Location of CACM query Text file , It will be inside the CS6200 folder with Python files Example: E:\NEUSUBJECTS\IR_Project\CS6200Project\CACM_QUERY.TXT
resultFile - This Location will be the output file location.The folder should be same as queryFile Example : E:\NEUSUBJECTS\IR_Project\CS6200Project\model3_queries_results.txt

After this is run please run Evaluation.py with --sys=3 to get query by query results in excel

Python programs

Dependencies:

PyDictionary: for dictionary and synonyms. Computer has to be on the internet ; it makes web requests to thresaurus.com

There is a sequence of a few python programs which have to be run in order:

1)Indexing.py

To create indexing file run following command, example- python Indexing.py

Command takes no argument. data.txt which has the raw crawled documents is presumed to be in the same working directory

After running Indexing.py an index file called unigramIndex.txt and mapping file called docIDMapping.txt are created in the same directory.

2)RetrievalModels.py

Takes as an argument the system number and run the related retrieval model, reading the queries and the index and producing an output file with the ranking of documents for each query (TREC eval format).

To run the Retrieval Models: example- python RetrievalModels.py --sys=1

See table above for all the system options

When running system 5, some error messages might show up. They are from PyDictionary and don't affect the system.

3)Evaluation.py Takes as an argument the system number and run the evaluation on the related retrieval model.

To run the Evaluation : example- python Evaluation.py --sys=1

To perform the evaluation, relevance judgment is needed. System 7 - with stemming - don't have that information for the queries, so no evaluation metrics can be calculated.

Outputs the MAP, MRR, P@5 and P@20. Precision and recall tables are saved in a file with the system number.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.idea		.idea
output		output
Anupam_Sapre_Gabriel_Backiewicz_Hemanth_Shetty_Project.rar		Anupam_Sapre_Gabriel_Backiewicz_Hemanth_Shetty_Project.rar
CACM_QUERY.txt		CACM_QUERY.txt
Evaluation.py		Evaluation.py
FileReadWrite.java		FileReadWrite.java
Indexing.py		Indexing.py
Lucene.java		Lucene.java
PseudoRelevanceRocchio.py		PseudoRelevanceRocchio.py
QueryExpansion.py		QueryExpansion.py
README.md		README.md
RetrievalModels.py		RetrievalModels.py
Stopping.py		Stopping.py
TFIDF.py		TFIDF.py
bm25.py		bm25.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

output

output

Anupam_Sapre_Gabriel_Backiewicz_Hemanth_Shetty_Project.rar

Anupam_Sapre_Gabriel_Backiewicz_Hemanth_Shetty_Project.rar

CACM_QUERY.txt

CACM_QUERY.txt

Evaluation.py

Evaluation.py

FileReadWrite.java

FileReadWrite.java

Indexing.py

Indexing.py

Lucene.java

Lucene.java

PseudoRelevanceRocchio.py

PseudoRelevanceRocchio.py

QueryExpansion.py

QueryExpansion.py

README.md

README.md

RetrievalModels.py

RetrievalModels.py

Stopping.py

Stopping.py

TFIDF.py

TFIDF.py

bm25.py

bm25.py

Repository files navigation

CS6200Project

About

Releases

Packages

Contributors 3

Languages

HemanthShetty/CS6200Project

Folders and files

Latest commit

History

Repository files navigation

CS6200Project

About

Resources

Stars

Watchers

Forks

Languages