Information-Retrieval Project 3: Query Reduction / Expansion

Design Overview

My project’s design is broadly the same as it was in prior iterations, but I have outlined the main changes in the algorithm below. A full outline of the indexing design can be found in Indexing Project Report.pdf and more details on query processing decisions may be found in Query Processing Project 2 Report.pdf, so I will spend no further time on it here. The “processQueries” script initializes the relevant section of this project, as it includes both query processing and query expansion/reduction. This script runs a sequential process that does the following (new parts are bolded):

Extract queries from the file at the path specified by ENV.QUERY_SRC.
Load the necessary indexes to memory to be stored as an Index object (contains posting list, doc list, and lexicon). In the case of standard processing, this is the index specified by ENV.QUERY_PROCESSING_INDEX. In the case of Conditional, the file system will load the phrase index and the positional index as well as a backup index specified by ENV.BACKUP_INDEX. Please note that conditional processing does not support query expansion or reduction.
Queries are preprocessed one at a time, using the same preprocessing methods as documents. To ensure consistency, Queries are stored as Query objects that inherit from a parent TextObject class (which documents also inherit from)
If ENV.USE_QUERY_REDUCTION is set to “True” and ENV.QUERY_TYPE is set to “NARRATIVE,” the system will employ query reduction. This is done using a document retrieval threshold rather than a strict percentage of the query terms. The document threshold is specified by ENV.QUERY_THRESHOLD_DOCS_RETRIEVED. Terms are added to the query iteratively, and the query is re-ranked until a specified number of docs are retrieved.
(if query reduction is not set to true) For each query, relevant docs are ranked in a results dictionary a. In the case of “STANDARD” processing, each query is processed according to the specified index using the similarity measure specified by ENV.SIMILARITY_MEASURE (possible options are BM25, COSINE, and LANGUAGE) b. In the case of “CONDITIONAL” processing, queries are sent to the positional index if they have a document frequency above ENV.PHRASE_MIN_DF in the phrase index. Otherwise, they are processed by the positional index. If not enough documents have been retrieved after accessing either index, thequery will be matched with documents using the index specified by ENV.BACKUP_INDEX. These results are added to the others to produce a new ranking.
If ENV.USE_QUERY_EXPANSION is set to True and ENV.QUERY_TYPE is set to “TITLE,” the system will use relevance feedback to expand the queries. This works using the queryExpander.py file. It will select the top n documents as specified by ENV.REL_NUM_TOP_DOCS and then select the top m terms as specified by ENV.REL_NUM_TOP_TERMS.
Results are written to a file in treceval format at treceval/results with the name assigned according to the settings

In terms of overall design of the query processing, most relevant files are within src/query, and all files rely on the settings specified in settings.py. queryProcessor.py contains all the logic related to query preprocessing and basic tf.idf extraction. index.py contains the object model for an index stored in memory, and it holds all logic related to accessing items within the index. This abstraction keeps the accessing of index items separate from the query processing task. vectorSpace.py, languageModel.py, and bm25.py each contains its respective similarity calculations. Few files were added for this stage of the project, but queryExpander.py serves as a wrapper for expansion.

External Libraries Used

Numpy –Used to calculate statistics (i.e. mean df, median df) for the program NLTK Porter Stemmer – Used to stem terms for the stem index. Tabulate – Used to display tabular results for the program after it has run. Bisect (built into python/not external) – Used for insort methods to keep list in sorted order.

Running the Project

To get started and run my project, there are a few things you need to do:

For everything to run, you’ll have to make sure the dependencies are installed (only numpy, tabulate, and nltk). To do so, run: pip install –r requirements.txt (note: this command requires pip)
Now, open settings.py and make sure that the settings are as you want them.
If indexes have not yet been generated (there is nothing in the index folder), generate them now by running this command: python buildIndices.py
Now, run the query processing task using the following command: python queryProcessing.py
View your output in the treceval folder under results and run the treceval script to generate the resulting statistics.

Please Note: In the current state of the project, the only setting you will have to change to test reduction/expansion is the value of QUERY_TYPE. If it is set to “NARRATIVE,” it will reduce the query, and if it is set to “TITLE,” it will expand it. The settings should already have USE_QUERY_REDUCTION and USE_QUERY_EXPANSION set to True.

###Recommended Settings The following settings reflect the optimal configuration for running the program quickly and receiving ideal results (this is the current configuration):

####Indexing Settings DOCUMENT_SRC = './data/BigSample/' STOP_LIST_SRC = './data/stops.txt' INCLUDE_DECIMALS = True REMOVE_STOP_WORDS = True INDEX_LOCATION = './index/' TEMP_FILE_NAME = 'TempIndex' TRIPLE_LIST_NAME = 'TripleList' BUILD_ALL_INDEXES = True MIN_PHRASE_TF = 2 MEMORY_MAXIMUM = 100000

####Query Retrieval Settings QUERY_PROCESSING_METHOD = "STANDARD" SIMILARITY_MEASURE = "BM25" BM_25_K1 = 1.2 BM_25_K2 = 500.0 BM_25_B = .75 QUERY_PROCESSING_INDEX = "INVERTED"

####Query Expansion Settings QUERY_TYPE = “TITLE” USE_QUERY_EXPANSION = True QUERY_EXPANSION_METHOD = 'RELEVANCE' REL_NUM_TOP_DOCS = 10 REL_NUM_TOP_TERMS = 5 REL_SORT_CRITERIA = 'NIDF'

####Query Reduction Settings (note: mutually exclusive with query expansion) QUERY_TYPE = “NARRATIVE” USE_QUERY_REDUCTION = True QUERY_THRESHOLD_DOCS_RETRIEVED = 50

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
data		data
src		src
.gitignore		.gitignore
README.md		README.md
buildIndices.py		buildIndices.py
processQueries.py		processQueries.py
requirements.txt		requirements.txt
run_treceval.sh		run_treceval.sh
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

src

src

.gitignore

.gitignore

README.md

README.md

buildIndices.py

buildIndices.py

processQueries.py

processQueries.py

requirements.txt

requirements.txt

run_treceval.sh

run_treceval.sh

settings.py

settings.py

Repository files navigation

Information-Retrieval Project 3: Query Reduction / Expansion

Design Overview

External Libraries Used

Running the Project

About

Releases

Packages

Languages

nataliemeurer/basic_search_engine

Folders and files

Latest commit

History

Repository files navigation

Information-Retrieval Project 3: Query Reduction / Expansion

Design Overview

External Libraries Used

Running the Project

About

Resources

Stars

Watchers

Forks

Languages