Skip to content

amkahn/question-answering

Repository files navigation

LING 573: D4
Claire Jaja, Andrea Kahn, Clara Gordon
5/30/2014

This repository holds our question-answering system for LING 573: Systems and Applications. It processes the AQUAINT corpus of English news text to produce answers to the natural language questions used in the TREC Q&A shared tasks of 2006 and 2007. See the project reports in doc/ for more detailed information, and descriptions of the main project scripts below. 

D4.CMD
The condor script D4.cmd runs question_answering.sh on the development set with the following non-default parameter settings (all others are set to default):
web_cache=src/cached_web_results/TREC-2006.4pg.web_cache
indri_passages=20
passage_length=75
indri_window_size=30


It runs question_answering_evaltest.sh on the evaluation set with the following non-default parameter settings (all others are set to default):
q_file=/dropbox/13-14/573/Data/Questions/evaltest/QA2007_testset.xml
web_cache=src/cached_web_results/QA2007_testset.4pg.web_cache
index=src/indexes/AQUAINT-2/porter.stoplist
indri_passages=20
passage_length=75
indri_window_size=30

This code runs in about 15 minutes on the Condor cluster.


SRC
Our main code is in six scripts within the src directory.

1. index.cmd : This Condor script runs Indri's IndriBuildIndex code to build an index.  It has a parameter file specified as an argument which gives the path to the document collection, the path to the output index, and other parameters.  Indexing of the AQUAINT corpus takes approximately 15 minutes.
We created several different versions of the index for D3, using both Porter and Krovetz stemmer, both including and excluding a list of stopwords. 

We stored the various indexes under /home2/cjaja/classwork/spring-2014/ling573/question-answering/src/indexes/


2. cache_web_results.py : This script will cache web results for all the questions in a given question file.  It takes three arguments:
1) path of TREC-format question file
2) path of output file for the web cached results
3) number of pages of web search results to return

3. cache_web_results.cmd : This script runs cache_web_results.py on the Condor cluster.

4. question_answering.py : This script runs our question answering system on a given question file.  It takes one mandatory argument and 20 optional parameters.  The mandatory argument is the run tag which must be the first argument to the script.  The parameters, which are described in the file parameters.txt, must be given in the format <parameter_name>=<parameter_setting> where <parameter_name> is the name of the parameter and <parameter_setting> is the desired setting.  These parameters all have default values.

This script implements parallelization to process multiple questions at once.


5. question_answering.sh : This script runs our question answering system and then evaluates the output.  It takes the same arguments as question_answering.py  It runs the evaluation script in both strict and lenient mode on the output.  Note that this script is currently hard-coded to use the pattern file for the devtest set.

6. question_answering_evaltest.sh : This script is identical to question_answering.sh except it is hard-coded to use the pattern file for the evaltest set.

Our code is divided into three main modules, which live within the query_processing, info_retrieval, and answer_processing directories.  Some additional classes are defined with the general_classes module.  Our script uses a stopword list (taken from the Indri/Lemur documentation) in Indri/Lemur parameter XML format which sits in the src directory, named "stoplist.dft".

Third-party modules that we use include: BeautifulSoup, NLTK, and pymur (Python wrapper for Indri/Lemur).


RESULTS
After extensive tuning of our parameters on the development set, the evaluation script gives us the following accuracies:
dev: strict: 0.2160, lenient: 0.3773
eval: strict: 0.1766, lenient: 0.3584

About

A basic question-answering system

Resources

License

Stars

Watchers

Forks

Packages

No packages published