TREC WebTrack

This is a repository used to employ Machine Learning models on the adhoc task, TREC Web Track. Any issues, PRs or suggestions will be welcome.

To be more specific, these models are reranking models for query-document pairs. Since the cost of computing the relevance score for every query-document pair is too high, the objective is to rerank the QL submissions of each year, that you can find in here.

These models are capable of ordering a list of text documents according to their relevance to a particular query. It is possible to use this repository to train your rerank models or use a pre-trained on custom data, i.e., a set of queries and documents.

Currently, there are 2 models implemented, described in:

Kai Hui, Andrew Yates, Klaus Berberich, Gerard de Melo. PACRR: A Position-Aware Neural IR Model for Relevance Matching. In EMNLP, 2017.
Kai Hui, Andrew Yates, Klaus Berberich, Gerard de Melo. Co-PACRR: A Context-Aware Neural IR Model for Ad-hoc Retrieval. In WSDM, 2018.

Their implementation was adapted from the official release.

To install run (Python 3.5+):

python setup.py develop

Create a softlink to point to your stored data:

ln -s path_to_your_data DATA

Under your DATA directory you'll need to have different data, depending whether:

You want to reproduce results on the TREC Web Track
You want to use a pre-trained model on your own data

Reproduce TREC Web Track results

For now, I only provide scripts to reproduce 2013 and 2014 results. However, you can change the bash and config files accordingly to run on other configurations. Throughout the instructions, replace gpu_device with the CUDA_ID you want to run with (replace with None for running on CPU).

Under your DATA directory, download the official similarity matrices, provided by the authors, and extract them using:

cd DATA
mkdir corpora
cd corpora
tar xvf simmat.tar.gz

Also download the query idf vectors:

cd DATA/corpora/
unzip query_idf.zip

Now you can either train and test or test only the PACRR model.

For train and test, run any script under bin/test13 or bin/test14 as the following example:
```
bash bin/test1*/run_pacrr_1*val.sh gpu_device
```
or, to run using a round-robin procedure (will take longer):
```
bash bin/test1*/run_pacrr_test1*.sh gpu_device
```
For test only, you'll need to download my weights files and extract them under DATA.
```
cd DATA
unzip model_outputs.zip
```
Then comment the part of the bin/test1*/ bash files that call script/train.py and run the same way as described for train and test.

Using a pre-trained model on your data

You'll need to download my weights files and extract them under DATA.

  cd DATA
  unzip model_outputs.zip

and the pretrained embeddings cd DATA unzip embeddings.zip

Now, you will have to change the file qrels/customdata.txt according to your data. As you can see in the example file I have, the file is constructed with the following format:

query text
document text

(...)

query text
document text

Just change this file with your queries and document and run:

bash bin/run_pacrr_customdata.sh

At the end, a message will be printed by that script saying where the test.probs file was saved. That file contains the relevance scores of every query-document pair you inserted in qrels/customdata.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
bin		bin
configs		configs
corpus_construction		corpus_construction
data		data
eval		eval
model		model
qrels		qrels
scripts		scripts
topic_files		topic_files
utils		utils
warc @ 071d828		warc @ 071d828
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

huchenggit/TREC_WebTrack

Folders and files

Latest commit

History

Repository files navigation

TREC WebTrack

To install run (Python 3.5+):

Create a softlink to point to your stored data:

Reproduce TREC Web Track results

Using a pre-trained model on your data

About

Resources

License

Stars

Watchers

Forks

Languages