This repo has the implementation of two methods
- ./src/path.cfg
- Ignore the following variables:
- topics19
- entities
- entities19
- eqrels
- All the background linking related files (The dataset, topics and qrels) go in the path given by "DataPath" variable
- Ignore the following variables:
- Result files are created by both main scripts of two models, IR-BERT and Weighted BM25
- These result files in turn can be directly evaluted by using the background linking eval script
- Set appropriate paths in src/path.cfg
- Run merge.py in wapo/WashingtonPost/data. You will need the files listed in "filenames" in this directory alongside the merge script.
- Start elasticsearch server. command: "elasticsearch". (In case of port mismatch check "http.port" in elasticsearch.yml)
- Run Preprocess.py
- Run IR-BERT.py
Data Processing code here
- lower case all the text
- stemming and lematization
- remove stop words by using "stopwords.txt" as a dictionary of words
- filter the articles based on their kicker field
- We propose IR-BERT, which combines the retrieval power of BM25 with the contextual understanding gained through a BERT based model. It has following components
- Elasticsearch BM25
- RAKE for keyword extraction
- Setence BERT for semantic similarity
- Our model outperforms the TREC median as well as the highest scoring model of 2018 in terms of the nDCG@5 metric.