This repository contains the implelentation of building a document ranking system using the sentence embeddings tailored for your domain specific corpora.
It allows to you to train the sentence embedding system on your own unique corpus, create your own indexed document reposistory and buid a ranking system to output the top K most similar/relavant docs for a given input query.
Clone this reposistory locally
git clone https://github.com/ffangsong/Sentence_Embedding_for_Ranking.git
cd Sentence_Embedding_for_Ranking
Create the environment from the the provided requirement.txt file
conda create python=3.6
pip install -r requirement.txt
Make sure you have dataset in the data
folder(you can specify the path in the train script later)
For emebedding model, I used transfer learning to leverage the pretrained models which has been trained with huge amounts of general language data prior to being released:
- The first model leverages pretrained word embedding, then use a LSTM layer to capture the contexual information and thus a richer semantic representation.
- The second model leverages pretrained BERT model, a deep bidirectional transformer. The BERT layer was initialized with the pre-trained weights, followed by a pool layer, and the weights were fine-tuned for the domain specific corpus during training.
I restricted the classification to rely on a simple cosine similarity metric to compel the model to learn a better text representation.
After the training is complete, the classification layer is dropped and the output of the 2nd last layer is used as the text embedding.
- To train the word2vec_LSTM model, please download Google's pretrained model here and put in the
docs/pretrained
folder. To start training:
cd src/
python word2vec_LSTM_train.py
- To fine tune the Bert model, please download pretrained Bert model here, unzip it and put in the
docs/pretrained
. To start training
cd src/
python bert_fine_tune_train.py
After training, the checkpoints will be saved at docs/model_checkpoint/
To generate the index, please put your docs atdata/
and run:
python src/encoder.py
python src/indexing.py
The index will be saved at docs/
To run test on the ranking application, please run:
python src/ranking.py
The application will output the top 3 most relevant docs for a given input query.