Sentence Embedding for Document Ranking

This repository contains the implelentation of building a document ranking system using the sentence embeddings tailored for your domain specific corpora.

It allows to you to train the sentence embedding system on your own unique corpus, create your own indexed document reposistory and buid a ranking system to output the top K most similar/relavant docs for a given input query.

Demo:

Getting Started:

Clone this reposistory locally

git clone https://github.com/ffangsong/Sentence_Embedding_for_Ranking.git
cd Sentence_Embedding_for_Ranking

Create the environment from the the provided requirement.txt file

conda create python=3.6
pip install -r requirement.txt

Make sure you have dataset in the data folder(you can specify the path in the train script later)

Architecture

Embedding Model Details

For emebedding model, I used transfer learning to leverage the pretrained models which has been trained with huge amounts of general language data prior to being released:

The first model leverages pretrained word embedding, then use a LSTM layer to capture the contexual information and thus a richer semantic representation.
The second model leverages pretrained BERT model, a deep bidirectional transformer. The BERT layer was initialized with the pre-trained weights, followed by a pool layer, and the weights were fine-tuned for the domain specific corpus during training.

I restricted the classification to rely on a simple cosine similarity metric to compel the model to learn a better text representation.

After the training is complete, the classification layer is dropped and the output of the 2nd last layer is used as the text embedding.

Train the Embedding model

To train the word2vec_LSTM model, please download Google's pretrained model here and put in the docs/pretrained folder. To start training:

cd src/
python word2vec_LSTM_train.py

To fine tune the Bert model, please download pretrained Bert model here, unzip it and put in the docs/pretrained. To start training

cd src/
python bert_fine_tune_train.py

After training, the checkpoints will be saved at docs/model_checkpoint/

Document Indexing

To generate the index, please put your docs atdata/ and run:

python src/encoder.py
python src/indexing.py

The index will be saved at docs/

Run Ranking test

To run test on the ranking application, please run:

python src/ranking.py

The application will output the top 3 most relevant docs for a given input query.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
data		data
docs		docs
images		images
src		src
LICENSE		LICENSE
README.md		README.md
demo.gif		demo.gif
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

docs

docs

images

images

src

src

LICENSE

LICENSE

README.md

README.md

demo.gif

demo.gif

requirements.txt

requirements.txt

Repository files navigation

Sentence Embedding for Document Ranking

Demo:

Getting Started:

Architecture

Embedding Model Details

Train the Embedding model

Document Indexing

Run Ranking test

About

Releases

Packages

Languages

License

ffangsong/Sentence_Embedding_for_Ranking

Folders and files

Latest commit

History

Repository files navigation

Sentence Embedding for Document Ranking

Demo:

Getting Started:

Architecture

Embedding Model Details

Train the Embedding model

Document Indexing

Run Ranking test

About

Resources

License

Stars

Watchers

Forks

Languages