Variational Language Model

QHack hackathon | 22-26 february 2021
By TeamX Slimane Thabet & Jonas Landman

Introduction

In this project, we developed a variational quantum algorithm for Natural Language Processing (NLP). Our goal is to train a quantum circuit such that it can process and recognize words. Applications varies from word matching, sentence completion, sentence generation, named entity recognition and more.

Word encoding

Words are preprocessed using state-of-the art deep learning word embedding methods such as FastText. Then these embeddings arer cast down to few features using dimensionality reduction. For instance each word will be described as a vector of 8 dimensions. Using Quantum Amplitude Encoding, we can encode each word into a 3-qubits register. If a sentence is composed of $N$ words, and to represent it we propose to stack $N$ 3-qubits register sequentially.

Variational Circuit

We propose a new ansatz and training methodology to perform this NLP quantum learning:

The ansatz is composed of several layers of controlled rotations that mix the words between each other, and between themselves.
During the training, we will mask one word randomly in each sentence, by imposing its quantum register to $|0\rangle$
Using a SWAP Test, a supplementary word is then compared to the output register of the missing word (after the output of the ansatz). Therefore the cost function is the probability of output '0' on the swap test's ancillary qubit. We chose the supplementary word to be the missing word itself in order to drive the learning.
The goal of the training is to adjust the ansatz's parameters such that the missing word is guessed.

Applications

With such a circuit trained, we can provide a new sentence with a missing word and compare it with all possible words in the "dictionary". We can generate artifical sentence by starting with only one word, or completing a sentence after its last words.

By training a decoder circuit, we can perform Named Entity Recognition to classify words into categories, such as people, places, etc.

Performances

We consider $M$ sentences of $N$ words, each one encoded as $Q$ qubits.

Number of qubits required: One quantum circuit corresponds to one sentence plus an extra word and an ancillary qubit, therefore $Q*(N+1)+1$ qubits. E.g for a 4 words sentence with 3 qubits per word, we require 16 qubits. For a 5 words sentence with 4 qubits per word, we require 25 qubits.
Number of trainable parameters: The number of trainable parameters in the ansatz is around $Q*(1+N/2)*L$, where $L$ is the number of layers, on average (it depends of the parity of the number of words, and number of qubits). E.g for a 4 words sentence with 3 qubits per word and 3 layers, we require 27 parameters.

We can use AWS SV1 for parallelizing the gradient during the training. But the computational cost remains high due to the number of sentences and the total number of words in the dictionary.

Datasets

We propose 3 differents datasets to train and test our algorithm

IMDB Dataset composed of 100 000 sentences and 12 words in total
Newsgroup Dataset composed of 100 000 sentences and 12 words in total
An synthetic dataset of 'dummy' sentences with small number of sentences and words, for performance limitation and grammatical simplicity

Code architecture

The Pennylane variational ansatz are defined in utils.py
The NLP preprocessing using FastText is made in preprocessing_dataset.py and generate readable file as embeddings.npy, sentences.npy etc.
In config.py are defined the global configurations such as the number of words, of qubit per word, and the number of layers per ansatz.
In the notebook Final_notebook_train.ipynb, we train the quantum variational circuit and test applications

How to run this project

You will need the following library:

nltk
scikit-learn
gensim
numpy
pennylane
pickle

The creation of the Newsgroup Dataset and the synthetic dataset can be made via

python preprocessing_dataset.py

The training is made in Final_notebook_train.ipynb
The applications with saved parameters are in Applications.ipynb
All the code runs in the Amazon bracket instances, you may encounter some issue if you run it locally especially when using the pennylane qml.templates.AmplitudeEncodingprocedure.
Be careful to update the global variables in config.py such that they match your desired configuration.

Pretrained models

We provide several pretrained models in the folder saved_parameters:

In dummy_dataset/5_words: 5 words, 2 qubits/words, 2 layers trained on the 5 words dummy dataset
In dummy_dataset/4_words: 4 words, 3 qubits/words, 2 layers on the 4 words dummy dataset
In dummy_dataset/decoder: 5 words, 2 qubits/words for the decoder ansatz for Named Entity Recognition on the 5 words dummy dataset
In newsgroup: 10 words, 2 qubits/word trained on the newsgroup dataset

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
dummy_dataset		dummy_dataset
dummy_dataset_example		dummy_dataset_example
imdb		imdb
newsgroup		newsgroup
saved_parameters		saved_parameters
.DS_Store		.DS_Store
Applications_Simplified_Dataset.ipynb		Applications_Simplified_Dataset.ipynb
Final_notebook_train.ipynb		Final_notebook_train.ipynb
NER.png		NER.png
QHack.pdf		QHack.pdf
README.md		README.md
circuit.png		circuit.png
config.py		config.py
preprocessing_dataset.py		preprocessing_dataset.py
sentence_generation.png		sentence_generation.png
utils.py		utils.py

Slimane33/qhack_project

Folders and files

Latest commit

History

Repository files navigation

Variational Language Model

Introduction

Word encoding

Variational Circuit

Applications

Performances

Datasets

Code architecture

How to run this project

Pretrained models

About

Resources

Stars

Watchers

Forks

Languages