GitHub - lpsunny/lips-reading: Automated Lip Reading using Deep Reinforcement Learning

Automated Lip Reading

Lip reading, also known as audio-visual recognition, has been considered as a solution for speech recognition tasks, especially when the audio is corrupted or when the conversation happened in noisy environments. It can also be an extremely helpful tool for people who are hearing-impaired to communicate through video calls. This task, however, is challenging, due to factors such as the variances in the inputs (facial features, skin colors, speaking speeds, etc.) and the one-to-many relationships between viseme and phoneme. This project aims to tackle lip reading by modeling an agent that is capable of learning the features by interacting with the environment using reinforcement learning methodology.

Model

Three dominant components of the model are:

Convolutional Neural Network: VGG16 model pre-trained on ImageNet dataset was used to transform images of the lips region to its vector representation.
Long Short-Term Memory network: a recurrent neural network with long short-term memory cells acts as an agent that used REINFORCE to learn its parameters.
Reinforcement Learning:

The components of RL in Lip Reading setting:
- An agent: the generative model (RNN with LSTMs)
- An environment contains the words and the context vector it sees as input at every time step.
- A policy: the parameters of the generative model
- An action refers to predicting the next word in the sequence at each time step.
- A reward function: BLEU - evaluating the similarity between the generated text and the ground truth.

Reinforcement Learning Methods

The three RL methods implemented on this project are:

Method	References	Implementation
REINFORCE	Williams, 1992; Zaremba & Sutskever, 2015	REINFORCE
Deep Q-Network	Mnih et al., 2014	DDQN
Asynchronous Advantage Actor Critic (A3C)	Mnih et al., 2016	A3C

Important Notes

The GRID corpus contains 33,000 facial recordings. In the folder GRID corpus/vectors, only 100 vector representations of 100 videos are shown to demonstrate the method. The GRID corpus can be found HERE.
Another dataset that I plan to use in this project is the BBC-Oxford 'Multi-View Lip Reading Sentences' (MV-LRS) Dataset, which can be found HERE.
In folder Video Processing, the file shape_predictor_68_face_landmarks.dat, which was used to detect lips region, is left out due to its large size. This file can be downloaded HERE.
In folder CNN/pretrained-VGG16, the file vgg16_weights.npz is also left out due to its large size. This file contains the weights pre-trained using ImageNet dataset. This file can be downloaded HERE.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Actor Critic		Actor Critic
CNN		CNN
DDQN		DDQN
GRID corpus		GRID corpus
Lip Reading Sentences		Lip Reading Sentences
Misc		Misc
Policy Gradient		Policy Gradient
Video Processing		Video Processing
.DS_Store		.DS_Store
README.md		README.md
model.png		model.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Actor Critic

Actor Critic

CNN

CNN

DDQN

DDQN

GRID corpus

GRID corpus

Lip Reading Sentences

Lip Reading Sentences

Misc

Misc

Policy Gradient

Policy Gradient

Video Processing

Video Processing

.DS_Store

.DS_Store

README.md

README.md

model.png

model.png

Repository files navigation

Automated Lip Reading

Model

Reinforcement Learning Methods

Important Notes

About

Releases

Packages

Languages

lpsunny/lips-reading

Folders and files

Latest commit

History

Repository files navigation

Automated Lip Reading

Model

Reinforcement Learning Methods

Important Notes

About

Resources

Stars

Watchers

Forks

Languages