native-language-recognition

Introduction

The goal of this project is to build an automatic speech recognition system that can recognize the native language of the speaker from English utterances. The cultural differences between non-native English speakers might allow virtual assistants to further understand the context between a speakers question or request. This project is greatly motivated by a recent research challenge, which made a large data-set available to test and develop such a method. From an initial goal of reproducing the baseline results, our goal is to gain a deeper understanding into the techniques required to recognize these slight differences.

Data set

The dataset used is the Educational Testing Service Corpus of Non-Native Spoken English which is made of English utterances of 45 seconds from eleven different backgrounds. There are 3300 training instances (41.3 h, ∼ 64%), 965 validation instances (12.1 h, ∼ 19%) and 867 testing instances (10.8 h, ∼ 17%). In total, there are 5132 unique nonnative speakers. The following table demonstrates the distribution of all samples.

L1	Training	Development	Test	Sum
Arabic	300	86	80	466
Chinese	300	84	74	458
French	300	80	78	458
German	300	85	75	460
Hindi	300	83	82	465
Italian	300	94	68	462
Japanese	300	85	75	460
Korean	300	90	80	470
Spanish	300	100	77	477
Telugu	300	83	88	471
Turkish	300	95	90	485
TOTAL:	3300	965	867	5132

Number of 45 second recordings for each of the L1 languages.

Methods

Our goals during this project includes the following approaches:

Replicate the SVM based baseline from the INTERSPEECH paper.
Simplify the ComParE dataset using feature selection and dimensionality reduction algorithms:
- Variance thresholding, Chi-squared, Relief-F, Principal component analysis
Trying other (possibly better) classifiers: boosting, tree based methods and DNNs.
Evaluate feasibility of an end-to-end deep learning approach using raw audio.
Evaluate the feasibility of image-based systems (spectograms).

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
end2you		end2you
feature-based-methods		feature-based-methods
image-based-methods		image-based-methods
.gitignore		.gitignore
README.md		README.md
baseline.py		baseline.py
emo_test_environment.py		emo_test_environment.py
feature_generation.py		feature_generation.py
feature_selection.py		feature_selection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

end2you

end2you

feature-based-methods

feature-based-methods

image-based-methods

image-based-methods

.gitignore

.gitignore

README.md

README.md

baseline.py

baseline.py

emo_test_environment.py

emo_test_environment.py

feature_generation.py

feature_generation.py

feature_selection.py

feature_selection.py

Repository files navigation

native-language-recognition

Introduction

Data set

Methods

About

Releases

Packages

Contributors 3

Languages

albertonietos/native-language-recognition

Folders and files

Latest commit

History

Repository files navigation

native-language-recognition

Introduction

Data set

Methods

About

Resources

Stars

Watchers

Forks

Languages