GitHub - chagge/langdetect: Spoken language detection (Stanford CS 229 class project)

Installation

Clone this repo, then run:

# Set up virtualenv
virtualenv . && source bin/activate

# Install package requirements
pip install -r requirements.txt

To build the training data (with prepare.py), you'll need the following packages:

SPHERE, for decoding from the OGI sound formats
SoX, for sound preprocessing
openSMILE, for audio feature extraction

The binaries from these pacakages need to be on your PATH for the corpus handling scripts to work properly.

Data

We use the OGI Multilanguage Corpus for training and evaluation. You can acquire a copy of this corpus through the LDC.

With a fresh download of the OGI corpus, run prepare.py to get the corpus into a format usable with this software:

python prepare.py <OGI_PATH> ge,ma

In this command the argument ge,ma specifies that we'd like to perform language detection with German and Mandarin recordings.

Training

The prepare.py script generates a directory with devtest, evaltest, and train Pickle files for each provided language. To train on this data, run the command

python nodules.py -d <data_dir> train

The nodules.py script will train SVM and logistic regression models with default settings and output them to a local directory. Training can take anywhere from 5 minutes (baseline) to 24 hours (SVM, complex feature set, lots of data).

Testing

You can use the same nodules.py script to evaluate trained models. The train phase should save model files with .jbl extensions. Provide such a file to the nodules.py script like so:

python nodules.py -d <data_dir> --model-in-file <model_file> -v test

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
config		config
data		data
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
make_dummy_data.py		make_dummy_data.py
nodule_features.py		nodule_features.py
nodules.py		nodules.py
prepare.py		prepare.py
prepare_accents.py		prepare_accents.py
print_weights.py		print_weights.py
recording.py		recording.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

data

data

.gitignore

.gitignore

README.md

README.md

init.py

init.py

make_dummy_data.py

make_dummy_data.py

nodule_features.py

nodule_features.py

nodules.py

nodules.py

prepare.py

prepare.py

prepare_accents.py

prepare_accents.py

print_weights.py

print_weights.py

recording.py

recording.py

requirements.txt

requirements.txt

Repository files navigation

Installation

Data

Training

Testing

About

Releases

Packages

Languages

chagge/langdetect

Folders and files

Latest commit

History

Repository files navigation

Installation

Data

Training

Testing

About

Resources

Stars

Watchers

Forks

Languages