LIVE_SER

Live demo for speech emotion recognition using Keras and Tensorflow models

Maintainer: batikim09 (batikim09) - j.kim@utwente.nl

This folder has source codes for speech emotion recognition. This module relies on many machine learning packages and platforms such as Google Tensorflow and Keras, which is comptutationally expensive. Hence, it may not be operationable on mobile devices. Performance depends on contextual factors such as speaker, language, environment, etc.

The module mainly consists of two parts: voice activity detection and recognition (see details in codes). At this moment, we provide an emotion recognition model that was trained on aggregated English speech corpora (eNTERFACE, SEMAINE, and IEMOCAP), so that English speaker may best fit this model. However, you can use your own trained model for the best performance.

The prediction is very sensitive to gains of the microphone. Hence, it is important to set minimum and maximum gains for the normalisation of gains. Please find details of arguments below.

##Contents

Installation Requirements
Build
Device setup
Usage
References

1. Installation Requirements

This software only runs on OSX or Linux (tested on Ubuntu). It is compatible with python 2.x and 3.x, but the following descrptions assume that python 3.x is installed.

basic system packages

This software relies on several system packages that must be installed using a software manager.

For Ubuntu, please run the following steps:

`sudo apt-get install python-pip python-dev libhdf5-dev portaudio19-dev'

For mac osx, you should install portaudio and pulseaudio using brew as follows:

brew install portaudio

brew install pulseaudio

If you have any issues with portaudio, see: https://stackoverflow.com/questions/33851379/pyaudio-installation-on-mac-python-3?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

python packages

Using pip, install all pre-required modules. (pip version >= 8.1 is required, see: http://askubuntu.com/questions/712339/how-to-upgrade-pip-to-latest)

If you have old numpy (<1.12) please remove it. tensorflow/tensorflow#559

Next, install a right version of tensorflow depending on your os (see https://www.tensorflow.org/install/)

For example, you can install a cpu version for mac os:

pip3 install --upgrade https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-1.8.0-py3-none-any.whl

Then,

sudo pip3 install -r requirements.txt

2. Build

Any building process is not required at the moment.

3. Device setup

Runnning on Linux

Currently, using pulseaudio as the input device is the most stable way. If you do not specify device ID, a pulse audio device will be chosen as a default input. However, you must make sure if pulseaudio server is running. (if not, type "pulseaudio --start").

If you want and know pulse & alsa works, you can choose your own input device as a pulse audio and use the pulse as the input device for emotion recognition as follows:

run pulseaudio as a daemon process if it's off

pulseaudio --D

find your device (see: https://wiki.archlinux.org/index.php/PulseAudio/Examples) by typing in your terminal.

pacmd list-sources | grep -e device.string -e 'name:'

Depending on os and devices, it gives you various names. You need to choose a right input device among them.

For example, you find "name: <Channel_1__Channel_2.4>" as your input source. Then, set a default device for "pulse" by typing:

pacmd "set-default-source Channel_1__Channel_2.4"

run

The following script will use the default pulse device

python ./src/offline_ser.py -p_mode 2 -f_mode 1 -log ./output/live.csv -md ./model/AIBO.si.ENG.cw.raw.2d.res.lstm.gpool.dnn.1.h5 -c_len 1600 -m_t_step 16000 -tasks 'arousal:3,valence:3'

Runnning on OSX

Type:

python3 ./src/offline_ser.py

This will give you a list of audio devices and you need to identify index of your device. Note that this index changes depending on usb devices being used. Hence, it's safe to check before it runs.

if you find your device in index 2 (for the argument of d_id), run:

python ./src/offline_ser.py -d_id 2 -p_mode 2 -f_mode 1 -log ./output/live.csv -md ./model/AIBO.si.ENG.cw.raw.2d.res.lstm.gpool.dnn.1.h5 -c_len 1600 -m_t_step 16000 -tasks 'arousal:3,valence:3'

4. Usage

There are many parameters that controls VAD, feature extraction, and prediction. To get information of parameters, type:

python ./src/offline_ser.py

For quick use (assuming your device id is 2):

python ./src/offline_ser.py -d_id 2 -p_mode 2 -f_mode 1 -log ./output/live.csv -md ./model/AIBO.si.ENG.cw.raw.2d.res.lstm.gpool.dnn.1.h5 -c_len 1600 -m_t_step 16000 -tasks 'arousal:3,valence:3'

The provided model (./model/AIBO.si.ENG.cw.raw.2d.res.lstm.gpool.dnn.1.h5) is trained by using end-to-end method, which means its input feature is raw-wave. So you have to specify -f_mode as "1". Also, the raw wave form has 16000 samples per sec. So we set -m_t_step as "16000". The model uses 10 contextual windows; so each window has 1600 samples (-c_len 1600).

To get the probablistic distribution, we set -p_mode as "2".

The model predicts two tasks: arousal and valence. Each task has 3 classes (low, neutral, and high). So we specify -tasks as "arousal:3,valence:3".

Note that gain normalisation is crucial to the performance. Details can be found in ./scripts/decode_file.sh

5. References

This software is based on the following papers. Please cite one of these papers in your publications if it helps your research:

@inproceedings{kim2017interspeech, title={Towards Speech Emotion Recognition ``in the wild'' using Aggregated Corpora and Deep Multi-Task Learning}, author={\textbf{Kim, Jaebok} and Englebienne, Gwenn and Truong, Khiet P and Evers, Vanessa}, booktitle={Proceedings of the INTERSPEECH}, pages={1113--1117}, year={2017} }

@inproceedings{kim2017acmmm, title={Deep Temporal Models using Identity Skip-Connections for Speech Emotion Recognition}, author={Kim, Jaebok and Englebienne, Gwenn and Truong, Khiet P and Evers, Vanessa}, booktitle={Proceedings of ACM Multimedia}, pages={1006-1013}, year={2017} }

@inproceedings{kim2017acii, title={Learning spectro-temporal features with 3D CNNs for speech emotion recognition}, author={Kim, Jaebok and Truong, Khiet and Englebienne, Gwenn and Evers, Vanessa}, booktitle={Proceedings of International Conference on Affective Computing and Intelligent Interaction}, pages={}, year={2017} }

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
docs		docs
model		model
scripts		scripts
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

model

model

scripts

scripts

src

src

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

LIVE_SER

1. Installation Requirements

basic system packages

python packages

2. Build

3. Device setup

Runnning on Linux

Runnning on OSX

4. Usage

5. References

About

Releases

Packages

Languages

hmi-utwente/LIVE_SER

Folders and files

Latest commit

History

Repository files navigation

LIVE_SER

1. Installation Requirements

basic system packages

python packages

2. Build

3. Device setup

Runnning on Linux

Runnning on OSX

4. Usage

5. References

About

Resources

Stars

Watchers

Forks

Languages