The goal if this project is to create a multi-modal Speech Emotion Recogniton system on IEMOCAP dataset.
- Feb 2019 - IEMOCAP dataset aquisition and parsing
- Mar 2019 - Baseline of linguistic model
- Apr 2019 - Baseline of acoustic model
- May 2019 - Integration and optimiaztion of both models
- Jun 2019 - Integration with open-source ASR(most likely DeepSpeech)
IEMOCAP states for Interactive Emotional Dyadic Motion and Capture dataset. It is the most popular database used for multi-modal speech emotion recognition.
IEMOCAP database suffers from major class imbalance. To solve this problem we reduce the number of classes to 4 and merge Enthusiastic and Happiness into one class.
References: [1] [2] [3] [4] [5] [6] [7] [8] [9]
Model | Accuracy | Unweighted Accuracy | Loss |
---|---|---|---|
Acoustic | 0.602 | 0.601 | 0.983 |
Linguistic | 0.642 | 0.638 | 0.913 |
Ensemble (highest confidence) | 0.699 | 0.704 | 0.827 |
Ensemble (average) | 0.711 | 0.708 | 0.948 |
Ensemble (weighted average) | 0.716 | 0.712 | 0.944 |
loss: 0.944, acc: 0.716. unweighted acc: 0.712, conf_mat:
[[291. 60. 31. 9.]
[ 88. 282. 17. 6.]
[ 46. 19. 191. 2.]
[ 61. 26. 4. 167.]]
*classes in order: [Neutral, Happiness, Sadness, Anger]
*row - correct class, column - prediction