Skip to content

songtaoshi/open_stt_e2e

 
 

Repository files navigation

PyTorch E2E ASR for open_stt dataset

Minimal set of scripts for training language and acoustic models for the speech recognition task. Training pipeline includes the following stages:

  1. Character-based RNN language model

  2. CNN-RNN acoustic model with CTC loss

  3. Character-based RNN language model and CNN-RNN acoustic model with RNN-T loss

  4. Fine-tuning with Reinforcement Learning and RNN-T loss

Results

The following table shows the results for Russian Open Speech To Text (STT/ASR) Dataset.

Stage Model Loss Updates CER WER
1 LM CE 2407000
2 AM CTC 216850 19.9 57.0
3 LM+AM RNN-T 108425 21.7 45.6
4 LM+AM RL 300 19.2 43.9

Requirements

Preprocessing

Acoustic models based on the log mel filterbanks with 40 filters of size 25ms, strided by 10ms.

  • features.py - extract features of utterances listed in manifest file

Language model is character-based and not case sensitive.

  • utterances.py - extract transcriptions of precomputed utterances

Google Cloud Storage

Pre-processed datasets:

Pre-trained models:

Kaggle Kernels

There are outdated kernels with small training subsets:

About

PyTorch end-to-end speech recognition

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%