Skip to content

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

License

Notifications You must be signed in to change notification settings

mosh2151984/Tacotron-pytorch

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Implement google's Tacotron TTS system with pytorch. tacotron

Updates

2018/09/15 => Fix RNN feeding bug.
2018/11/04 => Add attention mask and loss mask.
2019/05/17 => 2nd version updated.

TODO

  • Add vocoder
  • Multispeaker version

Requirements

See used_packages.txt.

Usage

  • Data
    Download LJSpeech provided by keithito. It contains 13100 short audio clips of a single speaker. The total length is approximately 24 hrs.

  • Preprocessing

# Generate a directory 'training/' containing extracted features and a new meta file 'ljspeech_meta.txt'
$ python data/preprocess.py --output-dir training \ 
                            --data-dir <WHERE_YOU_PUT_YOUR_DATASET>/LJSpeech-1.1/wavs \
                            --old-meta <WHERE_YOU_PUT_YOUR_DATASET>/LJSpeech-1.1/metadata.csv \
                            --config config/config.yaml
  • Split dataset
# Generate 'meta_train.txt' and 'meta_test.txt' in 'training/'
$ python data/train_test_split.py --meta-all training/ljspeech_meta.txt \ 
                                  --ratio-test 0.1
  • Train
# Start training
$ python main.py --config config/config.yaml \
                 --checkpoint-dir <WHERE_TO_PUT_YOUR_CHECKPOINTS> 

# Restart training
$ python main.py --config config/config.yaml \
                 --checkpoint-dir <WHERE_TO_PUT_YOUR_CHECKPOINTS> \
                 --checkpoint-path <LAST_CHECKPOINT_PATH>
  • Examine the training process
# Scalars : loss curve 
# Audio   : validation wavs
# Images  : validation spectrograms & attentions
$ tensorboard --logdir log
  • Inference
# Generate synthesized speech 
$ python generate_speech.py --text "For example, Taiwan is a great place." \
                            --output <DESIRED_OUTPUT_PATH> \ 
                            --checkpoint-path <CHECKPOINT_PATH> \
                            --config config/config.yaml

Samples

All the samples can be found here. These samples are generated after 102k updates.

Alignment

Proper alignment occurs after 10k steps of updating.

Differences from the original Tacotron

  1. Gradient clipping
  2. Noam style learning rate decay (The mechanism that Attention is all you need applies.)

Acknowlegements

This work is mainly based on r9y9's implementation of Tacotron, however, my implementation is more user-friendly.

Refenrence

  • Tacotron: Towards End-to-End Speech Synthesis [link]

Finally, this is the code used in my work "End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning".

About

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%