A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Implement google's Tacotron TTS system with pytorch.

Updates

2018/09/15 => Fix RNN feeding bug.
2018/11/04 => Add attention mask and loss mask.
2019/05/17 => 2nd version updated.

TODO

Add vocoder
Multispeaker version

Requirements

See used_packages.txt.

Usage

Data
Download LJSpeech provided by keithito. It contains 13100 short audio clips of a single speaker. The total length is approximately 24 hrs.
Preprocessing

# Generate a directory 'training/' containing extracted features and a new meta file 'ljspeech_meta.txt'
$ python data/preprocess.py --output-dir training \ 
                            --data-dir <WHERE_YOU_PUT_YOUR_DATASET>/LJSpeech-1.1/wavs \
                            --old-meta <WHERE_YOU_PUT_YOUR_DATASET>/LJSpeech-1.1/metadata.csv \
                            --config config/config.yaml

Split dataset

# Generate 'meta_train.txt' and 'meta_test.txt' in 'training/'
$ python data/train_test_split.py --meta-all training/ljspeech_meta.txt \ 
                                  --ratio-test 0.1

Train

# Start training
$ python main.py --config config/config.yaml \
                 --checkpoint-dir <WHERE_TO_PUT_YOUR_CHECKPOINTS> 

# Restart training
$ python main.py --config config/config.yaml \
                 --checkpoint-dir <WHERE_TO_PUT_YOUR_CHECKPOINTS> \
                 --checkpoint-path <LAST_CHECKPOINT_PATH>

Examine the training process

# Scalars : loss curve 
# Audio   : validation wavs
# Images  : validation spectrograms & attentions
$ tensorboard --logdir log

Inference

# Generate synthesized speech 
$ python generate_speech.py --text "For example, Taiwan is a great place." \
                            --output <DESIRED_OUTPUT_PATH> \ 
                            --checkpoint-path <CHECKPOINT_PATH> \
                            --config config/config.yaml

Samples

All the samples can be found here. These samples are generated after 102k updates.

Alignment

Proper alignment occurs after 10k steps of updating.

Differences from the original Tacotron

Gradient clipping
Noam style learning rate decay (The mechanism that Attention is all you need applies.)

Acknowlegements

This work is mainly based on r9y9's implementation of Tacotron, however, my implementation is more user-friendly.

Refenrence

Tacotron: Towards End-to-End Speech Synthesis [link]

Finally, this is the code used in my work "End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning".

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
asset		asset
config		config
data		data
samples		samples
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
generate_speech.py		generate_speech.py
main.py		main.py
used_packages.txt		used_packages.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asset

asset

config

config

data

data

samples

samples

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

generate_speech.py

generate_speech.py

main.py

main.py

used_packages.txt

used_packages.txt

Repository files navigation

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Updates

TODO

Requirements

Usage

Samples

Alignment

Differences from the original Tacotron

Acknowlegements

Refenrence

About

Releases

Packages

Languages

License

mosh2151984/Tacotron-pytorch

Folders and files

Latest commit

History

Repository files navigation

A Pytorch Implementation of Tacotron: End-to-end Text-to-speech Deep-Learning Model

Updates

TODO

Requirements

Usage

Samples

Alignment

Differences from the original Tacotron

Acknowlegements

Refenrence

About

Resources

License

Stars

Watchers

Forks

Languages