- Transformer-TTS
- FastSpeech
- FastSpeech2
- ...
- Build Dataset Loader
- Compare mel-spectrogram processing/loading time (3:1)
- Build a model and modules
- Baseline model architecture
- Tensorboard logging
- requirements.txt or Docker image
- overwrite configs with parsed arguments
- Check why phoneme dictionary is of length 12463
- Make phoneme dictionary process multi-threaded
- git clone https://github.com/Joovvhan/transformer-tts.git
- cd transformer-tts
- source scripts/set_locale.sh
- source scripts/init.sh
- python main.py
- Neural Speech Synthesis with Transformer Network
- Each phoneme has a trainable embedding of 512 dims
- the output of each convolution layer has 512 channels, followed by a batch normalization and ReLU activation, and a dropout layer as well.
- we add a linear projection after the final ReLU activation