Larynx

A fork of MozillaTTS that uses gruut for cleaning and phonemizing text.

Used by the Rhasspy project to train freely available voices from public datasets. See pre-trained models.

See the tutorial below for step by step instructions.

Once installed, you can run a web server and test it out at http://localhost:5002

Dependencies

Python 3.7 or higher
PyTorch >= 1.6
gruut
rhasspy/TTS (MozillaTTS fork, dev branch)

Pre-Trained Models

Models and Docker images are available here:

If you use Home Assistant, these are also available as Hass.io add-ons

Differences from MozillaTTS

MozillaTTS (maintained by the awesome erogol) models are typically trained on a set of phonemes derived from text for a given language. The phonemizer tool is used by default, which calls out to espeak-ng to guess phonemes for words.

Inside MozillaTTS, there is a file called symbols.py that contains a large set (129) of phonemes meant to cover a large number of languages:

# Phonemes definition
_vowels = 'iyɨʉɯuɪʏʊeøɘəɵɤoɛœɜɞʌɔæɐaɶɑɒᵻ'
_non_pulmonic_consonants = 'ʘɓǀɗǃʄǂɠǁʛ'
_pulmonic_consonants = 'pbtdʈɖcɟkɡqɢʔɴŋɲɳnɱmʙrʀⱱɾɽɸβfvθðszʃʒʂʐçʝxɣχʁħʕhɦɬɮʋɹɻjɰlɭʎʟ'
_suprasegmentals = 'ˈˌːˑ'
_other_symbols = 'ʍwɥʜʢʡɕʑɺɧ'
_diacrilics = 'ɚ˞ɫ'
_phonemes = _vowels + _non_pulmonic_consonants + _pulmonic_consonants + _suprasegmentals + _other_symbols + _diacrilic

Contrast this with the set of phonemes (45) used by gruut for U.S. English.

_ | ‖ # aɪ aʊ b d d͡ʒ eɪ f h i iː j k l m n oʊ p s t t͡ʃ uː v w z æ ð ŋ ɑ ɑː ɔ ɔɪ ə ɛ ɝ ɡ ɪ ɹ ʃ ʊ ʌ ʒ θ

Fewer phonemes means smaller models, which means faster training and synthesis. Unfortunately, this means Larynx models are not compatible with vanilla MozillaTTS.

Larynx is intended to be used on small datasets from volunteers, typically with only 1,000 examples. We therefore do more work upfront, making it so the model does not have to learn about dipthongs, short/long vowels, or all ways of writing breaks.

Datasets

Larynx assumes your datasets follow a simple convention:

A metadata.csv file
- Delimiter is | and there is no header
- Each row is id|text
- Each corresponding WAV file must be named <id>.wav
WAV files in the same directory
- All WAVs have the same sample rate (22050 recommended)

Installation

See scripts/create-venv.sh

This includes cloning rhasspy/TTS as a submodule (dev branch).

You need to activate the virtual env via source .venv/bin/activate and you can leave it via deactivate.

Docker

A CPU-only Docker image is available at rhasspy/larynx with no voices included. See the voices section for Docker images containing specific voices.

$ docker run -it -p 5002:5002 \
    --device /dev/snd:/dev/snd \
    rhasspy/larynx:<VOICE>-<VERSION>

See web server section for endpoints.

You can leave off --device if you don't plan to play test audio through your speakers.

Usage

Before training, you must initialize a model directory. Larynx will scan your dataset(s) and generate appropriate config files for both TTS and a vocoder.

Initialization

$ python3 -m larynx init /path/to/model --language <LANGUAGE> --dataset /path/to/dataset

Add --model-type glowtts for GlowTTS instead of Tactron2.

See python3 -m larynx init --help for more options.

Training (Tacotron2)

$ python3 TTS/TTS/bin/train_tacotron.py \
    --config_path /path/to/model/config.json

Training (GlowTTS)

You should have added --model-type glowtts during initialization.

$ python3 TTS/TTS/bin/train_glow_tts.py \
    --config_path /path/to/model/config.json

Training (Vocoder)

$ python3 TTS/TTS/bin/train_vocoder.py \
    --config /path/to/model/vocoder/config.json

Synthesis

$ python3 -m larynx synthesize \
    --model /path/to/model/<timestamp>/best_model.pth.tar \
    --config /path/to/model/<timestamp>/config.json \
    --vocoder-model /path/to/model/vocoder/<timestamp>/best_model.pth.tar \
    --vocoder-config /path/to/model/vocoder/<timestamp>/config.json \
    --output-file /path/to/test.wav \
    'This is a test sentence!'

If you have sox installed, you can leave off --output-file and type lines via standard in. They will be played using the play command.

You may also specify --output-dir to have each sentence (line on stdin or argument) written to a different WAV file.

Web Server

Run a web server at http://localhost:5002

$ python3 -m larynx serve \
    --model /path/to/model/<timestamp>/best_model.pth.tar \
    --config /path/to/model/<timestamp>/config.json \
    --vocoder-model /path/to/model/vocoder/<timestamp>/best_model.pth.tar \
    --vocoder-config /path/to/model/vocoder/<timestamp>/config.json \
    --cache-dir /tmp/larynx

Endpoints:

/api/tts - returns WAV audio for text
- GET with ?text=...
- POST with text body
/api/phonemize - returns phonemes for text
- GET with ?text=...
- POST with text body
/process - compatibility endpoint to emulate MaryTTS
- GET with ?INPUT_TEXT=...

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
TTS @ d35875a		TTS @ d35875a
bin		bin
docker		docker
docs		docs
larynx		larynx
scripts		scripts
voices		voices
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
.projectile		.projectile
Dockerfile		Dockerfile
Dockerfile.nobuildx		Dockerfile.nobuildx
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
mypy.ini		mypy.ini
pylintrc		pylintrc
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
setup.cfg		setup.cfg

License

solhuebner/larynx

Folders and files

Latest commit

History

Repository files navigation