GitHub - anirudh9119/SpeechSyn

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Speech		Speech
bricks		bricks
cluster		cluster
datasets		datasets
extensions		extensions
models		models
projects		projects
results/cluster/blizzard		results/cluster/blizzard
sk-master		sk-master
toy		toy
utils		utils
README		README
__init__.py		__init__.py
continue_experiment.py		continue_experiment.py

Repository files navigation

The mel-generalized cepstrum (MGC) is an approximate representation of the spectral envelope of a speech signal on a per-frame basis. For each frame (32 ms of speech, with an overlap of 28 ms between successive frames), we estimate MGC coefficients that correspond to a filter whose frequency response approximates the log-magnitude spectrum of the speech frame.

In order to generate speech from those coefficients, we use the pitch (a variable related to the fundamental frequency of a speech frame) to generate an excitation signal, which is then filtered by the filter we found in the MGC estimation step. In unvoiced segments (i.e., segments where you have a noisy excitation because the vocal cords are not vibrating), we are currently using noise (either Gaussian or a maximum length sequence). The diagram seen below can help visualizing this:

Inline images 1

Our models are generating values related to the two inputs in this diagram (smoothed FFT amplitudes, which are converted to MGC coefficients, and pitch). The circle represents a binary decision the synthesizer makes based on the pitch value: if pitch > 0, it uses the excitation generated by the upper branch, otherwise it uses the noise from the lower branch.

To train the models, we extract these features from real speech and then train the models to predict the next frame based on previously seen frames. Our current model is a stack of GRUs with feedforward layers between the model input and first GRU input, and the last GRU output and the model output. The criterion for training is based on the mixed density network principle: the model outputs means and standard deviations for a GMM, and to generate from this output we sample from this GMM. For the pitch, we have additionally a binomial output which represents the binary decision (pitch > 0 or == 0). (José: please let me know if there's anything wrong in this explanation!).

About

No description, website, or topics provided.

Readme

Activity

0 stars

2 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speech

Speech

bricks

bricks

cluster

cluster

datasets

datasets

extensions

extensions

models

models

projects

projects

results/cluster/blizzard

results/cluster/blizzard

sk-master

sk-master

toy

toy

utils

utils

README

README

init.py

init.py

continue_experiment.py

continue_experiment.py

Repository files navigation

About

Releases

Packages

Languages

anirudh9119/SpeechSyn

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages