anirudh9119/SpeechSyn
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The mel-generalized cepstrum (MGC) is an approximate representation of the spectral envelope of a speech signal on a per-frame basis. For each frame (32 ms of speech, with an overlap of 28 ms between successive frames), we estimate MGC coefficients that correspond to a filter whose frequency response approximates the log-magnitude spectrum of the speech frame. In order to generate speech from those coefficients, we use the pitch (a variable related to the fundamental frequency of a speech frame) to generate an excitation signal, which is then filtered by the filter we found in the MGC estimation step. In unvoiced segments (i.e., segments where you have a noisy excitation because the vocal cords are not vibrating), we are currently using noise (either Gaussian or a maximum length sequence). The diagram seen below can help visualizing this: Inline images 1 Our models are generating values related to the two inputs in this diagram (smoothed FFT amplitudes, which are converted to MGC coefficients, and pitch). The circle represents a binary decision the synthesizer makes based on the pitch value: if pitch > 0, it uses the excitation generated by the upper branch, otherwise it uses the noise from the lower branch. To train the models, we extract these features from real speech and then train the models to predict the next frame based on previously seen frames. Our current model is a stack of GRUs with feedforward layers between the model input and first GRU input, and the last GRU output and the model output. The criterion for training is based on the mixed density network principle: the model outputs means and standard deviations for a GMM, and to generate from this output we sample from this GMM. For the pitch, we have additionally a binomial output which represents the binary decision (pitch > 0 or == 0). (José: please let me know if there's anything wrong in this explanation!).
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published