GitHub

SampleRNN for speech synthesys

Keras implementation of SampleRNN model published here. This repo does only three tier architecture . Original audio sequence is feed to 3 inputs. Input_1(on picture) goes to slow tier RNN that groups 8 audio samples into 1 timestep. Mid tier gets 2 audio samples at the time plus input from slow tier(see add_1). Finally the samples are being generated by MLP that gets embedding of the previos audio sample(input_3) and output from mid tier layer(see add_2)

Audio preprocessing

Before we can start training audio must undergo some preprocessing. The process to follow is:

mkdir -p blizzard/tiny
copy some wav files to ./blizzard/tiny; for example 1 min of audio in total
run python preprocess.py $PWD/blizzard/tiny
blizzard/tiny_parts now contains audio material split into 8seconds long chunks

Baseline

Original implementation of the SampleRNN can be found here. It served as baseline reference during the development. Training results on 'tiny' (see below) dataset were compared with the baseline. Below the costs in bits per sequence for this code and baseline are shown

This code epoch	Training	Validation
1	3.98438	4.87372
10	2.29819	4.14896

Baseline epoch	Training	Validation
1	3.9624	4.9070
10	2.6645	4.2562

Training

Unfortunately start/stop indexes to separate validation and training data sets are to be picked manually. Depending on the dataset size. Following values were used for 2 datasets namely tiny and blizzard2013. Index of last training sequence is given by --trainstop command arg(see below) and --validstop points to the last validation sequence index.

Dataset	--trainstop	--validstop	minibatch size
tiny(~50sec)	4	6	2
blizzard2013(~20h)	8000	9000	100

To start training run THEANO_FLAGS=device=cpu,mode=FAST_RUN python train_srnn.py --exp=tiny --slowdim=32 --dim=32 --cutlen=512 --batchsize=2 --validstop=6 --trainstop=4. This will create model with 32 hidden units in each layer and run tbpp for 512 timestamps (due to --cutlen=512). Using theano backend and CPU to compute.

After about 3 epochs of the training on blizzard2013 dataset the model should be able to generate nice looking and even sounding samples.

Sampling

Training process produces files named <tiny|all>_srnn_sz<dim>_e<epoch>.h5 with model weights every --svepoch and in the end of the training. Choose the one with the best validation performance to generate a wav sample. For example THEANO_FLAGS=device=cpu,mode=FAST_RUN python train_srnn.py --exp=tiny --slowdim=32 --dim=32 --cutlen=512 --batchsize=2 --validstop=6 --trainstop=4 --sample=<filename> will produce generated.wav

Sampling from pretrained model

This repo contains a file allmost_1e.h5 with model weights after about 12 hours of training on blizzard2013 using Colab's K80 GPU. Thus it is possible to try it right away and do audio sampling using following command THEANO_FLAGS=device=cpu,mode=FAST_RUN python train_srnn.py --slowdim=1024 --dim=1024 --sample=allmost_1e.h5. Which will use CPU and Theano backend to do the work and produce something like that The audio sample shown on the picture can be found in sample4s.wav

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
allmost_1e.h5		allmost_1e.h5
kdllib.py		kdllib.py
model3t.png		model3t.png
neib.py		neib.py
preprocess.py		preprocess.py
sample4s.png		sample4s.png
sample4s.wav		sample4s.wav
sframe.py		sframe.py
srnn.py		srnn.py
train_srnn.py		train_srnn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

allmost_1e.h5

allmost_1e.h5

kdllib.py

kdllib.py

model3t.png

model3t.png

neib.py

neib.py

preprocess.py

preprocess.py

sample4s.png

sample4s.png

sample4s.wav

sample4s.wav

sframe.py

sframe.py

srnn.py

srnn.py

train_srnn.py

train_srnn.py

Repository files navigation

SampleRNN for speech synthesys

Audio preprocessing

Baseline

Training

Sampling

Sampling from pretrained model

About

Releases

Packages

Languages

License

szcom/samplernn

Folders and files

Latest commit

History

Repository files navigation

SampleRNN for speech synthesys

Audio preprocessing

Baseline

Training

Sampling

Sampling from pretrained model

About

Topics

Resources

License

Stars

Watchers

Forks

Languages