Conditional WaveGAN

In this project we developed Conditional WaveGAN to synthesize speech/audio samples that are conditioned on class labels. The thus synthesized raw audio is used for improving the baseline ASR system.

Getting Started

Generative models are successfully used for image synthesis in the recent years. But when it comes to other modalities like audio, text, and etc, little progress has been made. Recent works focus on generating audio from a generative model in an unsupervised setting. We explore the possibility of using generative models conditioned on class labels.

Methods

Speech style transfer and applications in improving ASR system

Can synthesized data be used to train Automatic Speech Recognition (ASR) systems? We tackle this problem by generating samples with large variety. We first build Conditional WaveGAN explored in this Repo to synthesize the samples that we target. Then we use the Discovery GAN architecture to perform style transfer in speech domains. The thus synthesized samples with large variety can be used to build a robust ASR system. Developing the Conditional WaveGAN is a part of this bigger project. Please refer to this Repo to know more about our original ideas.

Conditional WaveGANs

Usage

Training can be done in both GPU and TPU settings. Only concatenation based conditioning is available in GPU, whereas bias based conditioning is also available in TPU.

Prerequisites

Tensorflow >= 1.4
Python 2.7

Datasets

Speech Commands Zero through Nine (SC09)
Techsorflow Challenge Speech Commands data full

Data must assume the form of tf.Data.TFRecord. The label data must be in one hot encoded for concatenation based conditioning, whereas it must be simple integers for bias based conditioning. Thus, the code to make the TFRecord differs by the type of conditioning.

python make_tfrecord.py \
	new/sc09/train \
	new/sc09_tf \
	--name train --labels \
	--ext wav \
	--fs 16000 \
	--nshards 128 \
	--slice_len 1 \

Training in GPU

To begin or resume training

python gpu/train_wavegan.py train ./gpu/train \
	--data_dir ./data/customdataset

To save the checkpoints every specified minutes while training

# save checkpoints every 60 minutes
python gpu/backup.py ./gpu/train 60

To generate 20 preview audio samples with two per class

python gpu/train_wavegan.py preview ./gpu/preview

Training in TPU

Setting up TPU is explained here.

To begin or resume training

# concatenation based conditioning
python tpu/concat_main.py

# bias based conditioning
python tpu/bias_main.py

Create a bucket for backup checkpoints and name it [CKPT_BUCKET_NAME]-backup. To save the checkpoints every specified minutes while training

# save checkpoints every 60 minutes
python tpu/backup.py gs://ckpt 60

To generate 20 preview audio samples with two per class

python tpu/preview.py

Synthesized audio samples

https://colab.research.google.com/drive/1VRyNJQBgiFF-Gi9qlZkOhiBE-KkUaHjw

References

Donahue, Chris, Julian McAuley, and Miller Puckette. "Synthesizing Audio with Generative Adversarial Networks." arXiv preprint arXiv:1802.04208 (2018). paper
Shen, Jonathan, et al. "Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions." arXiv preprint arXiv:1712.05884 (2017). paper
Perez, Anthony, Chris Proctor, and Archa Jain. Style transfer for prosodic speech. Tech. Rep., Stanford University, 2017. paper
Goodfellow, Ian, et al. "Generative adversarial nets." Advances in neural information processing systems. 2014. paper
Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016. paper
Grinstein, Eric, et al. "Audio style transfer." arXiv preprint arXiv:1710.11385 (2017). paper
Pascual, Santiago, Antonio Bonafonte, and Joan Serra. "SEGAN: Speech enhancement generative adversarial network." arXiv preprint arXiv:1703.09452 (2017). paper
Yongcheng Jing, Yezhou Yang, Zunlei Feng, Jingwen Ye, Yizhou Yu, Mingli Song. "Neural Style Transfer: A Review" arXiv:1705.04058 (2017) paper
Van Den Oord, Aäron, et al. "Wavenet: A generative model for raw audio." CoRR abs/1609.03499 (2016). paper
Glow: Generative Flow with Invertible 1×1 Convolutions paper
Kingma, Diederik P., et al. "Semi-supervised learning with deep generative models." Advances in Neural Information Processing Systems. 2014. paper
Van Den Oord, Aäron, et al. "Wavenet: A generative model for raw audio." CoRR abs/1609.03499 (2016). paper

Authors

Anoop Toffy - IIIT Bangalore - Personal Website
Chae Young Lee - Hankuk Academy of Foreign Studies - Homepage

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Credits

We used our baseline mode from waveGAN paper by Chris Donahue et al. (2018)

@article{donahue2018synthesizing,
  title={Synthesizing Audio with Generative Adversarial Networks},
  author={Donahue, Chris and McAuley, Julian and Puckette, Miller},
  journal={arXiv preprint arXiv:1802.04208},
  year={2018}
}

TPU Implementations are based on the DCGAN implemenatation released by Tensorflow Hub. link

Acknowledgments

Dr. Gue Jun Jung, Speech Recognition Tech, SK Telecom
Dr. Woo-Jin Han, Netmarble IGS
Google Mentors
Tensorflow Korea
Google

This was supported by Deep Learning Camp Jeju 2018 which was organized by TensorFlow Korea User Group.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.idea		.idea
data		data
dependencies		dependencies
examples		examples
final_presentation		final_presentation
gpu		gpu
paper_draft		paper_draft
tpu		tpu
LICENSE		LICENSE
README.md		README.md

License

yes7rose/cwavegan

Folders and files

Latest commit

History

Repository files navigation

Conditional WaveGAN

Getting Started

Methods

Usage

Prerequisites

Datasets

Training in GPU

Training in TPU

Synthesized audio samples

References

Authors

License

Credits

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Languages