av_segmentation

Audio segmentation in videos using visual information from regions of interest.

This repo contains code for isolating sounds from sources of interest (e.g. human speaker, musical instrument) in videos. The sound sources are specified by drawing regions-of-interest (ROIs) on the first frame of the video. A trained neural network utilizes the combination of visual and audio information to extract sounds from objects within the ROIs. At core, the underlying design of the neural network is inspired by

@inproceedings{gabbay2018visual,
author  	= {Aviv Gabbay and
	  	   Asaph Shamir and
		   Shmuel Peleg},
title     	= {Visual Speech Enhancement},
booktitle 	= {Interspeech},
pages     	= {1170--1174},
publisher 	= {{ISCA}},
year      	= {2018}
}

The code also allows the user to download videos from online sources, upload videos from local stores, draw ROIs, and train the network on new video data. Libraries for extracting images and audio signals from videos, for mixing sound sources to produce new training data, for converting audio timeseries to Short-Time Fourier Transform (STFT) representations are included.

Contains scripts for downloading AVSpeech dataset,720p/360p videos with 25fps and audio at 44.1kHz, as well other datsets (e.g. Audioset) This part of the code has been adapted from Nabarun Goswami's code at https://github.com/naba89/AVSpeechDownloader

Get this as follows: Assumptions/Limitations:

the script creates a file called badfiles_train.txt which lists the youtube id's of the deleted/private videos which are no longer available for download.

Usage:

inOut.download_av_speech.py train

Cloning, creating virtual environment, installing dependencies:

git clone --single-branch -b minimal_win --depth 1 https://github.com/avinashpujala/av_segmentation.git
cd av_segmentation
conda env create -f environment/av_segmentaton.yml
conda activate av_segmentation

NB: When installing librosa >=0.5.1 make sure that numba version is compatible. pip install numba==0.48

The complete pipeline from data preprocessing, builing the neural network, training, and running inference can now be done through subroutines available in seeSound.py

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
demo_examples		demo_examples
environment		environment
networks		networks
nmfTools		nmfTools
notebooks		notebooks
tests		tests
unit_tests		unit_tests
util		util
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
downloadMedia.py		downloadMedia.py
preProcess.py		preProcess.py
readWriteMedia.py		readWriteMedia.py
seeSound.py		seeSound.py
setup.py		setup.py
test_predictions.py		test_predictions.py

License

avinashpujala/av_segmentation

Folders and files

Latest commit

History

Repository files navigation

av_segmentation

Audio segmentation in videos using visual information from regions of interest.

About

Resources

License

Stars

Watchers

Forks

Languages