Variational Autoencoders for Polyphonic Music Interpolation

ML-interpolation-Master-Thesis

Variational Autoencoders for Polyphonic Music Interpolation

This is my Master Thesis, submitted to National Tsing Hua University (Taiwan, July 2020). All the code can be found in this repository.

Abstract

This thesis aims to use Machine Learning techniques to solve the novel problem of music interpolation composition. Two models based on Variational Autoencoders (VAEs) are proposed to generate a suitable polyphonic harmonic bridge between two given songs, smoothly changing the pitches and dynamics of the interpolation. The interpolations generated by the first model surpass a Random baseline data and a bidirectional LSTM approaches and its performance is comparable to the current state-of-the-art. The novel architecture of the second model outperforms the state-of-the-art interpolation approaches in terms of reconstruction loss by using an additional neural network for direct estimation of the interpolation encoded vector. Furthermore, the Hsinchu Interpolation MIDI Dataset was created, making both models proposed in this thesis more efficient than previous approaches in the literature in terms of computational and time requirements during training. Finally, a quantitative user study was done in order to ensure the validity of the results.

What do we mean by music interpolation?

“Interpolation is a type of estimation, a method of constructing new data points within the range of a discrete set of known data points” (Fleetwood, 1991)

In traditional Machine Learning, the generation of music is conditioned on the past events. But what if we could condition the music generation on both past and future events? We would input a begin track and an end track of 10 seconds each to our model, obtaining the middle (or interpolation) track of also 10 seconds as output, whose pitches and dynamics match both given tracks.

What is polyphonic music and how to model it?

In monophonic music, every timestep or time unit contains one single note. On the other hand, in polyphonic music, every timestep contains several notes, forming chords that make the composition richer. We use MIDI (Musical Instrument Digital Interface) to represent the music in a symbolic way, instead of using the raw waveform (which is computationally expensive to manipulate). Each timestep of a song is represented as a vector of 64 binary elements, where each binary element represents one piano key (or one note or pitch), 1 meaning note on and 0 meaning note off.

Dataset

A new MIDI dataset based on the Lahk MIDI Dataset has been created: the Hsinchu Interpolation MIDI Dataset. This dataset contains only very valuable interpolation segments, where the begin track and the end track are very different (simulating a style transfer within the same human composition). The begin and end tracks similarity has been evaluated with a neural network (binary classification problem). The Hsinchu Interpolation MIDI Dataset contains 30,830 segments of 30 seconds each.

Experiments

Four experiments have been done in this thesis. Experiments 1 and 2 are the baselines. Experiments 03 and 04 are proposed novel models based on Variational Autoencoders to solve the interpolation problem:

Random Data
Bi-LSTM
VAE (Variational Autoencoder)
VAE+NN (Variational Autoencoder + Neural Network)

Experiment 3. VAE: interpolation done with linear sampling of latent space. Steps:

Encode begin track and end track with VAE to obtain z_begin and z_end, respectively:

Average vectors z_begin and z_end to obtain the interpolation encoded vector z_interpolation:

Decode the interpolation encoded vector z_interpolation to obtain the interpolation track:

Ideally, the reconstructed interpolation track has to be identical to the original interpolation track (ground truth):

Experiment 4. VAE+NN: interpolation done with direct estimation of the interpolation encoded vector. Steps:

Encode begin track and end track with VAE to obtain z_begin and z_end, respectively:

Use the novel neural network (NN) approach to directly estimate the interpolation encoded vector z_interpolation based on z_begin and z_end:

Decode the interpolation encoded vector z_interpolation to obtain the interpolation track:

Ideally, the reconstructed interpolation track has to be identical to the original interpolation track (ground truth):

Full architecture of the novel VAE+NN model proposed in this thesis:

Results

Objective evaluation based on MSE error with respect to human compositions (ground truth):

Averaged MSE error of each model's tracks:

MSE error by binarization threshold of each model's tracks:

MSE error of latent space of VAE models:

Subjective evaluation based on the preference of 32 users from 8 different countries in Africa, America, Asia and Europe:

Pair-wise comparisons of each model's tracks:

Total number of votes per model:

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
images		images
Build_interpolation_dataset.ipynb		Build_interpolation_dataset.ipynb
Evaluate_similarities_word2vec.ipynb		Evaluate_similarities_word2vec.ipynb
Magenta VAE re-train.ipynb		Magenta VAE re-train.ipynb
NN_predict_sim_model		NN_predict_sim_model
README.md		README.md
Similarity_with_NN_own_method.ipynb		Similarity_with_NN_own_method.ipynb
analyze.py		analyze.py
base_vae.py		base_vae.py
configs.py		configs.py
convert_dir_to_note_sequences.py		convert_dir_to_note_sequences.py
data.py		data.py
encoding_decoding.py		encoding_decoding.py
flow.py		flow.py
flow_test.py		flow_test.py
hparam.py		hparam.py
lstm_baseline.py		lstm_baseline.py
lstm_models.py		lstm_models.py
lstm_utils.py		lstm_utils.py
midi_state_conversion.py		midi_state_conversion.py
models.py		models.py
music_vae_generate.py		music_vae_generate.py
music_vae_train.py		music_vae_train.py
prepare_input_data.py		prepare_input_data.py
preprocess.py		preprocess.py
requirements		requirements
requirements_and_running_commands		requirements_and_running_commands
similarity_training_w_rescaling.csv		similarity_training_w_rescaling.csv
similarity_training_wo_rescaling.csv		similarity_training_wo_rescaling.csv
simple_VAE.py		simple_VAE.py
test.py		test.py
tf_utils.py		tf_utils.py
train.py		train.py
trained_model.py		trained_model.py
utils.py		utils.py
utils.py (orig name utils.py) [magenta]		utils.py (orig name utils.py) [magenta]

pablomp3/ML-interpolation-Master-Thesis

Folders and files

Latest commit

History

Repository files navigation

ML-interpolation-Master-Thesis

Variational Autoencoders for Polyphonic Music Interpolation

Abstract

What do we mean by music interpolation?

What is polyphonic music and how to model it?

Dataset

Experiments

Experiment 3. VAE: interpolation done with linear sampling of latent space. Steps:

Experiment 4. VAE+NN: interpolation done with direct estimation of the interpolation encoded vector. Steps:

Results

Objective evaluation based on MSE error with respect to human compositions (ground truth):

Subjective evaluation based on the preference of 32 users from 8 different countries in Africa, America, Asia and Europe:

About

Resources

Stars

Watchers

Forks

Languages