Work in progress on 'learning to fingerprint' for challening audio-based content ID problems, such as cover song detection.
Currently focused on experiments in which a fingerprint is learned from a dataset of cover songs. The main idea behind this is explained in our Audio Bigrams paper [1].
See this notebook.
Very briefly explained:
-
most fingerprints encode some kind of co-occurrence of salient events
(e.g., Shazam's landmark-based fingerprinter, 'intervalgrams'...) -
'salient event detection' can be implemented as a convolution:
conv2d(X, W)
with W the 'salient events'. -
co-occurrence can be implemented as
conv2d(X, w) @ X.T
with w a window and @ the matrix product. -
all of this is differentiable, therefore, any fingerprinting system that can be formulated like this can be trained 'end-to-end'.
To evaluate the learned fingerprint, we compare to the elegant and performant '2D Fourier Transform Magniture Coeffients' by Bertin-Mahieux and Ellis [2], and a simpler fingerprinting approach by Kim et al [3].
We use the Second-hand Song Dataset with dublicates removed as proposed by Julien Osmalskyj.
[1] Van Balen, J., Wiering, F., & Veltkamp, R. (2015). Audio Bigrams as a Unifying Model of Pitch-based Song Description.
[2] Bertin-Mahieux, T., & Ellis, D. P. W. (2012). Large-Scale Cover Song Recognition Using The 2d Fourier Transform Magnitude. In Proc. International Society for Music Information Retrieval Conference.
[3] Kim, S., Unal, E., & Narayanan, S. (2008). Music fingerprint extraction for classical music cover song identification. IEEE Conference on Multimedia and Expo.
(c) 2016 Jan Van Balen