- Description
- Public Domain Annotations
- Citations and Repositories
- Notation
- Installation
- Tutorial
- Detecting Sounds
- Visualizing Clusters
- Manually Annotating
- Training a Classifier
- Quantifying Accuracy
- Making Predictions
- Correcting False Alarms
- Correcting Misses
- Minimizing Annotation Effort
- Double Checking Annotations
- Measuring Generalization
- Searching Hyperparameters
- Examining Errors
- Testing Densely
- Discovering Novel Sounds
- Scripting Automation
- Troubleshooting
- Frequently Asked Questions
- Reporting Problems
- Development
You have an audio recording, and you want to know where certain classes of sounds are. SongExplorer is trained to recognize such words by manually giving it a few examples. It will then automatically calculate the probability, over time, of when those words occur in all of your recordings.
Applications suitable for SongExplorer include quantifying the rate or pattern of words emitted by a particular species, distinguishing a recording of one species from another, and discerning whether individuals of the same species produce different song.
Underneath the hood is a deep convolutional neural network. The input is the raw audio stream, and the output is a set of mutually-exclusive probability waveforms corresponding to each word of interest.
Training begins by first thresholding one of your recordings in the time- and frequency-domains to find sounds that exceed the ambient noise. These sounds are then clustered into similar categories for you to manually annotate with however many word labels naturally occur. A classifier is then trained on this corpus of ground truth, and a new recording is analyzed by it. The words it automatically finds are then clustered as before, but this time are displayed with predicted labels. You manually correct the mistakes, both re-labeling words that it got wrong, as well as labeling words it missed. These new annotations are added to the ground truth, and the process of retraining the classifier and analyzing and correcting new recordings is repeated until the desired accuracy is reached.
SongExplorer is open source and free for you to use. However, SongExplorer is not a static piece of software. It’s performance is improved with additional high-quality annotations.
Therefore, when you publish results based on SongExplorer, we request that you make all of your primary data and annotations freely available in a recognized data repository, such as figshare, Dryad, or Zenodo. Many journals already require deposition of raw data, but we strongly encourage you to also provide your manual annotations. These manual annotations will serve to improve the performance of SongExplorer over time, helping both your own work and that of everyone else.
Please let us know where you have deposited your raw data and annotations by posting an issue to the SongExplorer repository. We will endeavor to maintain a database of these recordings and annotations and will periodically re-train SongExplorer with the new data.
In addition, consider donating your recordings to library or museum, like the Cornell Lab of Ornithology's Macauley Library or the Museo de Ciencias Naturales de Madrid's Fonoteca Zoológica.
BJ Arthur, Y Ding, M Sosale, F Khalif, S Turaga, DL Stern (in prep)
SongExplorer: A deep learning workflow for discovery and segmentation of animal acoustic communication signals
Throughout this document Buttons
and variables
in the SongExplorer graphical
user interface (GUI) as well as code
are highlighted like so. Files and
paths are enclosed in double quotes ("..."). The dollar sign ($) in code
snippets signifies your computer terminal's command line. Square brackets
([...]) in code indicate optional components, and angle brackets (<...>)
represent sections which you much customize.
SongExplorer can be run on all three major platforms. The installation procedure is different on each due to various support of the technologies used. We recommend using Singularity on Linux, and Docker on Microsoft Windows and Apple Macintosh. Training your own classifier is fastest with an Nvidia graphics processing unit (GPU).
TensorFlow, the machine learning framework from Google that SongExplorer uses, supports Ubuntu, Windows and Mac. The catch is that Nvidia (and hence TensorFlow) currently doesn't support GPUs on Macs. So while using a pre-trained classifier would be fine on a Mac, because inference is just as fast on the CPU, training your own would take several times longer.
Docker, a popular container framework which provides an easy way to deploy software across platforms, supports Linux, Windows and Mac, but only supports GPUs on Linux. Moreover, on Windows and Mac it runs within a heavy-weight virtual machine, and on all platforms it requires administrator privileges to both install and run.
Singularity is an alternative to Docker that does not require root access. For this reason it is required in certain high-performance computing (HPC) environments. Currently it only natively supports Linux. There is a version for Macs which uses a light-weight virtual machine, but it is not being actively developed anymore. You can run Singularity on Windows within a virtual environment, like Docker does, but would have to set that up yourself. As with Docker, GPUs are only accessible on Linux.
To use SongExplorer with a GPU on Windows one must install it manually, without the convenience of a container. We're looking for volunteers to write a Conda recipe to make this easy.
Platform-specific installation instructions can be found at Sylabs. SongExplorer has been tested with version 3.4 on Linux and 3.3 Desktop Beta on Mac.
On Linux you'll also need to install the CUDA and CUDNN drivers from nvidia.com. The latter requires you to register for an account. SongExplorer was tested and built with version 10.2.
Next download the SongExplorer image from the cloud. You can either go to SongExplorer's cloud.sylabs.io page and click the Download button, or equivalently use the command line (for which you might need an access token):
$ singularity remote login SylabsCloud
$ singularity pull library://bjarthur/janelia/songexplorer:latest
INFO: Container is signed
Data integrity checked, authentic and signed by:
ben arthur (songexplorer) <arthurb@hhmi.org>, Fingerprint XXABCXXX
$ ls -lht | head -n 2
total 16G
-rwxr-xr-x 1 arthurb scicompsoft 1.5G Sep 2 08:16 songexplorer_latest.sif*
Put these definitions in your .bashrc (or .zshrc file on Mac OS Catalina) file:
export SONGEXPLORER_BIN="singularity exec [--nv] [-B <disk-drive>] \
[--vm-cpu] [--vm-ram] <path-to-songexplorer_latest.sif>"
alias songexplorer="$SONGEXPLORER_BIN gui.sh <path-to-configuration.pysh> 5006"
Add to the SONGEXPLORER_BIN export any directories you want to access using the
-B
flag (e.g. singularity exec -B /my/home/directory ...
).
On Mac singularity runs within a virtual machine that is configured by default
to only use one CPU core and one GB of memory. Use the --vm-cpu
and
--vm-ram
flags to allocate a different amount of system resources to SongExplorer
(e.g. singularity exec --vm-cpu 4 --vm-ram 4096 ...
). Note that even when
SongExplorer is idle these resources will not be available to other programs,
including the operating system.
In System Configuration we'll make a copy of the default configuration file. For now, you just need to decide where you're going to put it, and then specify the full path to that file in the alias definition (e.g. "$HOME/songexplorer/configuration.pysh").
Platform-specific installation instructions can be found at Docker. Once you have it installed, open the Command Prompt on Windows, or the Terminal on Mac, and download the SongExplorer image from cloud.docker.com:
$ docker login
$ docker pull bjarthur/songexplorer
Using default tag: latest
latest: Pulling from bjarthur/songexplorer
Digest: sha256:466674507a10ae118219d83f8d0a3217ed31e4763209da96dddb03994cc26420
Status: Image is up to date for bjarthur/songexplorer:latest
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
bjarthur/songexplorer latest b63784a710bb 20 hours ago 2.27GB
Use notepad
to create two ".bat" files.
"SONGEXPLORER_BIN.bat":
docker run ^
[-v <disk-drive>] [-u <userid>] [-w <working-directory] ^
bjarthur/songexplorer %*
"songexplorer.bat":
docker run ^
[-v <disk-drive>] [-u <userid>] [-w <working-directory] ^
-e SONGEXPLORER_BIN -h=`hostname` -p 5006:5006 ^
bjarthur/songexplorer gui.sh <path-to-configuration.pysh> 5006
The equivalent on Mac and Linux is to put these definitions in your .bashrc (or .zshrc file on Mac OS X Catalina) file:
export SONGEXPLORER_BIN="docker run \
[-v <disk-drive>] [-u <userid>] [-w <working-directory] \
bjarthur/songexplorer"
alias songexplorer="docker run \
[-v <disk-drive>] [-u <userid>] [-w <working-directory] \
-e SONGEXPLORER_BIN -h=`hostname` -p 5006:5006 \
bjarthur/songexplorer gui.sh <path-to-configuration.pysh> 5006"
Add to these definitions any directories you want to access using the -v
flag. You might also need to use the -u
flag to specify your username or
userid. Optionally specify the current working directory with the -w
flag.
All together these options would look something like docker run -v C:\:/C -w /C/Users/%USERNAME% ...
on Windows, and docker run -v /Users:/Users -w $HOME ...
on Mac.
In System Configuration we'll make a copy of the default configuration file. For now, you just need to decide where you're going to put it, and then specify the full path to that file in the alias definition (e.g. "%HOMEPATH%/songexplorer/configuration.pysh" on Windows, or "$HOME/..." on Mac and Linux).
To quit out of SongExplorer you might need to open another terminal window and
issue the stop
command:
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS ...
6a26ad9d005e bjarthur/songexplorer "detect.sh /src/p..." 3 seconds ago Up 2 seconds ...
$ docker stop 6a26ad9d005e
To make this easy, put this short cut in your .bashrc file:
alias dockerkill='docker stop $(docker ps --latest --format "{{.ID}}")'
On Windows and Mac docker runs within a virtual machine that is configured by default to only use half the available CPU cores and half of the memory. This configuration can be changed in the Preferences window. Note that even when SongExplorer is idle these resources will not be available to other programs, including the operating system.
SongExplorer is capable of training a classifier and making predictions on recordings either locally on the host computer, or remotely on a workstation or a cluster. You specify how you want this to work by editing "configuration.pysh".
Copy the exemplar configuration file out of the container and into your home directory:
$ $SONGEXPLORER_BIN cp /opt/songexplorer/configuration.pysh $PWD [%CD% on Windows]
Inside you'll find many variables which control where SongExplorer does its work:
$ grep _where= configuration.pysh
default_where="local"
detect_where=default_where
misses_where=default_where
train_where=default_where
generalize_where=default_where
xvalidate_where=default_where
mistakes_where=default_where
activations_where=default_where
cluster_where=default_where
accuracy_where=default_where
freeze_where=default_where
classify_where=default_where
ethogram_where=default_where
compare_where=default_where
congruence_where=default_where
Each operation (e.g. detect, train, classify, generalize, etc.) is dispatched
according to these _where
variables. SongExplorer is shipped with each set to
"local" via the default_where
variable at the top of the configuration file.
This value instructs SongExplorer to perform the task on the same machine as used
for the GUI. You can change which computer is used to do the actual work
either globally through this variable, or by configuring the operation specific
ones later in the file. Other valid values for these variables are "server"
for a remote workstation that you can ssh
into, and "cluster" for an
on-premise Beowulf-style cluster with a job scheduler.
Note that "configuration.pysh" must be a valid python and bash file. Hence the unusual ".pysh" extension.
Irrespective of where you want to perform your compute, there are additional variables that need to be tailored to your specific resources.
When running locally SongExplorer uses a custom job scheduler to manage the resources required by different commands. This permits doing multiple things at once, as well as queueing a bunch of jobs for offline analysis. By default, each task reserves all of your computer's CPU cores, GPU cards, and memory. To tailor resources according to your particular data set, you need to specify for each kind of task how much it actually requires. Here, for example, are the default settings for training a model locally:
$ grep train_ configuration.pysh | head -8
train_gpu=0
train_where=default_where
train_cpu_ncpu_cores=-1
train_cpu_ngpu_cards=-1
train_cpu_ngigabytes_memory=-1
train_gpu_ncpu_cores=-1
train_gpu_ngpu_cards=-1
train_gpu_ngigabytes_memory=-1
Let's break this down. When training (as well as certain other tasks),
SongExplorer provides the option to use a GPU or not with the train_gpu
variable.
Depending on the size of your model, the resources you have access to and their
associated cost, and how many tasks you want to run in parallel, you might or
might not want or be able to use a GPU. The
train_{cpu,gpu}_{ncpu_cores,ngpu_cards,ngigabytes_memory}
variables specify
the number of CPU cores, number of GPU cards, and number of gigabytes of memory
needed respectively, with -1 reserving everything available.
Training the model in the Tutorial below, for example, only needs
two CPU cores and a megabyte of memory. So in this case you could set
train_gpu
to 0 and train_cpu_{ncpu_cores,ngpu_cards,ngigabytes_memory}
to
2, 0, and 1, respectively. Doing so would then permit you to train multiple
models at once even on not so fancy machines. Alternatively, if you have a
GPU, you could set train_gpu
to 1 and
train_gpu_{ncpu_cores,ngpu_cards,ngigabytes_memory}
to 2, 1, and 1. As it
happens, for models of this size, training is quicker without a GPU.
Note that these settings don't actually limit the job to that amount of
resources, but rather they just limit how many jobs are running simultaneously.
It is important not to overburden you computer with tasks, so don't
underestimate the resources required, particularly memory consumption. To make
an accurate assessment for your particular workflow, use the top
and
nvidia-smi
commands to monitor jobs while they are running.
$ $SONGEXPLORER_BIN top -b | head -10
top - 09:36:18 up 25 days, 1:18, 0 users, load average: 11.40, 12.46, 12.36
Tasks: 252 total, 1 running, 247 sleeping, 0 stopped, 4 zombie
%Cpu(s): 0.7 us, 0.9 sy, 87.9 ni, 10.4 id, 0.1 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32702004 total, 3726752 free, 2770128 used, 26205124 buff/cache
KiB Swap: 16449532 total, 16174964 free, 274568 used. 29211496 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21124 arthurb 20 0 55.520g 2.142g 320792 S 131.2 6.9 1:38.17 python3
1 root 20 0 191628 3004 1520 S 0.0 0.0 1:20.57 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.33 kthreadd
The output above shows that a python3
command, which is how a training session
appears, is currently using 131.2% of a CPU core (e.g. 1.3 cores), and 6.9% of
the 32702004 KiB of total system memory (so about 2.15 GiB). To monitor how
these numbers change throughout the course of an entire job, omit the -b
flag
and do not pipe the output into head
(so just use $SONGEXPLORER_BIN top
) and
the screen will be refreshed every few seconds.
The output below shows how to similarly monitor the GPU card. The same
python3
command as above is currently using 4946 MiB of GPU memory and 67% of
the GPU cores. Use the watch
command to receive repeated updates (i.e.
$SONGEXPLORER_BIN watch nvidia-smi
).
$ $SONGEXPLORER_BIN nvidia-smi
Fri Jan 31 09:35:13 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.39 Driver Version: 418.39 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 980 Ti Off | 00000000:03:00.0 Off | N/A |
| 22% 65C P2 150W / 250W | 4957MiB / 6083MiB | 67% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 21124 C /usr/bin/python3 4946MiB |
+-----------------------------------------------------------------------------+
Using a lab or departmental server, or perhaps a colleague's workstation
remotely, is easiest if you run SongExplorer on it directly and then view the GUI
in your own personal workstation's internet browser. To do this, simply ssh
into the server and install SongExplorer as described above.
Alternatively, you can run the GUI code (in addition to viewing its output) on your own personal workstation and batch compute jobs to the remote server. This is easiest if there is a shared file system between the two computers. The advantage here is that less compute intensive jobs (e.g. freeze, accuracy) can be run on your workstation. In this case:
-
Store all SongExplorer related files on the share, including the container image, "configuration.pysh", and all of your data.
-
Make the remote and local file paths match by creating a symbolic link. For example, if on a Mac you use SMB to mount as "/Volumes/sternlab" an NSF drive whose path is "/groups/stern/sternlab", then add
-[v|B] /groups/stern/sternlab
toSONGEXPLORER_BIN
andmkdir -p /groups/stern && ln -s /Volumes/sternlab/ /groups/stern/sternlab
. With Docker you'll additionally need to open the preferences panel and configure file sharing to bind "/groups". -
Set the
SONGEXPLORER
environment variable plus thesongexplorer
alias on both your workstation and the server to point to this same image. -
You might need an RSA key pair. If so, you'll need to add
-[v|B] ~/.ssh:/ssh
toSONGEXPLORER_BIN
. -
You might need to use ssh flags
-i /ssh/id_rsa -o "StrictHostKeyChecking no"
in "configuration.pysh".
If you do not have a shared file system, the SongExplorer image and configuration file must be separately installed on both computers, and you'll need to do all of the compute jobs remotely.
Lastly, update "configuration.pysh" with the IP address of the server. As when
doing compute locally, SongExplorer uses a job scheduler on the server to manage
resources. The per-task resources used are the same as specified for the local
machine in <task>_{gpu,cpu}_{ncpu_cores,ngpu_cards,ngigabytes_memory}
.
$ grep -A1 \'server configuration.pysh
# URL of the 'server' computer
server_ipaddr="c03u14"
Submitting jobs to a cluster is similar to using a remote workstation, so read the above section first. You might want to even try batching to a another workstation first, as it can be easier to debug problems than doing so on a cluster.
You use your own workstation to view the GUI in a browser, and can either run the GUI code locally or on the cluster. With the former you have the option to submit only a portion of the compute jobs to the cluster, whereas with the latter they must all be performed by the cluster. Running the GUI code on the cluster also requires that the cluster be configured to permit hosting a web page. Moreover, if your cluster charges a use fee, you'll be charged even when the GUI is sitting idle.
As before, it is easiest if there is a shared file system, and if so, all files need to be on it, and the local and remote file paths must be the same or made to be the same with links. The environment variables and aliases must also be the same.
You'll likely need an RSA key pair, possibly need special ssh
flags, and
definitely need to specify the IP address of the head node and corresponding
job submission command and its flags. The best person to ask for help here is
your system administrator.
$ grep -A3 \'cluster configuration.pysh
# specs of the 'cluster'
cluster_ipaddr="login1"
cluster_cmd="bsub -Ne -Pstern"
cluster_logfile_flag="-oo"
The syntax used to specify the resources required is unique to the particular
scheduler your cluster uses and how it is configured. SongExplorer was developed
and tested using the Load Sharing Facility (LSF) from IBM. To support any
cluster scheduler (e.g. SGE, PBS, Slurm, etc.), SongExplorer ignores
<task>_local_resources_{gpu,cpu}
when <task_where>
is set to "cluster" and
uses the variables <task>_cluster_flags_{gpu,cpu}
instead to provide maximum
flexibility. Instead of specifying the cores, GPUs, and RAM needed explicitly,
you give it the flags that the job submission command uses to allocate those
same resources.
$ grep -E train.*cluster configuration.pysh
train_gpu_cluster_flags="-n 2 -gpu 'num=1' -q gpu_rtx"
train_cpu_cluster_flags="-n 12"
Let's walk through the steps needed to train a classifier completely from scratch.
Recordings need to be monaural 16-bit little-endian PCM-encoded WAV files. They should all be sampled at the same rate, which can be anything. For this tutorial we supply you with Drosophila melanogaster data sampled at 2500 Hz.
First, let's get some data bundled with SongExplorer into your home directory.
$ $SONGEXPLORER_BIN ls -1 /opt/songexplorer/data
20161207T102314_ch1-annotated-person1.csv
20161207T102314_ch1.wav*
20190122T093303a-7-annotated-person2.csv*
20190122T093303a-7-annotated-person3.csv*
20190122T093303a-7.wav*
20190122T132554a-14-annotated-person2.csv*
20190122T132554a-14-annotated-person3.csv*
20190122T132554a-14.wav*
Antigua_20110313095210_ch26.wav
my_frozen_graph_1k_0.pb*
PS_20130625111709_ch3-annotated-person1.csv
PS_20130625111709_ch3.wav*
vgg_labels.txt*
$ mkdir -p groundtruth-data/round1
$ $SONGEXPLORER_BIN cp /opt/songexplorer/data/PS_20130625111709_ch3.wav \
$PWD/groundtruth-data/round1 [%CD%/... on Windows]
Now that we have some data, let's extract the timestamps of some sounds from one of these as-of-yet unannotated audio recordings.
First, start SongExplorer's GUI:
$ songexplorer
arthurb-ws2:5006
2020-08-09 09:30:02,377 Starting Bokeh server version 2.0.2 (running on Tornado 6.0.4)
2020-08-09 09:30:02,381 User authentication hooks NOT provided (default user enabled)
2020-08-09 09:30:02,387 Bokeh app running at: http://localhost:5006/gui
2020-08-09 09:30:02,387 Starting Bokeh server with process id: 1189
2020-08-09 09:30:15,054 404 GET /favicon.ico (10.60.1.47) 1.15ms
2020-08-09 09:30:15,054 WebSocket connection opened
2020-08-09 09:30:15,055 ServerConnection created
Then in your favorite internet browser navigate to the URL on the first line printed to the terminal. In the output above this is "arthurb-ws2:5006", which is my computer's name, but for you it will be different. If that doesn't work, try "http://localhost:5006/gui".
On the left you'll see three empty panels (two large squares side by side and one wide rectangle underneath) in which the sound recordings are displayed and annotated. In the middle are buttons and text boxes used to train the classifier and make predictions with it, as well as a file browser and a large editable text box with "configuration.pysh". On the right is this instruction manual for easy reference.
Click on the Label Sounds
button and then Detect
. All of the parameters
below that are not used will be greyed out and disabled. If all of the
required parameters are filled in, the DoIt!
button in the upper right will
in addition be enabled and turn red.
The first time you use SongExplorer all of the parameters will need to be
specified. In the File Browser
, navigate to the WAV file in the "round1/"
directory and click on the WAV Files
button. Then specify the eight
numeric parameters that control the algorithm used to find sounds: In the time
domain, subtract the median, take the absolute value, threshold by the median
absolute deviation times the first number in time σ
, and morphologically
close gaps shorter than time smooth
milliseconds. Separately, use
multi-taper harmonic analysis (Thomson, 1982;
IEEE) in the frequency domain to
create a spectrogram using a window of length freq N
milliseconds (freq N
/
1000 * audio_tic_rate
should be a power of two) and twice freq NW
Slepian
tapers, multiply the default threshold of the F-test by the first number in
freq ρ
, and open islands and close gaps shorter than freq smooth
milliseconds. Sound events are considered to be periods of time which pass
either of these two criteria. Quiescent intervals are similarly defined as
those which pass neither the time nor the frequency domain criteria using the
second number in time σ
and freq ρ
text boxes.
Once all the needed parameters are specified, click on the red DoIt!
button
to start detecting sounds. It will turn orange while the job is being
asynchronously dispatched, and then back to grey. "DETECT
PS_20130625111709_ch3.wav ()" will appear in the status bar. It's font
will initially be grey to indicate that it is pending, then turn black when it
is running, and finally either blue if it successfully finished or red if it
failed.
The result is a file of comma-separated values with the start and stop times (in tics) of sounds which exceeded a threshold in either the time or frequency domain, plus intervals which did not exceed either.
$ grep -m 3 time groundtruth-data/round1/PS_20130625111709_ch3-detected.csv PS_20130625111709_ch3.wav,2251,2252,detected,time PS_20130625111709_ch3.wav,2314,2316,detected,time PS_20130625111709_ch3.wav,2404,2405,detected,time
$ grep -m 3 frequency groundtruth-data/round1/PS_20130625111709_ch3-detected.csv PS_20130625111709_ch3.wav,113872,114032,detected,frequency PS_20130625111709_ch3.wav,158224,158672,detected,frequency PS_20130625111709_ch3.wav,182864,182960,detected,frequency
$ grep -m 3 ambient groundtruth-data/round1/PS_20130625111709_ch3-detected.csv PS_20130625111709_ch3.wav,388,795,detected,ambient PS_20130625111709_ch3.wav,813,829,detected,ambient PS_20130625111709_ch3.wav,868,2201,detected,ambient
To cluster these detected sounds we're going to use the same method that we'll later use to cluster the hidden state activations of a trained classifier.
Click on the Train
button to create a randomly initialized network. Then use
the File Browser
to choose directories in which to put the log files (e.g.
"untrained-classifier") and to find the ground-truth data. The latter should
point to a folder two-levels up in the file hierarchy from the WAV and CSV
files (i.e. "groundtruth-data" in this case). Check to make sure that the
Label Types
button automatically set # steps
, validate period
, and
validation %
to 0, restore from
to blank, wanted words
to
"time,frequency,ambient", and label types
to "detected". The rest of the
fields, most of which specify the network architecture, are filled in with
default values the first time you ever use SongExplorer, and any changes you make
to them, along with all of the other text fields, are saved to a file named
"songexplorer.state.yml" in the directory specified by "state_dir" in
"configuration.pysh". Now press DoIt!
. Output into the log directory are
"train1.log", "train_1r.log", and "train_1r/". The former two files contain
error transcripts should any problems arise, and the latter folder contains
checkpoint files prefixed with "vgg.ckpt-" which save the weights of the neural
network at regular intervals.
Use the Activations
button to save the input to the neural network as well as
its hidden state activations and output logits by mock-classifying these
detected sounds with this untrained network. You'll need to tell it which
model to use by selecting the last checkpoint file in the untrained
classifier's log files with the File Browser
(i.e.
"untrained-classifier/train_1r/vgg.ckpt-0.{index,meta,data*}" in this case).
The time and amount of memory this takes depends directly on the number and
dimensionality of detected sounds. To limit the problem to a manageable size
one can use max samples
to randomly choose a subset of samples to cluster.
(The time σ
and freq ρ
variables can also be used limit how many sound
events were detected in the first place.) The Activations
button also limits
the relative proportion of each wanted word
to equalize ratio
. In the case
of "detected" label types
you'll want to set this to a large number, as it
does not matter if the number of samples which pass the "time" threshold far
exceeds the "frequency" threshold, or vice versa. Output are three files in
the ground-truth directory: "activations.log", "activations-samples.log", and
"activations.npz". The two ending in ".log" report any errors, and the ".npz"
file contains the actual data in binary format.
Now reduce the dimensionality of the hidden state activations to either two or
three dimensions with the Cluster
button. Choose to do so using either UMAP
(McInnes, Healy, and Melville (2018)),
t-SNE (van der Maaten and Hinton
(2008)), or PCA. UMAP and
t-SNE are each controlled by separate parameters (neighbors
and distance
,
and perplexity
and exaggeration
respectively), a description of which can
be found in the aforementioned articles. UMAP and t-SNE can also be optionally
preceded by PCA, in which case you'll need to specify the fraction of
coefficients to retain using PCA fraction
. You'll also need to choose which
network layer to cluster. At this point, choose just the input layer. Output
are two or three files in the ground-truth directory: "cluster.log" contains
any errors, "cluster.npz" contains binary data, and "cluster-pca.pdf" shows the
results of the principal components analysis (PCA) if one was performed.
Finally, click on the Visualize
button to render the clusters in the
left-most panel. Adjust the size and transparency of the markers using the
Dot Size
and Dot Alpha
sliders respectively. Nominally there should be
some structure to the clusters based on just the raw data alone. This
structure will become much more pronounced after a model is trained with
annotated data.
To browse through your recordings, click on one of the more dense areas and a
fuchsia circle (or sphere if the clustering was done in 3D) will appear. In
the right panel are now displayed snippets of detected waveforms which are
within that circle. The size of the circle can be adjusted with the Circle Radius
slider and the number of snippets displayed with gui_snippet_n{x,y}
in "configuration.pysh". Nominally the snippets are all similar to one another
since they are neighbors in the clustered space. They will each be labeled
"detected time", "detected frequency", or "detected ambient" to indicate which
threshold criterion they passed and that they were detected (as opposed to
annotated, predicted, or missed; see below). The color is the scale bar--
yellow is loud and purple is quiet. Clicking on a snippet will show it in
greater temporal context in the wide panel below. Pan and zoom with the
buttons labeled with arrows. The Play
button can be used to listen to the
sound, and if the Video
button is selected and a movie with the same root
basename exists alongside the corresponding WAV file, it will be displayed as
well.
To record a manual annotation, first pick a waveform snippet that contains an unambiguous example of a particular word. Type the word's name into one of the text boxes to the right of the pan and zoom controls and hit return to activate the corresponding counter to its left. Hopefully the gray box in the upper half of the wide context window nicely demarcates the temporal extent of the word. If so, all you have to do is to double click either the grey box itself, or the corresponding snippet above, and it will be extended to the bottom half and your chosen label will be applied. If not, either double-click or click-and-drag in the bottom half of the wide context window to create a custom time span for a new annotation. In all cases, annotations can be deleted by double clicking any of the gray boxes.
For this tutorial, choose the words "mel-pulse", "mel-sine", "ambient", and "other". We use the syntax "A-B" here, where A is the species (mel being short for D. melanogaster) and B is the song type, but that is not strictly required. The word syntax could nominally be anything. The GUI does have a feature, however, to split labels at the hyphen and display groups of words that share a common prefix or suffix.
Once you have a few tens of examples for each word, it's time to train a classifier and make some predictions. First, confirm that the annotations you just made were saved into an "-annotated.csv" file in the ground-truth folder.
$ tree groundtruth-data
groundtruth-data
├── cluster.npz
├── cluster.log
├── activations.npz
├── activations.log
├── activations-samples.log
└── round1
├── PS_20130625111709_ch3-annotated.csv
├── PS_20130625111709_ch3-detected.csv
├── PS_20130625111709_ch3-detect.log
└── PS_20130625111709_ch3.wav
$ tail -5 groundtruth-data/round1/PS_20130625111709_ch3-annotated.csv
PS_20130625111709_ch3.wav,470151,470719,annotated,mel-sine
PS_20130625111709_ch3.wav,471673,471673,annotated,mel-pulse
PS_20130625111709_ch3.wav,471752,471752,annotated,mel-pulse
PS_20130625111709_ch3.wav,471839,471839,annotated,mel-pulse
PS_20130625111709_ch3.wav,492342,498579,annotated,ambient
Click on the Make Predictions
button to disable the irrelevant actions and
fields. Now train a classifier on your annotations using the Train
button.
Choose a Logs Folder
as before (e.g. "trained-classifier1"). One hundred
steps suffices for this amount of ground truth. So we can accurately monitor
the progress, withhold 40% of the annotations to validate on, and do so every
10 steps. Enter these values into the # steps
, validate %
, and validate period
variables. You'll also need to change the wanted words
variable to
"mel-pulse,mel-sine,ambient,other" and label types
to "annotated" so that it
will ignore the detected annotations in the ground-truth directory. It's
important to include "other" as a wanted word here, even if you haven't labeled
any sounds as such, as it will be used later by SongExplorer to highlight false
negatives (see Correcting Misses). Note that the total
number of annotations must exceed the size of the mini-batches, which is
specified by the mini-batch
variable.
With small data sets the network should just take a minute or so to train. As your example set grows, you might want to monitor the training progress as it goes:
$ watch tail trained-classifier1/train_1r.log
Every 2.0s: tail trained-classifier1/train_1.log Mon Apr 22 14:37:31 2019
INFO:tensorflow:Elapsed 39.697532, Step #9: accuracy 75.8%, cross entropy 0.947476
INFO:tensorflow:Elapsed 43.414184, Step #10: accuracy 84.4%, cross entropy 0.871244
INFO:tensorflow:Saving to "/home/arthurb/songexplorer/trained-classifier1/train_1k/vgg.ckpt-10"
INFO:tensorflow:Confusion Matrix:
['mel-pulse', 'mel-sine', 'ambient']
[[26 9 9]
[ 0 4 0]
[ 0 0 4]]
INFO:tensorflow:Elapsed 45.067488, Step 10: Validation accuracy = 65.4% (N=52)
INFO:tensorflow:Elapsed 48.786851, Step #11: accuracy 79.7%, cross entropy 0.811077
It is common for the accuracy, as measured on the withheld data and reported as "Validation accuracy" in the log file above, to be worse than the training accuracy. If so, it is an indication that the classifier does not generalize well at that point. With more training steps and more ground-truth data though the validation accuracy should become well above chance.
Measure the classifier's performance using the Accuracy
button. Output are the
following charts and tables in the logs folder and the train_1r
subdirectory
therein:
-
"train-loss.pdf" shows the loss value and training and validation accuracies as a function of the number of training steps, wall-clock time, and epochs. Should the curves not quite plateau, choose a checkpoint to
restore from
, increase# steps
, and train some more. If you've changed any of the parameters, you'll need to first reset them as they were, which is made easy by selecting one of the original log files and pressing theCopy
button. -
"accuracy.pdf" shows a confusion matrix in the left panel. Each annotation is placed in this two-dimensional grid according to the label it was manually assigned and the label it was automatically predicted to be. For a perfect classifier this matrix would be diagonal-- that is, only the fuchsia numbers in the upper left to lower right boxes would be non-zero. The number in the upper right triangle in each square is the number of annotations in this square divided by the number of annotations in this row. For boxes along the diagonal it indicates the recall, or sensitivity, for that word, which is the percentage of true positives among all real events (true positives plus false negatives). Similarly in the lower left is the precision, or positive predictive value-- the percentage of true positives among all (both true and false) positives. It is calculated by dividing the numbers in the upper right corner of each box by the sum of the corresponding column. In the title is the overall accuracy, which is calculated as the average of the upper right numbers along the diagonal.
In the middle panel of "accuracy.pdf" is the precision and recall for each word plotted separately. For a perfect classifier they would all be 100. These are simply the values in the corners of the boxes along the diagonal of the confusion matrix.
The right panel of "accuracy.pdf" shows the overall accuracy. The fuchsia numbers along the diagonal of the confusion matrix are first divided by the number of annotations for the corresponding word (equivalent to the sum of the corresponding row) and then averaged. In this case there will only be one point plotted, but in Measuring Generalization and Searching Hyperparameters we'll train multiple models, and each will have it's own point here calculated from their individual confusion matrices.
-
"validation-F1.pdf" plots the F1 score (the product divided by the sum of the precision and recall) over time for each of the
wanted words
separately. Check here to make sure that the accuracy of each word has converged. -
"validation-PvR-.pdf" plots, separately for each word, the trajectory of the precision versus recall curve over number of training steps.
-
"train_1r/precision-recall.ckpt-<>.pdf" and "train_1r/sensitivity-specificity.ckpt-<>.pdf" show how the ratio of false positives to false negatives changes as the threshold used to call an event changes. The areas underneath these curves are widely-cited metrics of performance.
-
"train_1r/probability-density.ckpt-<>.pdf" shows, separately for each word, histograms of the values of the classifier's output taps across all of that word's annotations. The difference between a given word's probability distribution and the second most probable word can be used as a measure of the classifier's confidence.
-
"train_1r/thresholds.ckpt-<>.csv" lists the word-specific probability thresholds that one can use to achieve a specified precision-recall ratio. Use the
P/Rs
variable to specify which ratios to include. This file is used when creating ethograms (see Making Predictions). -
The CSV files in the "train_1r/predictions.ckpt-<>" directory list the specific annotations in the withheld validation set which were mis-classified (plus those that were correct). The WAV files and time stamps therein can be used to look for patterns in the raw data (see Examining Errors).
At this point in the tutorial we have just trained a single model, but SongExplorer
does have workflows were multiple models are saved to a single Logs Folder
(e.g. if replicates
is >1, or if Omit One
, Omit All
, or X-Validate
is
used). In these cases, the left panel of "accuracy.pdf" will show the sum of
the confusion matrix across all models, the right panel will have a gray box
showing the mean and standard deviation of the overall accuracy across all
models, and two additional files will be produced:
-
"train-overlay.pdf" shows the same validation accuracies as in "train-loss.pdf" overlayed across all replicates, leave-one-out models, or cross-validation folds.
-
"confusion-matrices.pdf" shows the individual confusion matrices for each replicate, model, or fold.
For the next round of manual annotations, we're going to have this newly trained classifier find sounds for us instead of using a simple threshold. And we're going to do so with a different recording so that the classifier learns to be insensitive to experimental conditions.
First let's get some more data bundled with SongExplorer into your home directory:
$ mkdir groundtruth-data/round2
$ $SONGEXPLORER_BIN cp /opt/songexplorer/data/20161207T102314_ch1.wav \
$PWD/groundtruth-data/round2 [%CD%/... on Windows]
Use the Freeze
button to save the classifier's neural network graph structure
and weight parameters into the single file that TensorFlow needs for inference.
You'll need to choose a checkpoint to use with the File Browser as you did
before when saving the activations (i.e.
"trained-classifier1/train_1r/vgg.ckpt-100.{index,meta,data*}" in this case).
Output into the log files directory are "freeze.ckpt-<>.log" and
"frozen-graph.ckpt-<>.log" files for errors, and "frozen-graph.ckpt-<>.pb"
containing the binary data. This latter PB file can in future be chosen as the
model instead of a checkpoint file.
Now use the Classify
button to generate probabilities over time for each
annotated word. Specify which recordings using the File Browser
and the WAV Files
button. Note that while the "Checkpoint File" button changed to "PB
File", you can leave the text box as is; all SongExplorer needs is a filename from
which it can parse "ckpt-*". The probabilities are first stored in a file
ending in ".tf", and then converted to WAV files for easy viewing.
$ ls groundtruth-data/round2/
20161207T102314_ch1-ambient.wav 20161207T102314_ch1-other.wav
20161207T102314_ch1-classify.log 20161207T102314_ch1.tf
20161207T102314_ch1-mel-pulse.wav 20161207T102314_ch1.wav
20161207T102314_ch1-mel-sine.wav
Discretize these probabilities using thresholds based on a set of
precision-recall ratios using the Ethogram
button. Choose one of the
"thresholds.ckpt-*.csv" files in the log files folder using the File Browser
. These are created by the Accuracy
button and controlled by the
P/Rs
variable at the time you quantified the accuracy. For convenience you
can also just leave this text box as it was when freezing or classifying; all
SongExplorer needs is a filename in the logs folder from which in can parse
"ckpt-*". You'll also need to specify which ".tf" files to threshold using the
TF Files
button. Again, for convenience, you can specify the ".wav" files too,
and hence leave this as it was when classifying.
$ ls -t1 groundtruth-data/round2/ | head -4
20161207T102314_ch1-ethogram.log
20161207T102314_ch1-predicted-0.5pr.csv
20161207T102314_ch1-predicted-1.0pr.csv
20161207T102314_ch1-predicted-2.0pr.csv
$ head -5 groundtruth-data/round2/20161207T102314_ch1-predicted-1.0pr.csv
20161207T102314_ch1.wav,19976,20008,predicted,mel-pulse
20161207T102314_ch1.wav,20072,20152,predicted,mel-sine
20161207T102314_ch1.wav,20176,20232,predicted,mel-pulse
20161207T102314_ch1.wav,20256,20336,predicted,mel-sine
20161207T102314_ch1.wav,20360,20416,predicted,mel-pulse
The resulting CSV files are in the same format as those generated when we detected sounds in the time and frequency domains as well as when we manually annotated words earlier using the GUI. Note that the fourth column distinguishes whether these words were detected, annotated, or predicted.
In the preceding section we generated three sets of predicted sounds by applying three sets of word-specific thresholds to the probability waveforms:
$ cat trained-classifier/thresholds.csv
precision/recall,2.0,1.0,0.5
mel-pulse,0.9977890984593017,0.508651224000211,-1.0884193525904096
mel-sine,0.999982304641803,0.9986744484433365,0.9965472849431617
ambient,0.999900757998532,0.9997531463467944,0.9996660975683063
Higher thresholds result in fewer false positives and more false negatives. A precision-recall ratio of one means these two types of errors occur at equal rates. Your experimental design drives this choice.
Let's manually check whether our classifier in hand accurately calls sounds
using these thresholds. First, click on the Fix False Positives
button.
Then choose one of the predicted CSV files that has a good mix of the labels
and either delete or move outside of the ground-truth directory the others.
Double check that the label types
variable was auto-populated with
"annotated,predicted". Not having "detected" in this field ensures that
"detected.csv" files in the ground-truth folder are ignored. Finally, cluster
and visualize the neural network's hidden state activations as we did before
using the Activations
, Cluster
, and Visualize
buttons. This time though
choose to cluster just the last hidden layer. So that words with few samples
are not obscured by those with many, randomly subsample the latter by setting
equalize ratio
to a small integer when saving the hidden state activations.
Now let's correct the mistakes! Select predicted
and ambient
from the
kind
and no hyphen
pull-down menus, respectively, and then click on a dense
part of the density map. Optionally adjust the dot size
, dot alpha
, and
circle radius
sliders. Were the classifier perfect, all the snippets now
displayed would look like background noise. Click on the ones that don't and
manually annotate them appropriately. Similarly select mel-
and -pulse
from the species
and word
pull-down menus and correct any mistakes, and
then mel-
and -sine
.
Keep in mind that the only words which show up in the clusters are those that exceed the chosen threshold. Any mistakes you find in the snippets are hence strictly false positives.
It's important that false negatives are corrected as well. One way find them is to click on random snippets and look in the surrounding context in the window below for sounds that have not been predicted. A better way is to home in on detected sounds that don't exceed the probability threshold.
To systematically label missed sounds, first click on the Fix False Negatives
button. Then detect sounds in the recording you just classified, using the
Detect
button as before, and create a list of the subset of these sounds which
were not assigned a label using the Misses
button. For the latter, you'll need
to specify both the detected and predicted CSV files with the File Browser
and the WAV Files
button. The result is another CSV file, this time
ending in "missed.csv":
$ head -5 groundtruth-data/round2/20161207T102314_ch1-missed.csv
20161207T102314_ch1.wav,12849,13367,missed,other
20161207T102314_ch1.wav,13425,13727,missed,other
20161207T102314_ch1.wav,16105,18743,missed,other
20161207T102314_ch1.wav,18817,18848,missed,other
20161207T102314_ch1.wav,19360,19936,missed,other
Now visualize the hidden state activations-- Double check that the label types
variable was auto-populated with "annotated,missed" by the Fix False Negatives
button, and then use the Activations
, Cluster
, and Visualize
buttons in turn.
Examine the false negatives by selecting missed
in the kind
pull-down menu
and click on a dense cluster. Were the classifier perfect, none of the
snippets would be an unambiguous example of any of the labels you trained upon
earlier. Annotate any of them that are, and add new label types for sound
events which fall outside the current categories.
From here, we just keep alternating between annotating false positives and false negatives, using a new recording for each iteration, until mistakes become sufficiently rare. The most effective annotations are those that correct the classifier's mistakes, so don't spend much time, if any, annotating what it got right.
Each time you train a new classifier, all of the existing "predicted.csv", "missed.csv", ".tf", and word-probability WAV files are moved to an "oldfiles" sub-folder as they will be out of date. You might want to occasionally delete these folders to conserve disk space:
$ rm groundtruth-data/*/oldfiles*
Ideally a new model would be trained after each new annotation is made, so that subsequent time is not spent correcting a prediction (or lack thereof) that would no longer be made in error. Training a classifier takes time though, so a balance must be struck with how quickly you alternate between annotating and training.
Since there are more annotations each time you train, use a proportionately
smaller percentage of them for validation and proportionately larger number of
training steps. You don't need more than ten-ish annotations for each word to
confirm that the learning curves converge, and a hundred-ish suffice to
quantify accuracy. Since the learning curves generally don't converge until
the entire data set has been sampled many times over, set # steps
to be
several fold greater than the number of annotations (shown in the table near
the labels) divided by the mini-batch
size, and check that it actually
converges with the "train-loss.pdf", "validation-F1.pdf", and
"validation-PvR-.pdf" figures generated by the Accuracy
button. If the
accuracy converges before an entire epoch has been trained upon, use a smaller
learning rate
.
As the wall-clock time spent training is generally shorter with larger mini-batches, set it as high as the memory in your GPU will permit. Multiples of 32 are generally faster. The caveat here is that exceedingly large mini-batches can reduce accuracy, so make sure to compare it with smaller ones.
One should make an effort to choose a recording at each step that is most different from the ones trained upon so far. Doing so will produce a classifier that generalizes better.
Once a qualitatively acceptable number of errors in the ethograms is achieved,
quantitatively measure your model's ability to generalize by leaving entire
recordings out for validation (see Measuring
Generalization), and/or using cross validation (see
Searching Hyperparameters). Then train a single
model with nearly all of your annotations for use in your experiments.
Optionally, use replicates
to train multiple models with different batch
orderings and/or initial weights to measure the variance. Report accuracy on
an entirely separate set of densely-annotated test data(see Testing
Densely).
If a mistake is made annotating, say the wrong label is applied to a particular
time interval, and one notices this immediately, the Undo
button can be used
to correct it.
Sometimes though, mistakes might slip into the ground truth and a model is
trained with them. These latter mistakes can be corrected in a fashion similar
to correcting false positives and false negatives. Simply cluster the hidden
state activations using the Activations
, Cluster
, and Visualize
buttons
as before making sure that "annotated" is in label types
. Then click on
annotated
in the kind
pull-down menu and select one of your labels (e.g.
mel-
in species
and -pulse
in word
). Scan through the visualized
clusters by click on several points and looking at the snippets therein. If
you find an error, simply choose the correct label in one of the text boxes
below and then double click on either the snippet itself or the corresponding
gray box in the upper half of the wide context window. If you want to remove
the annotation entirely, choose a label with an empty text box and
double-click. In both cases, the entry in the original "annotated.csv" file is
removed, and in the former case a new entry is created in the current
"annotated.csv" file. Should you make a mistake while correcting a mistake,
simply Undo
it, or double click it again. In this case, the original CSV
entry remains deleted and the new one modified in the current "annotated.csv"
file.
Up to this point we have validated on a small portion of each recording. Once you have annotated many recordings though, it is good to set aside entire WAV files to validate on. In this way we measure the classifier's ability to extrapolate to different microphones, individuals, or whatever other characteristics that are unique to the withheld recordings.
To train one classifier with a single recording or set of recordings withheld
for validation, first click on Generalize
and then Omit All
. Use the File Browser
to either select (1) specific WAV file(s), (2) a text file
containing a list of WAV file(s) (either comma separated or one per line), or
(3) the ground-truth folder or a subdirectory therein. Finally press the
Validation Files
button and DoIt!
.
To train multiple classifiers, each of which withholds a single recording in a
set you specify, click on Omit One
. Select the set as described above for
Omit All
. The DoIt!
button will then iteratively launch a job for each WAV
file that has been selected, storing the result in the same Logs Folder
but in
separate files and subdirectories that are suffixed with the letter "w". Of
course, training multiple classifiers is quickest when done simultaneously
instead of sequentially. If your model is small, you might be able to fit
multiple on a single GPU (see the models_per_job
variable in
"configuration.pysh"). Otherwise, you'll need a machine with multiple GPUs,
access to a cluster, or patience.
A simple jitter plot of the accuracies on withheld recordings is included in
the output of the Accuracy
button (right panel of "accuracy.pdf"). It will
likely be worse than a model trained on a portion of each recording. If so,
label more data, or try modifying the hyperparameters (Searching
Hyperparameters)
Achieving high accuracy is not just about annotating lots of data, it also depends on choosing the right model architecture. While SongExplorer is (currently) set up solely for convolutional neural networks, there are many free parameters by which to tune its architecture. You configure them by editing the variables itemized below, and then use cross-validation to compare different choices. One could of course also modify the source code to permit radically different neural architectures, or even something other than neural networks.
-
context
is the temporal duration, in milliseconds, that the classifier inputs -
shift by
is the asymmetry, in milliseconds, ofcontext
with respect to the point in time that is annotated or being classified.shift by
divided bystride
(see below) should be an integer. For positive values the duration of the context preceding the annotation is longer than that succeeding it. -
representation
specifies whether to use the raw waveform directly, to make a spectrogram of the waveform to input to the neural network, or to use a mel-frequency cepstrum (see Davis and Mermelstein 1980; IEEE). Waveforms do not make any assumptions about the data, and so can learn arbitrary features that spectrograms and cepstrums might be blind to, but need more annotations to make the training converge. -
window
is the length of the temporal slices, in milliseconds, that constitute the spectrogram.window
/ 1000 *audio_tic_rate
should round down to a power of two. -
stride
is the time, in milliseconds, by which thewindow
s in the spectrogram are shifted. 1000/stride
must be an integer. -
mel & DCT
specifies how many taps to use in the mel-frequency cepstrum. The first number is for the mel-frequency resampling and the second for the discrete cosine transform. Modifying these is tricky as valid values depend onaudio_tic_rate
andwindow
. The table below shows the maximum permissible values for each, and are what is recommended. See the code in "tensorflow/contrib/lite/kernels/internal/mfcc.cc" for more details.
sample rate | window | mel,DCT |
---|---|---|
10000 | 12.8 | 28,28 |
10000 | 6.4 | 15,15 |
5000 | 6.4 | 11,11 |
2500 | 6.4 | 7,7 |
1250 | 6.4 | 3,3 |
10000 | 3.2 | 7,7 |
6000 | 10.7 | 19,19 |
-
dropout
is the fraction of hidden units on each forward pass to omit during training. See Srivastava, Hinton, et al (2014; J. Machine Learning Res.). -
optimizer
can be one of stochastic gradient descent (SGD), Adam, AdaGrad, or RMSProp. -
learning rate
specifies the fraction of the gradient to change each weight by at each training step. Set it such that the training curve accuracy in "train-loss.pdf" does not saturate until after at least one epoch of ground truth has been trained upon. -
kernels
is a 3-vector of the size of the convolutional kernels. The first value is used for each layer until the tensor height in the frequency axis is smaller than it. Then the second value is then repeatedly used until the height is again smaller than it. Finally the third value is used until the width is less thanlast conv width
. Only the third value matters whenrepresentation
is "waveform". -
# features
is the number of feature maps to use at each of the corresponding stages inkernel_sizes
. See LeCun et al (1989; Neural Computation). -
dilate after
specifies the first layer, starting from zero, at which to start dilating the convolutional kernels. See Yu and Koltun (2016; arXiv). -
stride after
specifies the first layer, starting from zero, at which to start striding the convolutional kernels by two. -
connection
specifies whether to use identity bypasses, which can help models with many layers converge. See He, Zhang, Ren, and Sun (2015; arXiv. -
weights seed
specifies whether to randomize the initial weights or not. a value of -1 results in different values for each fold. they are also different each time you runx-validate
. any other number results in a set of initial weights that is unique to that number across all folds and repeated runs. -
batch seed
similarly specifies whether to randomize the order in which samples are drawn from the groundtruth data set during training. a value of "-1" results in a different order for each fold and run; any other number results in a unique order specific to that number across folds and runs.
To perform a simple grid search for the optimal value for a particular
hyperparameter, first choose how many folds you want to partition your
ground-truth data into using k-fold
. Then set the hyperparameter of interest
to the first value you want to try and choose a name for the Logs Folder
such
that its prefix will be shared across all of the hyperparameter values you plan
to validate. Suffix any additional hyperparameters of interest using
underscores. (For example, to search mini-batch and keep track of kernel size
and feature maps, use "mb-64_ks129_fm64".) If your models is small, use
models_per_job
in "configuration.pysh" to train multiple folds on a GPU.
Click the X-Validate
button and then DoIt!
. One classifier will be trained
for each fold, using it as the validation set and the remaining folds for
training. Separate files and subdirectories are created in the Logs Folder
that are suffixed by the fold number and the letter "k". Plot overlayed
training curves with the Accuracy
button, as before. Repeat the above
procedure for each of remaining hyperparameter values you want to try (e.g.
"mb-128_ks129_fm64", "mb-256_ks129_fm64", etc.). Then use the Compare
button
to create a figure of the cross-validation data over the hyperparameter values,
specifying the prefix that the logs folders have in common ("mb" in this case).
Output are three files:
-
"[suffix]-compare-confusion-matrices.pdf" contains the summed confusion matrix for each of the values tested.
-
"[suffix]-compare-overall-params-speed.pdf" plots the accuracy, number of trainable parameters, and training time for each model.
-
"[suffix]-compare-precision-recall.pdf" shows the final false negative and false positive rates for each model and wanted word.
Mistakes can possibly be corrected if more annotations are made of similar sounds. To find such sounds, cluster the errors made on the ground-truth annotations with sounds detected in your recordings. Then look for localized hot spots of mistakes and make annotations therein.
SongExplorer provides two ways to generate lists of errors, which you'll need to
choose between. The Accuracy
button does so just for the validation data,
while Activations
uses the entire ground truth or a randomly sampled subset
thereof.
As mentioned earlier, the Accuracy
button creates a
"predictions/" folder in the Log Folder containing CSV files itemizing whether
the sounds in the validation set were correctly or incorrectly classified.
Each CSV file corresponds to a sub-folder within the ground-truth folder. The
file format is similar to SongExplorer's other CSV files, with the difference being
that the penultimate column is the prediction and the final one the annotation.
To use these predictions, copy these CSV files into their corresponding
ground-truth sub-folders.
$ tail -n 10 trained-classifier1/predictions/round1-mistakes.csv
PS_20130625111709_ch3.wav,377778,377778,correct,mel-pulse,mel-pulse
PS_20130625111709_ch3.wav,157257,157257,correct,mel-pulse,mel-pulse
PS_20130625111709_ch3.wav,164503,165339,correct,ambient,ambient
PS_20130625111709_ch3.wav,379518,379518,mistaken,ambient,mel-pulse
PS_20130625111709_ch3.wav,377827,377827,correct,mel-pulse,mel-pulse
PS_20130625111709_ch3.wav,378085,378085,correct,mel-pulse,mel-pulse
PS_20130625111709_ch3.wav,379412,379412,mistaken,ambient,mel-pulse
PS_20130625111709_ch3.wav,160474,161353,correct,ambient,ambient
PS_20130625111709_ch3.wav,207780,208572,correct,mel-sine,mel-sine
PS_20130625111709_ch3.wav,157630,157630,correct,mel-pulse,mel-pulse
Similarly, the Activations
button creates an "activations.npz" file
containing the logits of the output layer (which is just a vector of word
probabilities), as well as the correct answer from the ground-truth
annotations. To turn these data into a CSV file, use the Mistakes
button.
In the ground-truth sub-folders, CSV files are created for each WAV file, with
an extra column just like above. No need to copy any files here.
Now detect sounds in the ground-truth recordings for which you haven't done so
already. Press the Examine Errors
wizard and confirm that label types
is
set to "detected,mistaken", and save the hidden state activations, cluster, and
visualize as before. Select mistaken
in the kind
pull-down menu to look
for a localized density. View the snippets in any hot spots to examine the
shapes of waveforms that are mis-classified-- the ones whose text label, which
is the prediction, does not match the waveform. Then select detected
in the
kind
pull-down menu and manually annotate similar waveforms. Nominally they
will cluster at the same location.
The accuracy statistics reported in the confusion matrices described above are limited to the points in time which are annotated. If an annotation withheld to validate upon does not elicit the maximum probability across all output taps at the corresponding label, it is considered an error. Quantifying accuracy in this way is a bit misleading, as when a model is used to make ethograms, a word-specific threshold is applied to the probabilities instead. Moreover, ethograms are made over the entire recording, not just at specific times of interest. To more precisely quantify a model's accuracy then, as it would be used in your experiments, a dense annotation is needed-- one for which all occurrences of any words of interest are annotated.
To quantify an ethogram's accuracy, first select a set of recordings in your
validation data that are collectively long enough to capture the variance in
your data set but short enough that you are willing to manually label every
word in them. Then detect and cluster the sounds in these recordings using the
Detect
, Activations
, Cluster
, and Visualize
buttons as described
earlier. Annotate every occurrence of each word of interest by jumping to the
beginning of each recording and panning all the way to the end. Afterwards,
manually suffix each resulting "annotated.csv" file with the name of the
annotator (e.g. "annotated-.csv"). Take your best model to date and make
ethograms of these densely annotated recordings using the Classify
and
Ethogram
buttons as before. Finally, use the Congruence
button to plot the
fraction of false positives and negatives, specifying which files you've
densely annotated with ground truth
and either validation files
or test files
(a comma-separated list of .wav files, a text file of
.wav filenames, or a folder of .wav files; see Measuring
Generalization). If the accuracy is not
acceptable, iteratively adjust the hyperparameters, train a new model, and make
new ethograms and congruence plots until it is. You might also need to add new
annotations to your training set.
Once the accuracy is acceptable on validation data, quantify the accuracy on a densely annotated test set. The network should have never been trained or validated on these latter data before; otherwise the resulting accuracy could be spuriously better. Label every word of interest as before, make ethograms with your best model, and plot the congruence with SongExplorer's predictions. Hopefully the accuracy will be okay. If not, and you want to change the hyperparameters or add more training data, then the proper thing to do is to use this test data as training or validation data going forward, and densely annotate a new set of data to test against.
The congruence between multiple human annotators can be quantified using the
same procedure. Simply create "annotated-.csv" files for each one. The
plots created by Congruence
will include lines for the number of sounds labeled
by all annotators (including SongExplorer), only each annotator, and not by a given
annotator.
Much as one can examine the mistakes of a particular model with respect to
sparsely annotated ground truth by clustering with "mistaken" as one of the
label types
, one can look closely at the errors in congruence between a model
and a densely annotated test set by using
"everyone|{tic,word}-{only,not}{1.0pr,annotator1,annotator2,...}" as the label types
. The Congruence button generates a bunch of "disjoint.csv" files:
"disjoint-everyone.csv" contains the intersection of intervals that SongExplorer
and all annotators agreed upon; "disjoint-only.csv" files contain the intervals
which only SongExplorer or one particular annotator labelled; "disjoint-not.csv"
contains those which were labelled by everyone except SongExplorer or a given
annotator. Choose one or all of these label types and then use the
Activations
, Cluster
, and Visualize
buttons as before.
Should no amount of adjustments to the hyperparameters yield acceptable
accuracy, and annotating additional ground truth becomes tiresome, try
specifying the a priori occurrence frequency of each word in prevalences
.
So for a given interval of time, enter a comma-separated list of expected
durations for each entry in wanted words
(e.g. 6,12,42 seconds for a minute
of mel-pulse,mel-sine,ambient respectively); alternatively, the relative
probability for each word can be given (e.g. 0.1,0.2,0.7). The probability
waveforms generated by Classify
will now be adjusted on a word-specific basis
to account for these known imbalanced distributions. Note that a similar
effect can be achieved by changing the P/Rs
variable, but there the
thresholds are adjusted instead of the probabilities, and they are changed
equally for all words.
One can also use thresholds derived from this dense annotation to achieve better accuracy. Doing so is particularly useful when making predictions on recordings that were made under different conditions than the data used to train and validate the model. To do so, choose the "thresholds-dense.ckpt-*.csv" file when making ethograms. This file is created when the congruence is quantified. The precision-recalls ratios therein are controlled by the "P/Rs" box when the sparse annotation accuracy was quantified with the Accuracy button.
After amassing a sizeable amount of ground truth one might wonder whether one has manually annotated all of the types of words that exist in the recordings. One way to check for any missed types is to look for hot spots in the clusters of detected sounds that have no corresponding annotations. Annotating known types in these spots should improve generalization too.
First, set label types
to "annotated" and train a model that includes "time"
and "frequency" plus all of your existing wanted words
("mel-pulse,mel-sine,ambient,other"). Then, use the Detect
button to
threshold the recordings that you want to search for novel sounds. Save their
hidden state activations, along with those of the manually annotated sounds,
using the Activations
button by setting the label types to "annotated,detected".
Cluster and visualize as before. Now rapidly and alternately switch between
annotated
and detected
in the kind
pull-down menu to find any differences
in the density distributions. Click on any new hot spots you find in the
detected clusters, and annotate sounds which are labeled as detected but not
annotated. Create new word types as necessary.
For some tasks it may be easier to write code instead of use the GUI-- tasks
which require many tedious mouse clicks, for example, or simpler ones that must
be performed repeatedly. To facilitate coding your analysis, SongExplorer is
structured such that each action button (Detect
, Misses
, Activations
,
etc.) is backed by a linux bash script. At the top of each script is
documentation showing how to call it. Here, for example, is the interface for
Detect
:
$ $SONGEXPLORER_BIN head -n 8 /opt/songexplorer/src/detect.sh
#!/bin/bash
# threshold an audio recording in both the time and frequency spaces
# detect.sh <full-path-to-wavfile> <time-sigma> <time-smooth-ms>
# <frequency-n-ms> <frequency-nw> <frequency-p> <frequency-smooth-ms>
# <audio-tic-rate> <audio-nchannels>
# e.g.
# $SONGEXPLORER_BIN detect.sh \
`pwd`/groundtruth-data/round2/20161207T102314_ch1_p1.wav \
6 6.4 25.6 4 0.1 25.6 2500 1
The following bash code directly calls this script to make predictions on a set of recordings in different folders:
$ wavfiles=(
groundtruth-data/round1/PS_20130625111709_ch3.wav
groundtruth-data/round2/20161207T102314_ch1.wav
groundtruth-data/round3/Antigua_20110313095210_ch26.wav
)
$ for wavfile in ${wavfiles[@]} ; do
$SONGEXPLORER_BIN detect.sh $wavfile 6 6.4 25.6 4 0.1 25.6 2500 1
done
The above workflow could also easily be performed in Julia, Python, Matlab, or any other language that can execute shell commands.
Alternatively, one can also write a python script which invokes SongExplorer's GUI interface to programmatically fill text boxes with values and to push action buttons:
import sys
import os
# load the GUI
sys.path.append("/opt/songexplorer/src/gui")
import model as M
import view as V
import controller as C
# start the GUI
M.init("configuration.pysh")
V.init(None)
C.init(None)
# start the job scheduler
run(["hetero", "start", str(M.local_ncpu_cores),
str(M.local_ngpu_cards), str(M.local_ngigabytes_memory)])
# set the needed textbox variables
V.time_sigma_string.value = "6"
V.time_smooth_ms_string.value = "6.4"
V.frequency_n_ms_string.value = "25.6"
V.frequency_nw_string.value = "4"
V.frequency_p_string.value = "0.1"
V.frequency_smooth_ms_string.value = "25.6"
# repeatedly push the Detect button
wavpaths_noext = [
"groundtruth-data/round1/PS_20130625111709_ch3",
"groundtruth-data/round2/20161207T102314_ch1",
"groundtruth-data/round3/Antigua_20110313095210_ch26",
]
for wavpath_noext in wavepaths_noext:
V.wavtfcsvfiles_string.value = wavpath_noext+".wav"
C.detect_actuate()
# stop the job scheduler
run(["hetero", "stop"], stdout=PIPE, stderr=STDOUT)
For more details see the system tests in /opt/songexplorer/test/tutorial.{sh,py}. These two files implement, as bash and python scripts respectively, the entire workflow presented in this Tutorial, from Detecting Sounds all the way to Testing Densely.
- Sometimes using control-C to quit out of SongExplorer does not work. In this
case, kill it with
ps auxc | grep -E '(gui.sh|bokeh)'
and thenkill -9 <pid>
. Errant jobs can be killed similarly.
- The
WAV,TF,CSV Files
text box, being plural, can contain multiple comma-separated filenames. Select multiple files in the File Browser using shift/command-click as you would in most other file browsers.
The code is hosted on github. Please file an issue there for all bug reports and feature requests. Pull requests are also welcomed! For major changes it is best to file an issue first so we can discuss implementation details. Please work with us to improve SongExplorer instead instead of forking your own version.
To build an image, change to a local (i.e. not NFS mounted; e.g. /opt/users) directory and:
$ git clone https://github.com/JaneliaSciComp/SongExplorer.git
$ rm -rf songexplorer/.git
$ sudo singularity build -s songexplorer.img songexplorer/containers/singularity.def
To confirm that the image works:
$ singularity run --nv songexplorer.img
>>> import tensorflow as tf
>>> msg = tf.constant('Hello, TensorFlow!')
>>> tf.print(msg)
Compress the image into a single file:
$ sudo singularity build songexplorer.sif songexplorer.img
Next create an access token at cloud.sylabs.io and login using:
$ singularity remote login SylabsCloud
Then push the image to the cloud:
$ singularity sign songexplorer.sif
$ singularity push songexplorer.sif library://bjarthur/janelia/songexplorer:<version>
To build an image without GPU support, comment out the section titled "install
CUDA" in "singularity.def" and omit the --nv
flags.
To use a copy of the SongExplorer source code outside of the container, set
SINGULARITYENV_PREPEND_PATH to the full path to SongExplorer's src
directory in
your shell environment. source_path
in "configuration.pysh" must be set
similarly if using a remote workstation or a cluster.
To start docker on linux and set permissions:
$ service docker start
$ setfacl -m user:$USER:rw /var/run/docker.sock
To build a docker image and push it to docker hub:
$ cd songexplorer
$ docker build --file=containers/dockerfile --tag=bjarthur/songexplorer \
[--no-cache=true] .
$ docker login
$ docker {push,pull} bjarthur/songexplorer
To monitor resource usage:
$ docker stats
To run a container interactively add "-i --tty".
SongExplorer comes with a comprehensive set of tests to facilitate easy validation that everything works both after you've first installed it as well as after any changes have been made to the code. The tests exercise both the python GUI as well as the linux bash interfaces. To run them, simply execute "runtests.sh":
$ singularity exec -B /tmp:/opt/songexplorer/test/scratch [--nv] <songexplorer.sif> \
/opt/songexplorer/test/runtests.sh
or with docker:
$ docker run -v %TMP%:/opt/songexplorer/test/scratch ^
[-v <other-disks>] [-u <userid>] [-w <working-directory] ^
-e SONGEXPLORER_BIN bjarthur/songexplorer /opt/songexplorer/test/runtests.sh