LIEPA dataset stats and tools for cleanup.
For audio processing script needs soundfile
, matplotlib
and resampy
packages.
To install run:
pip install -r requirements.txt
Dataset contains audio utterances and phoneme transcriptions for 4 speakers.
To download and unpack locally dataset run following command:
python get_liepa.py
To clean data integrity run following command:
python clean_syn.py -a
To fix known issues in dataset run following command:
python clean_syn.py -ax
Dataset contains audio utterances and transcriptions for over 300 speakers.
To download and unpack locally dataset run following command:
python get_liepa.py -rx
To download and unpack locally dataset additional annotations run following command:
python get_liepa.py -nx
To clean data integrity run following command:
python clean_rec.py -a
It should output file/directory naming issues, audio file framerate isssues and transcription encoding.
To fix known issues in dataset run following command:
python clean.py -u -x
Will fix file structure. The next command will fix all other issues including forsing wav PCM_16 encoding.
python clean.py -a -x
You can also call > python clean.py -h
to see help.
To get wordcount run following command:
Decode LIEPA dataset structure
python stats.py -w