Skip to content

ailsamm/errorInjectionPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Error Injection Pipeline

This pipeline takes an input txt file and injects artificially-generated errors to its text. The pipeline does so by relying on phonetic similarity to imitate the types of errors one may expect from automatic speech recognition (ASR) output. More specifically, this functions by adding

Each text to be processed within the file must be placed on a new line. Please note that the pipeline currently only functions for English text, however, this can be easily expanded by (1) downloading the Spacy model in your desired language and (2) creating Annoy dictionaries by training on a large set of text data in the same language.

This pipeline functions by first loading set of adjective, verb and noun phonetic vectors created from this repository. These are used to find phonetically-similar alternatives to switch other words with. Annoy is used create a phonetic vector space with which we can easily and quickly find these phonetically similar words. Spacy is then used to pre-process the input text. The rest of the code randomly selects word(s) within each line of input text to switch for alternatives. The number of words switched for phonetically-similar alternatives depends on the input (1-5).

Installation

Python 3 is required to run this programme. The following installations are also required:

pip3 install -U spacy
pip3 install annoy
python3 -m spacy download en

Usage

Run with:

python3 main.py --file <name_of_input_file> --level <noise_level> --splitType <split_type>

Note that:

  • level_of_errors is an integer between 1 and 5 where 1 adds a low amount of noise and 5 adds a high amount. (Further details/specifications may be found in the paper)
  • split_type should be either test, train or dev - this is only used to name the output file

The output file will be output as <splitType>Articles_<level>.txtin a separate folder errorInjectionOutput.

Testing

In the folder resources a file named test.txt is included. It consists of 10 sentences taken from the Gigaword corpus. This pipeline may be tested by using this file by running the following command:

python3 main.py --file resources/test.txt --level 2 --splitType test

The output file for comparison will be output to ./errorInjectionOutput/testArticles_2.txt.

Paper

The paper for which this pipeline was created can be found here. Further details on the pipeline may also be found there.

About

This pipeline takes an input txt file and injects artificially-generated errors to its text to simulate errors output by ASR systems.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages