Skip to content

ahkimn/Lang8-NAIST-extractor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 

Repository files navigation

These are the scripts for extracting error-correct pairs from the NAIST Lang-8 Learner Corpora These programs support only python 3.

How to use

For the Lang-8 Learner Corpora (raw format) in NAIST Lang-8 Learner Corpora

python extract_err-cor-pair.py -d lang-8-20111007-L1-v2.dat (-l1 [native_language]) (-l2 [learning_language; default: English]) (-tags)

Language list

['Korean', 'English', 'Japanese', 'Mandarin', 'Traditional Chinese', 'Vietnamese', 'German', 'French', 'Other language', 'Spanish', 'Indonesian', 'Russian', 'Arabic', 'Thai', 'Swedish', 'Dutch', 'Hebrew', 'Tagalog', 'Portuguese(Brazil)', 'Cantonese', 'Italian', 'Esperanto', 'Hawaiian', 'Afrikaans', 'Mongolian', 'Hindi', 'Polish', 'Finnish', 'Greek', 'Bihari', 'Farsi', 'Urdu', 'Turkish', 'Portuguese(Portugal)', 'Bulgarian', 'Norwegian', 'Romanian', 'Albanian', 'Ukrainian', 'Catalan', 'Latvian', 'Danish', 'Serbian', 'Slovak', 'Georgian', 'Hungarian', 'Malaysian', 'Icelandic', 'Latin', 'Laotian', 'Croatian', 'Lithuanian', 'Bengali', 'Tongan', 'Slovenian', 'Swahili', 'Irish', 'Czech', 'Estonian', 'Khmer', 'Javanese', 'Sinhalese', 'Sanskrit', 'Armenian', 'Tamil', 'Basque', 'Welsh', 'Bosnian', 'Macedonian', 'Telugu', 'Uzbek', 'Gaelic', 'Azerbaijanian', 'Tibetan', 'Panjabi', 'Marathi', 'Yiddish', 'Ainu', 'Haitian', 'Slavic']

For the Lang-8 Corpus of Learner English in NAIST Lang-8 Learner Corpora

python extract_err-cor_pair4en.py entries.train

Example of outputs

These outputs are pairs of learner sentence and correct sentence separated by tab character. If learner sentence and correct sentence are same, the programs output original sentence as correct sentence.

original sentence written by learners \t sentence corrected by native speakers
And he took in my favorite subject like soccer .        And he took in my favorite subjects like soccer .
It said that was disappointing .        It said that it was disappointing .

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%