Skip to content

callison-burch/wikilanguages-pipeline

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wiki Languages Pipeline

Wikipedia Languages Pipeline is multistep pipeline script to:

  • Collect list of wikipedia languages based on stats from - http://meta.wikimedia.org/wiki/List_of_Wikipedias#All_Wikipedias_ordered_by_number_of_articles

  • Collect wikipedia articles usage per language

  • Create languages vocabularies for every language

  • Download top articles for every language

  • Split every article in sentences and then in words

  • Automatically train sentence splitter/tokenizer for every language (based on top articles)

  • Build foreign language - english dictionaries (based on single word wikipedia titles with language links)

Requirements

Wikipydia library

wpTextExtractor library

NLTK library

Running pipeline

Run pipeline to generate vocabularies for all languages

Get help on command line options

python main.py --help

Run full production pipeline

python main.py --settings settings --debug INFO --tokenizer TRAIN

--tokenizer parameter explained: TRAIN - will train new tokenizers and save them SKIP - will not train tokenizers, will use existing ones (assuming that they are exist for all languages)

--pipeline parameter explained: PROCESS - please proceed SKIP - skip generating vocabulary/dictionary

--override parameter explained: YES - override existing data NO - skip if vocabulary/dictionary is alrady exist

About

Wikipedia Languages Pipeline

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published