Wiki Languages Pipeline

Wikipedia Languages Pipeline is multistep pipeline script to:

Collect list of wikipedia languages based on stats from - http://meta.wikimedia.org/wiki/List_of_Wikipedias#All_Wikipedias_ordered_by_number_of_articles
Collect wikipedia articles usage per language
Create languages vocabularies for every language
Download top articles for every language
Split every article in sentences and then in words
Automatically train sentence splitter/tokenizer for every language (based on top articles)
Build foreign language - english dictionaries (based on single word wikipedia titles with language links)

Requirements

Wikipydia library

wpTextExtractor library

NLTK library

Running pipeline

Run pipeline to generate vocabularies for all languages

Get help on command line options

python main.py --help

Run full production pipeline

python main.py --settings settings --debug INFO --tokenizer TRAIN

--tokenizer parameter explained: TRAIN - will train new tokenizers and save them SKIP - will not train tokenizers, will use existing ones (assuming that they are exist for all languages)

--pipeline parameter explained: PROCESS - please proceed SKIP - skip generating vocabulary/dictionary

--override parameter explained: YES - override existing data NO - skip if vocabulary/dictionary is alrady exist

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
scripts		scripts
.gitignore		.gitignore
README.md		README.md
langlib.py		langlib.py
main.py		main.py
settings.example.py		settings.example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

.gitignore

.gitignore

README.md

README.md

langlib.py

langlib.py

main.py

main.py

settings.example.py

settings.example.py

Repository files navigation

Wiki Languages Pipeline

Requirements

Running pipeline

About

Releases

Packages

callison-burch/wikilanguages-pipeline

Folders and files

Latest commit

History

Repository files navigation

Wiki Languages Pipeline

Requirements

Running pipeline

About

Resources

Stars

Watchers

Forks