language_predictor

Simple project I made for fun to build language models and predict what language a string or text file belongs to. It creates a model for each reference language, caching it to disk for repeated use, and rebuilds the model if the language file is updated.

I use the NGram class from http://blog.ebookglue.com/write-language-detector-50-lines-python/, which I've modified slightly. Check it out for the theory as to why this predictor works.

I've included some language files with text I got from Project Gutenberg

Included languages: chinese, dutch, english, french, german, japanese, latin, ancient-greek, modern-greek, polish, romanian, russian, spanish

To create a new language reference text, just slap any old text into a file and load it in (as described below). The only thing it does before building the language model is reduce all whitespace down to single spaces. It also has unicode support.

One thing to note is that languages with a lot of diacritics (á, é, č, ć, đ, etc) might not perform well if you input a string without the marks, as it treats them as completely separate characters. For these languages make sure to construct the model file with text that includes the marks and text that doesn't have them, so the model has a variety to work with.

Using the `Predictor` Class

from Predictor import Predictor
language_files = [
    {"language": "english", "training_file": "english_language.txt"},
    {"language": "spanish", "training_file": "spanish_language.txt"},
    {"language": "german", "training_file": "german_language.txt"},
]

predictor = Predictor(language_files)
best_guess = predictor.predict("Ich bin ein Berliner")

print("Language is: {}".format(best_guess.language))

# prints: Language is: german

`predict_language.py`

Or you can use the script I included, which takes a string or text file from the command line and guesses it's language. It loads all files of the format <languagename>_language.txt in the current directory and builds models for them

usage: predict_language.py [-h] [-s STRING] [-f TEXT_FILE]

optional arguments:
  -h, --help    show this help message and exit
  -s STRING
  -f TEXT_FILE

Example:

$ predict_language.py -s "Fryderyk Franciszek Chopin polski kompozytor i pianista"
Loading model for ancient-greek - ancient-greek_language_ngrams.txt
Loading model for polish - polish_language_ngrams.txt
.
.
.
Loading model for spanish - spanish_language_ngrams.txt
Language is: polish

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
Predictor.py		Predictor.py
README.md		README.md
ancient-greek_language.txt		ancient-greek_language.txt
chinese_language.txt		chinese_language.txt
dutch_language.txt		dutch_language.txt
english_forum_post.txt		english_forum_post.txt
english_language.txt		english_language.txt
french_language.txt		french_language.txt
german_language.txt		german_language.txt
japanese_language.txt		japanese_language.txt
latin_language.txt		latin_language.txt
modern-greek_language.txt		modern-greek_language.txt
polish_language.txt		polish_language.txt
predict_language.py		predict_language.py
predict_server.py		predict_server.py
romanian_language.txt		romanian_language.txt
russian_language.txt		russian_language.txt
spanish_language.txt		spanish_language.txt

jpinsonault/language_predictor

Folders and files

Latest commit

History

Repository files navigation

language_predictor

Using the Predictor Class

predict_language.py

About

Resources

Stars

Watchers

Forks

Languages

Using the `Predictor` Class

`predict_language.py`