GitHub - darius/greek_to_me: Guess the language of a text using n-gram models

This Python module takes text input and guesses what natural language it's in:

>>> from greek_to_me import make_pundit
>>> p = make_pundit('models')   # The models/ dir in this distro
>>> p.best_guess('hello world')
'en'   # English
>>> p.best_guess('hola mundo')
'es'   # Espanol

You can also build new models and ask the pundit for more info if you want a measure of confidence or want to make more subtle discriminations, e.g. to combine this textual evidence with an Accept-Language header.

See the code for docs. smoketest.py shows some sample usage.

The judgments use a character n-gram model of each language. Supplied with this module in models/ are some bigram models built from the Europarl and Leipzig parallel corpora, mostly for European languages. (In code not supplied here, I first used http://pypi.python.org/pypi/guess-language to screen out text in other languages like Mandarin. So why not use guess-language for the whole job? Because it works poorly on very short inputs like search queries; our approach needs less evidence to reach a reasonable judgement.)

IIRC trigram models do noticeably better but take an order of magnitude more space; I didn't feel like checking 4MB into this repo.

See http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html for a similar but more sophisticated package in Java.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
greek_to_me		greek_to_me
models		models
COPYING		COPYING
README.md		README.md
notes.text		notes.text
punditeval.py		punditeval.py
smoketest.py		smoketest.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

greek_to_me

greek_to_me

models

models

COPYING

COPYING

README.md

README.md

notes.text

notes.text

punditeval.py

punditeval.py

smoketest.py

smoketest.py

Repository files navigation

About

Releases

Packages

Languages

License

darius/greek_to_me

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Languages