Skip to content

darius/greek_to_me

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This Python module takes text input and guesses what natural language it's in:

>>> from greek_to_me import make_pundit
>>> p = make_pundit('models')   # The models/ dir in this distro
>>> p.best_guess('hello world')
'en'   # English
>>> p.best_guess('hola mundo')
'es'   # Espanol

You can also build new models and ask the pundit for more info if you want a measure of confidence or want to make more subtle discriminations, e.g. to combine this textual evidence with an Accept-Language header.

See the code for docs. smoketest.py shows some sample usage.

The judgments use a character n-gram model of each language. Supplied with this module in models/ are some bigram models built from the Europarl and Leipzig parallel corpora, mostly for European languages. (In code not supplied here, I first used http://pypi.python.org/pypi/guess-language to screen out text in other languages like Mandarin. So why not use guess-language for the whole job? Because it works poorly on very short inputs like search queries; our approach needs less evidence to reach a reasonable judgement.)

IIRC trigram models do noticeably better but take an order of magnitude more space; I didn't feel like checking 4MB into this repo.

See http://alias-i.com/lingpipe/demos/tutorial/langid/read-me.html for a similar but more sophisticated package in Java.

About

Guess the language of a text using n-gram models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages