Skip to content

ItsLastDay/Twitter-language-identification

Repository files navigation

Twitter-language-identification

Academic project, coursework on 3rd year of studying.

Researched language identification algorithms, with focus on short informal messages.
Gathered data-set of 227k messages (most of them in Russian) from various sources (Twitter API, other works).
Implemented two approaches for language identification task, made modifications to one approach.
Compared performance of 6 approaches on gathered data-set.

As a result, approach modified by me outperforms others. However, it is rather memory-consuming.

This repository contains all files, that were gathered\produced during the research. My implementations of LID algorithms lie in /progs/logr and /progs/liga. There are also a bunch of programs in /scripts, which helped with tweet processing.