- Text Processing Framework for Indian Languages
Introduction
The goal of this project is to develop a Text Processing framework for nine Indian Languages (Hindi, Tamil, Telugu,
Bengali, Malayalam, Kannada, Marathi, Gujarathi, Punjabi). The framework that the team develops should include
all the basic text processing algorithms for Indian Languages such as, Stop word detection, Tokenization, Sentence
Breaker, POS Tagging, Key Concepts Identification, Entity Recognition and Categorization.
Project Description
Indian languages are morphologically very rich. The challenge for any working in Indian Languages is to identify
these subtle variations in writing. Hence, it becomes a very important task to identify these variations in a
topic/entity and map it to single right one.
The team can also use tools that are already available and have all or some of those modules, and work on them
improving the stateoftheart.
The modules in the framework should be
Stop Word detection
Tokenization
Sentence Breaker
Identify Variations (Highest Priority, Should not be confused with stemming)
POS Tagging(Highest Priority)
Tag a continuous text very similar to English POS tagging
Concept/Keyword Identification
Use POS tagging or some other approaches to identify key concepts
Entity Recognition(Highest Priority)
Identify people, locations, products, organizations, brands, money, health industry terminology
(Zika virus, Pregnancy, Autism) etc.
Categorization(Highest Priority)
Categorize an article into one of the following categories Politics, Crime, Entertainment, Sports,
Business, Technology, Science, Health, Foods, Travel, Auto and Fashion. Politics can be treated as
default category.
Dataset and Evaluation
Team will be provided with news corpus for Indian Languages. They can also make use of Wikipedia. Test data will
be provided for Categorization, Entity Recognition, Variations. Other modules can use standard evaluation
procedure.