Skip to content

deepanshu1995/IRE_MajorProject

 
 

Repository files navigation

  1. Text Processing Framework for Indian Languages

Introduction The goal of this project is to develop a Text Processing framework for nine Indian Languages (Hindi, Tamil, Telugu, Bengali, Malayalam, Kannada, Marathi, Gujarathi, Punjabi). The framework that the team develops should include all the basic text processing algorithms for Indian Languages such as, Stop word detection, Tokenization, Sentence Breaker, POS Tagging, Key Concepts Identification, Entity Recognition and Categorization.
Project Description Indian languages are morphologically very rich. The challenge for any working in Indian Languages is to identify these subtle variations in writing. Hence, it becomes a very important task to identify these variations in a topic/entity and map it to single right one.
The team can also use tools that are already available and have all or some of those modules, and work on them improving the state­of­the­art.
The modules in the framework should be

Stop Word detection Tokenization Sentence Breaker Identify Variations (Highest Priority, Should not be confused with stemming) POS Tagging(Highest Priority)  Tag a continuous text very similar to English POS tagging
 Concept/Keyword Identification  Use POS tagging or some other approaches to identify key concepts
 Entity Recognition(Highest Priority)  Identify people, locations, products, organizations, brands, money, health industry terminology (Zika virus, Pregnancy, Autism) etc.
 Categorization(Highest Priority)  Categorize an article into one of the following categories Politics, Crime, Entertainment, Sports, Business, Technology, Science, Health, Foods, Travel, Auto and Fashion. Politics can be treated as default category.
Dataset and Evaluation Team will be provided with news corpus for Indian Languages. They can also make use of Wikipedia. Test data will be provided for Categorization, Entity Recognition, Variations. Other modules can use standard evaluation procedure.

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%