GitHub - deepanshu1995/IRE_MajorProject: IRE Project

Text Processing Framework for Indian Languages

Introduction The goal of this project is to develop a Text Processing framework for nine Indian Languages (Hindi, Tamil, Telugu, Bengali, Malayalam, Kannada, Marathi, Gujarathi, Punjabi). The framework that the team develops should include all the basic text processing algorithms for Indian Languages such as, Stop word detection, Tokenization, Sentence Breaker, POS Tagging, Key Concepts Identification, Entity Recognition and Categorization.
Project Description Indian languages are morphologically very rich. The challenge for any working in Indian Languages is to identify these subtle variations in writing. Hence, it becomes a very important task to identify these variations in a topic/entity and map it to single right one.
The team can also use tools that are already available and have all or some of those modules, and work on them improving the stateoftheart.
The modules in the framework should be

Stop Word detection Tokenization Sentence Breaker Identify Variations (Highest Priority, Should not be confused with stemming) POS Tagging(Highest Priority)  Tag a continuous text very similar to English POS tagging
 Concept/Keyword Identification  Use POS tagging or some other approaches to identify key concepts
 Entity Recognition(Highest Priority)  Identify people, locations, products, organizations, brands, money, health industry terminology (Zika virus, Pregnancy, Autism) etc.
 Categorization(Highest Priority)  Categorize an article into one of the following categories Politics, Crime, Entertainment, Sports, Business, Technology, Science, Health, Foods, Travel, Auto and Fashion. Politics can be treated as default category.
Dataset and Evaluation Team will be provided with news corpus for Indian Languages. They can also make use of Wikipedia. Test data will be provided for Categorization, Entity Recognition, Variations. Other modules can use standard evaluation procedure.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Categoriser		Categoriser
NER		NER
POSTagger		POSTagger
Parser		Parser
SVM		SVM
Stemmer		Stemmer
Tokeniser		Tokeniser
Variation		Variation
IRE_Project_Scope_Doc.pdf		IRE_Project_Scope_Doc.pdf
Project_Report.pdf		Project_Report.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categoriser

Categoriser

NER

NER

POSTagger

POSTagger

Parser

Parser

SVM

SVM

Stemmer

Stemmer

Tokeniser

Tokeniser

Variation

Variation

IRE_Project_Scope_Doc.pdf

IRE_Project_Scope_Doc.pdf

Project_Report.pdf

Project_Report.pdf

README.md

README.md

Repository files navigation

About

Releases

Packages

Languages

deepanshu1995/IRE_MajorProject

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages