Skip to content

A Hindi-English Dataset for Text Normalization

Notifications You must be signed in to change notification settings

prashantkodali/hinglishNorm

 
 

Repository files navigation

hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Normalization

A Hindi-English Code-Mixed Dataset for Text Normalization

License

by-nc-sa

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

Dataset Description

We are releasing our dataset for Normalization of Hindi-English Code-Mixed Text Data in JSON format.

The object/fields in the released dataset are as shown in the following table:

Field Description Example
id Unique identifier for each datapoint 30
inputText Filtered & cleaned input text whtas ur name
tags We get normalizedText from inputText after applying transformation according to the tags ['Short Form', 'Short Form', 'Looks Good']
normalizedText Manually annotated normalized inputText what is your name

About

A Hindi-English Dataset for Text Normalization

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%