hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Normalization

A Hindi-English Code-Mixed Dataset for Text Normalization

License

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/4.0/.

Dataset Description

We are releasing our dataset for Normalization of Hindi-English Code-Mixed Text Data in JSON format.

The object/fields in the released dataset are as shown in the following table:

Field	Description	Example
id	Unique identifier for each datapoint	30
inputText	Filtered & cleaned input text	whtas ur name
tags	We get normalizedText from inputText after applying transformation according to the tags	['Short Form', 'Short Form', 'Looks Good']
normalizedText	Manually annotated normalized inputText	what is your name

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
dataset		dataset
license		license
README.md		README.md
computeWer.py		computeWer.py
dataPreprocessing.py		dataPreprocessing.py
getDatasetStatistics.py		getDatasetStatistics.py
modelEvaluation.py		modelEvaluation.py
requirement.txt		requirement.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dataset

dataset

license

license

README.md

README.md

computeWer.py

computeWer.py

dataPreprocessing.py

dataPreprocessing.py

getDatasetStatistics.py

getDatasetStatistics.py

modelEvaluation.py

modelEvaluation.py

requirement.txt

requirement.txt

Repository files navigation

hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Normalization

License

Dataset Description

About

Releases

Packages

Languages

prashantkodali/hinglishNorm

Folders and files

Latest commit

History

Repository files navigation

hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Normalization

License

Dataset Description

About

Resources

Stars

Watchers

Forks

Languages