Skip to content

We build a pipeline that does spelling normalization over Code-Switched text

Notifications You must be signed in to change notification settings

sumeet-iitg/CS-TextNormalization

Repository files navigation

CS-TextNormalization

We build a pipeline to clean text noisy code-switched text online.

Getting the repo

git clone --recursive https://github.com/sumeet-iitg/CS-TextNormalization.git

-- Don't miss the 'recursive' part for pulling required sub-modules

Components of the Normalization Pipeline

  • DataManagement: This folder contains the various abstractions that make up the pipeline. When you add a new implementation of some tool for the pipeline, make sure that it is always along the lines of an abstraction contained in this folder. Feel free to add new abstractions into this folder. Some of the abstractions are as follows:
    languageUtils.py: Classes for Langauge Specific Identifiers, Lexicons and SpellCheckers.
    dataloader.py: Classes for loading a corpus - mono-lingual/multi-lingual.

Requirements

Usage

You can use this pipeline end to end, or run the individual components within

python main.py "source_tanglish.txt" "english,telugu"

About

We build a pipeline that does spelling normalization over Code-Switched text

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages