Anuvaad Corpus Tools

Overview

This repository houses the crawler code for building the Anuvaad parallel corpus. The ultimate goal is to build quality parallel datasets across various domains (General, Judicial, Educational, Financial, Press, etc) & various Indian languages.

The current set of crawlers are built to scrape, tokenizer and align multilingual reports/documents available at various sources.

Press Information Bureau (http://pib.gov.in)
Press Information Bureau Archives (http://pibarchive.nic.in)
Wikipedia (https://www.wikipedia.org)
Prothomalo (https://www.prothomalo.com)
Newsonair (http://newsonair.com)
Indianexpress (https://indianexpress.com)
DW (https://dw.com)
Goodreturns (https://www.goodreturns.in/)
Jagran-Josh (https://www.jagran.com/)
Tribune (https://tribuneindia.com)
Times of India (https://timesofindia.indiatimes.com/)
Zee News (https://zeenews.india.com/)
Pranabmukherjii (http://pranabmukherjee.nic.in/)
Eparliament(http://eparlib.nic.in/)
Ebalbook(https://cart.ebalbharati.in/BalBooks/ebook.aspx)
National Institute of Open Schooling (https://nios.ac.in/)
tntextbooks(https://www.tntextbooks.in/p/school-books.html)
keralatextbooks(https://samagra.kite.kerala.gov.in/#/textbook/page)

Processing Steps

The broader steps involved in all the tools can be generalized to the following :

1. Scraping

Hit the required web page & download the contents in respective languages.

2. Tokenizing

The process of spliting the scraped document into individual sentences using the Tokenizer.

3. Sentence Aligning

The process of pairing the sentences across different languages which has the same meaning.

4. Data Validation Pipeline

This involves both model based validation & generating an ideal sample for manual review.

Parallel Corpus

The parallel corpus of the above datasets are available under : anuvaad-parallel-corpus

Name		Name	Last commit message	Last commit date
Latest commit History 205 Commits
BusinessStandard-crawler		BusinessStandard-crawler
DW-crawler		DW-crawler
Ebalbook-crawler		Ebalbook-crawler
National_Institute_of_Open_Schooling-crawler		National_Institute_of_Open_Schooling-crawler
data-validation-pipeline		data-validation-pipeline
dataset-cleaner		dataset-cleaner
eparliament-crawler		eparliament-crawler
goodreturns-crawler		goodreturns-crawler
indianexpress-crawler		indianexpress-crawler
jagranjosh-crawler		jagranjosh-crawler
keralatextbooks-crawler		keralatextbooks-crawler
kolkata24x7		kolkata24x7
lokmat_crawler		lokmat_crawler
nativeplanet-crawler		nativeplanet-crawler
newsonair-crawler		newsonair-crawler
pib-crawler-type1		pib-crawler-type1
pib-crawler-type2		pib-crawler-type2
pib-crawler-type3		pib-crawler-type3
pibarchives-crawler		pibarchives-crawler
pranabmukherjii-crawler		pranabmukherjii-crawler
prothomalo-crawler		prothomalo-crawler
sakshi-crawler		sakshi-crawler
thewire-crawler		thewire-crawler
timesofindia-crawler		timesofindia-crawler
tntextbooks-crawlers		tntextbooks-crawlers
tribune-crawler		tribune-crawler
wikipedia-crawler		wikipedia-crawler
zee-news-crawler		zee-news-crawler
zeebiz-crawler		zeebiz-crawler
LICENSE		LICENSE
README.md		README.md

License

project-anuvaad/anuvaad-corpus-tools

Folders and files

Latest commit

History

Repository files navigation

Anuvaad Corpus Tools

Overview

Processing Steps

1. Scraping

2. Tokenizing

3. Sentence Aligning

4. Data Validation Pipeline

Parallel Corpus

About

Resources

License

Stars

Watchers

Forks

Languages