GitHub - saitamandd/japanese_corpus: large open-source EN-JP translation corpus

3471571 sentance pairs so far

Number one TODO: first pass on everything is done. Now make everything more resilient and accurate.

Source	Notes	Status
d-addicts	* ~600 dramas, ~5M sentance pairs, * fansubs, * japanese & english subs in different parts of same page	* crawled * parsed * matched * 620408
OpenSubtitles	* 1.4M sentance pairs, * professional translations, 1-1 en/jp matching (in same file)	* crawled * parsed * matched * 1381339
kitsunekko	* ~600 dramas/movies (largeley incomplete), ~3M pairs * fansubs * en/jp lists on different pages	* crawled * parsed * matched(?) * 161792
subscene	* ~2000 movies/shows, ~5M raw sentance pairs, ~1.3M usable pairs * mix of fansubs & professional translations	* crawled * parsed * matched(?) 810678
TED talks	~100k pairs	* crawled * parsed * matched * 497294

I'd like to make this the biggest open-source Japanese-English corpus out there. Looks like ~1M sentance pairs is the number to beat: http://www.phontron.com/japanese-translation-data.php?lang=en

accessing paired translations
- soln: crawl sub sites, pull down en and jp subs for matched films/tv shows
Poor translations. I.e. I've seen subs that were generated by running another language's subs through google translate
- soln: run language model over each movie/show's corpus. if average sentance quality is below some threshold, throw it out
En/Jp subtitle mismatch. Sometimes the srt files don't have the same number of entries, and entries don't correspond to the same times.
- soln: sentance alignment model. run encoder over en/jp srt files. pair up nearby sentances with similar thought vectors
romanji transliterations
- soln: throw out
broken character encodings (gobblygook)
- soln: convert to utf-8, throw out subs that can't be converted
stuff that doesn't belong (i.e. "TRANSLATED BY ______")

Name		Name	Last commit message	Last commit date
Latest commit History 140 Commits
aligners_cleaners		aligners_cleaners
crawlers		crawlers
final_sentences		final_sentences
parsers		parsers
README.md		README.md