Skip to content

saitamandd/japanese_corpus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

3471571 sentance pairs so far

Number one TODO: first pass on everything is done. Now make everything more resilient and accurate.

Source Notes Status
d-addicts * ~600 dramas, ~5M sentance pairs,
* fansubs,
* japanese & english subs in different parts of same page
* crawled
* parsed
* matched
* 620408
OpenSubtitles * 1.4M sentance pairs,
* professional translations, 1-1 en/jp matching (in same file)
* crawled
* parsed
* matched
* 1381339
kitsunekko * ~600 dramas/movies (largeley incomplete), ~3M pairs
* fansubs
* en/jp lists on different pages
* crawled
* parsed
* matched(?)
* 161792
subscene * ~2000 movies/shows, ~5M raw sentance pairs, ~1.3M usable pairs
* mix of fansubs & professional translations
* crawled
* parsed
* matched(?)
810678
TED talks ~100k pairs * crawled
* parsed
* matched
* 497294

Numbers to beat

I'd like to make this the biggest open-source Japanese-English corpus out there. Looks like ~1M sentance pairs is the number to beat: http://www.phontron.com/japanese-translation-data.php?lang=en

Roadblocks

  • accessing paired translations
    • soln: crawl sub sites, pull down en and jp subs for matched films/tv shows
  • Poor translations. I.e. I've seen subs that were generated by running another language's subs through google translate
    • soln: run language model over each movie/show's corpus. if average sentance quality is below some threshold, throw it out
  • En/Jp subtitle mismatch. Sometimes the srt files don't have the same number of entries, and entries don't correspond to the same times.
    • soln: sentance alignment model. run encoder over en/jp srt files. pair up nearby sentances with similar thought vectors
  • romanji transliterations
    • soln: throw out
  • broken character encodings (gobblygook)
    • soln: convert to utf-8, throw out subs that can't be converted
  • stuff that doesn't belong (i.e. "TRANSLATED BY ______")

About

large open-source EN-JP translation corpus

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 91.6%
  • Jupyter Notebook 6.6%
  • Python 1.8%