Number one TODO: first pass on everything is done. Now make everything more resilient and accurate.
Source | Notes | Status |
---|---|---|
d-addicts | * ~600 dramas, ~5M sentance pairs, * fansubs, * japanese & english subs in different parts of same page |
* crawled * parsed * matched * 620408 |
OpenSubtitles | * 1.4M sentance pairs, * professional translations, 1-1 en/jp matching (in same file) |
* crawled * parsed * matched * 1381339 |
kitsunekko | * ~600 dramas/movies (largeley incomplete), ~3M pairs * fansubs * en/jp lists on different pages |
* crawled * parsed * matched(?) * 161792 |
subscene | * ~2000 movies/shows, ~5M raw sentance pairs, ~1.3M usable pairs * mix of fansubs & professional translations |
* crawled * parsed * matched(?) 810678 |
TED talks | ~100k pairs | * crawled * parsed * matched * 497294 |
I'd like to make this the biggest open-source Japanese-English corpus out there. Looks like ~1M sentance pairs is the number to beat: http://www.phontron.com/japanese-translation-data.php?lang=en
- accessing paired translations
- soln: crawl sub sites, pull down en and jp subs for matched films/tv shows
- Poor translations. I.e. I've seen subs that were generated by running another language's subs through google translate
- soln: run language model over each movie/show's corpus. if average sentance quality is below some threshold, throw it out
- En/Jp subtitle mismatch. Sometimes the srt files don't have the same number of entries, and entries don't correspond to the same times.
- soln: sentance alignment model. run encoder over en/jp srt files. pair up nearby sentances with similar thought vectors
- romanji transliterations
- soln: throw out
- broken character encodings (gobblygook)
- soln: convert to utf-8, throw out subs that can't be converted
- stuff that doesn't belong (i.e. "TRANSLATED BY ______")