Tool can build comparable corpora by crawling and aligning at article level subtitles from https://www.ted.com talks page.
You may specify as much languages as you need and skip some if needed by filling in ignore_list file.
python crawl_ted_com.py -h
Final info ====
Feel free to use this tool if you cite: • Wołk K., Marasek K., “Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents”, Proceedings of the 12th International Workshop on Spoken Language Translation, Da Nang, Vietnam, December 3-4, 2015, p.118-125
For more information, see: http://arxiv.org/pdf/1512.01641
For any questions: | Krzysztof Wolk | krzysztof@wolk.pl