Skip to content

krzwolk/TED-Talks-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

About

Tool can build comparable corpora by crawling and aligning at article level subtitles from https://www.ted.com talks page.

You may specify as much languages as you need and skip some if needed by filling in ignore_list file.

Usage

python crawl_ted_com.py -h

Final info ====

Feel free to use this tool if you cite: • Wołk K., Marasek K., “Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents”, Proceedings of the 12th International Workshop on Spoken Language Translation, Da Nang, Vietnam, December 3-4, 2015, p.118-125

For more information, see: http://arxiv.org/pdf/1512.01641

For any questions: | Krzysztof Wolk | krzysztof@wolk.pl

About

Crawl and build comparable corpora from ted.com

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published