About

Tool can build comparable corpora by crawling and aligning at article level subtitles from https://www.ted.com talks page.

You may specify as much languages as you need and skip some if needed by filling in ignore_list file.

Usage

python crawl_ted_com.py -h

Final info ====

Feel free to use this tool if you cite: • Wołk K., Marasek K., “Unsupervised comparable corpora preparation and exploration for bi-lingual translation equivalents”, Proceedings of the 12th International Workshop on Spoken Language Translation, Da Nang, Vietnam, December 3-4, 2015, p.118-125

For more information, see: http://arxiv.org/pdf/1512.01641

For any questions: | Krzysztof Wolk | krzysztof@wolk.pl

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
3rd		3rd
cted		cted
README.rst		README.rst
crawl_ted_com.py		crawl_ted_com.py
ignore_list		ignore_list
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

3rd

3rd

cted

cted

README.rst

README.rst

crawl_ted_com.py

crawl_ted_com.py

ignore_list

ignore_list

requirements.txt

requirements.txt

Repository files navigation

About

Usage

About

Releases

Packages

Languages

krzwolk/TED-Talks-Crawler

Folders and files

Latest commit

History

Repository files navigation

About

Usage

About

Resources

Stars

Watchers

Forks

Languages