Skip to content

chimney37/github-python-term-standardizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

The purpose of the program is to scrape wikipedia using a set of keywords to get their canonical form. We can go ja => en or en => ja. Therefore the input can either be Japanese or English. If a search result exist on wikipedia (language is set to "ja"), we get the top ranked result, by fetching the page of the search result. Note: this software is dependent on wikipedia API for python.


Installation:

    1) install python >= 2.6
    2) get wikipedia API for python >= 1.4.0
        pip install wikipedia

Running the software:

    python main.py --batch <path-to-input-file> --out <path-to-output-file>

Input file format (TSV) Per row:

    <Input Keyword>
    Note: Multiple rows of keywords in Japanese or English

Output file format (TSV) Per row:

    <Input Keyword><tab><Output Keyword><tab>Summary
    Note: Summary refers to the page summary of the wiki article, specified by the output keyword (title of wiki page)

TODO:

    Try to use a more versatile wiki scraping API, to get the translation equivalent of ja => en and en => ja by traversing the language link of a page article rather than relying it on a single language setting and getting page based on its title.


Changelog:

    2016/5/1 - Initial version of program.

About

Input a list of keywords to scrape wikipedia to get their canonical form. We can get ja => en or en => ja, using "ja" lang setting of wiki to search.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages