Refer to http://www.google-melange.com/gci/task/view/google/gci2013/5591760998760448
$ sudo pip install urlnorm BeautifulSoup4
Depending on your language, you may need to install more dependencies.
Here are the list of language specific dependencies:
- zh:
sudo pip install mafan PyICU
$ python main.py
That's... all you have to do. All configuration is done in config.py
.
The crawler goes through 5 stages.
-
The crawler looks for sub-categories in a starting category (
config.start_cat
) and records them down indata/site/start_cat/subcats.txt
. Any sub-categories blacklisted will not be included. -
The crawler looks for sub-categories in those sub-categories and adds them to the list of sub-categories.
- The crawler then crawls the sub-categories for pages and records them down in
data/site/start_cat/subcat/pages.txt
. Any pages blacklisted or filtered will not be included.
More information can be found at Filters.
- These pages are then added to a list of pages which will then be crawled at the next stage.
- The crawler goes through every page in the list and downloads it into
data/pages
.
NOTE: This stage will take a very long time to complete as the crawler has to abide by crawl delays.
- The parser goes through every page in the list and parses it based on the language.
More information can be found at Parsers.
-
All spelings are written to
data/spelings.txt
-
The program terminates with statistics.
At stage 2, pages can be filtered out by plugins based on language. Here is the list of filters:
- zh.py - Chinese (simplified and traditional)
- Pages are filtered out based on whether the word is simplified, traditional or both.
- This can be set in
config.py
.
Filters are stored in filters/
. Every filter has a test suite which goes by the filename filter_test.py
. This test suite can be runned to check if the filter has any errors.
At stage 4, pages are parsed based on language. Here is the list of parsers:
- zh.py - Chinese (simplified and traditional)
- Parses pages into the following format:
word ; POS tag ; pinyin ; gloss (meaning)
- Parses pages into the following format:
The parser is automatically selected based on the language set in config.py
. More information can be found at General Config
Parsers are stored in parsers/
. Every parser has a test suite which goes by the filename parser_test.py
. This test suite can be runned to check if the parser has any errors.
The default config.py
looks like this:
# coding=utf8
start_cat = "Category:Mandarin language"
api_crawl_delay = 1 # in seconds
page_crawl_delay = 0.4 # in seconds
lang = "zh"
wiki_lang = "en"
# blacklists
subcats_bl = []
pages_bl = [
"Appendix:.*"
]
## lang-specific config vals ##
.
.
.
## lang-specific config vals ##
## DO NOT MODIFY ##
.
.
.
## DO NOT MODIFY
The top comment, # coding=utf8
, is required to set the encoding of the file so that unicode characters can be included in comments, if you ever need to.
start_cat
is the category where the crawler begins crawling for sub-categories.
The default value is Category:Mandarin_language
. Adapt this to the language which you wish to crawl. Remember to modify lang
as well.
api_crawl_delay
is the time the crawler waits before sending a request to the API while page_crawl_delay
is the time the crawler waits before crawling the next page. The default delays are fine and you should not set a value lower than the defaults.
The default value of api_crawl_delay
is 1 second as it is an API request whereas the default value of page_crawl_delay
is 0.4 seconds as pages are cached.
The crawl delay is in seconds.
lang
is the language code which will determine the parser and filter which the program will use.
The default value is zh
. The following languages are supported:
- zh
wiki_lang
is the language code which will determine which language of Wiktionary will the program crawl from.
The default value is en
.
Note: If you change this to another value, there is no guarantee that the parser will work correctly. If you wish to obtain gloss in another language, please use a translator instead.
subcats_bl
is a list of regular expressions which can be used to match subcategories that should be blacklisted.
By default, the following subcats are blacklisted:
- Category:cmn.*
- .* derived from Mandarin
- .* in (simplified|traditional) script
Note that this is specific to Chinese.
pages_bl
is a list of regular expressions which can be used to match pages that should be blacklisted.
By default, the following pages are blacklisted:
Appendix:.*
Template:.*
The default configuration values parse simplified Chinese text. If you wish to parse another language, please comment out these values and look for your language, then uncomment those config values.
The config values for each language is specified here.
These values are available:
zh_s
: Set toTrue
if you wish to crawl only simplified words.zh_t
: Set toTrue
if you wish to crawl only traditional words.
Words that are both simplified and traditional will always be parsed if either zh_s
or zh_t
is set to True
.