For some editions of Wiktionary, extract translation pairs on each page.
ZIM is a file format that stores wiki content for offline usage.
- Wiktionary dumps in
.zim
format can be obtained from kiwix.
The input to this program can be either a .zim
containing all pages of a Wiktionary edition, or a list of urls. See usage for more detail.
Each line consists of these fields:
edition
: the edition of Wiktionary the translation pair is from. It is a 2-3 letter code used in the Wiktionary url.headword
: the word that is being translated.head_lang
: the language of theheadword
. It might be different from the language of the edition.translation
: the translation of theheadword
.trans_lang
: the language theheadword
is translated into.trans_lang_code
: the language code oftrans_lang
. This is the edition code used by Wiktionary, and it is not from a single ISO standard.pos
: the part of speech of theheadword
inhead_lang
.pronunciation
: the IPA representation of theheadword
; reflects how the word would be spoken in thehead_lang
The output is in CSV format with these eight columns.
beautifulsoup4
: used for parsing html.requests
: used to make http calls and fetch.html
from the Internet. Required if using Internet as data source.pycountry
andiso-639
: used for conversion between language codes. Required if you do not specify an Wiktionary edition code.repoze.lru
: LRU cache which significantly improve performance for.zim
. Recommended if using.zim
as data source.
Install in a virtualenv
as appropriate.
To install all dependencies (you don't have to):
$ pip install -r requirements.txt
To install one by one, use pip install [PACKAGE NAME]
.
parser.py or extract.py.
usage: parser.py [-h] (--url_zim URL_ZIM | --url_list URL_LIST | --zim ZIM)
[--edition EDITION]
optional arguments:
-h, --help show this help message and exit
--url_zim URL_ZIM, -uz URL_ZIM
use a zim file as the source of urls and get html from
the Internet
--url_list URL_LIST, -ul URL_LIST
use a file containing a list of urls and get html from
the Internet
--zim ZIM, -z ZIM use the zim file as input instead of html
--edition EDITION, -e EDITION
explicitly specify the language edition, for either
html or zim
- Support for using
.zim
file has only been tested forPython 3.5
. It is probably not working forPython 2
at this moment. parser.py
should be able to automatically figure out the Wiktionary edition and choose the correct parser based on the url or the metadata in.zim
. If it doens't use the parser you expect, please use-e
to explicitly specify the edition.
A .zim
file contains all pages in a Wiktionary edition.
To run parser.py
with .zim
as input:
$ python parser.py -z [ZIM FILE]
Instead of using a .zim
file, you can also provide a list of urls to specify the pages to extract. The parser will fetch html from the urls to use as data source.
If you already have a file with a list of urls:
$ python parser.py -ul [FILE]
- The file should contain one url on each line.
- All urls should come from the same Wiktionary edition.
If you want to use the urls from a .zim
file, which contains all the urls from a Wiktionary edition:
$ python parser.py -uz [ZIM FILE]
$ python -m zim.extract
-m
is telling python to run a file in the module as main()
.
- Notice there is no
.py
extension.
$ python -m zim.extract -i ZIMFILE url
If you want full url:
$ python -m zim.extract -i ZIMFILE url -f
The edition will be inferred from metadata in .zim
. If you want to explicitly specify the edition instead:
$ python -m zim.extract -i ZIMFILE url -f -e EDITION
$ python -m zim.extract -i ZIMFILE html -o OUTPUT_DIRECTORY
$ python -m parser.parse_[EDITION]
This is telling python to run the main()
in a file in the module.
- Notice there is no
.py
extension.
- Tested with
.zim
file:ja
,de
- Tested with some representative
.html
pages:az
fr
ru
tr
uz
vi
- Started:
pl
- Write parsers for two or three editions.
- Run parsers on zim files (entire foreign editions of Wiktionary)
- Generalize them and create a skeleton for writing other parsers.
- make it so that we need minimal changes in order to parse another edition
- Generate parsers for editions of interest.
- Modify current scripts to include pronunciation extraction from foreign editions of Wiktionary.
- Use translation scripts as base for derivation-table-parsing scripts.