Parsing, tokenizing text using a Wikipedia dump XML file.
- Name: T.Furukawa
- Email: tfurukawa.mail@gmail.com
pip install wscraper
- japanese
- Japanese Wikipedia
- english
- English Wikipedia
Please run this command.
wscraper --help
Executable commands to be listed.
For start, you have to execute this command.
It creates necessary directory and files.
wscraper initialize
wscraper root directory is created at $HOME/.wscraper
.
If you change this path, please set environment WSCRAPER_ROOT
.
wscraper root set --language japanese --page_chunk 1000
language
- Default language. If you do not set the parameter
language
for each corpus, this default language is used.
- Default language. If you do not set the parameter
page_chunk
- A Wikipedia dump XML file has large text data as many pages. For analysis, it is separated to several small files because of memory efficiency.
See wscraper root set -h
A file wikipedia.xml assumes like (lang)wiki-(date)-pages-articles-multistream.xml
wscraper import /path/to/sample.xml
wscraper import /path/to/wikipedia.xml --name my_wp
See wscraper import -h
.
It can check Wikipedia corpus resources.
wscraper list
output
Available wikipedia:
- sample
- my_wp
wscraper switch my_wp
wscraper status
output
current: my_wp
language [default]: japanese
Required parameters should be set for current corpus.
wscraper set --language english
parameters:
language
You can delete parameters by running following command.
wscraper unset --language
You can rename a corpus name from $source
to $target
.
wscraper rename $source $target
When a corpus (example: $target
) is unnecessary, it can be removed.
wscraper delete $target
Importing iterator classes.
from wscraper.analysis import *
You can iterate pages of a corpus by writing this.
# entry
entry_iterator = EntryIterator()
# You can specify corpus name and language.
# If parameter is not given, current Wikipedia corpus is used.
# >>> EntryIterator(name = "sample", language = "japanese")
both_iterator = BothIterator()
redirection_iterator = RedirectionIterator()
for i, b in enumerate(both_iterator):
print(f"both: {i}: {type(b)}")
for i, e in enumerate(entry_iterator):
print(f"entry {i}: {e.title} {len(e.mediawiki)}")
for i, r in enumerate(redirection_iterator):
print(f"redirection: {i}: {r.source} -> {r.target}")
For example, you can give iterator a ML model.
def to_words(x):
return x.split()
# return word list for each iteration
iterator = ArticleIterator(tagger = to_words)
# If you set `type = dict`, you can get records as dictionary
# ex: { "title", "ABC...", "article": "eval(to_words(article))" }
# Iterators:
# ArticleIterator
# - 1 page / record
# - dict keys: ["title", "article"]
# ParagraphIterator:
# - N page / record
# - Description like "== A ==" is delimiter of paragraphs.
# - dict keys: ["page_title", "paragraph_title", "paragraph"]
# For example, gensim word2vec can interpret this iterator
# from gensim.models.word2vec import Word2Vec
Word2Vec(iterator)
# You can concatenate iterators to train ML model using CombinedIterator.
Word2Vec(CombinedIterator(iterator, another_iterator))
The source code is licensed MIT.
Please check the file LICENSE.