Wikipedia Scraper

Parsing, tokenizing text using a Wikipedia dump XML file.

Author

Name: T.Furukawa
Email: tfurukawa.mail@gmail.com

Installation

pip install wscraper

Support

language

japanese
- Japanese Wikipedia
english
- English Wikipedia

How to Work (Command)

Check Console Commands

Please run this command.

wscraper --help

Executable commands to be listed.

Initialize

For start, you have to execute this command.
It creates necessary directory and files.

wscraper initialize

wscraper root directory is created at $HOME/.wscraper.
If you change this path, please set environment WSCRAPER_ROOT.

Set Global Parameters

wscraper root set --language japanese --page_chunk 1000

language
- Default language. If you do not set the parameter language for each corpus, this default language is used.
page_chunk
- A Wikipedia dump XML file has large text data as many pages. For analysis, it is separated to several small files because of memory efficiency.

See wscraper root set -h

Import a Wikipedia XML File

A file wikipedia.xml assumes like (lang)wiki-(date)-pages-articles-multistream.xml

wscraper import /path/to/sample.xml
wscraper import /path/to/wikipedia.xml --name my_wp

See wscraper import -h.

Check Wikipedia Resources

It can check Wikipedia corpus resources.

wscraper list

output

Available wikipedia:
  - sample
  - my_wp

Switch Current Corpus

wscraper switch my_wp

Check the Status of Current Corpus

wscraper status

output

current: my_wp

language [default]: japanese

Set Parameters for Current Corpus

Required parameters should be set for current corpus.

wscraper set --language english

parameters:

language

Unset Parameters

You can delete parameters by running following command.

wscraper unset --language

Rename a Corpus Name

You can rename a corpus name from $source to $target.

wscraper rename $source $target

Delete a Corpus

When a corpus (example: $target) is unnecessary, it can be removed.

wscraper delete $target

How to Work (Python)

Importing iterator classes.

from wscraper.analysis import *

You can iterate pages of a corpus by writing this.

# entry
entry_iterator = EntryIterator()
# You can specify corpus name and language.
# If parameter is not given, current Wikipedia corpus is used.
# >>> EntryIterator(name = "sample", language = "japanese")
both_iterator = BothIterator()
redirection_iterator = RedirectionIterator()

for i, b in enumerate(both_iterator):
    print(f"both: {i}: {type(b)}")

for i, e in enumerate(entry_iterator):
    print(f"entry {i}: {e.title} {len(e.mediawiki)}")

for i, r in enumerate(redirection_iterator):
    print(f"redirection: {i}: {r.source} -> {r.target}")

For example, you can give iterator a ML model.

def to_words(x):
    return x.split()

# return word list for each iteration
iterator = ArticleIterator(tagger = to_words)
# If you set `type = dict`, you can get records as dictionary
# ex: { "title", "ABC...", "article": "eval(to_words(article))" }

# Iterators:
#   ArticleIterator
#     - 1 page / record
#     - dict keys: ["title", "article"]
#   ParagraphIterator:
#     - N page / record
#     - Description like "== A ==" is delimiter of paragraphs.
#     - dict keys: ["page_title", "paragraph_title", "paragraph"]

# For example, gensim word2vec can interpret this iterator

# from gensim.models.word2vec import Word2Vec
Word2Vec(iterator)

# You can concatenate iterators to train ML model using CombinedIterator.
Word2Vec(CombinedIterator(iterator, another_iterator))

License

The source code is licensed MIT.

Please check the file LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
docker		docker
wscraper		wscraper
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
upload.sh		upload.sh

License

tfull/wikipedia_scraper

Folders and files

Latest commit

History

Repository files navigation

Wikipedia Scraper

Author

Installation

Support

language

How to Work (Command)

Check Console Commands

Initialize

Set Global Parameters

Import a Wikipedia XML File

Check Wikipedia Resources

Switch Current Corpus

Check the Status of Current Corpus

Set Parameters for Current Corpus

Unset Parameters

Rename a Corpus Name

Delete a Corpus

How to Work (Python)

License

About

Resources

License

Stars

Watchers

Forks

Languages