Skip to content

halfak/MediaWiki-Streaming

Repository files navigation

MediaWiki Streaming

A set of utilities for stream-processing MediaWiki data.

Usage

mwstream (-h | --help)

mwstream <utility> [-h|--help]

Data processing utilities

diffs2persistence

Generates token persistence statistics using revision JSON blobs with diff information.

dump2json

Converts an XML dump to a stream of revision JSON blobs

dump2diffs

Computes diffs directly from an XML dump

fetch_missing_diffs

Scans diff documents looking for missing diffs and fills them in.

json2diffs

Computes and adds a "diff" field to a stream of revision JSON blobs

mend_diffs

Mends diffs that were computed in chunks and out of order.

persistence2stats

Aggregates a token persistence statistics to revision statistics

wikihadoop2json

Converts a Wikihadoop-processed stream of XML pages to JSON blobs

General utilities

json2tsv

Converts a stream of JSON blobs to tab-separated values based a set of fieldnames.

normalize

Normalizes old versions of RevisionDocument json schemas to correspond to the most recent schema version.

validate

Validates JSON against a provided schema.

truncate_text

Truncates the 'text' field of JSON blobs to a limited length in unicode characters. (addresses content dump vandalism issues) and adds a boolean 'truncated' field.

Installation

pip install mwstreaming

About

A set of utilities for processing MediaWiki data with streams.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages