GitHub - natsheh/arwiki_parser: Extract plain text from Arabic Wikipedia dumps.

About

arwiki_parser is a small python script for extracting plain text articles from Arabic Wikipedia dumps. Grab the fresh dump from here, extract it, and give arwiki_parser a try.

Requirements:

The scripts should run on any unix-like system with Python 2.7 installed along with pip or easy_install to install additional packages.

The following third-party libraries neead to be installed as well:

Beautiful Soup 4: $ pip install beautifulsoup4
mwlib: $ pip install mwlib mwlib.xhtml
lxml: $ pip install lxml

Usage:

$ python arwiki_parser.py path/to/dump.xml path/to/output/dir/

The script will extract each article into separate files. To make dealing with the files easier on a window manager, the files are distributed across 256 directories with hexadecimal names 00-ff. All articles are stored as <article_id>.txt. The first line in each file is always the title of the article.

License:

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE.txt		LICENSE.txt
README.md		README.md
arwiki_parser.py		arwiki_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE.txt

LICENSE.txt

README.md

README.md

arwiki_parser.py

arwiki_parser.py

Repository files navigation

About

Requirements:

Usage:

License:

About

Releases

Packages

Languages

License

natsheh/arwiki_parser

Folders and files

Latest commit

History

Repository files navigation

About

Requirements:

Usage:

License:

About

Resources

License

Stars

Watchers

Forks

Languages