GitHub - shonosuke/wikipedia-scripts

####0. Preparation

Download Wikidata/Wikipedia dumps.
Import Wikipedia *.sql to MySQL database.
Prepare Stanford Tokenizer and Corenlp.
Prepare WikiExtractor.

####1. Extract all items, properties, triples from Wikidata. python wd.extract_all.py target_dir

####2. Tokenize the entities' description, aka. python wd.tokenize.py

####3. Extract linked sentences from Wikipedia xml dump parsed by WikiExtractor.py. python wp.extract_all.py

####4. Combine Two data. python wp.combine_wd.py

####5. Create wikiP2D dataset. python create_dataset.py

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
legacy		legacy
.gitignore		.gitignore
README.md		README.md
checkbins.py		checkbins.py
common.py		common.py
create_dataset.py		create_dataset.py
requirements.txt		requirements.txt
run.sh		run.sh
statistics.py		statistics.py
wd_extract_all.py		wd_extract_all.py
wd_tokenize.py		wd_tokenize.py
wikidata		wikidata
wikipedia		wikipedia
wp_combine_wd.py		wp_combine_wd.py
wp_extract_all.py		wp_extract_all.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

legacy

legacy

.gitignore

.gitignore

README.md

README.md

checkbins.py

checkbins.py

common.py

common.py

create_dataset.py

create_dataset.py

requirements.txt

requirements.txt

run.sh

run.sh

statistics.py

statistics.py

wd_extract_all.py

wd_extract_all.py

wd_tokenize.py

wd_tokenize.py

wikidata

wikidata

wikipedia

wikipedia

wp_combine_wd.py

wp_combine_wd.py

wp_extract_all.py

wp_extract_all.py

Repository files navigation

About

Releases

Packages

Languages

shonosuke/wikipedia-scripts

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages