EconstorCorpus

Econstor is ZBW's Open access server for scientific publications. The software in this repository deals with the task of building a textmining corpus from EconStor documents.

Overview

You can find two independent (yet related) components, that are described in the following:

Luke the Downloader

Generates an index of all EconStor files using the Econbiz API
Downloads PDF files
Determines RePEc handles for the documents
Fetches citation count figures (using CitEc)

Han the Converter

Extracts plaintext from PDF files
Guesses the language of the document
Normalizes the plaintext (This may require tailoring for your purposes). Details

More information is provided in the IPython notebooks and README files in the subdirectories.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
Han_the_Converter		Han_the_Converter
Luke_the_Downloader		Luke_the_Downloader
helper		helper
.gitignore		.gitignore
LICENCE		LICENCE
README.md		README.md
dataStats.ipynb		dataStats.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Han_the_Converter

Han_the_Converter

Luke_the_Downloader

Luke_the_Downloader

helper

helper

.gitignore

.gitignore

LICENCE

LICENCE

README.md

README.md

dataStats.ipynb

dataStats.ipynb

requirements.txt

requirements.txt

Repository files navigation

EconstorCorpus

Overview

Luke the Downloader

Han the Converter

About

Releases

Packages

Languages

License

n-witt/EconstorCorpus

Folders and files

Latest commit

History

Repository files navigation

EconstorCorpus

Overview

Luke the Downloader

Han the Converter

About

Resources

License

Stars

Watchers

Forks

Languages