Skip to content

n-witt/EconstorCorpus

Repository files navigation

EconstorCorpus

Econstor is ZBW's Open access server for scientific publications. The software in this repository deals with the task of building a textmining corpus from EconStor documents.

Overview

You can find two independent (yet related) components, that are described in the following:

Luke the Downloader

  1. Generates an index of all EconStor files using the Econbiz API
  2. Downloads PDF files
  3. Determines RePEc handles for the documents
  4. Fetches citation count figures (using CitEc)

Han the Converter

  1. Extracts plaintext from PDF files
  2. Guesses the language of the document
  3. Normalizes the plaintext (This may require tailoring for your purposes). Details

More information is provided in the IPython notebooks and README files in the subdirectories.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published