Econstor is ZBW's Open access server for scientific publications. The software in this repository deals with the task of building a textmining corpus from EconStor documents.
You can find two independent (yet related) components, that are described in the following:
- Generates an index of all EconStor files using the Econbiz API
- Downloads PDF files
- Determines RePEc handles for the documents
- Fetches citation count figures (using CitEc)
- Extracts plaintext from PDF files
- Guesses the language of the document
- Normalizes the plaintext (This may require tailoring for your purposes). Details
More information is provided in the IPython notebooks and README files in the subdirectories.