metadata-extraction-from-content-files

Extract metadata automatically from educative content files (ebooks, articles, etc.)

Currently supports filetype:

This repository contains submodules:

To get the complete repository alongwith submodule, clone recursively, i.e.,

extractFromEpub implements endeavour to auto-extract metadata from Epub files like Author, publish date, publish company and many more, which may be relevant for indexing content in a digital library. Such information can be explicitly in the metadata section of directory structure zipped within Epub files. If such explicit metadata is not present, then it tries to extract some metadata information from the first few pages of the content of the book.
After extracting whatever metadata it can, it indexes the files in SIP format(a specific protocol of directory structure) using the metadata.
extractor script takes in a folder location from user, finds the epub files within the directory and auto-extracts metadata from them and indexes them into the desired SIP format directory structure.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
dependencies		dependencies
deprecated		deprecated
extras		extras
import		import
tmp		tmp
utils		utils
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
epub_extract.output		epub_extract.output
extractFromEpub.py		extractFromEpub.py
extractFromPdf.py		extractFromPdf.py
extractor.py		extractor.py
main.output		main.output
reset_import.sh		reset_import.sh