MetaDataExtractor

Python repo to extract metadata from a variety of documents (MS Office docs, PDF, images).

Launch with:

python3 -m pip install requirements.txt

python main.py

This will create a json file "metadata.json" stored at the root of the repo.

You will also find a shinyapp in the visualization folder, convert the json file to csv with the code below and store in /visualization/data/. For some reason python gives a segfault when embedding the code in the repo, so just launch the code below in your favorite IDE to avoid it!

import pandas as pd

path = 'data/data/metadata.json'

temp = pd.read_json(path)

df = temp.T

df.to_csv('metadata.csv')

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
extractor		extractor
format		format
visualization		visualization
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extractor

extractor

format

format

visualization

visualization

.gitignore

.gitignore

README.md

README.md

main.py

main.py

Repository files navigation

MetaDataExtractor

About

Releases

Packages

Languages

Henry-nlp/MetaDataExtractor

Folders and files

Latest commit

History

Repository files navigation

MetaDataExtractor

About

Resources

Stars

Watchers

Forks

Languages