Skip to content

Python repo to extract metadata from a variety of documents (MS Office docs, PDF, images)

Notifications You must be signed in to change notification settings

Henry-nlp/MetaDataExtractor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetaDataExtractor

Python repo to extract metadata from a variety of documents (MS Office docs, PDF, images).

Launch with:

python3 -m pip install requirements.txt

python main.py

This will create a json file "metadata.json" stored at the root of the repo.

You will also find a shinyapp in the visualization folder, convert the json file to csv with the code below and store in /visualization/data/. For some reason python gives a segfault when embedding the code in the repo, so just launch the code below in your favorite IDE to avoid it!

import pandas as pd

path = 'data/data/metadata.json'

temp = pd.read_json(path)

df = temp.T

df.to_csv('metadata.csv')

About

Python repo to extract metadata from a variety of documents (MS Office docs, PDF, images)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 53.2%
  • R 46.8%