GitHub - jdmonaco/annot2md: Script workflow for using Skim/BibDesk on OS X to extract Adobe-style PDF annotations into beautiful markdown output.

annot2md

This is a set of scripts for an automatic PDF annotation extraction workflow based around the Skim viewer and BibDesk bibliography manager for OS X. Skim is leveraged for its ability to cleanly extract text from Adobe-style highlight/underline/strike-through annotations.

The script bin/annot2md ties everything together and should be the single entry point for taking the path to an annotated PDF file and producing a beautifully formatted markdown file presenting the annotated text.

Usage: annot2md [-h] filename

Use Skim to extract standard Adobe annotations and other information about a PDF article to markdown format.

Some notes:

Put annot2md/bin on your $PATH. If the annot2md/bin/annot2md script is directly symlinked, it won't be able to find the other scripts that it calls. Theoretically, all of the scripts could just be symlinked into your ~/bin or whatever, but I didn't want to pollute the executable namespace.
Markdown output files currently go to ~/Dropbox/Papers/Annotations, but this can be changed at the top of the annot2md script if you want.
Parsing the cite-key currently depends on the PDF file name, which I have set to <cite-key> [<first-keyword>].pdf for BibDesk auto-filing. So, txt2md.Article._parse_cite_key() should be changed to fit your filenames or you can re-autofile under my scheme.
You may want to change the locale set by the line export LANG=en_US.UTF-8 in the annot2md script, which is necessry to ensure that notes are extracted using the proper Unicode encoding.
No guarantees, this is working for my setup, but that's all I know for now.

Todo:

PDF links currently only work for my path and folder structure (~/Dropbox/Papers/<Year>/*.pdf), so this needs to be handled better and more generally
Fix parsing of cite key which currently depends on particular PDF file-naming scheme.
~~Notes and highlights should be ordered by (-y, x) of top/left bounds; they seem to be nearly random right now for a given PDF page~~
~~There should be a batch script to process a directory full of PDFs, find the ones with annotations, and then extract all notes~~

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
bin		bin
scripts		scripts
annot2txt.sh		annot2txt.sh
batch-extract.sh		batch-extract.sh
bibdesk-query.scpt		bibdesk-query.scpt
has-annotations.sh		has-annotations.sh
readme.md		readme.md
txt2md.py		txt2md.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bin

bin

scripts

scripts

annot2txt.sh

annot2txt.sh

batch-extract.sh

batch-extract.sh

bibdesk-query.scpt

bibdesk-query.scpt

has-annotations.sh

has-annotations.sh

readme.md

readme.md

txt2md.py

txt2md.py

Repository files navigation

annot2md

About

Releases

Packages

Languages

jdmonaco/annot2md

Folders and files

Latest commit

History

Repository files navigation

annot2md

About

Resources

Stars

Watchers

Forks

Languages