GitHub - joepickrell/pubMunch: various tools to download, convert and process scientific articles

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
journalLists		journalLists
lib		lib
scripts		scripts
sql		sql
tests/pubParseDb		tests/pubParseDb
ucscScripts		ucscScripts
.gitignore		.gitignore
README.txt		README.txt
imgtToFasta		imgtToFasta
log.txt		log.txt
problem.txt		problem.txt
pubBedOverlapSearch		pubBedOverlapSearch
pubBlast		pubBlast
pubClassify		pubClassify
pubCompare		pubCompare
pubConvCrawler		pubConvCrawler
pubConvElsevier		pubConvElsevier
pubConvGenbank		pubConvGenbank
pubConvGoogle		pubConvGoogle
pubConvImgt		pubConvImgt
pubConvMedline		pubConvMedline
pubConvPdfDir		pubConvPdfDir
pubConvPmc		pubConvPmc
pubConvYif		pubConvYif
pubCorrectPublisher.py		pubCorrectPublisher.py
pubCountArticles		pubCountArticles
pubCrawl		pubCrawl
pubCronDailyUpdate.sh		pubCronDailyUpdate.sh
pubCronWeeklyUpdate.sh		pubCronWeeklyUpdate.sh
pubExpMatrix		pubExpMatrix
pubFilter		pubFilter
pubGetElsevier		pubGetElsevier
pubGetMedline		pubGetMedline
pubGetPmc		pubGetPmc
pubGroundMutations		pubGroundMutations
pubLoadMysql		pubLoadMysql
pubLoadSqlite		pubLoadSqlite
pubMap		pubMap
pubParseDb		pubParseDb
pubPrepCdnaDir		pubPrepCdnaDir
pubPrepCrawlDir		pubPrepCrawlDir
pubPrepMarkerDir		pubPrepMarkerDir
pubRunAnnot		pubRunAnnot
pubRunMap		pubRunMap
pubRunMapReduce		pubRunMapReduce
pubRunReduce		pubRunReduce
pubSearch		pubSearch
search		search
todo.txt		todo.txt

Repository files navigation

These are the tools that I use for the UCSC Genocoding project, see
http://text.soe.ucsc.edu

Most start with the prefix "pub", the category and then the concrete
publisher. The tool categories are:

- pubGetX = download files from publisher X (medline, pmc, elsevier)
- pubConvX = convert downloaded files to a pub format (tab-separated table
             with fields defined in lib/pubStore.py)
- pubLoadX = load pub format data into a database system (mysql or sqlite)

More general tools are:

- pubPrepX = prepare directory structures. These are used to download
        taxon names, import gene models from websites like NCBI or
        UCSC. 
- pubRunAnnot = run an annotator from the scripts directory on text data in
             pub format
- pubRunMapReduce = run a map/reduce style job from "scripts" onto fulltext.
- pubCrawl = crawl papers from various publishers, needs a directory set up
             with pubPrepCrawlDir and the journalList directory
- pubLoad = load pub format files into mysql db
- pubMap = complex multi stage pipeline to find and map markers found in text 
           (sequences, snps, bands, genes, etc) to genomic locations 
           and create/load bed files into the ucsc browser

If you plan to use any of these, make sure to go over lib/pubConf.py first.
Most commands need some settings in the config file adapted to your particular
server / cluster system. E.g. pubCrawl needs your email address, pubConvX 
need the cluster system and various input/output directories.

Maximilian Haeussler, maximilianh@gmail.com


BUGS to fix:

fixme: illegal DOI landing page
http://www.nature.com/doifinder/10.1046/j.1523-1747.1998.00092.x

URL constructor:
http://www.nature.com/nature/journal/v437/n7062/full/4371102a.html
for DOI  doi:10.1038/4371102a

URL construction for supplemental files:
http://www.nature.com/bjc/journal/v103/n10/suppinfo/6605908s1.html

no access page:
http://www.nature.com/nrclinonc/journal/v7/n11/full/nrclinonc.2010.119.html
- in wget, it triggers a 401 error


cat /cluster/home/max/projects/pubs/crawlDir/rupress/articleMeta.tab | head
-n13658 | tail -n2 > problem.txt