cochrane-nlp

files for systematic review automation project

bilearn.py

the main algorithm for the co-training/distant supervision

pipeline.py

does the NLP stuff, takes in text and outputs dicts of features which can be used by sklearn algorithms

indexnumbers.py

some code I wrote for something else, but contains a function to efficiently convert numbers in words to integers not perfect (won't handle big ordinals correctly yet, e.g. "one hundred and second" is converted too "100 and second" and probably other things too)

pmlib.py

Manages eutils connections and batch downloads

rm5reader.py

Parses Cochrane review XML files

pmreader.py

Parses pubmed eutils XML output

biviewer.py

Contains BiViewer class which manages use of the parallel corpora by creating a 'list' of (cochrane, pubmed) tuples which can be accessed from memory or disk

data/biviewer_links_all.pck

Pickle file containing linkage data between Cochrane and Pubmed in the format (used by biviewer.py):

[{"CDSRfilename": cdsr_filename1, "CDSRrefcode": cdsr_refcode1A, "PMfilename": pm_filename1B},
{"CDSRfilename": cdsr_filename1, "CDSRrefcode": cdsr_refcode1B, "PMfilename": pm_filename1B},
{"CDSRfilename": cdsr_filename1, "CDSRrefcode": cdsr_refcode1C, "PMfilename": pm_filename1C},
{"CDSRfilename": cdsr_filename2, "CDSRrefcode": cdsr_refcode2A, "PMfilename": pm_filename2B},
 ...
]

data/test_abstracts.pck

Pickle file containing 137 abstracts with population size manually tagged, as a list of dicts:

[{"test": abstract1_as_str,
  "answer": population_size1_as_int},
  {"test": abstract2_as_str,
  "answer": population_size2_as_int},
  ...
]

data/brill_pos_tagger.pck

Brill POS tagger from NLTK for temporary, to be replaced by CRFsuite version trained on medpost corpus soon...

Name		Name	Last commit message	Last commit date
Latest commit History 389 Commits
cache/labeled		cache/labeled
data		data
toolkit		toolkit
.gitignore		.gitignore
Get unique domain names from all Cochrane.ipynb		Get unique domain names from all Cochrane.ipynb
Get unique domain names.ipynb		Get unique domain names.ipynb
LICENSE.txt		LICENSE.txt
PDF BiViewer example.ipynb		PDF BiViewer example.ipynb
README.md		README.md
bilearn.py		bilearn.py
bilearn_hybrid.py		bilearn_hybrid.py
bilearn_supervised.py		bilearn_supervised.py
bilearn_unsupervised.py		bilearn_unsupervised.py
biviewer.py		biviewer.py
color.py		color.py
domain_names2.txt		domain_names2.txt
graph_unsupervised_results.r		graph_unsupervised_results.r
indexnumbers.py		indexnumbers.py
joint_supervised_learner.py		joint_supervised_learner.py
journalreaders.py		journalreaders.py
modcountvec.py		modcountvec.py
modvec2.py		modvec2.py
parse_annotations.py		parse_annotations.py
parse_results.py		parse_results.py
pdf_to_text.js		pdf_to_text.js
pipeline.py		pipeline.py
plot_learning_curve.py		plot_learning_curve.py
pmccorpusdownload_pdf.py		pmccorpusdownload_pdf.py
pmids_with_quality_quotes_in_cdsr.txt		pmids_with_quality_quotes_in_cdsr.txt
pmlib.py		pmlib.py
pmreader.py		pmreader.py
positional.py		positional.py
progressbar.py		progressbar.py
quality.py		quality.py
quality2.py		quality2.py
quality3.py		quality3.py
quality4.py		quality4.py
quality5.py		quality5.py
quality_describe.py		quality_describe.py
quality_model.py		quality_model.py
rm5reader.py		rm5reader.py
supervised_learner.py		supervised_learner.py
taggedpipeline.py		taggedpipeline.py
tokenizer.py		tokenizer.py
xmlbase.py		xmlbase.py

License

ijmarshall/cochrane-nlp

Folders and files

Latest commit

History

Repository files navigation

cochrane-nlp

bilearn.py

pipeline.py

indexnumbers.py

pmlib.py

rm5reader.py

pmreader.py

biviewer.py

data/biviewer_links_all.pck

data/test_abstracts.pck

data/brill_pos_tagger.pck

About

Resources

License

Stars

Watchers

Forks

Languages