This is a research and exploration pipeline designed to analyze grants, publication abstracts, and other biomedical corpora. While not designed for production, it is used internally within the Office of Portfolio Analysis at the National Institutes of Health.
Everything is run by the file config.ini, the defaults should help guide a new project.
All CSV files in input_data_directories
are read, passed through unidecode and given a reference number.
Imported data are tokenized via a configurable NLP pipeline. The default pipeline includes replace_phrases
, remove_parenthesis
, replace_from_dictionary
, token_replacement
, decaps_text
, pos_tokenizer
.
The selected target_columns
are feed into word2vec (implemented by gensim) and an embedding layer is trained.
Documents are scored by several methods, currently you can use locality_hash
, unique_TF
, simple_TF
, simple
, unique
.
You can predict over other columns in the data using a random forest. A meta-method that uses the inputs from the other classifiers will be built as well.
Similar to batch K-means, clustering is run on subsets and the centroids are clustered at the end. This is often much faster than standard clustering.
Returns a higher level description of the clusters found during the metaclustering. Cluster dispersion, cluster descriptions, and labeling will be found in results/
.