Skip to content

karthi2016/pipeline_word2vec

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

w2v pipeline

This is a research and exploration pipeline designed to analyze grants, publication abstracts, and other biomedical corpora. While not designed for production, it is used internally within the Office of Portfolio Analysis at the National Institutes of Health.

Everything is run by the file config.ini, the defaults should help guide a new project.

fab import_data

All CSV files in input_data_directories are read, passed through unidecode and given a reference number.

fab parse

Imported data are tokenized via a configurable NLP pipeline. The default pipeline includes replace_phrases, remove_parenthesis, replace_from_dictionary, token_replacement, decaps_text, pos_tokenizer.

fab embedding

The selected target_columns are feed into word2vec (implemented by gensim) and an embedding layer is trained.

fab score

Documents are scored by several methods, currently you can use locality_hash, unique_TF, simple_TF, simple, unique.

fab predict

You can predict over other columns in the data using a random forest. A meta-method that uses the inputs from the other classifiers will be built as well.

fab metacluster

Similar to batch K-means, clustering is run on subsets and the centroids are clustered at the end. This is often much faster than standard clustering.

fab analyze_metaclusters

Returns a higher level description of the clusters found during the metaclustering. Cluster dispersion, cluster descriptions, and labeling will be found in results/.

About

Pipeline to turn input text into a w2v embedding.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Makefile 0.4%