w2v pipeline

This is a research and exploration pipeline designed to analyze grants, publication abstracts, and other biomedical corpora. While not designed for production, it is used internally within the Office of Portfolio Analysis at the National Institutes of Health.

Everything is run by the file config.ini, the defaults should help guide a new project.

`fab import_data`

All CSV files in input_data_directories are read, passed through unidecode and given a reference number.

`fab parse`

Imported data are tokenized via a configurable NLP pipeline. The default pipeline includes replace_phrases, remove_parenthesis, replace_from_dictionary, token_replacement, decaps_text, pos_tokenizer.

`fab embedding`

The selected target_columns are feed into word2vec (implemented by gensim) and an embedding layer is trained.

`fab score`

Documents are scored by several methods, currently you can use locality_hash, unique_TF, simple_TF, simple, unique.

`fab predict`

You can predict over other columns in the data using a random forest. A meta-method that uses the inputs from the other classifiers will be built as well.

`fab metacluster`

Similar to batch K-means, clustering is run on subsets and the centroids are clustered at the end. This is often much faster than standard clustering.

`fab analyze_metaclusters`

Returns a higher level description of the clusters found during the metaclustering. Cluster dispersion, cluster descriptions, and labeling will be found in results/.

Name		Name	Last commit message	Last commit date
Latest commit History 299 Commits
w2v_pipeline		w2v_pipeline
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
config.ini		config.ini
fabfile.py		fabfile.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

w2v_pipeline

w2v_pipeline

.gitignore

.gitignore

Makefile

Makefile

README.md

README.md

config.ini

config.ini

fabfile.py

fabfile.py

requirements.txt

requirements.txt

Repository files navigation

w2v pipeline

`fab import_data`

`fab parse`

`fab embedding`

`fab score`

`fab predict`

`fab metacluster`

`fab analyze_metaclusters`

About

Releases

Packages

Languages

karthi2016/pipeline_word2vec

Folders and files

Latest commit

History

Repository files navigation

w2v pipeline

fab import_data

fab parse

fab embedding

fab score

fab predict

fab metacluster

fab analyze_metaclusters

About

Resources

Stars

Watchers

Forks

Languages

`fab import_data`

`fab parse`

`fab embedding`

`fab score`

`fab predict`

`fab metacluster`

`fab analyze_metaclusters`