w2v pipeline

This is a research and exploration pipeline designed to analyze grants, publication abstracts, and other biomedical corpora. While not designed for production, it is used internally within the Office of Portfolio Analysis at the National Institutes of Health.

Everything is run by the file config.ini, the defaults should help guide a new project.

`python word2vec_pipeline import_data`

All CSV files in input_data_directories are read, passed through unidecode and given a reference number.

`python word2vec_pipeline parse`

Imported data are tokenized via a configurable NLP pipeline. The default pipeline includes replace_phrases, remove_parenthesis, replace_from_dictionary, token_replacement, decaps_text, pos_tokenizer.

`python word2vec_pipeline embed`

The selected target_columns are feed into word2vec (implemented by gensim) and an embedding layer is trained.

`python word2vec_pipeline score`

Documents are scored by several methods, currently you can use locality_hash, unique_TF, simple_TF, simple, unique.

`python word2vec_pipeline predict`

You can predict over other columns in the data using a random forest. A meta-method that uses the inputs from the other classifiers will be built as well.

`python word2vec_pipeline metacluster`

Similar to batch K-means, clustering is run on subsets and the centroids are clustered at the end. This is often much faster than standard clustering.

`python word2vec_pipeline analyze_metaclusters`

Returns a higher level description of the clusters found during the metaclustering. Cluster dispersion, cluster descriptions, and labeling will be found in results/.

Name		Name	Last commit message	Last commit date
Latest commit History 370 Commits
word2vec_pipeline		word2vec_pipeline
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
config.ini		config.ini
fabfile.py		fabfile.py
requirements.txt		requirements.txt
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

word2vec_pipeline

word2vec_pipeline

.gitignore

.gitignore

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

Makefile

Makefile

README.md

README.md

config.ini

config.ini

fabfile.py

fabfile.py

requirements.txt

requirements.txt

setup.py

setup.py

tox.ini

tox.ini

Repository files navigation

w2v pipeline

`python word2vec_pipeline import_data`

`python word2vec_pipeline parse`

`python word2vec_pipeline embed`

`python word2vec_pipeline score`

`python word2vec_pipeline predict`

`python word2vec_pipeline metacluster`

`python word2vec_pipeline analyze_metaclusters`

About

Releases

Packages

Languages

License

hsali/word2vec_pipeline

Folders and files

Latest commit

History

Repository files navigation

w2v pipeline

python word2vec_pipeline import_data

python word2vec_pipeline parse

python word2vec_pipeline embed

python word2vec_pipeline score

python word2vec_pipeline predict

python word2vec_pipeline metacluster

python word2vec_pipeline analyze_metaclusters

About

Resources

License

Stars

Watchers

Forks

Languages

`python word2vec_pipeline import_data`

`python word2vec_pipeline parse`

`python word2vec_pipeline embed`

`python word2vec_pipeline score`

`python word2vec_pipeline predict`

`python word2vec_pipeline metacluster`

`python word2vec_pipeline analyze_metaclusters`