Word embeddings with JSTOR data

Word2Vec, hierarchical clustering, and visualization

This code takes in JSTOR OCR raw text and expert-generated dictionaries, computes embeddings, uses these to expand dictionaries, gets doc/dict cosine similarities, and visualizes overall trends.

Guide to Codebase

Word2Vec:

script for training full and decade-specific word2vec models: word2vec/word2vec_train.py
template notebook for exploring word2vec model running, with a full preprocessing workflow and decade-specific training: word2vec/w2v_nb_template_workflow.ipynb

Dictionary Method:

script to use the w2v model to expand seed vocab: refine_dictionaries/expand_dictonary_and_visualize.ipynb
script to use the w2v model to expand seed vocab for each decade and visualize TSNE: refine_dictionaries/refine_dict.ipynb

Validation and visualize trends:

prevalence of theories over time using word counts of expanded dictionaries (not normalized): validate/plot_engagement.ipynb
prevalence of theories over time using word counts and cosine scores of seed dictionaries (normalized): validate/viz_patterns_coredict_ratios_cosines_normalized.ipynb
correlations between theories via cosine similarities of core dictionaries: validate/correlations_cosine_coredict_by_year.ipynb

Clustering:

Notebook to view hierarchical clusters based on seed dictionaries divided by decade:
- clustering/hierarchical_by_decade.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
clustering		clustering
doc2vec		doc2vec
figures		figures
preprocessing		preprocessing
refine_dictionaries		refine_dictionaries
validate		validate
word2vec		word2vec
word_movers_distance		word_movers_distance
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clustering

clustering

doc2vec

doc2vec

figures

figures

preprocessing

preprocessing

refine_dictionaries

refine_dictionaries

validate

validate

word2vec

word2vec

word_movers_distance

word_movers_distance

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Word embeddings with JSTOR data

Word2Vec, hierarchical clustering, and visualization

Guide to Codebase

Word2Vec:

Dictionary Method:

Validation and visualize trends:

Clustering:

About

Releases

Packages

Contributors 2

Languages

License

h2researchgroup/embeddings

Folders and files

Latest commit

History

Repository files navigation

Word embeddings with JSTOR data

Word2Vec, hierarchical clustering, and visualization

Guide to Codebase

Word2Vec:

Dictionary Method:

Validation and visualize trends:

Clustering:

About

Resources

License

Stars

Watchers

Forks

Languages