This code takes in JSTOR OCR raw text and expert-generated dictionaries, computes embeddings, uses these to expand dictionaries, gets doc/dict cosine similarities, and visualizes overall trends.
- script for training full and decade-specific word2vec models: word2vec/word2vec_train.py
- template notebook for exploring word2vec model running, with a full preprocessing workflow and decade-specific training: word2vec/w2v_nb_template_workflow.ipynb
- script to use the w2v model to expand seed vocab: refine_dictionaries/expand_dictonary_and_visualize.ipynb
- script to use the w2v model to expand seed vocab for each decade and visualize TSNE: refine_dictionaries/refine_dict.ipynb
- prevalence of theories over time using word counts of expanded dictionaries (not normalized): validate/plot_engagement.ipynb
- prevalence of theories over time using word counts and cosine scores of seed dictionaries (normalized): validate/viz_patterns_coredict_ratios_cosines_normalized.ipynb
- correlations between theories via cosine similarities of core dictionaries: validate/correlations_cosine_coredict_by_year.ipynb
- Notebook to view hierarchical clusters based on seed dictionaries divided by decade: