Tools for analysis of the gistr data, and generation of null data.
Using the Fish shell with virtualfish and Python 3.6+:
# Setup base environment
vf new -p (which python3) interpretation-experiment.analysis
pip install -r requirements.txt
# Get necessary nltk and spaCy data
python -m nltk.downloader punkt averaged_perceptron_tagger wordnet stopwords brown cmudict
python -m spacy download en_core_web_md
# Check ou all the spreadr submodules and set up their environments
git submodule update --init
./setup_envs.sh
Note that experiment 1 was run with spreadr@0.10.2
, but the analysis here runs on spreadr@0.11.0
(for compatibility with Python 3.6, and because of some changes in the way settings are managed). This doesn't seem to lead to any problems.
Experiment data importing:
- Get a
.sql
file of the whole database. Back it up. - Edit your
.sql
file to set the database name you want to create locally to hold the data (in what follows here we'll usespreadr_XXX
). There's three replacements to make (a comment, theCREATE DATABASE
stance, and theUSE
stance). - Import the data in MySQL:
mysql -u root < db.sql
User setup (in a mysql shell, i.e. mysql -u root
):
- Create the analysis user:
CREATE USER 'spreadr_analysis'@'localhost';
- Grant it all privileges:
GRANT ALL on spreadr_XXX.* TO 'spreadr_analysis'@'localhost';
(change the database name to your own)
Use python -m analysis.cli --help
to explore what's possible.
Right now there are two parts in the code:
- Commands (in
analysis/commands/
) accessible through the cli, which are used to preprocess, code, or generate data - Library stuff in
analysis/
, which serve as utilities for the notebooks in all thenotebooks/*
folders.
These correspond to three steps you should follow:
- Generate the language models needed by the notebooks if they're not already in
data/models/
, or if outdated:
for n in 1 2 3; do
for type in word tag; do
python -m analysis.cli load --db nothing language_model $n $type
done
done
- Code the data with any needed codings (mostly spam, but could be any sentence-level feature); use
python -m analysis.cli load --db DB_NAME sentences_to_codable_csv CODING OUTFILE.csv
and follow the instructions. - Open Jupyter and run any of the notebooks you want. They use the spreadr models, but augment them with utilities and any sentence-codings you provided from the previous step, thanks to stuff in
analysis/
.
TODO: use a stub database for tests, instead of the real databases.