GitHub - sTranaeus/tweet_depress: Developed to run topic models with priors on twitter depression data for the 2015 CLPsych Shared Task

To run use "python run.py [OPTIONS]". Only dependency is on segan.jar which can be checked out and built from https://github.com/vietansegan/segan, and should be in the folder BASE_DIR/segan/dist/segan.jar. Also, if the weka classifier is to be used, uncomment out the relevant lines near the end of run.py and redirect to appropriate jar file. Other global variables may need to be changed near the top of the file but most are set with input parameters.

Example command to run 50-topic LDA prior then SLDA:

python run.py -in_tag train -ne lda -ne_run -ne_pp -predict -pred_set dev -prior -tw slda -tw_pp -tw_run --k 5 -vocab vocab.txt --neuro input/in_file1.txt -in_file input/in_file2

This runs lda on the data in input/in_file1.txt then uses that as a prior for slda training on the text from input/in_file2.train.txt (note for training and testing it assumes the file given to the "-in_file" parameter is actually followd by "train.txt" for the training set and "dev.txt" for the testing set. If supervised labels are used the label files should be named [in_file].train.lbl and [in_file].dev.lbl) and prediction on the text in input/in_file2.dev.txt. It uses the vocabulary in vocab.txt, which is an alphabetized whitelist of words (tokens) to consider, 1 token per line. The input files are a list of documents, 1 document per line, in the format "[id]\t[document]" where [id] is the document id, and [document] is a space separated list of words in the document. The label files (if any) are of the format "[id]\t[label]" where [id] should match the document corresponding to the label in the text file.

By default output is sent to BASE_DIR/output/_[timestamp], but can be further labeled with the -out_tag parameter, or set explicitly with the -model_dir parameter. Output is in the same format as in https://github.com/vietansegan/segan, with some additional utility files in the root of the output directory.

The code/ directory contains both python scripts used by run.py and a few additional utilities for preprocessing data or extracting results. Most simply adapt between formats, the most useful of them being "get_predictions.py" which extracts predictions output by SLDA/SNLDA and re-combines "weekly" documents into a single prediction per author. Additionally some of the scripts can be used to extract and tokenize documents from raw twitter formats.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
code		code
README.md		README.md
andy_tops.txt		andy_tops.txt
compare_tops.py		compare_tops.py
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

README.md

README.md

andy_tops.txt

andy_tops.txt

compare_tops.py

compare_tops.py

run.py

run.py

Repository files navigation

About

Releases

Packages

Languages

sTranaeus/tweet_depress

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages