GitHub - coltekin/identifying-depression

Identifying depression on Reddit: the effect of training data

This repository contains code and data used in paper

Inna Pirina and Çağrı Çöltekin (2018) Identifying Depression on Reddit: The Effect of Training Data. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task, pages 9–12

The results presented in the paper uses the bag-of-n-grams SVM classifiers implemented in bong.py. The following demonstrates a typical use case. Brief descriptions of other options are provided with the -h command line option.

The first step is tuning the hyperparameters:

./bong.py -i data/train.csv # traning data \
    -l ds: -l ff: -l do: -l ds: -l bc: -l nd: # we remove these labels from the training data \
    --negative-class=ndf # explicity state the negative class \
    tune # this is the command \
    -S dsf-ndf.log  # save results to this log file \
    -s random -k 5 # random search with 5-fold CV \
    -m 1000 # try 1000 hyperparameter settings
    '(("w_ngmax", "int", (1, 2)), ("c_ngmax", "int", (2, 4)), ("C", "real", (0.1, 2.0)), ("lowercase", "cat", ("word", "char", "both")))'

Yes, the cmdline interface, particularly filtering bit is somewhat convoluted. If you want clener command line you can also split the traning data into separate files with binary class labels.

The rest of the tunable hyperparameters can be found in the __init__ method of class Bong in bong.py. For more information on other options for tuning see ./bong.py tune -h The above will crunch numbers for a while and write the hyperparmeter settings and evaluation metrics in file dsf-ndf.log.

To get the best hyperparameters and scores (example is based on a short run):

./read-logs.py dsf-ndf.log
Based on 34 entries.
Best score (p r f a): 94.50±1.08 94.50±1.08 94.50±1.08 94.50±1.08
Top 20:
94.50±1.08 C=0.6762,c_ngmax=3,lowercase=char,w_ngmax=1
94.38±0.79 C=1.6789,c_ngmax=2,lowercase=both,w_ngmax=1
94.38±1.12 C=0.5059,c_ngmax=3,lowercase=both,w_ngmax=1
94.25±1.21 C=1.4373,c_ngmax=3,lowercase=both,w_ngmax=1
94.25±1.21 C=1.8086,c_ngmax=3,lowercase=both,w_ngmax=1
...

Now we can retrain the model with the best parameters, and test it on the test data.

./bong.py -i data/train.csv -t data/test.csv \
    -l ds: -l ff: -l do: -l ds: -l bc: -l nd: \
    -l dsf:do -l ndf:nd  # these two are new, maps trainig file labels to test file labels \
    score # now we want the score \
    C=0.6762,c_ngmax=3,lowercase=char,w_ngmax=1
2022-09-23 22:05:49,866 Classes: OrderedDict([('do', 400), ('nd', 400)])
2022-09-23 22:05:49,866 Training
2022-09-23 22:05:49,867 Converting documents to BoNG vectors
2022-09-23 22:05:51,773 Number of features: 27081
2022-09-23 22:05:51,775 Fitting the model
2022-09-23 22:05:52,936 Testing
Precision 0.6377374671898959, Recall: 0.6325000000000001, F-score:
0.6289729238574195

predict instead of prints out the labels instead.

The dataset(s) are included in data directory. The class labels match the class labels used in the paper.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
.gitignore		.gitignore
README.md		README.md
bm25.py		bm25.py
bong.py		bong.py
cmdline.py		cmdline.py
paper.pdf		paper.pdf
read-logs.py		read-logs.py
rnn.py		rnn.py
textc_csv.py		textc_csv.py
tune_textc.py		tune_textc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

README.md

README.md

bm25.py

bm25.py

bong.py

bong.py

cmdline.py

cmdline.py

paper.pdf

paper.pdf

read-logs.py

read-logs.py

rnn.py

rnn.py

textc_csv.py

textc_csv.py

tune_textc.py

tune_textc.py

Repository files navigation

Identifying depression on Reddit: the effect of training data

About

Releases

Packages

Languages

coltekin/identifying-depression

Folders and files

Latest commit

History

Repository files navigation

Identifying depression on Reddit: the effect of training data

About

Resources

Stars

Watchers

Forks

Languages