Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



2 Commits

Repository files navigation

Identifying depression on Reddit: the effect of training data

This repository contains code and data used in paper

Inna Pirina and Çağrı Çöltekin (2018) Identifying Depression on Reddit: The Effect of Training Data. In: Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task, pages 9–12

The results presented in the paper uses the bag-of-n-grams SVM classifiers implemented in The following demonstrates a typical use case. Brief descriptions of other options are provided with the -h command line option.

The first step is tuning the hyperparameters:

./ -i data/train.csv # traning data \
    -l ds: -l ff: -l do: -l ds: -l bc: -l nd: # we remove these labels from the training data \
    --negative-class=ndf # explicity state the negative class \
    tune # this is the command \
    -S dsf-ndf.log  # save results to this log file \
    -s random -k 5 # random search with 5-fold CV \
    -m 1000 # try 1000 hyperparameter settings
    '(("w_ngmax", "int", (1, 2)), ("c_ngmax", "int", (2, 4)), ("C", "real", (0.1, 2.0)), ("lowercase", "cat", ("word", "char", "both")))'

Yes, the cmdline interface, particularly filtering bit is somewhat convoluted. If you want clener command line you can also split the traning data into separate files with binary class labels.

The rest of the tunable hyperparameters can be found in the __init__ method of class Bong in For more information on other options for tuning see ./ tune -h The above will crunch numbers for a while and write the hyperparmeter settings and evaluation metrics in file dsf-ndf.log.

To get the best hyperparameters and scores (example is based on a short run):

./ dsf-ndf.log
Based on 34 entries.
Best score (p r f a): 94.50±1.08 94.50±1.08 94.50±1.08 94.50±1.08
Top 20:
94.50±1.08 C=0.6762,c_ngmax=3,lowercase=char,w_ngmax=1
94.38±0.79 C=1.6789,c_ngmax=2,lowercase=both,w_ngmax=1
94.38±1.12 C=0.5059,c_ngmax=3,lowercase=both,w_ngmax=1
94.25±1.21 C=1.4373,c_ngmax=3,lowercase=both,w_ngmax=1
94.25±1.21 C=1.8086,c_ngmax=3,lowercase=both,w_ngmax=1

Now we can retrain the model with the best parameters, and test it on the test data.

./ -i data/train.csv -t data/test.csv \
    -l ds: -l ff: -l do: -l ds: -l bc: -l nd: \
    -l dsf:do -l ndf:nd  # these two are new, maps trainig file labels to test file labels \
    score # now we want the score \
2022-09-23 22:05:49,866 Classes: OrderedDict([('do', 400), ('nd', 400)])
2022-09-23 22:05:49,866 Training
2022-09-23 22:05:49,867 Converting documents to BoNG vectors
2022-09-23 22:05:51,773 Number of features: 27081
2022-09-23 22:05:51,775 Fitting the model
2022-09-23 22:05:52,936 Testing
Precision 0.6377374671898959, Recall: 0.6325000000000001, F-score:

predict instead of prints out the labels instead.

The dataset(s) are included in data directory. The class labels match the class labels used in the paper.


No description, website, or topics provided.






No releases published


No packages published
