overdosed 0.1

What linguistic features are unique to discussions of nonmedical substance use?

Background

Social media (Twitter, Facebook, websites like CrazyMeds) can provide us with information on how the general population uses substances for nonmedical purposes. Social media may, in fact, provide a more accurate picture of usage than data from surveys or emergency rooms. Surveys ask a small sample of the population to remember (sometimes) illicit activities and report them to a federal authority under the promise of anonymynity. Emergency rooms only see the part of the story when substance use goes wrong.

Methodology

overdosed 0.1 uses latent semantic analysis to identify the words or phrases that distinguish tweets discussing the use of substances from other substances. There are two phases:

Phase 1

Sample two streams from Twitter gardenhose (1% sampler).
Stream 1: Unfiltered.
Stream 2: Filtered for keywords describing substance of interest.
Develop the classifier.
Sensitive (rule-in) component: Identify words present in both streams.
Specific (rule-out) component: Identify words present in filtered stream but not unfiltered stream. (Filtered stream - unfiltered stream)
Analyze the classifier.
Identify groups of semantically related words in the rule-in component.
Same for rule-out component. (i.e. Taxonomize)
Test the classifier.
Curate new samples from the two streams
Adjust the words needed to be present or absent in a tweet to achieve an acceptable sensitivity and specificity

Phase 2

Sample the unfiltered Twitter gardenhose (1% sampler)
Cannot calculate valid sample statistics if you combine streams
Partition the unfiltered Twitter stream into
All tweets discussing use of the substance
All other tweets
Calculate the relative abundance of each component of the metadata, e.g.
Are the geographic distributions the same?
What latent attributes differ?

Quickstart

 git clone https://github.com/mac389/overdosed.git
 cd overdosed
 sh setup.sh

Dependencies

Tweepy (3.3.0)
Gensim (0.10.3)
Seaborn (0.6dev, for visualization, also requires pandas)
NumPy (1.9.1)
Matplotlib
SciPy

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
Area 51		Area 51
data/case		data/case
.gitignore		.gitignore
CaseControlStream.py		CaseControlStream.py
LICENSE		LICENSE
README.md		README.md
all-fuzzy-roc-curve.png		all-fuzzy-roc-curve.png
all-fuzzy-roc-curve.tiff		all-fuzzy-roc-curve.tiff
amalgamated.json		amalgamated.json
analyze_case_control.py		analyze_case_control.py
analyze_geographic_clustering.py		analyze_geographic_clustering.py
analyze_hashtags_users.py		analyze_hashtags_users.py
analyze_user_mentions.py		analyze_user_mentions.py
birthday.json		birthday.json
calculate-testing-agreement.py		calculate-testing-agreement.py
case-control-classifications.json		case-control-classifications.json
classify.py		classify.py
combined-not-rated-deduped.txt		combined-not-rated-deduped.txt
combined.txt		combined.txt
compare_descriptions_case_control.py		compare_descriptions_case_control.py
control_tweets.json		control_tweets.json
determine-tokens.py		determine-tokens.py
determine-wordnet-coverage.py		determine-wordnet-coverage.py
develop-classifier.py		develop-classifier.py
distribution-scores-test.png		distribution-scores-test.png
effluvia-enriched-rating		effluvia-enriched-rating
elaborate-ratings-for-ROC.py		elaborate-ratings-for-ROC.py
evaluate-classifier.py		evaluate-classifier.py
evaluation-rating		evaluation-rating
excluding-fuzzy-testing-sample-w-ratings-10		excluding-fuzzy-testing-sample-w-ratings-10
excluding-fuzzy-testing-sample-w-ratings-20		excluding-fuzzy-testing-sample-w-ratings-20
excluding-fuzzy-testing-sample-w-ratings-30		excluding-fuzzy-testing-sample-w-ratings-30
excluding-fuzzy-testing-sample-w-ratings-40		excluding-fuzzy-testing-sample-w-ratings-40
excluding-fuzzy-testing-sample-w-ratings-50		excluding-fuzzy-testing-sample-w-ratings-50
excluding-fuzzy-testing-sample-w-ratings-60		excluding-fuzzy-testing-sample-w-ratings-60
excluding-fuzzy-testing-sample-w-ratings-70		excluding-fuzzy-testing-sample-w-ratings-70
excluding-fuzzy-testing-sample-w-ratings-80		excluding-fuzzy-testing-sample-w-ratings-80
excluding-fuzzy-testing-sample-w-ratings-90		excluding-fuzzy-testing-sample-w-ratings-90
extracted-entities.json		extracted-entities.json
feature-tokens.json		feature-tokens.json
fuzzy-elaborate-ratings-ROC.py		fuzzy-elaborate-ratings-ROC.py
fuzzy-roc-curve.png		fuzzy-roc-curve.png
fuzzy-roc-curve.tiff		fuzzy-roc-curve.tiff
fuzzy-sample-automatically-rated-10		fuzzy-sample-automatically-rated-10
fuzzy-sample-automatically-rated-20		fuzzy-sample-automatically-rated-20
fuzzy-sample-automatically-rated-30		fuzzy-sample-automatically-rated-30
fuzzy-sample-automatically-rated-40		fuzzy-sample-automatically-rated-40
fuzzy-sample-automatically-rated-50		fuzzy-sample-automatically-rated-50
fuzzy-sample-automatically-rated-60		fuzzy-sample-automatically-rated-60
fuzzy-sample-automatically-rated-70		fuzzy-sample-automatically-rated-70
fuzzy-sample-automatically-rated-80		fuzzy-sample-automatically-rated-80
fuzzy-sample-automatically-rated-90		fuzzy-sample-automatically-rated-90
fuzzy-testing-sample-w-ratings-10		fuzzy-testing-sample-w-ratings-10
fuzzy-testing-sample-w-ratings-20		fuzzy-testing-sample-w-ratings-20
fuzzy-testing-sample-w-ratings-30		fuzzy-testing-sample-w-ratings-30
fuzzy-testing-sample-w-ratings-40		fuzzy-testing-sample-w-ratings-40
fuzzy-testing-sample-w-ratings-50		fuzzy-testing-sample-w-ratings-50
fuzzy-testing-sample-w-ratings-60		fuzzy-testing-sample-w-ratings-60
fuzzy-testing-sample-w-ratings-70		fuzzy-testing-sample-w-ratings-70
fuzzy-testing-sample-w-ratings-80		fuzzy-testing-sample-w-ratings-80
fuzzy-testing-sample-w-ratings-90		fuzzy-testing-sample-w-ratings-90
hypothesis_testing.json		hypothesis_testing.json
informative-tokens.json		informative-tokens.json
irrelevant-senses-count		irrelevant-senses-count
irrelevant-tokens		irrelevant-tokens
irrelevant.png		irrelevant.png
irrelevant.tiff		irrelevant.tiff
jq		jq
keys.json		keys.json
make-ROC-curve.py		make-ROC-curve.py
more-rigorous-fuzzy.py		more-rigorous-fuzzy.py
opioid-keywords		opioid-keywords
partition-metadata.py		partition-metadata.py
phase-1-rule-in.py		phase-1-rule-in.py
phase-1-rule-out.py		phase-1-rule-out.py
phase-1.sh		phase-1.sh
phase-2.sh		phase-2.sh
phase_1.py		phase_1.py
relevant-tokens		relevant-tokens
relevant.png		relevant.png
relevant.tiff		relevant.tiff
roc-curve-dup.png		roc-curve-dup.png
roc-curve-dup.tiff		roc-curve-dup.tiff
rule-in-tokens.txt		rule-in-tokens.txt
scores		scores
senses-count		senses-count
setup.sh		setup.sh
specific-to-irrelevant.png		specific-to-irrelevant.png
specific-to-irrelevant.tiff		specific-to-irrelevant.tiff
specific-to-relevant.png		specific-to-relevant.png
specific-to-relevant.tiff		specific-to-relevant.tiff
test-high-prevalence-deduped.txt		test-high-prevalence-deduped.txt
test-high-prevalence.txt		test-high-prevalence.txt
test-tokens		test-tokens
test-tokens-in-wordnet		test-tokens-in-wordnet
testing-sample-MC-ratings.csv		testing-sample-MC-ratings.csv
testing-sample-no-ratings.csv		testing-sample-no-ratings.csv
testing-sample-w-ratings		testing-sample-w-ratings
testing-sample-w-ratings-10		testing-sample-w-ratings-10
testing-sample-w-ratings-100		testing-sample-w-ratings-100

License

mac389/overdosed

Folders and files

Latest commit

History