Skip to content

mac389/overdosed

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

overdosed 0.1

What linguistic features are unique to discussions of nonmedical substance use?

Background

Social media (Twitter, Facebook, websites like CrazyMeds) can provide us with information on how the general population uses substances for nonmedical purposes. Social media may, in fact, provide a more accurate picture of usage than data from surveys or emergency rooms. Surveys ask a small sample of the population to remember (sometimes) illicit activities and report them to a federal authority under the promise of anonymynity. Emergency rooms only see the part of the story when substance use goes wrong.

Methodology

overdosed 0.1 uses latent semantic analysis to identify the words or phrases that distinguish tweets discussing the use of substances from other substances. There are two phases:

Phase 1

  1. Sample two streams from Twitter gardenhose (1% sampler).
    Stream 1: Unfiltered.
    Stream 2: Filtered for keywords describing substance of interest.

  2. Develop the classifier.
    Sensitive (rule-in) component: Identify words present in both streams.
    Specific (rule-out) component: Identify words present in filtered stream but not unfiltered stream. (Filtered stream - unfiltered stream)

  3. Analyze the classifier.
    Identify groups of semantically related words in the rule-in component.
    Same for rule-out component. (i.e. Taxonomize)

  4. Test the classifier.
    Curate new samples from the two streams
    Adjust the words needed to be present or absent in a tweet to achieve an acceptable sensitivity and specificity

Phase 2

  1. Sample the unfiltered Twitter gardenhose (1% sampler)
    Cannot calculate valid sample statistics if you combine streams

  2. Partition the unfiltered Twitter stream into
    All tweets discussing use of the substance
    All other tweets

  3. Calculate the relative abundance of each component of the metadata, e.g.
    Are the geographic distributions the same?
    What latent attributes differ?

Quickstart

 git clone https://github.com/mac389/overdosed.git
 cd overdosed
 sh setup.sh

Dependencies

  1. Tweepy (3.3.0)
  2. Gensim (0.10.3)
  3. Seaborn (0.6dev, for visualization, also requires pandas)
  4. NumPy (1.9.1)
  5. Matplotlib
  6. SciPy

About

What linguistic features are unique to discussions of nonmedical substance use?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published