Skip to content

jonathanronen/ml101-biebola

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Example code for tweet classification

Special Topics In Computer Science: Computing with Large Data Sets, NYU 2014

Code

Code for preprocessing, vectorizing, and classifying tweets.

Data

Some tweets about ebola or justin bieber, collected in a 24 hour period 10/6-7 2014.

Requirements

  • smappPy here
  • nltk use pip
  • numpy, scipy use pip
  • scikit-learn use pip

Run

  1. The script preprocessing.py reads the data from tweets.csv and does some word cleanup:

    • lowercase
    • removes punctuation
    • removes words which contain ebola or bieber
    • removes numbers
  2. The script vectorizer.py reads the output from preprocessing.py, which is clean_data.csv and dictionary.csv. It then counts word occurences in clean tweet texts, and produces a document-term-matrix. Saves that ti tfm.m.

  3. The script train_and_predict.py uses sklearn's Naive Bayes classifier to train on a training set, and report some results on a held out test set.


© Jonathan Ronen, New York University 2014

About

Code, slides and data for Naive Bayes tweets classification tutorial for NYU Special Topics in Computer Science, 2014

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages