Special Topics In Computer Science: Computing with Large Data Sets, NYU 2014
Code for preprocessing, vectorizing, and classifying tweets.
Some tweets about ebola or justin bieber, collected in a 24 hour period 10/6-7 2014.
smappPy
herenltk
use pipnumpy
,scipy
use pipscikit-learn
use pip
-
The script
preprocessing.py
reads the data fromtweets.csv
and does some word cleanup:- lowercase
- removes punctuation
- removes words which contain ebola or bieber
- removes numbers
-
The script
vectorizer.py
reads the output frompreprocessing.py
, which isclean_data.csv
anddictionary.csv
. It then counts word occurences in clean tweet texts, and produces a document-term-matrix. Saves that titfm.m
. -
The script
train_and_predict.py
usessklearn
's Naive Bayes classifier to train on a training set, and report some results on a held out test set.
© Jonathan Ronen, New York University 2014