GitHub - jonathanronen/ml101-biebola: Code, slides and data for Naive Bayes tweets classification tutorial for NYU Special Topics in Computer Science, 2014

Example code for tweet classification

Special Topics In Computer Science: Computing with Large Data Sets, NYU 2014

Code

Code for preprocessing, vectorizing, and classifying tweets.

Data

Some tweets about ebola or justin bieber, collected in a 24 hour period 10/6-7 2014.

Requirements

smappPy here
nltk use pip
numpy, scipy use pip
scikit-learn use pip

Run

The script preprocessing.py reads the data from tweets.csv and does some word cleanup:
- lowercase
- removes punctuation
- removes words which contain ebola or bieber
- removes numbers
The script vectorizer.py reads the output from preprocessing.py, which is clean_data.csv and dictionary.csv. It then counts word occurences in clean tweet texts, and produces a document-term-matrix. Saves that ti tfm.m.
The script train_and_predict.py uses sklearn's Naive Bayes classifier to train on a training set, and report some results on a held out test set.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
code		code
data		data
readme.MD		readme.MD
slides.odp		slides.odp
slides.pdf		slides.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

code

code

data

data

readme.MD

readme.MD

slides.odp

slides.odp

slides.pdf

slides.pdf

Repository files navigation

Example code for tweet classification

Code

Data

Requirements

Run

About

Releases

Packages

Languages

jonathanronen/ml101-biebola

Folders and files

Latest commit

History

Repository files navigation

Example code for tweet classification

Code

Data

Requirements

Run

About

Resources

Stars

Watchers

Forks

Languages