Twitter Sentiment Analysis

Introduction:

Around 6000 new tweets are posted every second by Twitter users around the world. This corresponds to roughly 500 million tweets per day, from which valuable insights can be extracted using text mining tools like sentiment analysis. Sentiment analysis has been of great interest to companies and organizations in the areas of business, marketing and politics, as a tool to gauge public reaction towards products launches and campaigns. This project is aimed at using 3 different machine learning models to classify the sentiment of tweets as ‘positive’ or ‘negative’. The strengths and weaknesses of each model will be analyzed and compared to determine the suitability of each for the task.

Related Work:

Based on research done in “Probabilistic vs Deterministic Short Text Sentiment Classifiers“ by Ramaneek Gill & Alexandros Tagtalenidis [1] , they have focused on analysing how Bayesian Networks impact the precision and recall of a text sentiment classifier in comparison to the Gaussian class-conditional of a Linear Discriminant analysis model. They then compared the results of probabilistic models - Naive Bayesian Network & Linear Discriminant analysis to deterministic models - SVM & Neural Networks with the task of classifying positive & negative sentiment to see if generative algorithms out-perform deterministic models.

Dataset:

The Stanford dataset is used in this project. The training dataset has ~1.6 million tweets. The dataset is processed to remove the emoticons. The dataset is in CSV format with columns/features such as Sentiment of tweet ,Tweet id ,Date of tweet ,Query for finding tweet ,Username and Tweet text. The test dataset has ~500 tweets. The feature primarily used in classification is sentiment of the tweet. It has ID’s corresponding to the sentiment such as 2 for the Neutral tweet , 4 for a positive tweet and 0 for a negative tweet. Since the Project focuses on classifying tweets with positive and negative sentiments, the tweets with neutral sentiment (i.e sentiment ID = 2) are removed as part of the preprocessing. Methodology:

We implemented two different approaches for semantic analysis of tweets:

Bag-Of-Words Approach
Word2Vec Embeddings Approach

Bag-Of-Words:

This approach involves encoding each tweet based on the presence and absence of the top 2000 words. For example: "hello this is a great day" , In this tweet, hello and great are among the top 2000 words and this tweet would turn into a 2000 length vector with the position of the words hello and great having 1's and the rest 0's. In data preprocessing, we did the following steps:

Removed punctuations
Removed of stop words
Built a global dictionary of words
Chose the top 2000 words and used them as features for each tweet
Ramaneek Gill and Alexandros Tagtalenidis [1] used both training and test data to build their global vocabulary which has the potential risk of leaking information from test data into the model. To address this issue, we built the global dictionary with just the training data and test the model with unseen (test) data.

Word2Vec Embeddings:

This approach involves converting every word in the vocabulary into a vector representation. The idea is that words with similar meaning will be close to each other in the vector space. In data preprocessing we did the following:

Removed the stop words
Removed the punctuations
Removed any neutral tweets from both training and test set.
Tokenized each tweet
Created a word2vec embedding for each word in the vocabulary using the gensim package where each word in the training data is represented as a vector.
Created a tf-idf (term frequency - inverse document frequency) for each token in the training data
Generated the training data array and test data array by taking a weighted average of each word in the tweet. The weight used is the tf-idf of the word in the training data.

We implemented three models for classifying tweets:

Support Vector Machines (Deterministic model): We implemented SVM as they are good for text classification. SVMs do not scale with large dimensions, hence we employed Principal Component Analysis (PCA) for dimensionality reduction before passing it through the SVM.
Naive Bayes Classifier (Probabilistic model): We implemented Naive Bayes Classifier as this treats the features as independent of each other and we wanted to study how this bayesian independence assumptions affect the classification. We implemented Gaussian, Bernoulli and Multinomial Naive Bayes models.
Neural Networks (Deterministic model): We also implemented a neural network and experimented by varying the learning rate and number of hidden layers in the network After the classification phase, the performance of the algorithms is evaluated according to different measures like accuracy, confusion matrix, precision-recall curve, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
.gitignore		.gitignore
BernoulliNB_200d_alpha1_new.py		BernoulliNB_200d_alpha1_new.py
BernoulliNB_200d_alpha2_new.py		BernoulliNB_200d_alpha2_new.py
BernoulliNB_500d_alpha1_new.py		BernoulliNB_500d_alpha1_new.py
GaussianNB_200d_alpha1_new.py		GaussianNB_200d_alpha1_new.py
GaussianNB_500d_alpha1_new.py		GaussianNB_500d_alpha1_new.py
LICENSE		LICENSE
MNF_Precision_Recall.png		MNF_Precision_Recall.png
MultinomialNB_200_alpha1_new.py		MultinomialNB_200_alpha1_new.py
MultinomialNB_300d_alpha1_new.py		MultinomialNB_300d_alpha1_new.py
MultinomialNB_500d_alpha1_new.py		MultinomialNB_500d_alpha1_new.py
README.md		README.md
SVM_100d_new.py		SVM_100d_new.py
SVM_200d_new.py		SVM_200d_new.py
SVM_500d_new.py		SVM_500d_new.py
SVM_new.py		SVM_new.py
SVM_new_with_0s.py		SVM_new_with_0s.py
SVM_plot.py		SVM_plot.py
W2V_BernoulliNB.py		W2V_BernoulliNB.py
W2V_GaussianNB.py		W2V_GaussianNB.py
W2V_GaussianNB_500d.py		W2V_GaussianNB_500d.py
W2V_MultinomialNB.py		W2V_MultinomialNB.py
W2V_MultinomialNB_500d.py		W2V_MultinomialNB_500d.py
W2V_NN.py		W2V_NN.py
W2V_NN_500d.py		W2V_NN_500d.py
W2V_SVM.py		W2V_SVM.py
W2V_SVM_500d.py		W2V_SVM_500d.py
W2V_bernoulliNB_500d.py		W2V_bernoulliNB_500d.py
mlxplot.py		mlxplot.py
naive_bayes.py		naive_bayes.py
neural_network.py		neural_network.py
neural_network_500d.py		neural_network_500d.py
neural_network_new.py		neural_network_new.py
preprocessing.py		preprocessing.py
preprocessing_stanford.py		preprocessing_stanford.py
stopwords.txt		stopwords.txt
svm.py		svm.py
svm_rbf.py		svm_rbf.py
test.py		test.py
word2Vec.py		word2Vec.py

License

nivedit1/TwitterSentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

Twitter Sentiment Analysis

About

Resources

License

Stars

Watchers

Forks

Languages