GitHub - ptiwari/Sentiment-Analysis-of-Tweets

The Theoretical Background of Naive Bayes [Citation: http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/

As stated earlier, the Naive Bayes classifier assumes that the features used in the classification are independent. Despite the fact that this assumption is usually false, analysis of the Bayesian classification problem has shown that there are some theoretical reasons for the apparently unreasonable efficacy of Naive Bayes classifiers as Zhang (2004) shown. It can be proven (Manning et al, 2008) that even though the probability estimates of Naive Bayes are of low quality, its classification decisions are quite good. Thus, despite the fact that Naive Bayes usually over estimates the probability of the selected class, given that we use it only to make the decision and not to accurately predict the actual probabilities, the decision making is correct and thus the model is accurate.

In a text classification problem, we will use the words (or terms/tokens) of the document in order to classify it on the appropriate class. By using the “maximum a posteriori (MAP)” decision rule, we come up with the following classifier:

Where tk are the tokens (terms/words) of the document, C is the set of classes that is used in the classification, the conditional probability of class c given document d, the prior probability of class c and the conditional probability of token tk given class c.

This means that in order to find in which class we should classify a new document, we must estimate the product of the probability of each word of the document given a particular class (likelihood), multiplied by the probability of the particular class (prior). After calculating the above for all the classes of set C, we select the one with the highest probability.

Due to the fact that computers can handle numbers with specific decimal point accuracy, calculating the product of the above probabilities will lead to float point underflow. This means that we will end up with a number so small, that will not be able to fit in memory and thus it will be rounded to zero, rendering our analysis useless. To avoid this instead of maximizing the product of the probabilities we will maximize the sum of their logarithms:

Thus instead of choosing the class with the highest probability we choose the one with the highest log score. Given that the logarithm function is monotonic, the decision of MAP remains the same.

The last problem that we should address is that if a particular feature/word does not appear in a particular class, then its conditional probability is equal to 0. If we use the first decision method (product of probabilities) the product becomes 0, while if we use the second (sum of their logarithms) the log(0) is undefined. To avoid this, we will use add-one or Laplace smoothing by adding 1 to each count:

Where B’ is equal to the number of the terms contained in the vocabulary V.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
BBCNews_tweets.txt		BBCNews_tweets.txt
CNN_tweets.csv		CNN_tweets.csv
CNN_tweets.txt		CNN_tweets.txt
J_tsar_tweets.csv		J_tsar_tweets.csv
MainHandler.py		MainHandler.py
MainHandlerNaiveBayes.py		MainHandlerNaiveBayes.py
README.md		README.md
bayesianClassifier.py		bayesianClassifier.py
bayesianClassifier.pyc		bayesianClassifier.pyc
calcAccuracy.py		calcAccuracy.py
classifyTweets.py		classifyTweets.py
dict.txt		dict.txt
dict_test.txt		dict_test.txt
distribute_setup.py		distribute_setup.py
drawPieChart.py		drawPieChart.py
drawPieChart.pyc		drawPieChart.pyc
finalreport.docx		finalreport.docx
finalreport.pdf		finalreport.pdf
getTweetOfaUser.py		getTweetOfaUser.py
large.pickle		large.pickle
ltest.csv		ltest.csv
ltrain.csv		ltrain.csv
myNaivBayes.pyc		myNaivBayes.pyc
my_classifier.pickle		my_classifier.pickle
neutral.pickle		neutral.pickle
neutral_featred.pickle		neutral_featred.pickle
neutral_woswrds.pickle		neutral_woswrds.pickle
nytimes_tweets.csv		nytimes_tweets.csv
nytimes_tweets.txt		nytimes_tweets.txt
polar.pickle		polar.pickle
polar_featred.pickle		polar_featred.pickle
processdata.py		processdata.py
processdata.pyc		processdata.pyc
processdataTFIDF.py		processdataTFIDF.py
realt.py		realt.py
realtweets.txt		realtweets.txt
run_libsvm.m		run_libsvm.m
scriptclass.py		scriptclass.py
sentiment.csv		sentiment.csv
sentiment140.csv		sentiment140.csv
sentiment_old.csv		sentiment_old.csv
sentiment_test.csv		sentiment_test.csv
small.pickle		small.pickle
stopwords.txt		stopwords.txt
tsearch.py		tsearch.py
tweetclassify.py		tweetclassify.py
twitter__login.py		twitter__login.py
visualization.html		visualization.html
wIIIMbk		wIIIMbk

ptiwari/Sentiment-Analysis-of-Tweets

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Languages