movie_classifier

We decided to base our project on creating a sentiment classifier specifically geared toward determining the sentiment among movie reviews. In order to implement our algorithm, we used the IMDB corpus which consisted of two simple tags, which were positive and negative.

To create our features we made use of wordnet, more specifically sentiwordnet, and part of speech taggers. In our algorithm we made use of two features. One of which is the Osgood semantic differentiation. Osgood believed in investigating the connotations of words themselves. He specifically looked at three different metrics which were evaluation, potency, and activity. We specifically were looking at evaluation. Osgood in particular investigated how closely related a word was to words 'good' and 'bad'. Therefore, we used the same metric in our own implementation. Treating each of the documents as bags of words, we used sentiwordnet to compile a list of words that we classified as subjective. With this, we were able to create a set of subjective words which we used as a feature.

Next, we looked at descriptive phrases. We thought it might be useful to find the users' feelings about the movie. We wanted to extract positive objective phrases in a document. We used the PMI-IR method in order to extract out these phrases. To extract the phrases, we defined our own grammar which lead to sentences that contained descriptive phrases only. Then we part of speech tagged the files and checked for sentences according to our own grammar. We used the presence of these descriptive phrases as features.

In the end, we represented our dataset in a 5000x17173 matrix. We used this matrix to train a linear kernel support vector machine. To set the hyperparameter C,we grid searched loglinearly over 1e-3 to 1e3.

We tested the support vector machine classifier on 1000 randomly selected movie reviews from the test set. From our tests results, we received an accuracy of roughly 81%. We noticed that when we increased the training set from 5000 to 25000, the accuracy improved an additional 5%.

To test this program on the test dataset, please run ./pipelined_classifier.sh If you want to test it on a single review please save the review at test.file and run ./pipelined_classifier_ind.py This will also save all of the extracted features at test.feat

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
__pycache__		__pycache__
dataset		dataset
Osgood.py		Osgood.py
README.md		README.md
classifier.py		classifier.py
classifier.svm		classifier.svm
classifier_ind.py		classifier_ind.py
classifier_trainer.py		classifier_trainer.py
phrase.set		phrase.set
phrase_set_generator.py		phrase_set_generator.py
phrases.py		phrases.py
pipelined_classifier.sh		pipelined_classifier.sh
pipelined_classifier_ind.sh		pipelined_classifier_ind.sh
problem.matrix		problem.matrix
problem_generator.py		problem_generator.py
sword.set		sword.set
test.feat		test.feat
test.file		test.file
test.matrix		test.matrix
test.py		test.py
test_ind.matrix		test_ind.matrix
test_ind.py		test_ind.py

IhsanGunay/SentimentAnalysis

Folders and files

Latest commit

History

Repository files navigation

movie_classifier

About

Resources

Stars

Watchers

Forks