tweettracker-sa

A sentiment analysis tool for Twitter.

Check twitter-specific tokenizing repository for updates and bug fixing.

Required Dependencies

- Scikit-Learn
- NumPy
- NLTK
- SciPy
- Goslate

References

Go, Alec, Richa Bhayani, and Lei Huang. "Twitter sentiment classification using distant supervision." CS224N Project Report, Stanford 1 (2009): 12.
Mohammad, Saif M., Svetlana Kiritchenko, and Xiaodan Zhu. "NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets." Second Joint Conference on Lexical and Computational Semantics (* SEM). Vol. 2. 2013.
Owoputi, Olutobi, et al. "Improved part-of-speech tagging for online conversational text with word clusters." Association for Computational Linguistics, 2013.

Dataset

Dataset for distant_supervision.py example provided by Sentiment140. You can directly download the corpus here.
Dataset for demonstration.py example from TweetTracker's backup.

Usage

Use Processor class for tweet processing and vectorizing
Use parse.py for collecting proper tweets from TweetTracker's backup to fit the classifier.

How to:

Before deployment, the following steps should be done:

Collect data using parse.py from tweet: zcat path/to/backup_file.json.gz.Z | python parse.py lang limit where lang stands for language (either code or abbreviation, depending on the "tweet-lang" type) and limit is the number of tweets to be collected (pass -1 to collect every possible tweet).
- WARNING (1): there should be one classifier per language, therefore each either a bash script is needed to make this automated for many languages or call for each desired language.
Process the data collected with Processor class. The following settings are default:
- TF-IDF representation of the vocabulary
- Unigrams and bigrams only (trade-off between processing time complexity and accuracy improvement by using trigrams favors time not using trigrams)
- Twitter-specific features are not concatened by default, must set the parameter on (see documentation for usage)
- WARNING (2): the tokenizer makes mistakes and sometimes the label from the emoticons is not inferred, therefore it can't be used for classification. Since those instances are few, we can simply discard these samples. Use Processor.clearmethod for this.
Store the vectorizer fitted from the data for each language. This will be necessary to classify online, unseen data.
Use sklearn.linear_model.LogisticRegression as classifier for optimal results. Using a LinearSVM does not yield a big accuracy improvement, is not as fast and is not possible to easily get the probabilities.

License

MIT License (MIT)

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
examples		examples
parser		parser
processor		processor
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

parser

parser

processor

processor

.DS_Store

.DS_Store

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

tweettracker-sa

Required Dependencies

References

Dataset

Usage

How to:

License

About

Releases

Packages

Languages

License

gppeixoto/tweettracker-sa

Folders and files

Latest commit

History

Repository files navigation

tweettracker-sa

Required Dependencies

References

Dataset

Usage

How to:

License

About

Resources

License

Stars

Watchers

Forks

Languages