Skip to content

This repository contains code and data sets I used in my Master's thesis "Cross-collection aspect based opinion mining using topic models". The document can be found at http://research.sabanciuniv.edu/36615/1/10209464_HemedHamisiKaporo.pdf.

Notifications You must be signed in to change notification settings

Directorman9/cross_collection_opinion_mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Cross-collection opinion mining

This repository contains code and data sets I used in my Master's thesis "Cross-collection aspect based opinion mining using topic models"

How to use these resources.

Datasets: All data sets can be found under the data sets directory. Detail of data sets can be found in the actual thesis document http://research.sabanciuniv.edu/36615/1/10209464_HemedHamisiKaporo.pdf.

• The airlines data set can also be downloaded from https://www.kaggle.com/crowdflower/twitter-airline-sentiment
• The debate data set can also be downloaded from https://www.kaggle.com/benhamner/clinton-trump-tweets
• The hotels data set can also be downloaded from http://times.cs.uiuc.edu/~wang296/Data 
• The full movies data set can be downloaded from http://snap.stanford.edu/data/web-Movies.html
• This phones data set was scraped from gsm-arena.

Codes: All related code can be found under codes directory. All code is in python. How to use each code is indicated inside the codes. Here we discuss when to use each code.

1) Preprocessing
• Run get-hotels_reviews_text.py on the selecting 3 hotels of interest from the downloaded dataset, to get only the text part of the reviews, excluding image and sentiment scored.
• Run get_tweets_by_airlines.py on the downloaded airlines data set, to get tweets arranged according to the airline of affiliation.
• Run get_tweets_by_debate.py on the downloaded debate data set, to get tweets arranged according to the candidate of affiliation.
• Run get_movie_by_id.py on the downloaded movies data set to get movies of interest.
• Run prepro_tweets.py on the two twitter dataset (airlines and debate) to remove hash tags, web links and numerical values and to put all character to lower case.
• Run cclda_tam_prepro.py on all the datasets above to make them input ready for the cclda and tam algorithms.
• Run cptm_prepro.py on all the datasets above to make them input ready for the cptm algorithms.

2) Run the topic modeling algorithms.
• Implementation and instructions on cclda and tam can be found from https://github.com/blade091shenwei/TAM_ccLDA
• Implementation and instructions on cptm can be found from https://github.com/NLeSC/cptm


3) Post process topic modeling outputs.
• cclda_post_processing.py and tam_post_processing.py post processes cclda and tam outputs, no post processing is required for cptm outputs. Post processing means putting aspect words (nouns) and opinion words (adjectives, verbs, adverbs) in their appropriate places. 
• cclda_refinement.py, cptm_refinement.py and tam_refinement.py refines the post processed topic modeling outputs. Refinement means taking only those topics that  have highest average pairwise cosine similarity as measured from pre-trained word embeddings of unseen corpus. 

4) Conduct evaluation
• coherence measures contain code that was used to measure topic coherence for all the three algorithms outputs.
• cclda_coherence_measures.py for cclda, cptm_coherence_measures.py for cptm and tam_coherence_measures.py for tam as their named suggest.
• Measures include lcp, pmi, npmi and cosim. These measures require unseen corpus, in this case wikipedia corpus prepared using code in the side_codes directory.
• cclda_sentiment_measure.py and cptm_sentiment_measure.py measure sentiment scores of the output opinion words in each perspective and each topic.
• average_sentiments.py computes the baseline sentiment scores . Run this code on the initial 3 hotels of interest before prepocessing them.

About

This repository contains code and data sets I used in my Master's thesis "Cross-collection aspect based opinion mining using topic models". The document can be found at http://research.sabanciuniv.edu/36615/1/10209464_HemedHamisiKaporo.pdf.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages