Skip to content

MattChanTK/KaggleTwitterProject

Repository files navigation

Partly Sunny with a Chance of Hashtags
----------------------------------------------

The proposed project is to analyze a set of messages posted on tweeter (also known as "tweets:) about weather, in order to be able to determine 3 principal characteristics. These characteristics describe the sentiment about the weather (positive, negative or neutral), when the "tweet" refers to (past weather, current weather or future weather), and the type of weather the tweet refers to (e.g. hot). Each analyzed factor has a different number of categories which sums up to 24 total labels that the data can belong to, considering that one data can be part of several categories (at least three). 
The objective of the project is to classify this data and determine which labels each tweet belongs to. The training set consists of data that has been classified manually by several analysts who have selected the labels for each point. Using this as a reference point, the training set, consisting of raw information, will be classified into the mentioned labels.
Several topics taught in the course need to be combined in order to approach the different challenges that the project presents. The project will be solved following the general steps developed throughout the course and selecting the most appropriate methods for the data, mainly focusing on: data preprocessing, text mining, and classification methods.


Dataset
-----------------------------------------------

The dataset that will be used during the course of the project is available at new open data library debuted by CrowdFlower and it is run as a competition on Kaggle. The competition webpage is: http://www.kaggle.com/c/crowdflower-weather-twitter/data.
As mentioned before the dataset is about tweets related to the weather. The dataset contains 77,946 samples as training set and 42,157 samples as testing set. The training set contains tweets, locations, and a confidence score for each of 24 possible class labels.  The 24 class labels come from three categories: sentiment, when, and kind. Human raters can choose only one class label from the "sentiment" and "when" categories, but are allowed multiple choices for the "kind".

s = sentiment
w = when
k = kind
============================================================
s1,"I can't tell"
s2,"Negative"
s3,"Neutral / author is just sharing information"
s4,"Positive"
s5,"Tweet not related to weather condition"
w1,"current (same day) weather"
w2,"future (forecast)"
w3,"I can't tell"
w4,"past weather"
k1,"clouds"
k2,"cold"
k3,"dry"
k4,"hot"
k5,"humid"
k6,"hurricane"
k7,"I can't tell"
k8,"ice"
k9,"other"
k10,"rain"
k11,"snow"
k12,"storms"
k13,"sun"
k14,"tornado"
k15,"wind"

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published