GitHub - MattChanTK/KaggleTwitterProject

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
Data		Data
KaggleTwitterProject_Sentiment		KaggleTwitterProject_Sentiment
KaggleTwitterProject_Weather		KaggleTwitterProject_Weather
KaggleTwitterProject_When		KaggleTwitterProject_When
linguistics/en		linguistics/en
peach-doc		peach-doc
.gitattributes		.gitattributes
.gitignore		.gitignore
README		README
Useful links.txt		Useful links.txt

Repository files navigation

Partly Sunny with a Chance of Hashtags
----------------------------------------------

The proposed project is to analyze a set of messages posted on tweeter (also known as "tweets:) about weather, in order to be able to determine 3 principal characteristics. These characteristics describe the sentiment about the weather (positive, negative or neutral), when the "tweet" refers to (past weather, current weather or future weather), and the type of weather the tweet refers to (e.g. hot). Each analyzed factor has a different number of categories which sums up to 24 total labels that the data can belong to, considering that one data can be part of several categories (at least three). 
The objective of the project is to classify this data and determine which labels each tweet belongs to. The training set consists of data that has been classified manually by several analysts who have selected the labels for each point. Using this as a reference point, the training set, consisting of raw information, will be classified into the mentioned labels.
Several topics taught in the course need to be combined in order to approach the different challenges that the project presents. The project will be solved following the general steps developed throughout the course and selecting the most appropriate methods for the data, mainly focusing on: data preprocessing, text mining, and classification methods.


Dataset
-----------------------------------------------

The dataset that will be used during the course of the project is available at new open data library debuted by CrowdFlower and it is run as a competition on Kaggle. The competition webpage is: http://www.kaggle.com/c/crowdflower-weather-twitter/data.
As mentioned before the dataset is about tweets related to the weather. The dataset contains 77,946 samples as training set and 42,157 samples as testing set. The training set contains tweets, locations, and a confidence score for each of 24 possible class labels.  The 24 class labels come from three categories: sentiment, when, and kind. Human raters can choose only one class label from the "sentiment" and "when" categories, but are allowed multiple choices for the "kind".

s = sentiment
w = when
k = kind
============================================================
s1,"I can't tell"
s2,"Negative"
s3,"Neutral / author is just sharing information"
s4,"Positive"
s5,"Tweet not related to weather condition"
w1,"current (same day) weather"
w2,"future (forecast)"
w3,"I can't tell"
w4,"past weather"
k1,"clouds"
k2,"cold"
k3,"dry"
k4,"hot"
k5,"humid"
k6,"hurricane"
k7,"I can't tell"
k8,"ice"
k9,"other"
k10,"rain"
k11,"snow"
k12,"storms"
k13,"sun"
k14,"tornado"
k15,"wind"

About

No description, website, or topics provided.

Readme

Activity

2 stars

4 watching

0 forks

Report repository

Releases

No releases published

Packages

No packages published

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data

Data

KaggleTwitterProject_Sentiment

KaggleTwitterProject_Sentiment

KaggleTwitterProject_Weather

KaggleTwitterProject_Weather

KaggleTwitterProject_When

KaggleTwitterProject_When

linguistics/en

linguistics/en

peach-doc

peach-doc

.gitattributes

.gitattributes

.gitignore

.gitignore

README

README

Useful links.txt

Useful links.txt

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

MattChanTK/KaggleTwitterProject

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages