GitHub

This directory contains code for data prep (/dataprep), EDA (/eda) and regression models that are fit to the data (/regression). Details about each are in the sections below - scripts in /dataprep need to be executed first to populate /data. The scripts for EDA and regression can then be executed.

Figures generated by a script are stored in the figures subdirectory (in the directory the script is in), i.e. in /eda/figures and /regression/figures.

Setup

Install the following packages:

pip3 install pandas
pip3 install plotly
pip3 install psutil
pip3 install requests
pip3 install matplotlib
pip3 install vaderSentiment
pip3 install scikit-learn
pip3 install yellowbrick
pip3 install statsmodels
pip3 install Cython
git clone git://github.com/scikit-learn-contrib/py-earth.git
cd py-earth
sudo python3 setup.py install --cythonize

I. /dataprep

Contains scripts to derive features from already existing features in the Facebook News Dataset.

Clone the Facebook News Dataset into the /dataprep folder.

cd dataprep
https://github.com/jbencina/facebook-news.git
cd facebook-news

Unzip the two 7z files. This should result in two new csv files: /facebook-news/fb_news_posts_20K.csv and /facebook-news/fb_news_comments_1000K_hashed.csv
Run the scripts that derive new features. This should create two new csv files; /data/commentsProcessed.py and /data/postsProcessed.py
```
cd ..
python3 prep_comments.py
python3 prep_posts.py
```

II. /data

Contains csv files with all original and derived features. These can be:

Generated by running the scripts in /dataprep (steps in section I)

OR

Downloaded from https://drive.google.com/drive/folders/10RDSPErTmFY9LI-s07HNCfw90iSaW77_?usp=sharing

Either way, this folder must contain two files, /data/commentsProcessed.py and /data/postsProcessed.py before scripts from subsequent sections / other folders are executed.

Other csv with transformed features, created by subsequent scripts likes /eda/reactions.py, are also placed in this folder when created.

III. /eda

Contains scripts to perform exploratory data analysis. Execute these scripts in the order they are mentioned in below (because each one produces a new csv that includes any new or transformed features created in the process).

cd eda
python3 reactions.py
python3 sentiment.py
python3 distributions.py
python3 relationships_outcome.py
python3 relationships_sentiment.py
python3 relationships_day_time.py

reactions.py
- Plots and analyzes the distribution of the fraction of each type of reaction on each post, i.e. it looks at the distribution of six features: frac_angry, frac_wow, frac_sad, frac_haha, frac_love and frac_like.
- Plots and analyzes the distribution of the number of reactions (of each type) obtained per second, and the distributions of their log transformed values (this ends up resulting in a normal distribution).
- Reads from /data/postsProcessed.csv
- Writes log-transformed features to a new csv /data/postsWithFracReactions.csv
sentiment.py
- Plots the distribution of each of the sentiment fields
- Displays information to help analyze the relationship between the fraction of reactions of each type received by a post, and its sentiment as computed by VADER
distributions.py
- Plots the distribution of all other features: mins_to_first_comment, day of the week a post was created, time of the day it was created, number of shares per second, number of reactions per second.
- Applies a reciprocal root transform to make a normal distribution out of mins_to_first_comment, mins_to_100_comment, shares_per_sec and reactions_per_sec.
- Reads from /data/postsWithFracReactions.csv
- Writes reciprocal-root transformed features to /data/postsWithReciRoot.csv
relationships_outcome.py
- Explores the relationships between each (transformed) feature and the mins_to_100_comment feature.
- Plots either scatter plots, or line plots showing the mean over a certain category.
- Reads from /data/postsWithReciRoot.csv
relationships_sentiment.py
- Explores the relationships between the sentiment of a post and all other features.
- Plots line plots binner by sentiment (width 0.1)
- Reads from /data/postsWithReciRoot.csv
relationships_day_time.py
- Explores the relationships between the day/time of a post creation and all other features.
- Plots line plots binned by one-hot encoded day/time features
- Reads from /data/postsWithReciRoot.csv

IV. /regression

Contains scripts that fit the following models to the data - univariate linear regression, multivariate linear regression, MARS and SVR.

cd eda
python3 univariate.py
python3 multivariate.py
python3 mars.py
python3 svr.py

univariate.py
- Fits a univariate linear regression model between mins_to_100_comment and each other transformed feature.
multivariate.py
- Fits a multivariate linear regression model to all features, to predict mins_to_100_comment
- Fits a multivariate linear regression model multiple times - the number of times is defined by NUMITER.
- Each iteration, one feature is added to the model at a time and the change in R2 and RMSE is noted for each feature. The order in which features are added changes randomly. P-values and coefficients are displayed. Over all iterations, keeps track of the addition of which six features leads to the largest increase in the R2 score.
mars.py
- Fits a MARS model to the top 6 features chosen by the multivariate linear regression model and plots feature importances.
svr.py
- Performs a grid search and then fits an SVR RBF kernel to the top 6 features.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
dataprep		dataprep
eda		eda
paper		paper
presentation		presentation
regression		regression
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

dataprep

dataprep

eda

eda

paper

paper

presentation

presentation

regression

regression

.DS_Store

.DS_Store

README.md

README.md

Repository files navigation

Setup

I. /dataprep

II. /data

III. /eda

IV. /regression

About

Releases

Packages

Languages

thanikaReddy/fbNewsAnalysis

Folders and files

Latest commit

History

Repository files navigation

Setup

I. /dataprep

II. /data

III. /eda

IV. /regression

About

Resources

Stars

Watchers

Forks

Languages