This directory contains code for data prep (/dataprep
), EDA (/eda
) and regression models that are fit to the data (/regression
). Details about each are in the sections below -
scripts in /dataprep
need to be executed first to populate /data
. The scripts for EDA and regression can then be executed.
Figures generated by a script are stored in the figures
subdirectory (in the directory the script is in), i.e. in /eda/figures
and /regression/figures
.
Install the following packages:
pip3 install pandas
pip3 install plotly
pip3 install psutil
pip3 install requests
pip3 install matplotlib
pip3 install vaderSentiment
pip3 install scikit-learn
pip3 install yellowbrick
pip3 install statsmodels
pip3 install Cython
git clone git://github.com/scikit-learn-contrib/py-earth.git
cd py-earth
sudo python3 setup.py install --cythonize
Contains scripts to derive features from already existing features in the Facebook News Dataset.
- Clone the Facebook News Dataset into the
/dataprep
folder.cd dataprep https://github.com/jbencina/facebook-news.git cd facebook-news
- Unzip the two 7z files. This should result in two new csv files:
/facebook-news/fb_news_posts_20K.csv
and/facebook-news/fb_news_comments_1000K_hashed.csv
- Run the scripts that derive new features. This should create two new csv files;
/data/commentsProcessed.py
and/data/postsProcessed.py
cd .. python3 prep_comments.py python3 prep_posts.py
Contains csv files with all original and derived features. These can be:
- Generated by running the scripts in
/dataprep
(steps in section I)
OR
- Downloaded from https://drive.google.com/drive/folders/10RDSPErTmFY9LI-s07HNCfw90iSaW77_?usp=sharing
Either way, this folder must contain two files, /data/commentsProcessed.py
and /data/postsProcessed.py
before scripts from subsequent sections / other folders are executed.
Other csv with transformed features, created by subsequent scripts likes /eda/reactions.py
, are also placed in this folder when created.
Contains scripts to perform exploratory data analysis. Execute these scripts in the order they are mentioned in below (because each one produces a new csv that includes any new or transformed features created in the process).
cd eda
python3 reactions.py
python3 sentiment.py
python3 distributions.py
python3 relationships_outcome.py
python3 relationships_sentiment.py
python3 relationships_day_time.py
-
reactions.py
- Plots and analyzes the distribution of the fraction of each type of reaction on each post, i.e. it looks at the distribution of six features: frac_angry, frac_wow, frac_sad, frac_haha, frac_love and frac_like.
- Plots and analyzes the distribution of the number of reactions (of each type) obtained per second, and the distributions of their log transformed values (this ends up resulting in a normal distribution).
- Reads from
/data/postsProcessed.csv
- Writes log-transformed features to a new csv
/data/postsWithFracReactions.csv
-
sentiment.py
- Plots the distribution of each of the sentiment fields
- Displays information to help analyze the relationship between the fraction of reactions of each type received by a post, and its sentiment as computed by VADER
-
distributions.py
- Plots the distribution of all other features: mins_to_first_comment, day of the week a post was created, time of the day it was created, number of shares per second, number of reactions per second.
- Applies a reciprocal root transform to make a normal distribution out of mins_to_first_comment, mins_to_100_comment, shares_per_sec and reactions_per_sec.
- Reads from
/data/postsWithFracReactions.csv
- Writes reciprocal-root transformed features to
/data/postsWithReciRoot.csv
-
relationships_outcome.py
- Explores the relationships between each (transformed) feature and the mins_to_100_comment feature.
- Plots either scatter plots, or line plots showing the mean over a certain category.
- Reads from
/data/postsWithReciRoot.csv
-
relationships_sentiment.py
- Explores the relationships between the sentiment of a post and all other features.
- Plots line plots binner by sentiment (width 0.1)
- Reads from
/data/postsWithReciRoot.csv
-
relationships_day_time.py
- Explores the relationships between the day/time of a post creation and all other features.
- Plots line plots binned by one-hot encoded day/time features
- Reads from
/data/postsWithReciRoot.csv
Contains scripts that fit the following models to the data - univariate linear regression, multivariate linear regression, MARS and SVR.
cd eda
python3 univariate.py
python3 multivariate.py
python3 mars.py
python3 svr.py
univariate.py
- Fits a univariate linear regression model between mins_to_100_comment and each other transformed feature.
multivariate.py
- Fits a multivariate linear regression model to all features, to predict mins_to_100_comment
- Fits a multivariate linear regression model multiple times - the number of times is defined by NUMITER.
- Each iteration, one feature is added to the model at a time and the change in R2 and RMSE is noted for each feature. The order in which features are added changes randomly. P-values and coefficients are displayed. Over all iterations, keeps track of the addition of which six features leads to the largest increase in the R2 score.
mars.py
- Fits a MARS model to the top 6 features chosen by the multivariate linear regression model and plots feature importances.
svr.py
- Performs a grid search and then fits an SVR RBF kernel to the top 6 features.