Skip to content

thanikaReddy/fbNewsAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This directory contains code for data prep (/dataprep), EDA (/eda) and regression models that are fit to the data (/regression). Details about each are in the sections below - scripts in /dataprep need to be executed first to populate /data. The scripts for EDA and regression can then be executed.

Figures generated by a script are stored in the figures subdirectory (in the directory the script is in), i.e. in /eda/figures and /regression/figures.

Setup

Install the following packages:

pip3 install pandas
pip3 install plotly
pip3 install psutil
pip3 install requests
pip3 install matplotlib
pip3 install vaderSentiment
pip3 install scikit-learn
pip3 install yellowbrick
pip3 install statsmodels
pip3 install Cython
git clone git://github.com/scikit-learn-contrib/py-earth.git
cd py-earth
sudo python3 setup.py install --cythonize

I. /dataprep

Contains scripts to derive features from already existing features in the Facebook News Dataset.

  • Clone the Facebook News Dataset into the /dataprep folder.
    cd dataprep
    https://github.com/jbencina/facebook-news.git
    cd facebook-news 
    
  • Unzip the two 7z files. This should result in two new csv files: /facebook-news/fb_news_posts_20K.csv and /facebook-news/fb_news_comments_1000K_hashed.csv
  • Run the scripts that derive new features. This should create two new csv files; /data/commentsProcessed.py and /data/postsProcessed.py
    cd ..
    python3 prep_comments.py
    python3 prep_posts.py
    

II. /data

Contains csv files with all original and derived features. These can be:

  • Generated by running the scripts in /dataprep (steps in section I)

OR

Either way, this folder must contain two files, /data/commentsProcessed.py and /data/postsProcessed.py before scripts from subsequent sections / other folders are executed.

Other csv with transformed features, created by subsequent scripts likes /eda/reactions.py, are also placed in this folder when created.

III. /eda

Contains scripts to perform exploratory data analysis. Execute these scripts in the order they are mentioned in below (because each one produces a new csv that includes any new or transformed features created in the process).

cd eda
python3 reactions.py
python3 sentiment.py
python3 distributions.py
python3 relationships_outcome.py
python3 relationships_sentiment.py
python3 relationships_day_time.py
  • reactions.py

    • Plots and analyzes the distribution of the fraction of each type of reaction on each post, i.e. it looks at the distribution of six features: frac_angry, frac_wow, frac_sad, frac_haha, frac_love and frac_like.
    • Plots and analyzes the distribution of the number of reactions (of each type) obtained per second, and the distributions of their log transformed values (this ends up resulting in a normal distribution).
    • Reads from /data/postsProcessed.csv
    • Writes log-transformed features to a new csv /data/postsWithFracReactions.csv
  • sentiment.py

    • Plots the distribution of each of the sentiment fields
    • Displays information to help analyze the relationship between the fraction of reactions of each type received by a post, and its sentiment as computed by VADER
  • distributions.py

    • Plots the distribution of all other features: mins_to_first_comment, day of the week a post was created, time of the day it was created, number of shares per second, number of reactions per second.
    • Applies a reciprocal root transform to make a normal distribution out of mins_to_first_comment, mins_to_100_comment, shares_per_sec and reactions_per_sec.
    • Reads from /data/postsWithFracReactions.csv
    • Writes reciprocal-root transformed features to /data/postsWithReciRoot.csv
  • relationships_outcome.py

    • Explores the relationships between each (transformed) feature and the mins_to_100_comment feature.
    • Plots either scatter plots, or line plots showing the mean over a certain category.
    • Reads from /data/postsWithReciRoot.csv
  • relationships_sentiment.py

    • Explores the relationships between the sentiment of a post and all other features.
    • Plots line plots binner by sentiment (width 0.1)
    • Reads from /data/postsWithReciRoot.csv
  • relationships_day_time.py

    • Explores the relationships between the day/time of a post creation and all other features.
    • Plots line plots binned by one-hot encoded day/time features
    • Reads from /data/postsWithReciRoot.csv

IV. /regression

Contains scripts that fit the following models to the data - univariate linear regression, multivariate linear regression, MARS and SVR.

cd eda
python3 univariate.py
python3 multivariate.py
python3 mars.py
python3 svr.py
  • univariate.py
    • Fits a univariate linear regression model between mins_to_100_comment and each other transformed feature.
  • multivariate.py
    • Fits a multivariate linear regression model to all features, to predict mins_to_100_comment
    • Fits a multivariate linear regression model multiple times - the number of times is defined by NUMITER.
    • Each iteration, one feature is added to the model at a time and the change in R2 and RMSE is noted for each feature. The order in which features are added changes randomly. P-values and coefficients are displayed. Over all iterations, keeps track of the addition of which six features leads to the largest increase in the R2 score.
  • mars.py
    • Fits a MARS model to the top 6 features chosen by the multivariate linear regression model and plots feature importances.
  • svr.py
    • Performs a grid search and then fits an SVR RBF kernel to the top 6 features.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published