Skip to content

chaitanyakasaraneni/youtubetrendingdataanalysis

 
 

Repository files navigation

Youtube Trending Video Data Analysis

This project was developed as a part of CMPE 255 course.

Data Description:

This dataset includes several months (and counting) of data on daily trending YouTube videos. Data is included for the US, GB, DE, CA, FR, RU, MX, KR, JP and IN regions (USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan and India respectively), with up to 200 listed trending videos per day.

Each region’s data is in a separate file. Data includes the video title, channel title, publish time, tags, views, likes and dislikes, description, and comment count.

The data also includes a category_id field, which varies between regions. To retrieve the categories for a specific video, find it in the associated JSON. One such file is included for each of the five regions in the dataset.

For more information on specific columns in the dataset refer to the column metadata. We only used US, CA, GB, IN for this project.

Tasks performed:

Performing analysis and predictions on YouTube video dataset.

  • Task 1: Predicting the Category of a YouTube video based on its Title.
  • Task 2: Predicting the number of views (popularity) of a particular video given its Title.
  • Task 3: Sentiment analysis of the description, tags and title.

Algorithms used:

Task 1: Predicting the Category of a YouTube video based on its Title.

  - Multinomial NB
  - Support Vector Classifier
  - Random Forest Classifier
  - K Neighbors Classifier
  - Decision Tree Classifier

Task 2: Predicting the number of views (popularity) of a particular video given its Title.

  - Linear Regression
  - Random Forest Regressor
  - Gradient Boosting regressor
  - Ridge Regression
  - ElasticNet

Task 3: Sentiment analysis of the description, tags and title.

  - TextBlob
  - Support Vector Machine (SVM)
  - Logistic Regression
Note:

As there was no seperate feature for sentiment, in the initial step, textblob was used to perform sentiment analysis on the description, tags and titles separately and a new column called sentiment was attached to the dataframe. After that, SVM and Logistic Regression were applied on the dataframe for performance evaluation. So here, the accuracy for SVM and LR are based on their performance with respect to textblob.

Instructions to run:

You will require Jupyter Notebook or any Python IDE with Python 3.0 or later installed to run the code.
Change the directory of the data while loading it.

References

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.3%
  • Jupyter Notebook 4.4%
  • C 0.7%
  • Cython 0.5%
  • TeX 0.1%
  • Fortran 0.0%