Skip to content

AWS cloud deployment of big data train/test pipeline of gigabyte scale NLP machine learning algorithm to analyze the sentiment of Amazon product reviews using AWS elastic map reduce architecture

anubhavcoder/AWS-EMR-AmazonReviews-NLP-Sentiment-Analysis

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Gigabyte scale Amazon Product Reviews Sentiment Analysis Challenge: A scalable pySPARK NLP Machine Learning Pipeline hosted on AWS EMR Cloud

This project implements the Amazon Web Services (AWS) cloud deployment of big data train/test pipeline of gigabyte scale data using parallelized Natural Language Processing (NLP) machine learning algorithms to analyze the sentiment of Amazon product reviews using the AWS elastic map reduce architecture. The fully automated object oriented code was implemented using pySpark machine learning and pySpark SQL libraries. Prior to building the big data solution, I did extensive data cleaning, feature engineering, data visualization, and built prototype models to investigate the underlying data using sci-kit learn and seaborn. The details of the prototype model can be found in the /prototype/prototype.ipynb notebook.

Data Source: Amazon Review Data (2018) was sourced from deepyeti.ucsd.edu/jianmo/amazon/index.html

The per category review data was unzipped and stored in the Amazon S3 bucket. A total of 10 GB of data and 20 million reviews were used.

Feature Data Pipeline:

Three class labels were generated for the positive, negative, and neutral classes. For the two class labels, positive and neutral/negative labels were used.

The feature data was processed using the pySpark feature pipeline shown in the illustration below:

The following AWS architecture was launched to run the pySpark feature pipeline and machine learning models:

A total of 12 cluster nodes were used to deploy the above pipeline. The big data model yielded better performance compared to the baseline prototype model. This model can be scaled up to handle larger chunks of data by allocating more cluster nodes in the launch cluster script depending on the user requirements.

Results:

For the three class label case, the following results were obtained for the best case Logistic Regression + TFIDF model:

81% Precision Positive Sentiment Class, 72% Precision For Negative Sentiment Class, 55% Precision For Neutral Sentiment Class

The low precision for the neutral class can be explained by the classification of the neutral positive text as positive or the neutral negative text as negative. This phenomenon can be further understood by exploring the feature importances of neutral words with high gini coefficients. In the graphs below we observe that the word 'would' which is equally likely to occur in the positive, negative, neutral class can be misclassified as positive if positive sounding text like 'would have' had occured or as negative if 'would not' occured.

However, if the best case logistic Regression + TFIDF model was trained only on the positive and negative classes, we get a positive class precision of 92% and a negative class precision of 83%. This further confirms the hypothesis that neutral class data which is neutral-positive in character can be classified as positive or neutral-negative sounding text can be misclassified as negative.

Tech Stack Used:

AWS (S3 bucket, boto3, Dynamo DB, PySpark EMR), Yarn, hadoop

Scikit-learn, seaborn, python, pySpark ML, pySpark SQL, Numpy, Pandas, NLTK-NLP

Notes:

The instructions to launch the AWS EMR cluster and execute the code are under /scripts/AWS_EMR_Configure_Instructions.

The source files are under /src/

The prototype model Jupyter Notebook can be found in the folder /prototype/

About

AWS cloud deployment of big data train/test pipeline of gigabyte scale NLP machine learning algorithm to analyze the sentiment of Amazon product reviews using AWS elastic map reduce architecture

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 94.3%
  • Python 5.1%
  • Shell 0.6%