Amazon Review Valitator

My data science project to guess if an amazon review is helpful for galvanize DSI.

Automoation Pipleline

Planning and Reasoning

Amazon Customer Review dataset is on Amazon S3, but my $1000 AWS credit is not available yet. I decided to use my Azure VMs, which utlized around 20%.

I am going to deploy Spark docker containers to perform data analysis, may create a spark cluster in production stage.

Data Migration AWS S3 Bucket -> Azure Blob Storage

Azure Documentations are outdated but its staff responded to my questions within a reasonable timeframe.

Azcopy ss

Data Preparation

convert tsv to parquet in partitions because spark optimize performance with parquet files and the some of original tsv data files are more than 2GB, which can be partitioned and read by Spark in parallel.

EDA

Hypothesis

In reviews with rating>4 and <2, people found reviews with words top 50 occurances are more helpful

More frequently used words are more helpful or less frequently used words are more helpful? 1)

H0: the helpfulness mean of most used 100 words > the helpfulness mean of the population
H1: the helpfulness mean of most used 100 words < the helpfulness mean of the population

H0: the helpfulness mean of least used 100 words > the helpfulness mean of the population
H1: the helpfulness mean of least used 100 words < the helpfulness mean of the population

How about ratings?

Higher rating reviews or lower rating reviews are more helpful? 1) H0: the helpfulness mean of Higher rating reviews > the helpfulness mean of the population H1: the helpfulness mean of Higher rating reviews < the helpfulness mean of the population

H0: the helpfulness mean of lower rating reviews > the helpfulness mean of the population H1: the helpfulness mean of lower rating reviews < the helpfulness mean of the population

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.ipynb_checkpoints		.ipynb_checkpoints
capstone_2		capstone_2
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
functions.py		functions.py
partitioned_review_analysis.ipynb		partitioned_review_analysis.ipynb
partitioned_review_body_analysis.ipynb		partitioned_review_body_analysis.ipynb
review_title_analysis.ipynb		review_title_analysis.ipynb
star_rating_dist.ipynb		star_rating_dist.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.ipynb_checkpoints

.ipynb_checkpoints

capstone_2

capstone_2

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

functions.py

functions.py

partitioned_review_analysis.ipynb

partitioned_review_analysis.ipynb

partitioned_review_body_analysis.ipynb

partitioned_review_body_analysis.ipynb

review_title_analysis.ipynb

review_title_analysis.ipynb

star_rating_dist.ipynb

star_rating_dist.ipynb

Repository files navigation

Amazon Review Valitator

Automoation Pipleline

Planning and Reasoning

Data Migration AWS S3 Bucket -> Azure Blob Storage

Data Preparation

EDA

Hypothesis

In reviews with rating>4 and <2, people found reviews with words top 50 occurances are more helpful

How about ratings?

About

Releases

Packages

Languages

License

0xd5dc/amazon_review_valitator

Folders and files

Latest commit

History

Repository files navigation

Amazon Review Valitator

Automoation Pipleline

Planning and Reasoning

Data Migration AWS S3 Bucket -> Azure Blob Storage

Data Preparation

EDA

Hypothesis

In reviews with rating>4 and <2, people found reviews with words top 50 occurances are more helpful

How about ratings?

About

Resources

License

Stars

Watchers

Forks

Languages