50.043 Database and Big Data Systems Project

NOTE: Due to a recent update on 9 December for the Python Cryptography package which is required by the fabric package we used some of our scripts may not work as intended. Already informed Prof Dinh, Prof Kenny and GTA Vishva during presentation. See here.

Contributors

Tharun
Nina
Hadi
Jun Yuan
Akmal Hakim

Automated Use Flow

Prerequisites

AWS EC2 Ubuntu 18.04 LTS (due to use of bash scripts)
AWS EC2 key pair created in the region 'us-east-1'

Instructions

Setting up project folder and required packages

Install the necessary packages

 sudo apt-get -y update &&
 sudo apt -y install python3-pip && 
 pip3 install pip &&
 pip3 install --upgrade setuptools &&
 pip3 install boto3 &&
 pip3 install -U python-dotenv &&
 sudo apt-get install build-essential libssl-dev libffi-dev python-dev &&
 pip install fabric

Clone the project repository using the following command
```
git clone https://github.com/thawalk/db-final.git
```

Running the production script

Change directory into the db-final/production folder
```
cd db-final/production
```
Run production.py to setup and configure the EC2 instances. This will create instances for hosting MongoDB, mySQL, frontend and backend servers.
```
python3 production.py
```
NOTE Upon execution, you will be prompted for your AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account.

In any event the production.py fails, or gets stuck in a loop, stop the script by using CTRL + C . Then, teardown all the instances, instructions HERE
The public IP of the EC2 instance hosting the web application will be printed on the console. The website can then be viewed at <public IP 1> printed at the end.

This is the link of the website.

Running the Analytics Script

We are now at /db-final/production. For the Analytics portion, we shall head over to /db-final/hadoop by running
```
cd ..
cd hadoop
```
Let us now run the analytics script. This execution will take approximately 20-30 minutes. Run:
```
python3 hadoop_spark.py
```
NOTE: Similar to running production.py, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes for the cluster.Key in the number of clusters accordingly.

In any event the hadoop_spark.py fails, or gets stuck in a loop, click HERE
The results of the analytics tasks will be stored in /db-finals/analytics_files. First, we change our directory to the analytics_files folder.
```
cd ..
cd analytics_folder
```
To obtain the results of the analytics, we run:
```
python3 get_results.py
```
A sample of the TF-IDF and correlation analytics output can be seen below.

Pearson

TF-IDF The folder with all the CSVs will be saved locally in the analytics_files folder after you run the script
If you want to commission new nodes to the cluster, you can execute this script. This execution will take approximately 20-30 minutes. Run:
```
cd ..
cd hadoop
python3 hadoop_new_node.py
```
NOTE: Similar to running production.py, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes to be added to the cluster. Key in the number of clusters that you want to add to the cluster.

In any event the hadoop_new_node.py fails, or gets stuck in a loop,you would need to CTRL + C and teardown the cluster using instructions HERE. Then, rerun hadoop_spark.py and try running hadoop_new_node.py again
If you want to decommission nodes from the cluster, you can execute this script. This execution will take approximately 20-30 minutes. Run:
```
python3 hadoop_down_node.py
```
NOTE: If your hadoop_new_node.py has previously failed, please teardown entire cluster and rerun hadoop_spark.py before running hadoop_down_node.py. NOTE: Similar to running production.py, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes that you want to decommission from the cluster. Key in the number of clusters that you want to remove from the cluster.

In any event the hadoop_down_node.py fails, or gets stuck in a loop, you would need to CTRL + C. Then, rerun hadoop_down_node.py again

Tearing Down Instructions

First, we have to change directory into the db-final/teardown folder where all our teardown scripts will be at
```
cd db-final/teardown
```
For tearing down production side instances, run:
```
python3 teardown_production.py
```
For tearing down Hadoop side instances, run:
```
python3 teardown_hadoop.py
```

Description

In this project, we built a web application for Kindle book reviews, similar to Goodreads, using public datasets from Amazon. The full details of the project requirements can be found here.

Dataset

We were provided with 2 public datasets from Amazon.

Amazon Kindle's reviews, available from Kaggle website. This dataset has 982,619 entries (about 700MB).
Amazon Kindle metadata, available from UCSD website This dataset has 434,702 products (about 450MB)

After cleaning the dataset, we loaded the data into the respective databases.

Frontend

Requirements

A front-end with at least the following functionalities:

Add a new book
Search for existing book by author and by title
Add a new review
Sort books by reviews, genres
User-friendliness

Framework

We chose to use Handlebars.js as a framework as it is simple, clean, and decoupled from logic-based JavaScript files. Its templating engine makes client-side templating simple without requiring complicated Single Page Application frameworks.

Features

Some features of our frontend include:

Search a book by title
Search a book by author
Give reviews on books
Filter books based on Categories
Filter books based on Rating
Add new book

Preview

Production Backend

Backend Requirements

Deploy at least 3 separate servers: one for the web server, one for relational database, one for the non-relational databases.
A web server
A relational database
A non-relational database
The web server logs are recorded in a document store. Each log record must have at least the following information:
- Timestamp
- Type of request
- Response code (404, 200, etc)

Backend Framework

Flask was chosen because of its well-maintained libraries that we needed for the mySQL and MongoDB databases. Libraries used include PyMongo, Python-sql and mysql-connector.

Design

1. Books Metadata

Schema (MongoDB)

id: ObjectId, primary key
asin: String
categories:
description?: String
title: String
price: Double
imUrl: String
related: json
- also_bought: string[]
- also_viewed: string[]
- bought_together: string[]
- buy_after_viewing: string[]

2. Kindle Reviews

Schema (MySQL)

id: Integer, primary key
asin: varchar
helpful: varchar
overall: Integer
reviewText: Text
reviewTime: varchar
reviewerID: varchar
reviewerName: varchar
summary: varchar
unixReviewTime: Integer

3. User Logging

Schema (MongoDB)

id: Integer, primary key
timestamp: String
request: String
response: String

4. Extra Details

Schema (MongoDB)

id: ObjectId, primary key
asin: String
book_title: String
author_names: String

Analytics

For analytics, we build a system that will :

Gets data from production system and loads it into the HDFS
Use Spark to gain two analysis which comprises of :
- Pearson Correlation : The covarience of the two variables divided by the product of their standard deviations.
- TF-IDF : A statistical measure on how relevant a word is to a document in a collection of documents

Implementation

(1) Hadoop and spark installation

So when the user executes the hadoop_spark.py script in the hadoop folder, with their credentials and the specified number of nodes, the script will do all the necessary configuration, such as installing hadoop and spark in all the nodes. Then it will start the hadoop and spark cluster.

(2) Hadoop commissioning new nodes

When the hadoop_new_node.py which is in the hadoop folder is executed, the number of specified nodes will be commissioned to the cluster. Below is the explanation.

We implemented the adding of the new nodes to the cluster the proper way, which is to commission new nodes to the cluster.

So when we the script is executed. The specifed number of new nodes will be created, with all the configuration, such as the installing hadoop and getting the public key of the namenode that is running in the cluster.

Then all the previous nodes would be informed of these new nodes that are going to be added to the cluster through the hosts and workers files, with some other configuration as well.

Now, that all the necessary configuration has been completed. The script would then place a includes file in the namenode that contains all the datanodes private ips, both old and new nodes. Then the script would stop and start the cluster. At this point, through the new workers file and the includes file, the namenode would be 'informed' of the new nodes, that are coming in. Then when the cluster starts, the new nodes, would be commissioned.

(3) Hadoop decommissioning nodes

So when the user executes the hadoop_down_node.py script. The script would decommsission the specified number of nodes from the cluster. Below is the explanation

The way this is done is through the excludes file, which is placed in the namenode. The private ips of the nodes that are to be decommissioned wil be in the excludes file. Then the script will refresh the cluster. When this happens the namenods will decommission the nodes that are in the exludes file.

Point to the note: Excludes takes precedence over includes. So, when datanodes are decommissioned, the private ips in the includes file will does not have to be removed.

This is the image for the excludes file:

(3) The way we achieved the commissioning and decommissioning of the new nodes.

We added some special configuration to the namenode yarn file. These is the image below:

This configuration would use file named as 'includes', where the new nodes that are

Approach

After the hadoop and spark cluster is up. The user can run the analytics tasks by running the get_results.py file in the analytics_files folder. Below is the approach that we took for the analytics tasks.

(1) Pearson Correlation

To compute the Pearson correlation between price and average review length, we would first need to retrieve these data from HDFS. Since price is from the Meta dataset while the reviews are from the Reviews dataset, we would need to fetch the two datasets and do some preprocessing.

The preprocessing includes getting the average reviewText length for each book (asin), and removing all unnecessary columns in order to clean up our data. We then joined the 2 datasets using their asin, in order to get the the table with both price and reviewText length.

We used RDD in order to map the values of x, y, xy, x2, y2 and get the sum of each of these values. After getting these results, we were able to calculate the Pearson Correlation using the following equation.

$\small r = \frac{n \sum xy - \sum x \sum y}{\sqrt {(n \sum x^{2} - (\sum x)^{2}) (n \sum y^{2} - (\sum y)^{2})}}$

The output of this is printed in the console when get_results.py is run (as shown above).

(2) TF-IDF

In order to calculate the TF-IDF, we need to get 2 variables from our dataset, Term Frequency and Inverse Document Frequency.

Term Frequency : The number of instances a word appeared in the document
Inverse Document Frequency: A numerical mesaure of how much information a word provides

TF-IDF is calculated by mutliplying the Term Frequency and the Inverse Document Frequency. The higher the score means that the word is more relevant in the document.

Finding the TF-IDF score for word t in document d from a set of documents D can be done with the following :

We first need to retrieve the Reviews data stored in HDFS.

Following that, to get the TF, we need to calculate the word count of every word in the reviewText of each book and divide it with the total word count of the reviewText.

To get the IDF, we used the IDF(t, D) equation from above to calculate the score. We then proceeded to multiply the TF and IDF in order to get the TF-IDF scores for the books.

The results for the TF-IDF is downloaded into the db-final/analytics_files/tfidf_output.zip which contains an archived folder of the TF-IDF scores of the books, saved in a CSV.

Extras

Web Scraping

Since the metadata given Amazon Kindle's reviews has alot of missing book_details, we had to web scrape all the missing details from the web.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.history		.history
.vscode		.vscode
__pycache__		__pycache__
all_nodes_deploy		all_nodes_deploy
all_nodes_templates		all_nodes_templates
analytics_files		analytics_files
hadoop		hadoop
images		images
name_node_deploy		name_node_deploy
name_node_scaling_template		name_node_scaling_template
production		production
spark_config		spark_config
teardown		teardown
.env		.env
.gitignore		.gitignore
README.md		README.md
analytics_functions.py		analytics_functions.py
credentials.py		credentials.py
credentials.pyc		credentials.pyc
dailyAnalytics.py		dailyAnalytics.py

thawalk/GoodReviews

Folders and files

Latest commit

History

Repository files navigation

50.043 Database and Big Data Systems Project

Table of Contents

Contributors

Automated Use Flow

Prerequisites

Instructions

Setting up project folder and required packages

Running the production script

Running the Analytics Script

Tearing Down Instructions

Description

Dataset

Frontend

Requirements

Framework

Features

Preview

Production Backend

Backend Requirements

Backend Framework

Design

1. Books Metadata

2. Kindle Reviews

3. User Logging

4. Extra Details

Analytics

Implementation

(1) Hadoop and spark installation

(2) Hadoop commissioning new nodes

(3) Hadoop decommissioning nodes

(3) The way we achieved the commissioning and decommissioning of the new nodes.

Approach

(1) Pearson Correlation

(2) TF-IDF

Extras

Web Scraping

About

Resources

Stars

Watchers

Forks

Languages