Skip to content

A full-stack web application for Kindle book reviews using Big Data Tools and analytics, built using Python, Javascript, Flask, Hadoop, Spark, AWS Services.

Notifications You must be signed in to change notification settings

thawalk/GoodReviews

Repository files navigation

50.043 Database and Big Data Systems Project

NOTE: Due to a recent update on 9 December for the Python Cryptography package which is required by the fabric package we used some of our scripts may not work as intended. Already informed Prof Dinh, Prof Kenny and GTA Vishva during presentation. See here.

Table of Contents

Contributors

  • Tharun
  • Nina
  • Hadi
  • Jun Yuan
  • Akmal Hakim

Automated Use Flow

Prerequisites

  • AWS EC2 Ubuntu 18.04 LTS (due to use of bash scripts)
  • AWS EC2 key pair created in the region 'us-east-1'

Instructions

Setting up project folder and required packages

  1. Install the necessary packages
     sudo apt-get -y update &&
     sudo apt -y install python3-pip && 
     pip3 install pip &&
     pip3 install --upgrade setuptools &&
     pip3 install boto3 &&
     pip3 install -U python-dotenv &&
     sudo apt-get install build-essential libssl-dev libffi-dev python-dev &&
     pip install fabric  
    
  2. Clone the project repository using the following command
    git clone https://github.com/thawalk/db-final.git
    

Running the production script

  1. Change directory into the db-final/production folder

    cd db-final/production
    
  2. Run production.py to setup and configure the EC2 instances. This will create instances for hosting MongoDB, mySQL, frontend and backend servers.

    python3 production.py
    

    NOTE Upon execution, you will be prompted for your AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account.

    In any event the production.py fails, or gets stuck in a loop, stop the script by using CTRL + C . Then, teardown all the instances, instructions HERE

  3. The public IP of the EC2 instance hosting the web application will be printed on the console. The website can then be viewed at <public IP 1> printed at the end.

    This is the link of the website.

Running the Analytics Script

  1. We are now at /db-final/production. For the Analytics portion, we shall head over to /db-final/hadoop by running

    cd ..
    cd hadoop
    
  2. Let us now run the analytics script. This execution will take approximately 20-30 minutes. Run:

    python3 hadoop_spark.py
    

    NOTE: Similar to running production.py, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes for the cluster.Key in the number of clusters accordingly.


    In any event the hadoop_spark.py fails, or gets stuck in a loop, click HERE

  3. The results of the analytics tasks will be stored in /db-finals/analytics_files. First, we change our directory to the analytics_files folder.

    cd ..
    cd analytics_folder
    
  4. To obtain the results of the analytics, we run:

    python3 get_results.py
    

    A sample of the TF-IDF and correlation analytics output can be seen below.

    Pearson

    TF-IDF The folder with all the CSVs will be saved locally in the analytics_files folder after you run the script

  5. If you want to commission new nodes to the cluster, you can execute this script. This execution will take approximately 20-30 minutes. Run:

    cd ..
    cd hadoop
    python3 hadoop_new_node.py
    

    NOTE: Similar to running production.py, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes to be added to the cluster. Key in the number of clusters that you want to add to the cluster.


    In any event the hadoop_new_node.py fails, or gets stuck in a loop,you would need to CTRL + C and teardown the cluster using instructions HERE. Then, rerun hadoop_spark.py and try running hadoop_new_node.py again

  6. If you want to decommission nodes from the cluster, you can execute this script. This execution will take approximately 20-30 minutes. Run:

    python3 hadoop_down_node.py
    

    NOTE: If your hadoop_new_node.py has previously failed, please teardown entire cluster and rerun hadoop_spark.py before running hadoop_down_node.py. NOTE: Similar to running production.py, you will be asked for AWS Access key id, AWS Secret Access Key and AWS Session Token. These details can be found in your AWS Account. You will also be asked for the number of nodes that you want to decommission from the cluster. Key in the number of clusters that you want to remove from the cluster.


    In any event the hadoop_down_node.py fails, or gets stuck in a loop, you would need to CTRL + C. Then, rerun hadoop_down_node.py again

Tearing Down Instructions

  1. First, we have to change directory into the db-final/teardown folder where all our teardown scripts will be at
    cd db-final/teardown
    
  2. For tearing down production side instances, run:
    python3 teardown_production.py
    

  3. For tearing down Hadoop side instances, run:
    python3 teardown_hadoop.py
    

Description

In this project, we built a web application for Kindle book reviews, similar to Goodreads, using public datasets from Amazon. The full details of the project requirements can be found here.

Dataset

We were provided with 2 public datasets from Amazon.

  • Amazon Kindle's reviews, available from Kaggle website. This dataset has 982,619 entries (about 700MB).
  • Amazon Kindle metadata, available from UCSD website This dataset has 434,702 products (about 450MB)

After cleaning the dataset, we loaded the data into the respective databases.

Frontend

Requirements

A front-end with at least the following functionalities:

  • Add a new book
  • Search for existing book by author and by title
  • Add a new review
  • Sort books by reviews, genres
  • User-friendliness

Framework

We chose to use Handlebars.js as a framework as it is simple, clean, and decoupled from logic-based JavaScript files. Its templating engine makes client-side templating simple without requiring complicated Single Page Application frameworks.

Features

Some features of our frontend include:

  • Search a book by title
  • Search a book by author
  • Give reviews on books
  • Filter books based on Categories
  • Filter books based on Rating
  • Add new book

Preview





Production Backend

Backend Requirements

  • Deploy at least 3 separate servers: one for the web server, one for relational database, one for the non-relational databases.
  • A web server
  • A relational database
  • A non-relational database
  • The web server logs are recorded in a document store. Each log record must have at least the following information:
    • Timestamp
    • Type of request
    • Response code (404, 200, etc)

Backend Framework

Flask was chosen because of its well-maintained libraries that we needed for the mySQL and MongoDB databases. Libraries used include PyMongo, Python-sql and mysql-connector.

Design

1. Books Metadata

Schema (MongoDB)

  • id: ObjectId, primary key
  • asin: String
  • categories:
  • description?: String
  • title: String
  • price: Double
  • imUrl: String
  • related: json
    • also_bought: string[]
    • also_viewed: string[]
    • bought_together: string[]
    • buy_after_viewing: string[]

2. Kindle Reviews

Schema (MySQL)

  • id: Integer, primary key
  • asin: varchar
  • helpful: varchar
  • overall: Integer
  • reviewText: Text
  • reviewTime: varchar
  • reviewerID: varchar
  • reviewerName: varchar
  • summary: varchar
  • unixReviewTime: Integer

3. User Logging

Schema (MongoDB)

  • id: Integer, primary key
  • timestamp: String
  • request: String
  • response: String

4. Extra Details

Schema (MongoDB)

  • id: ObjectId, primary key
  • asin: String
  • book_title: String
  • author_names: String

Analytics

For analytics, we build a system that will :

  1. Gets data from production system and loads it into the HDFS
  2. Use Spark to gain two analysis which comprises of :
    • Pearson Correlation : The covarience of the two variables divided by the product of their standard deviations.
    • TF-IDF : A statistical measure on how relevant a word is to a document in a collection of documents

Implementation

(1) Hadoop and spark installation

So when the user executes the hadoop_spark.py script in the hadoop folder, with their credentials and the specified number of nodes, the script will do all the necessary configuration, such as installing hadoop and spark in all the nodes. Then it will start the hadoop and spark cluster.

(2) Hadoop commissioning new nodes

When the hadoop_new_node.py which is in the hadoop folder is executed, the number of specified nodes will be commissioned to the cluster. Below is the explanation.

We implemented the adding of the new nodes to the cluster the proper way, which is to commission new nodes to the cluster.

So when we the script is executed. The specifed number of new nodes will be created, with all the configuration, such as the installing hadoop and getting the public key of the namenode that is running in the cluster.

Then all the previous nodes would be informed of these new nodes that are going to be added to the cluster through the hosts and workers files, with some other configuration as well.

Now, that all the necessary configuration has been completed. The script would then place a includes file in the namenode that contains all the datanodes private ips, both old and new nodes. Then the script would stop and start the cluster. At this point, through the new workers file and the includes file, the namenode would be 'informed' of the new nodes, that are coming in. Then when the cluster starts, the new nodes, would be commissioned.


(3) Hadoop decommissioning nodes

So when the user executes the hadoop_down_node.py script. The script would decommsission the specified number of nodes from the cluster. Below is the explanation

The way this is done is through the excludes file, which is placed in the namenode. The private ips of the nodes that are to be decommissioned wil be in the excludes file. Then the script will refresh the cluster. When this happens the namenods will decommission the nodes that are in the exludes file.

Point to the note: Excludes takes precedence over includes. So, when datanodes are decommissioned, the private ips in the includes file will does not have to be removed.

This is the image for the excludes file:


(3) The way we achieved the commissioning and decommissioning of the new nodes.

We added some special configuration to the namenode yarn file. These is the image below:


This configuration would use file named as 'includes', where the new nodes that are

Approach

After the hadoop and spark cluster is up. The user can run the analytics tasks by running the get_results.py file in the analytics_files folder. Below is the approach that we took for the analytics tasks.

(1) Pearson Correlation

To compute the Pearson correlation between price and average review length, we would first need to retrieve these data from HDFS. Since price is from the Meta dataset while the reviews are from the Reviews dataset, we would need to fetch the two datasets and do some preprocessing.

The preprocessing includes getting the average reviewText length for each book (asin), and removing all unnecessary columns in order to clean up our data. We then joined the 2 datasets using their asin, in order to get the the table with both price and reviewText length.

We used RDD in order to map the values of x, y, xy, x2, y2 and get the sum of each of these values. After getting these results, we were able to calculate the Pearson Correlation using the following equation.

The output of this is printed in the console when get_results.py is run (as shown above).

(2) TF-IDF

In order to calculate the TF-IDF, we need to get 2 variables from our dataset, Term Frequency and Inverse Document Frequency.

  • Term Frequency : The number of instances a word appeared in the document
  • Inverse Document Frequency: A numerical mesaure of how much information a word provides

TF-IDF is calculated by mutliplying the Term Frequency and the Inverse Document Frequency. The higher the score means that the word is more relevant in the document.

Finding the TF-IDF score for word t in document d from a set of documents D can be done with the following : eqn1

We first need to retrieve the Reviews data stored in HDFS.

Following that, to get the TF, we need to calculate the word count of every word in the reviewText of each book and divide it with the total word count of the reviewText.

To get the IDF, we used the IDF(t, D) equation from above to calculate the score. We then proceeded to multiply the TF and IDF in order to get the TF-IDF scores for the books.

The results for the TF-IDF is downloaded into the db-final/analytics_files/tfidf_output.zip which contains an archived folder of the TF-IDF scores of the books, saved in a CSV.

Extras

Web Scraping

Since the metadata given Amazon Kindle's reviews has alot of missing book_details, we had to web scrape all the missing details from the web.

About

A full-stack web application for Kindle book reviews using Big Data Tools and analytics, built using Python, Javascript, Flask, Hadoop, Spark, AWS Services.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published