Skip to content

minhngh/Parallel-Computing-Assignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Asssignment for Parallel Computing subject


  1. Prerequisite
    • Environment: Python 3, Pyspark
    • Cloud service account if you want to deploy on it.
  2. Dataset
    • Task: Text classification
    • Name: Amazon review dataset
    • Description: Dataset contains a few million Amazon customer reviews in text and star ratings which are output labels. Training set is about 1.6GB, test set is about 177MB.
    • Link: Download. After downloading and extracting, we place it in dataset folder.
  3. Works
    • We use above dataset and combine with pyspark to evaluate the performance of Spark processing big data. We apply Machine learning supported by libraries of Spark to accomplish the task. The techniques we used here are TF-IDF for feature engineering step, and some different classification algorithms such as gradient boost, logistic regression, support vector machine, random forest. We'll calculate time and accuracy per algorithm.
    • Deployment
      • Now, we run the task on the local machine.
      • In the future, we will run on Super Node XP or AWS (charge).
  4. Result: details in file result/result.csv

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages