GitHub - minhngh/Parallel-Computing-Assignment

Prerequisite
- Environment: Python 3, Pyspark
- Cloud service account if you want to deploy on it.
Dataset
- Task: Text classification
- Name: Amazon review dataset
- Description: Dataset contains a few million Amazon customer reviews in text and star ratings which are output labels. Training set is about 1.6GB, test set is about 177MB.
- Link: Download. After downloading and extracting, we place it in dataset folder.
Works
- We use above dataset and combine with pyspark to evaluate the performance of Spark processing big data. We apply Machine learning supported by libraries of Spark to accomplish the task. The techniques we used here are TF-IDF for feature engineering step, and some different classification algorithms such as gradient boost, logistic regression, support vector machine, random forest. We'll calculate time and accuracy per algorithm.
- Deployment
  - Now, we run the task on the local machine.
  - In the future, we will run on Super Node XP or AWS (charge).
Result: details in file result/result.csv

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
result		result
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback