- Prerequisite
- Environment: Python 3, Pyspark
- Cloud service account if you want to deploy on it.
- Dataset
- Task: Text classification
- Name: Amazon review dataset
- Description: Dataset contains a few million Amazon customer reviews in text and star ratings which are output labels. Training set is about 1.6GB, test set is about 177MB.
- Link: Download. After downloading and extracting, we place it in dataset folder.
- Works
- We use above dataset and combine with pyspark to evaluate the performance of Spark processing big data. We apply Machine learning supported by libraries of Spark to accomplish the task. The techniques we used here are TF-IDF for feature engineering step, and some different classification algorithms such as gradient boost, logistic regression, support vector machine, random forest. We'll calculate time and accuracy per algorithm.
- Deployment
- Now, we run the task on the local machine.
- In the future, we will run on Super Node XP or AWS (charge).
- Result: details in file result/result.csv
-
Notifications
You must be signed in to change notification settings - Fork 0
minhngh/Parallel-Computing-Assignment
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published