-
Notifications
You must be signed in to change notification settings - Fork 0
salonimishr/Mapreduce
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
For this, there are three platforms which are used including Google Colab, Databricks and Docker Toolbox. Google Colab is a very good platform for Pyspark as it has high volume processing and provides templates. Microsoft's Databricks is almost similar but in this we have to create cluster for running a job. The same algorithm(mapping, shuffling and reducing) was also run using docker toolbox. Docker toolbox is container based technology and containers are just user space of the operating system.In this, the containers running share the host OS kernel. Google Colab was best due to its fast processing. First in Colab, I initialized the pyspark environment and then mounted the google drive. After this downloaded the data and then used pyspark to count the UniqueCarrier. In Databricks, I uploaded the data and performed the same task. At last, I did the task in Docker toolbox by copying the data into the container. Same results were achieved by three different platforms. In Colab, the dataset is loaded into a RDD but DF and RDD is used to load the data in Databricks and Docker toolbox.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published