Data-Mining (Python)

Programming assignments and competition project for the course DSCI 553: Foundations and Applications of Data Mining

Competition Project

Ranked amongst top 3 students in class on the basis of lowest RMSE of predictions.

Exploration of Yelp Datasets using Spark and understand how partitions work in RDDs.
Implementing the SON algorithm using Spark
Tasks:
(a) Implementing Minhashing and Locality Sensitive Hashing (LSH) algorithms using Spark.
(b) Build a content-based recommendation system for Yelp users.
(c) Build a recommendation systems using the following Collborative Filtering techniques:
(i) User-based CF
(ii) Item-based CF
Detecting communities in Graphs using the following two algorithms:
(a) Implementing the Girvan-Newman algorithm from scratch using the Spark Framework
(b) Implementing the Label Propagation Algorithm using the Spark GraphFrames library
Implement the following three algorithms on Streaming Data using Spark Streaming:
(a) Reservoir Sampling Algorithm on Twitter Streaming Data to find popular tags associated with tweets using Twitter API.
(b) Bloom Filtering Algorithm for offline Yelp business dataset to estimate whether the name of a coming business in the data stream has been seen before.
(c) Flajolet-Martin Algorithm using simulated streaming on Yelp dataset to estimate the number of unique states of the incoming businesses within a window in the data stream.
Implement the K-Means and Bradley-Fayyad-Reina (BFR) clustering algorithms.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Assignments		Assignments
Competition Project		Competition Project
README.md		README.md