Skip to content

Programming assignments and competition project for the course DSCI 553: Foundations and Applications of Data Mining

Notifications You must be signed in to change notification settings

shringarsharan/Data-Mining

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Data-Mining (Python)

Programming assignments and competition project for the course DSCI 553: Foundations and Applications of Data Mining

Competition Project

Ranked amongst top 3 students in class on the basis of lowest RMSE of predictions.

Objective: Build a recommendation system to provide accurate predictions of customer reviews for businesses using Yelp datasets.

Programming Assignments

Objective: Understand and implement Data Mining Agorithms on Yelp datasets using Apache Spark Framework

  1. Exploration of Yelp Datasets using Spark and understand how partitions work in RDDs.
  2. Implementing the SON algorithm using Spark
  3. Tasks:
    (a) Implementing Minhashing and Locality Sensitive Hashing (LSH) algorithms using Spark.
    (b) Build a content-based recommendation system for Yelp users.
    (c) Build a recommendation systems using the following Collborative Filtering techniques:
    (i) User-based CF
    (ii) Item-based CF
  4. Detecting communities in Graphs using the following two algorithms:
    (a) Implementing the Girvan-Newman algorithm from scratch using the Spark Framework
    (b) Implementing the Label Propagation Algorithm using the Spark GraphFrames library
  5. Implement the following three algorithms on Streaming Data using Spark Streaming:
    (a) Reservoir Sampling Algorithm on Twitter Streaming Data to find popular tags associated with tweets using Twitter API.
    (b) Bloom Filtering Algorithm for offline Yelp business dataset to estimate whether the name of a coming business in the data stream has been seen before.
    (c) Flajolet-Martin Algorithm using simulated streaming on Yelp dataset to estimate the number of unique states of the incoming businesses within a window in the data stream.
  6. Implement the K-Means and Bradley-Fayyad-Reina (BFR) clustering algorithms.

About

Programming assignments and competition project for the course DSCI 553: Foundations and Applications of Data Mining

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages