Skip to content

eriq-augustine/242-2016

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

242-2016

Project for the "Para-normal Distributions" team (Team #0) of CMPS242 Fall 2016. The group member are:

  • Eriq Augustine
  • Varun Embar
  • Dhawal Joharapurkar
  • Xiao Li

The aim of this project is to experiment on finding similar businesses in the Yelp Challenge Dataset using clustering.

Code

We are using python 3. The code directory contains the actual code for our clustering and evaluation.

Running the run.py file will run a short clustering run using a small (100 points) subset of our data.

Running experiments.py file will run our actual experiments. It runs on the entire test set and tries many parameter combinations, so it is suggested to not run that.

To run the tests, you can use the test.sh script. It is a very small script, but the command to run all tests in a directory is easy to forget.

The tests, run.py, and experiments.py do not hit the database by default. Instead, they load the data from a pickle generated by running data.py. If you want to use the database, then you will need a file called secrets.py that defines constants used to connect the file. The following constants must be defined:

  • DB_HOST
  • DB_PORT
  • DB_NAME
  • DB_USER
  • DB_PASS

Dependencies

Our project uses the numpy library, which will need to be installed prior to running.

In additional if you are going to connect to the database, then you will need to also install the psycopg2 Postgres driver.2

Data

The data directory mainly contains scripts for:

  • Generating SQL files from the Yelp JSON dataset.
  • Creating tables to hold the data.
  • Inserting the data.
  • Optimizing the data for our specific queries.

The build.sh script takes the data from the JSON files to optimized database tables.