Project for the "Para-normal Distributions" team (Team #0) of CMPS242 Fall 2016. The group member are:
- Eriq Augustine
- Varun Embar
- Dhawal Joharapurkar
- Xiao Li
The aim of this project is to experiment on finding similar businesses in the Yelp Challenge Dataset using clustering.
We are using python 3.
The code
directory contains the actual code for our clustering and evaluation.
Running the run.py
file will run a short clustering run using a small (100 points) subset of our data.
Running experiments.py
file will run our actual experiments.
It runs on the entire test set and tries many parameter combinations, so it is suggested to not run that.
To run the tests, you can use the test.sh
script.
It is a very small script, but the command to run all tests in a directory is easy to forget.
The tests, run.py
, and experiments.py
do not hit the database by default.
Instead, they load the data from a pickle generated by running data.py
.
If you want to use the database, then you will need a file called secrets.py
that defines constants used to connect the file.
The following constants must be defined:
- DB_HOST
- DB_PORT
- DB_NAME
- DB_USER
- DB_PASS
Our project uses the numpy
library, which will need to be installed prior to running.
In additional if you are going to connect to the database, then you will need to also install the psycopg2
Postgres driver.2
The data
directory mainly contains scripts for:
- Generating SQL files from the Yelp JSON dataset.
- Creating tables to hold the data.
- Inserting the data.
- Optimizing the data for our specific queries.
The build.sh
script takes the data from the JSON files to optimized database tables.