Skip to content

aparajita2930/NYC_Complaints_Analysis

Repository files navigation

Big Data Project

Instrucion to run the code:

1. Connect to NYU HPC

2. ssh dumbo

3. In /home/<user>/ directory, execute the following:
3.1 vi .bashrc
3.2 In the above created file .bashrc, enter the following lines:

alias hfs='/usr/bin/hadoop fs '
export HAS=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib
export HSJ=hadoop-mapreduce/hadoop-streaming.jar
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

4. Download the git repository either from the git web console by clicking on "Download ZIP" or by executing the following command:
git clone https://github.com/aparajita2930/NYC_Complaints_Analysis.git

5. cd NYC_Complaints_Analysis

6. From the top level directory structure, execute the run.sh script by the following command:
sh run.sh

This would generate all the summary files in the directory "out_data". This is to ensure that the output files generated by us in the directory "results/res_col_summary" during our runs are not overwritten.

To run only specific scripts, please do not execute the command "sh run.sh". Instead, for example to run for the column "city" run the following commands:
hfs -rm -r 17_details.out
hfs -rm -r 17_summary.out
hfs -rm -r 17_datatype.out
hfs -rm -r 17_semantictype.out
hfs -rm -r 17_validity.out
spark-submit src/column_summary/17_city.py /user/ac5901/NYC_Complaints.csv/part-00000
spark-submit src/column_summary/0_create_file_summary.py 17_details.out
hfs -getmerge 17_summary.out out_data/17_summary.out
spark-submit src/column_summary/0_column_summary.py 17
hfs -getmerge 17_datatype.out out_data/17_datatype.out
hfs -getmerge 17_semantictype.out out_data/17_semantictype.out
hfs -getmerge 17_validity.out out_data/17_validity.out

(Here, 17 refers to the column number. Every script in the src/column_summary directory begins with a column number. Example: 17_city.py)
The <col_num>_datatype.out file contains the count of each datatype in the column.
The <col_num>_semantictype.out file contains the count of each semantic type in the column.
The <col_num>_validity.out file contains the count of each label - valid, invalid or null, in the column.
The <col_num>_summary.out contains the combination of the above three things.

7. From the top level directory structure, execute the run_use_cases.sh script using the following command to get the results of other analysis performed to obtain trends from the data:
sh run_use_cases.sh

This would generate all the output files in the directory "use_cases_data". This is to ensure that the output files generated by us in the directory "results/res_use_cases" is not overwritten.

8. To generate the plots: Please run the code to generate the plots on local machine and not on NYU HPC/DUMBO.
Requirements: 
The plot generation requires the packages: matplotlib, seaborn, jupyter and csv
pip install matplotlib
pip install seaborn
pip install jupyter -- There are times when this does not work on Python 2, in that case, please install Anaconda following the instructions on: https://www.continuum.io/downloads

Then execute the following command:
cd results/plots
For Part I, run the following command:
jupyter notebook results/plots/visualizations.ipynb

For Part II, run the following command:
jupyter notebook results/plots/Part 2 Visualizations.ipynb

You can run all the cells in the notebook by clicking on the menu Cell -> Run All.

You can view all the plots in the notebook hosted at 

Part I
https://github.com/aparajita2930/NYC_Complaints_Analysis/blob/master/results/plots/visualizations.ipynb 

Part II
https://github.com/aparajita2930/NYC_Complaints_Analysis/blob/master/results/plots/Part%202%20Visualizations.ipynb

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published