GitHub - aparajita2930/NYC_Complaints

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 124 Commits
figures		figures
out_data		out_data
results		results
src		src
use_cases		use_cases
use_cases_data		use_cases_data
.gitignore		.gitignore
NYCComplaintsAnalysis.pdf		NYCComplaintsAnalysis.pdf
README		README
run.sh		run.sh
run_use_cases.sh		run_use_cases.sh

Repository files navigation

Big Data Project

Instrucion to run the code:

1. Connect to NYU HPC

2. ssh dumbo

3. In /home/<user>/ directory, execute the following:
3.1 vi .bashrc
3.2 In the above created file .bashrc, enter the following lines:

alias hfs='/usr/bin/hadoop fs '
export HAS=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib
export HSJ=hadoop-mapreduce/hadoop-streaming.jar
alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

4. Download the git repository either from the git web console by clicking on "Download ZIP" or by executing the following command:
git clone https://github.com/aparajita2930/NYC_Complaints_Analysis.git

5. cd NYC_Complaints_Analysis

6. From the top level directory structure, execute the run.sh script by the following command:
sh run.sh

This would generate all the summary files in the directory "out_data". This is to ensure that the output files generated by us in the directory "results/res_col_summary" during our runs are not overwritten.

To run only specific scripts, please do not execute the command "sh run.sh". Instead, for example to run for the column "city" run the following commands:
hfs -rm -r 17_details.out
hfs -rm -r 17_summary.out
hfs -rm -r 17_datatype.out
hfs -rm -r 17_semantictype.out
hfs -rm -r 17_validity.out
spark-submit src/column_summary/17_city.py /user/ac5901/NYC_Complaints.csv/part-00000
spark-submit src/column_summary/0_create_file_summary.py 17_details.out
hfs -getmerge 17_summary.out out_data/17_summary.out
spark-submit src/column_summary/0_column_summary.py 17
hfs -getmerge 17_datatype.out out_data/17_datatype.out
hfs -getmerge 17_semantictype.out out_data/17_semantictype.out
hfs -getmerge 17_validity.out out_data/17_validity.out

(Here, 17 refers to the column number. Every script in the src/column_summary directory begins with a column number. Example: 17_city.py)
The <col_num>_datatype.out file contains the count of each datatype in the column.
The <col_num>_semantictype.out file contains the count of each semantic type in the column.
The <col_num>_validity.out file contains the count of each label - valid, invalid or null, in the column.
The <col_num>_summary.out contains the combination of the above three things.

7. From the top level directory structure, execute the run_use_cases.sh script using the following command to get the results of other analysis performed to obtain trends from the data:
sh run_use_cases.sh

This would generate all the output files in the directory "use_cases_data". This is to ensure that the output files generated by us in the directory "results/res_use_cases" is not overwritten.

8. To generate the plots: Please run the code to generate the plots on local machine and not on NYU HPC/DUMBO.
Requirements:
The plot generation requires the packages: matplotlib, seaborn, jupyter and csv
pip install matplotlib
pip install seaborn
pip install jupyter -- There are times when this does not work on Python 2, in that case, please install Anaconda following the instructions on: https://www.continuum.io/downloads

Then execute the following command:
cd results/plots
For Part I, run the following command:
jupyter notebook results/plots/visualizations.ipynb

For Part II, run the following command:
jupyter notebook results/plots/Part 2 Visualizations.ipynb

You can run all the cells in the notebook by clicking on the menu Cell -> Run All.

You can view all the plots in the notebook hosted at

Part I
https://github.com/aparajita2930/NYC_Complaints_Analysis/blob/master/results/plots/visualizations.ipynb

Part II
https://github.com/aparajita2930/NYC_Complaints_Analysis/blob/master/results/plots/Part%202%20Visualizations.ipynb

About

No description, website, or topics provided.

Readme

Activity

1 star

4 watching

0 forks

Report repository

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

figures

figures

out_data

out_data

results

results

src

src

use_cases

use_cases

use_cases_data

use_cases_data

.gitignore

.gitignore

NYCComplaintsAnalysis.pdf

NYCComplaintsAnalysis.pdf

README

README

run.sh

run.sh

run_use_cases.sh

run_use_cases.sh

Repository files navigation

About

Releases

Packages

Contributors 3

Languages

aparajita2930/NYC_Complaints_Analysis

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages