aparajita2930/NYC_Complaints_Analysis
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Big Data Project Instrucion to run the code: 1. Connect to NYU HPC 2. ssh dumbo 3. In /home/<user>/ directory, execute the following: 3.1 vi .bashrc 3.2 In the above created file .bashrc, enter the following lines: alias hfs='/usr/bin/hadoop fs ' export HAS=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib export HSJ=hadoop-mapreduce/hadoop-streaming.jar alias hjs='/usr/bin/hadoop jar $HAS/$HSJ' 4. Download the git repository either from the git web console by clicking on "Download ZIP" or by executing the following command: git clone https://github.com/aparajita2930/NYC_Complaints_Analysis.git 5. cd NYC_Complaints_Analysis 6. From the top level directory structure, execute the run.sh script by the following command: sh run.sh This would generate all the summary files in the directory "out_data". This is to ensure that the output files generated by us in the directory "results/res_col_summary" during our runs are not overwritten. To run only specific scripts, please do not execute the command "sh run.sh". Instead, for example to run for the column "city" run the following commands: hfs -rm -r 17_details.out hfs -rm -r 17_summary.out hfs -rm -r 17_datatype.out hfs -rm -r 17_semantictype.out hfs -rm -r 17_validity.out spark-submit src/column_summary/17_city.py /user/ac5901/NYC_Complaints.csv/part-00000 spark-submit src/column_summary/0_create_file_summary.py 17_details.out hfs -getmerge 17_summary.out out_data/17_summary.out spark-submit src/column_summary/0_column_summary.py 17 hfs -getmerge 17_datatype.out out_data/17_datatype.out hfs -getmerge 17_semantictype.out out_data/17_semantictype.out hfs -getmerge 17_validity.out out_data/17_validity.out (Here, 17 refers to the column number. Every script in the src/column_summary directory begins with a column number. Example: 17_city.py) The <col_num>_datatype.out file contains the count of each datatype in the column. The <col_num>_semantictype.out file contains the count of each semantic type in the column. The <col_num>_validity.out file contains the count of each label - valid, invalid or null, in the column. The <col_num>_summary.out contains the combination of the above three things. 7. From the top level directory structure, execute the run_use_cases.sh script using the following command to get the results of other analysis performed to obtain trends from the data: sh run_use_cases.sh This would generate all the output files in the directory "use_cases_data". This is to ensure that the output files generated by us in the directory "results/res_use_cases" is not overwritten. 8. To generate the plots: Please run the code to generate the plots on local machine and not on NYU HPC/DUMBO. Requirements: The plot generation requires the packages: matplotlib, seaborn, jupyter and csv pip install matplotlib pip install seaborn pip install jupyter -- There are times when this does not work on Python 2, in that case, please install Anaconda following the instructions on: https://www.continuum.io/downloads Then execute the following command: cd results/plots For Part I, run the following command: jupyter notebook results/plots/visualizations.ipynb For Part II, run the following command: jupyter notebook results/plots/Part 2 Visualizations.ipynb You can run all the cells in the notebook by clicking on the menu Cell -> Run All. You can view all the plots in the notebook hosted at Part I https://github.com/aparajita2930/NYC_Complaints_Analysis/blob/master/results/plots/visualizations.ipynb Part II https://github.com/aparajita2930/NYC_Complaints_Analysis/blob/master/results/plots/Part%202%20Visualizations.ipynb
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published