This project will deploy Big Data Analytics Stack on virtual machines and then construct 100 graphs from 100 csv files using Python, Spark, and Hadoop.
Network analysis usually requires intensive computing resources and long time waiting; therefore, run the tasks using HDFS, Spark, and Python can greatly improve the efficiency.
- Ansible: For automated deployment of software packages across multiple VMs and running script;
- Hadoop: For hosting dataset;
- Spark and Python: For constructing graphs and analysis purpose.
This project will complete the following tasks:
- Deploy the Big Data Stack following the official documents;
- Use Ansible Playbook install Python packages (i.e., networkx and pandas) on VMs for network analysis;
- Use Ansible Playbook download the 100 csv files (dataset hosted on my own website);
- Use Ansible Playbook put the dataset onto HDFS;
- Use Ansible Playbook download the python script for analysis (script wrote by myself and hosted on my own website);
- Use Ansible Playbook run the analysis script.
- Clone this repository.
- Deployed the Big Data Stack following the official documents (https://github.com/futuresystems/big-data-stack);
- Run
ansible-playbook addons/{pig,spark}.yml
to install the Pig and Spark addons. - Run
ansible-playbook deploy.yml run.yml
to down dataset, deploy onto HDFS, and run analysis et al (Major tasks of this project).
After the analysis (takes about 1 minute), 100 GraphML files can be obtained and stored under /tmp/graphs directory in the fronthead VM.
Major finding is the improvement of efficiency: usually it takes about 20 minutes to complete a similar task, current project only takes about 1 minute.
- Original dataset: http://jima-wordpress.stor.sinaapp.com/simu-0-99.zip
- Python analysis script: http://jima-wordpress.stor.sinaapp.com/graph_generator.py
Special thanks to Hyungro Lee and Badi' Abdul-Wahid for their patient guidance.