GitHub - cloudmesh-ansible/example-project-network-analysis

Project Report: Construct Network Graphs using Python, Spark, and Hadoop

Final Project for I590 Big Data Software, Spring 2016

Student Name: Ji Ma

Description of Project

This project will deploy Big Data Analytics Stack on virtual machines and then construct 100 graphs from 100 csv files using Python, Spark, and Hadoop.

Problem Statement

Network analysis usually requires intensive computing resources and long time waiting; therefore, run the tasks using HDFS, Spark, and Python can greatly improve the efficiency.

Major Software Packages and Technologies

Ansible: For automated deployment of software packages across multiple VMs and running script;
Hadoop: For hosting dataset;
Spark and Python: For constructing graphs and analysis purpose.

Purpose and Objectives

This project will complete the following tasks:

Deploy the Big Data Stack following the official documents;
Use Ansible Playbook install Python packages (i.e., networkx and pandas) on VMs for network analysis;
Use Ansible Playbook download the 100 csv files (dataset hosted on my own website);
Use Ansible Playbook put the dataset onto HDFS;
Use Ansible Playbook download the python script for analysis (script wrote by myself and hosted on my own website);
Use Ansible Playbook run the analysis script.

How to Run

Clone this repository.
Deployed the Big Data Stack following the official documents (https://github.com/futuresystems/big-data-stack);
Run ansible-playbook addons/{pig,spark}.yml to install the Pig and Spark addons.
Run ansible-playbook deploy.yml run.yml to down dataset, deploy onto HDFS, and run analysis et al (Major tasks of this project).

Results

After the analysis (takes about 1 minute), 100 GraphML files can be obtained and stored under /tmp/graphs directory in the fronthead VM.

Findings

Major finding is the improvement of efficiency: usually it takes about 20 minutes to complete a similar task, current project only takes about 1 minute.

References & Acknowledgements

Original dataset: http://jima-wordpress.stor.sinaapp.com/simu-0-99.zip
Python analysis script: http://jima-wordpress.stor.sinaapp.com/graph_generator.py

Special thanks to Hyungro Lee and Badi' Abdul-Wahid for their patient guidance.

Name		Name	Last commit message	Last commit date
Latest commit History 219 Commits
addons		addons
base		base
bin		bin
docs		docs
examples/nist_fingerprint		examples/nist_fingerprint
project		project
roles		roles
.cluster.py		.cluster.py
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.org		CHANGELOG.org
CONTRIBUTORS.yml		CONTRIBUTORS.yml
Documentation.md		Documentation.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ansible.cfg		ansible.cfg
build.sbt		build.sbt
data.yml		data.yml
deploy.yml		deploy.yml
network.scala		network.scala
play-alladdons.yml		play-alladdons.yml
play-hadoop.yml		play-hadoop.yml
requirements-open.txt		requirements-open.txt
requirements.txt		requirements.txt
roles.txt		roles.txt
run.yml		run.yml
spark-submit.sh		spark-submit.sh
ssh_bastion_config		ssh_bastion_config

License

cloudmesh-ansible/example-project-network-analysis

Folders and files

Latest commit

History

Repository files navigation

Project Report: Construct Network Graphs using Python, Spark, and Hadoop

Final Project for I590 Big Data Software, Spring 2016

Student Name: Ji Ma

Description of Project

Problem Statement

Major Software Packages and Technologies

Purpose and Objectives

How to Run

Results

Findings

References & Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages