Big Data Analytics Stack

Note: for latest version refer the futuresystems/big-data-stack

Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.

The play-hadoop.yml deploys the base system. Addons, such as Pig, Spark, etc, are deployed using the playbooks in the addons directory. A playbook for deploying all the addons is given in play-alladdons.yml.

Stack

Legend:

available
ItaliBold: work-in-progress
planned

Analytics Layer

Data Processing Layer

Database Layer

Scheduling

YARN
Mesos

Storage

HDFS

Monitoring

Ganglia

Usage

Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on india, Ansible may be unable to access the node and complain with something like:

master0 | UNREACHABLE! => {
    "changed": false,
    "msg": "ssh cc@129.114.110.126:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.",
    "unreachable": true
}

To start the agent:

badi@i136 ~$ eval $(ssh-agent)
badi@i136 ~$ ssh-add

Make sure your public key is added to github.com
Download this repository using git clone --recursive. IMPORTANT: make sure you specify the --recursive option otherwise you will get errors.
Install the requirements using pip install -r requirements.txt
Edit .cluster.py to define the machines in the cluster.
Launch the cluster using vcl boot -p openstack -P $USER- This will start the machines on whatever openstack environment is currently available (via the $OS_PROJECT_NAME, $OS_AUTH_URL, etc), prefixing $USER- to the name of each VM (eg. zk0 becomes badi-zk0).
Make sure that ansible.cfg reflects your environment. Look especially at remote_user if you are not using Ubuntu.
Ensure ssh_bastion_config is to your liking (it assumes you are using the openstack cluster on FutureSystems).
Run ansible all -m ping to make sure all nodes can be managed.

~~Define zookeeper_id for each zookeeper node. Adapt the following:~~ (NO LONGER NEEDED as of v0.2.4)

mkdir host_vars
for i in 0 1 2; do
  echo "zookeeper_id: $(( i+1 ))" > host_vars/master$i
done

Run ansible-playbook play-hadoop.yml to install the base system
Run ansible-playbook addons/{pig,spark}.yml # etc to install the Pig and Spark addons.
Log into the frontend node (see the [frontends] group in the inventory) and use the hadoop user (sudo su - hadoop) to run jobs on the cluster.

Sidenote: you may want to pass the -f <N> flag to ansible-playbook to use N parallel connections. This will make the deployment go faster. For example:

$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...

The hadoop user is present on all the nodes and is the hadoop administrator. If you need to change anything on HDFS, it must be done as hadoop.

Access

vcl ssh can be used as shorthand to access the nodes. It looks up the ip address in the generated .machines.yml, using the floating ip if available.

Monitoring

You can access the Ganglia display on the monitoring node. The interface is kept local to the virtual cluster so you need log in with X forwarding enable and install a browser. For example:

[badi@india]: ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ubuntu@123.45.67.89 -X
[ubuntu@master2]: sudo apt-get -y install firefox
[ubuntu@master2]: firefox http://localhost/ganglia

Upgrading

Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:

$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt

Examples

See the examples directory:

nist_finterprint: fingerprint analysis using Spark with results pushed to HBase

License

Please see the LICENSE file in the root directory of the repository.

Contributing

Fork the repository
Add yourself to the CONTRIBUTORS.yml file
Submit a pull request to the unstable branch

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
addons		addons
base		base
bin		bin
docs		docs
examples/nist_fingerprint		examples/nist_fingerprint
roles		roles
.cluster.py		.cluster.py
.gitignore		.gitignore
.gitmodules		.gitmodules
CHANGELOG.org		CHANGELOG.org
CONTRIBUTORS.yml		CONTRIBUTORS.yml
Documentation.md		Documentation.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ansible.cfg		ansible.cfg
play-alladdons.yml		play-alladdons.yml
play-hadoop.yml		play-hadoop.yml
requirements-open.txt		requirements-open.txt
requirements.txt		requirements.txt
roles.txt		roles.txt
ssh_bastion_config		ssh_bastion_config

License

mjaglan/big-data-stack

Folders and files

Latest commit

History

Repository files navigation

Big Data Analytics Stack

Stack

Analytics Layer

Data Processing Layer

Database Layer

Scheduling

Storage

Monitoring

Usage

Access

Monitoring

Upgrading

Examples

License

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages