Note: for latest version refer the futuresystems/big-data-stack
Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.
The play-hadoop.yml
deploys the base system. Addons, such as Pig,
Spark, etc, are deployed using the playbooks in the addons
directory. A playbook for deploying all the addons is given in
play-alladdons.yml
.
Legend:
- available
- ItaliBold: work-in-progress
- planned
-
Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on
india
, Ansible may be unable to access the node and complain with something like:master0 | UNREACHABLE! => { "changed": false, "msg": "ssh cc@129.114.110.126:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.", "unreachable": true }
To start the agent:
badi@i136 ~$ eval $(ssh-agent) badi@i136 ~$ ssh-add
-
Make sure your public key is added to github.com
-
Download this repository using
git clone --recursive
. IMPORTANT: make sure you specify the--recursive
option otherwise you will get errors. -
Install the requirements using
pip install -r requirements.txt
-
Edit
.cluster.py
to define the machines in the cluster. -
Launch the cluster using
vcl boot -p openstack -P $USER-
This will start the machines on whatever openstack environment is currently available (via the$OS_PROJECT_NAME
,$OS_AUTH_URL
, etc), prefixing$USER-
to the name of each VM (eg.zk0
becomesbadi-zk0
). -
Make sure that
ansible.cfg
reflects your environment. Look especially atremote_user
if you are not using Ubuntu. -
Ensure
ssh_bastion_config
is to your liking (it assumes you are using the openstack cluster on FutureSystems). -
Run
ansible all -m ping
to make sure all nodes can be managed. -
Define(NO LONGER NEEDED as of v0.2.4)zookeeper_id
for each zookeeper node. Adapt the following:mkdir host_vars for i in 0 1 2; do echo "zookeeper_id: $(( i+1 ))" > host_vars/master$i done
-
Run
ansible-playbook play-hadoop.yml
to install the base system -
Run
ansible-playbook addons/{pig,spark}.yml # etc
to install the Pig and Spark addons. -
Log into the frontend node (see the
[frontends]
group in the inventory) and use thehadoop
user (sudo su - hadoop
) to run jobs on the cluster.
Sidenote: you may want to pass the -f <N>
flag to ansible-playbook
to use N
parallel connections.
This will make the deployment go faster.
For example:
$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...
The hadoop
user is present on all the nodes and is the hadoop administrator.
If you need to change anything on HDFS, it must be done as hadoop
.
vcl ssh
can be used as shorthand to access the nodes.
It looks up the ip address in the generated .machines.yml, using the floating ip if available.
You can access the Ganglia display on the monitoring node. The interface is kept local to the virtual cluster so you need log in with X forwarding enable and install a browser. For example:
[badi@india]: ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ubuntu@123.45.67.89 -X
[ubuntu@master2]: sudo apt-get -y install firefox
[ubuntu@master2]: firefox http://localhost/ganglia
Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:
$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt
See the examples
directory:
nist_finterprint
: fingerprint analysis using Spark with results pushed to HBase
Please see the LICENSE
file in the root directory of the repository.
- Fork the repository
- Add yourself to the
CONTRIBUTORS.yml
file - Submit a pull request to the
unstable
branch