Skip to content

mjaglan/big-data-stack

 
 

Repository files navigation

Big Data Analytics Stack

Note: for latest version refer the futuresystems/big-data-stack

Provides a set of Ansible playbooks to deploy a Big Data analytics stack on top of Hadoop/Yarn.

The play-hadoop.yml deploys the base system. Addons, such as Pig, Spark, etc, are deployed using the playbooks in the addons directory. A playbook for deploying all the addons is given in play-alladdons.yml.

Stack

Legend:

  • available
  • ItaliBold: work-in-progress
  • planned

Analytics Layer

Data Processing Layer

Database Layer

Scheduling

Storage

Monitoring

Usage

  1. Make sure to start an ssh-agent so you don't need to retype you passphrase multiple times. We've also noticied that if you are running on india, Ansible may be unable to access the node and complain with something like:

    master0 | UNREACHABLE! => {
        "changed": false,
        "msg": "ssh cc@129.114.110.126:22 : Private key file is encrypted\nTo connect as a different user, use -u <username>.",
        "unreachable": true
    }
    

    To start the agent:

    badi@i136 ~$ eval $(ssh-agent)
    badi@i136 ~$ ssh-add
    
  2. Make sure your public key is added to github.com

  3. Download this repository using git clone --recursive. IMPORTANT: make sure you specify the --recursive option otherwise you will get errors.

  4. Install the requirements using pip install -r requirements.txt

  5. Edit .cluster.py to define the machines in the cluster.

  6. Launch the cluster using vcl boot -p openstack -P $USER- This will start the machines on whatever openstack environment is currently available (via the $OS_PROJECT_NAME, $OS_AUTH_URL, etc), prefixing $USER- to the name of each VM (eg. zk0 becomes badi-zk0).

  7. Make sure that ansible.cfg reflects your environment. Look especially at remote_user if you are not using Ubuntu.

  8. Ensure ssh_bastion_config is to your liking (it assumes you are using the openstack cluster on FutureSystems).

  9. Run ansible all -m ping to make sure all nodes can be managed.

  10. Define zookeeper_id for each zookeeper node. Adapt the following: (NO LONGER NEEDED as of v0.2.4)

    mkdir host_vars
    for i in 0 1 2; do
      echo "zookeeper_id: $(( i+1 ))" > host_vars/master$i
    done
    
  11. Run ansible-playbook play-hadoop.yml to install the base system

  12. Run ansible-playbook addons/{pig,spark}.yml # etc to install the Pig and Spark addons.

  13. Log into the frontend node (see the [frontends] group in the inventory) and use the hadoop user (sudo su - hadoop) to run jobs on the cluster.

Sidenote: you may want to pass the -f <N> flag to ansible-playbook to use N parallel connections. This will make the deployment go faster. For example:

$ ansible-playbook -f $(egrep '^[a-zA-Z]' inventory.txt | sort | uniq | wc -l) # etc ...

The hadoop user is present on all the nodes and is the hadoop administrator. If you need to change anything on HDFS, it must be done as hadoop.

Access

vcl ssh can be used as shorthand to access the nodes. It looks up the ip address in the generated .machines.yml, using the floating ip if available.

Monitoring

You can access the Ganglia display on the monitoring node. The interface is kept local to the virtual cluster so you need log in with X forwarding enable and install a browser. For example:

[badi@india]: ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no ubuntu@123.45.67.89 -X
[ubuntu@master2]: sudo apt-get -y install firefox
[ubuntu@master2]: firefox http://localhost/ganglia

Upgrading

Whenever a new release is made, you can get the changes by either cloning a fresh repository (as above), or pulling changes from the upstream master branch and updating the submodules:

$ git pull https://github.com/futuresystems/big-data-stack master
$ git submodule update
$ pip install -U -r requirements.txt

Examples

See the examples directory:

  • nist_finterprint: fingerprint analysis using Spark with results pushed to HBase

License

Please see the LICENSE file in the root directory of the repository.

Contributing

  1. Fork the repository
  2. Add yourself to the CONTRIBUTORS.yml file
  3. Submit a pull request to the unstable branch

About

Hadoop-based Big Data stack (hdfs, yarn, spark, etc)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 62.7%
  • Makefile 31.8%
  • Shell 5.5%