Open Data Platform

The Open Data Platform (ODP) is an open-source data management platform that can be rapidly deployed and tailored to accelerate Big Data and Cloud-scale solution delivery. The Bootstrap repository features an Ansible Playbook that automates the deployment of a 5-server Hadoop cluster on Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instances that are managed by either HortonWorks or Cloudera.

Requirements

The following are installed locally or on Linux VM:

Ansible (2.4 or later)
Python (2.7 or later)
python-boto (Python 2.x) python-boto (Python 3.x)

AWS IAM user has permissions to launch EC2 Instances
AWS Security Groups are set up to allow for communication between EC2 instances

Setup Instructions

Ansible Host Setup

Set AWS_ACCESS_KEY and AWS_SECRET_ACCESS_KEY as environment variables for user running Ansible

export AWS_ACCESS_KEY_ID=aws_access_key_id
export AWS_SECRET_ACCESS_KEY=aws_secret_access_key

AWS private key from the key pair is saved as ~/.ssh/id_rsa for user running Ansible (id_rsa should be the filename of the key, not the directory - there is no extension for the key)

Notice: It is extremely important that the AWS SSH key for Ansible is saved as ~/.ssh/id_rsa
It is best practice to set permissions of 0400 on id_rsa files (read-only to file owner)

Clone the Open Data Platform repo to the Ansible host and cd to the repo base directory
Use Ansible Vault to encrypt the properties file. The Ansible Vault password entered in this step will be needed to edit the properties file and to run the Ansible Playbook.

ansible-vault encrypt group_vars/all

Use Ansible Vault to edit group_vars/all and configure AWS Settings for aws_user, aws_access_mode, aws_unique_identifier, aws_image, aws_region, aws_subnet_id, aws_security_group, aws_keypair, aws_device_name, aws_instance_type, aws_management_server_volume_size, and aws_client_server_volume_size. Further details and example values for these properties can be found in the group_vars/all file comments.

ansible-vault edit group_vars/all

AMI Image ID for RHEL7 can be found in AWS console by clicking 'Launch Instance' under the 'Quick Start' tab
AMI Image ID for CentOS 7 can be found on the CentOS Wiki
We have verified that the following AMIs work:
- RHEL-7.4_HVM_GA-20170808-x86_64-2-Hourly2-GP2 (ami-c998b6b2)
- CentOS Linux 7 x86_64 HVM EBS 1602-b7ee8a69-ee97-4a49-9e68-afaee216db2e-ami-d7e1d2bd.3 (ami-6d1c2007)
Region, Subnet, Security Group and Key Pair should be available from your AWS administrator
We recommend allocating at least 50 GB of primary disk space

HortonWorks Deployment

Components Installed via Ambari

The HortonWorks deployment installs the following services in Ambari:

HDFS
YARN
MapReduce2
Tez
Hive
Oozie
Zookeeper
Kafka
Spark
Zeppelin Notebook

Topology of HortonWorks Deployment

The specific components of each HortonWorks service are installed using the following default topology:

Master Node (Ambari Server): HDFS NameNode, HDFS Client, HDFS DataNode, Kafka Broker, HBase RegionServer, HBase Client, Zeppelin Master, Oozie Server, Spark Client, Tez Client, History Server, Node Manager
Client Node 1: HDFS SecondaryNameNode, HDFS Client, HDFS DataNode, Kafka Broker, HBase Master, HBase RegionServer, HBase Client, ZooKeeper server, Spark JobHistory Server, Hive WebHCat Server, Spark Client, Tez Client,
Client Node 2: HDFS Client, HDFS DataNode, Kafka Broker, HBase RegionServer, HBase Client, MySQL Server, Hive MetaStore, Hive Server, Resource Manager, App Timeline Server, ZooKeeper Client, Spark Client, Tez Client, MapReduce2 Client
Client Nodes 3 & 4: HDFS Client, HDFS DataNode, Kafka Broker, HBase Client, ZooKeeper Client, Spark Client, Hive Client, Oozie Client, Tez Client, Yarn Client, MapReduce2 Client

Steps to Provision HortonWorks Stack

Execute the following command from the repo base directory:

ansible-playbook --vault-id @prompt provision_hortonworks.yml

Note: The .retry files do not work. You can re-run the scripts.

When the playbook has completed execution, Ansible will print a message specifying the URL to access the Ambari console.

Cloudera Deployment

Components Installed in Cloudera

The Cloudera deployment installs the following services in Cloudera Manager:

HBase
HDFS
Hive
Hue
Spark
Kafka
Oozie
YARN (MapReduce2 included)
Zookeeper

Topology of Cloudera Deployment

The specific components of each Cloudera service are installed using the following default topology:

Master Server (Cloudera Manager): Cloudera Management Service, HBase RegionServer, HDFS DataNode, Hive Metastore Server, HiveServer2, Hue Server,Oozie Server, YARN NodeManager, Spark JobHistory Server, Spark Server
Client 1: HBase Thrift Server, HBase Region Server, HDFS Data Node, YARN Node Manager, YARN ResourceManager, ZooKeeper Server, Spark Server
Client 2: HBase REST Server, HBase RegionServer, HDFS DataNode, Hue Load Balancer, YARN ResourceManager, ZooKeeper Server, Spark Server
Client 3: HBase Master, HBase RegionServer, HDFS SecondaryNameNode, HDFS DataNode, YARN JobHistory Server, YARN NodeManager, ZooKeeper Server, Spark Server
Client 4: HBase RegionServer, HDFS NameNode, HDFS DataNode, Hive WebHCat Server, Kafka Broker, YARN NodeManager, Spark Server

Steps to Provision Cloudera Stack

Edit the group_vars/all file and create PostgreSQL database passwords by setting the values for cloudera_db_password, hive_metastore_db_password, hue_db_password, and oozie_db_password.
Execute the following command from the repo base directory:

ansible-playbook --vault-id @prompt provision_cloudera.yml

Note: The .retry files do not work. If the scripts failed during provisiong the AWS instances you can re-run the scripts. If they failed during the python script setup of cloudera, you will need to delete your EC2 instances and try again.

When the playbook has completed execution, Ansible will print a message specifying the URL to access the Cloudera Manager console.

Deploy Using a Local Repository

In certain cases, such as AWS environments with very limited bandwidth, it may be necessary to set up a local instance of the HortonWorks or Cloudera repositories. In order to do this, first start by creating a new EC2 instance to host the repositories, and then use the reposync utility to clone all necessary repositories. Finally, update the group_vars/all properties to point to the local repositories.

Configure Repository Server

Provision a new EC2 instance using either the Red Hat or CentOS AMI. Ensure the instance is of size t2.medium or larger.
SSH into the instance and become the root user.
Execute the following commands to disable SELinux, install the Apache httpd web server, wget, and createrepo utility:

setenforce 0
yum -y install httpd wget createrepo
systemctl start httpd
systemctl enable httpd
mkdir /var/www/html/repos

Clone the HortonWorks Repositories

Execute the following commands to create a full mirror of the Ambari, HDP, and HDP-Utils repositories:

cd /etc/yum.repos.d
wget http://public-repo-1.hortonworks.com/ambari/centos7/2.x/updates/2.5.2.0/ambari.repo http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.2.14/hdp.repo
cd /var/www/html/repos
nohup reposync -r ambari-2.5.2.0 -r HDP-2.6.2.14 -r HDP-UTILS-1.1.0.21 &
createrepo ambari-2.5.2.0
createrepo HDP-2.6.2.14
createrepo HDP-UTILS-1.1.0.21

Clone the Cloudera Repositories

Execute the following commands to create a full mirror of the Cloudera Manager repository and to create a partial mirror of the parcels repository that only pulls the necessary artifacts:

cd /etc/yum.repos.d
wget https://archive.cloudera.com/cm5/redhat/7/x86_64/cm/cloudera-manager.repo
cd /var/www/html/repos
nohup reposync -r cloudera-manager &
mkdir cloudera-parcels
cd cloudera-parcels
nohup wget http://archive.cloudera.com/cdh5/parcels/5.13.0.29/CDH-5.13.0-1.cdh5.13.0.p0.29-el7.parcel http://archive.cloudera.com/cdh5/parcels/5.13.0.29/manifest.json &
cd /var/www/html/repos
createrepo cloudera-manager
createrepo cloudera-parcels

Update Ansible Properties to Use Local Repositories

Edit the group_vars/all file and update the following properties, inserting the private IP address of the EC2 instance being used to host the repositories in place of repo_server_private_ip:

For HortonWorks:

ambari_repo_7: http://repo_server_private_ip/repos/ambari-2.5.2.0
hdp_repo_7: http://repo_server_private_ip/repos/HDP-2.6.2.14
hdp_utils_repo_7: http://repo_server_private_ip/repos/HDP-UTILS-1.1.0.21

For Cloudera:

cloudera_manager_repo: http://repo_server_private_ip/repos/cloudera-manager
cloudera_parcel_repo: http://repo_server_private_ip/repos/cloudera-parcels

External Resources

NiFi Resources
- We are using ODP NiFi Version 0.1.0
Elastic
- We are using Elastic version 6.6.0
kibana
- We are using Kibana version 6.6.0
Ansible Resources
HortonWorks Resources
Cloudera Resources
- Cloudera Homepage
- Cloudera Enterprise Documentation
Amazon Web Services Resources
- Amazon Web Services Homepage
- Amazon Web Services Elastic Compute Cloud (EC2)

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
group_vars		group_vars
host_vars/localhost		host_vars/localhost
roles		roles
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
provision_aws.yml		provision_aws.yml
provision_cloudera.yml		provision_cloudera.yml
provision_elastic_nifi_kibana.yml		provision_elastic_nifi_kibana.yml
provision_hortonworks.yml		provision_hortonworks.yml

License

syllogy/opendataplatform

Folders and files

Latest commit

History

Repository files navigation