Skip to content

h3biomed/eggo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

92 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

eggo

Provides Parquet-formatted public 'omics datasets in S3 for easily using ADAM and the Hadoop stack (including Spark and Impala). Instead of acquiring and converting the data sets yourself, simply get them from the eggo S3 bucket:

s3://bdg-eggo

Eggo also provides a command-line interface for easily provisioning Hadoop clusters in the cloud (built using Fabric) and also the necessary code to convert the data sets from the legacy formats into the Hadoop-friendly versions (built with Luigi).

User interface

Not implemented yet.

Getting started

git clone https://github.com/bigdatagenomics/eggo.git
cd eggo
python setup.py install

TODO: pip installable scripts for listing datasets ets.

You need to install fabric too.

Developer/maintainer interface

The eggo machinery uses Fabric and Luigi for its operation.

registry/

The registry/ directory contains the metadata for the data sets we ingest and convert to ADAM/Parquet. Each data set is stored as a JSON file loosely based on the Data Protocols spec.

Environment

If using AWS, ensure the following variables are set locally:

export SPARK_HOME= # local path to Spark
export EC2_KEY_PAIR= # EC2 name of the registered key pair
export EC2_PRIVATE_KEY_FILE= # local path to associated private key (.pem file)
export AWS_ACCESS_KEY_ID= # AWS credentials
export AWS_SECRET_ACCESS_KEY= # AWS credentials

These variables must be set remotely, which can be done by source eggo- ec2-variables.sh:

export AWS_ACCESS_KEY_ID= # AWS credentials
export AWS_SECRET_ACCESS_KEY= # AWS credentials
export SPARK_HOME= # remote path to Spark
export ADAM_HOME= # remote path to ADAM
export SPARK_MASTER= # Spark master host name

Setting up a cluster

Set EGGO_EXP=TRUE to have the setup commands use the experiment branch of eggo.

cd path/to/eggo

# provision a cluster on EC2 with 5 slave (worker) nodes
fab provision:5,r3.2xlarge

# configure proper environment on the instances
fab setup_master
fab setup_slaves

# (Cloudera infra-only)
./tag-my-instances.py

# get an interactive shell on the master node
fab login

# destroy the cluster
fab teardown

There is experimental support for using Cloudera Director to provision a cluster. This is useful for running a cluster with more services, including YARN, the Hive metastore, YARN, and Impala; however it takes longer (>30mins) to bring up a cluster than the Spark EC2 scripts.

# provision a cluster on EC2 with 5 worker nodes
fab provision_director

# run a proxy to access Cloudera Manager via http://localhost:7180
# type 'exit' to quit process
fab cm_web_proxy

# log in to the gateway node
fab login_director

# destroy the cluster
fab teardown_director

Converting data sets

The toast command will build the Luigi DAG for downloading the necessary data to S3 and running the ADAM command to transform it to Parquet.

# toast the 1000 Genomes data set
fab toast:registry/1kg.json

Configuration

Environment variables that should be set

ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem -s 3 -t m3.large -z us-east-1a --delete-groups --copy-aws-credentials launch eggo ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem login eggo ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem destroy eggo

curl http://169.254.169.254/latest/meta-data/public-hostname

def verify_env(): require('SPARK_HOME') require('EC2_KEY_PAIR') require('EC2_PRIVATE_KEY_FILE') require('AWS_ACCESS_KEY_ID') require('AWS_SECRET_ACCESS_KEY')

TODO: have to CLI commands: eggo for users and toaster for maintainers.

Testing

You can run Eggo from a local machine, which is helpful while developing Eggo itself.

Ensure that Hadoop, Spark, and ADAM are all installed.

Set up the environment with:

export AWS_DEFAULT_REGION=us-east-1
export EPHEMERAL_MOUNT=/tmp
export ADAM_HOME=~/workspace/adam
export HADOOP_HOME=~/sw/hadoop-2.5.1/
export SPARK_HOME=~/sw/spark-1.3.0-bin-hadoop2.4/
export SPARK_MASTER_URI=local
export STREAMING_JAR=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar
export PATH=$PATH:$HADOOP_HOME/bin

By default, datasets will be stored on S3, and you will need to set fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey in Hadoop's core-site.xml file.

To store datasets locally, set the EGGO_BASE_URL environment variable to a Hadoop path:

export EGGO_BASE_URL=file:///tmp/bdg-eggo

Generate a test dataset with

bin/toaster.py --local-scheduler VCF2ADAMTask --ToastConfig-config test/registry/test-genotypes.json

or

bin/toaster.py --local-scheduler BAM2ADAMTask --ToastConfig-config test/registry/test-alignments.json

You can delete the test datasets with

bin/toaster.py --local-scheduler DeleteDatasetTask --ToastConfig-config test/registry/test-genotypes.json
bin/toaster.py --local-scheduler DeleteDatasetTask --ToastConfig-config test/registry/test-alignments.json

NEW config-file-based organization

Concepts:

  • dfs: the target "distributed" filesystem that will contain the final ETL'd data

  • workers: the machines on which ETL is executed

  • worker_env: an environment which we assume available on the worker machines, including env variables and paths to write data

  • client: the local machine from which we issue the CLI commands

  • client_env: the environment assumed on the local machine

The only environment variable that MUST be set on the local client machine is EGGO_CONFIG. This config file will be deployed across all relevant worker machines.

Other local client env vars that will be respected include: SPARK_HOME, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, EC2_KEY_PAIR, EC2_PRIVATE_KEY_FILE. Everything else is derived from the EGGO_CONFIG file.

One of the workers is designated a master, which is where the computations are executed. This node needs additional configuration.

eggo provision
eggo deploy_config
eggo setup_master
eggo setup_slaves
eggo delete_all:config=$EGGO_HOME/test/registry/test-genotypes.json
eggo toast:config=$EGGO_HOME/test/registry/test-genotypes.json
eggo teardown

About

Ready-to-go Parquet-formatted public 'omics datasets

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 88.0%
  • Shell 12.0%