Provides Parquet-formatted public 'omics datasets in S3 for easily using ADAM and the Hadoop stack (including Spark and Impala). Instead of acquiring and converting the data sets yourself, simply get them from the eggo S3 bucket:
s3://bdg-eggo
Eggo also provides a command-line interface for easily provisioning Hadoop clusters in the cloud (built using Fabric) and also the necessary code to convert the data sets from the legacy formats into the Hadoop-friendly versions (built with Luigi).
Not implemented yet.
git clone https://github.com/bigdatagenomics/eggo.git
cd eggo
python setup.py install
TODO: pip installable scripts for listing datasets ets.
You need to install fabric
too.
The eggo
machinery uses Fabric and Luigi for its operation.
The registry/
directory contains the metadata for the data sets we ingest and
convert to ADAM/Parquet. Each data set is stored as a JSON file loosely based
on the Data Protocols spec.
If using AWS, ensure the following variables are set locally:
export SPARK_HOME= # local path to Spark
export EC2_KEY_PAIR= # EC2 name of the registered key pair
export EC2_PRIVATE_KEY_FILE= # local path to associated private key (.pem file)
export AWS_ACCESS_KEY_ID= # AWS credentials
export AWS_SECRET_ACCESS_KEY= # AWS credentials
These variables must be set remotely, which can be done by source eggo- ec2-variables.sh
:
export AWS_ACCESS_KEY_ID= # AWS credentials
export AWS_SECRET_ACCESS_KEY= # AWS credentials
export SPARK_HOME= # remote path to Spark
export ADAM_HOME= # remote path to ADAM
export SPARK_MASTER= # Spark master host name
Set EGGO_EXP=TRUE
to have the setup commands use the experiment
branch of
eggo.
cd path/to/eggo
# provision a cluster on EC2 with 5 slave (worker) nodes
fab provision:5,r3.2xlarge
# configure proper environment on the instances
fab setup_master
fab setup_slaves
# (Cloudera infra-only)
./tag-my-instances.py
# get an interactive shell on the master node
fab login
# destroy the cluster
fab teardown
There is experimental support for using Cloudera Director to provision a cluster. This is useful for running a cluster with more services, including YARN, the Hive metastore, YARN, and Impala; however it takes longer (>30mins) to bring up a cluster than the Spark EC2 scripts.
# provision a cluster on EC2 with 5 worker nodes
fab provision_director
# run a proxy to access Cloudera Manager via http://localhost:7180
# type 'exit' to quit process
fab cm_web_proxy
# log in to the gateway node
fab login_director
# destroy the cluster
fab teardown_director
The toast
command will build the Luigi DAG for downloading the necessary data
to S3 and running the ADAM command to transform it to Parquet.
# toast the 1000 Genomes data set
fab toast:registry/1kg.json
Environment variables that should be set
ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem -s 3 -t m3.large -z us-east-1a --delete-groups --copy-aws-credentials launch eggo ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem login eggo ec2/spark-ec2 -k laserson-cloudera -i ~/.ssh/laserson-cloudera.pem destroy eggo
curl http://169.254.169.254/latest/meta-data/public-hostname
def verify_env(): require('SPARK_HOME') require('EC2_KEY_PAIR') require('EC2_PRIVATE_KEY_FILE') require('AWS_ACCESS_KEY_ID') require('AWS_SECRET_ACCESS_KEY')
TODO: have to CLI commands: eggo
for users and toaster
for maintainers.
You can run Eggo from a local machine, which is helpful while developing Eggo itself.
Ensure that Hadoop, Spark, and ADAM are all installed.
Set up the environment with:
export AWS_DEFAULT_REGION=us-east-1
export EPHEMERAL_MOUNT=/tmp
export ADAM_HOME=~/workspace/adam
export HADOOP_HOME=~/sw/hadoop-2.5.1/
export SPARK_HOME=~/sw/spark-1.3.0-bin-hadoop2.4/
export SPARK_MASTER_URI=local
export STREAMING_JAR=$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar
export PATH=$PATH:$HADOOP_HOME/bin
By default, datasets will be stored on S3, and you will need to set
fs.s3n.awsAccessKeyId
and fs.s3n.awsSecretAccessKey
in Hadoop's core-site.xml file.
To store datasets locally, set the EGGO_BASE_URL
environment variable to a Hadoop path:
export EGGO_BASE_URL=file:///tmp/bdg-eggo
Generate a test dataset with
bin/toaster.py --local-scheduler VCF2ADAMTask --ToastConfig-config test/registry/test-genotypes.json
or
bin/toaster.py --local-scheduler BAM2ADAMTask --ToastConfig-config test/registry/test-alignments.json
You can delete the test datasets with
bin/toaster.py --local-scheduler DeleteDatasetTask --ToastConfig-config test/registry/test-genotypes.json
bin/toaster.py --local-scheduler DeleteDatasetTask --ToastConfig-config test/registry/test-alignments.json
Concepts:
-
dfs: the target "distributed" filesystem that will contain the final ETL'd data
-
workers: the machines on which ETL is executed
-
worker_env: an environment which we assume available on the worker machines, including env variables and paths to write data
-
client: the local machine from which we issue the CLI commands
-
client_env: the environment assumed on the local machine
The only environment variable that MUST be set on the local client machine is EGGO_CONFIG. This config file will be deployed across all relevant worker machines.
Other local client env vars that will be respected include: SPARK_HOME, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, EC2_KEY_PAIR, EC2_PRIVATE_KEY_FILE. Everything else is derived from the EGGO_CONFIG file.
One of the workers is designated a master, which is where the computations are executed. This node needs additional configuration.
eggo provision
eggo deploy_config
eggo setup_master
eggo setup_slaves
eggo delete_all:config=$EGGO_HOME/test/registry/test-genotypes.json
eggo toast:config=$EGGO_HOME/test/registry/test-genotypes.json
eggo teardown