This python client runs a MapReduce on Google App Engine that fetches a set of data from the Google Genomics API and calculates read coverage.
To run with app engine, you'll need to set up a Google App Engine environment.
You will also need to follow the setup instructions to create a project in the Google Cloud Console that has access to the Genomics API.
Once you have that project, you will need to create an OAuth Service Account and convert the .p12 file that gets downloaded into a .pem file:
cat privatekey.p12 | openssl pkcs12 -nodes -nocerts \
-passin pass:notasecret | openssl rsa > private-key.pem
Then using the email address associated with the Service Account you created, run the dev app server as follows:
dev_appserver.py --appidentity_private_key_path private-key.pem \
--appidentity_email_address XXX@developer.gserviceaccount.com
Without this setup, the Genomics API will return Invalid Credentials
.
Visit http://localhost:8080
to use a simple UI for launching the MapReduce.
To deploy on app engine, you'll need to make sure the API_KEY
variable in the app.yaml
file is properly set.
The app engine project that you will be deploying to needs to have access to the Genomics API. Once that's set up, go to the APIs & Auth -> Credentials page in the Google Cloud Console.
In the bottom section called "Public API access", there should be an API key that you can copy into the yaml file. Create a new key if there isn't one present already.
For easy development, make sure your API key and the privatekey.12 file come from the same project. Mismatches will cause exceptions.
- cloudstorage, httplib2, mapreduce, oauth2client:
these are all external libraries this sample depends on
- genomicsapi.py:
handles queries to the Genomics API. Includes helper methods to process API data. (like
compute_coverage
)- input_reader.py:
implements the App Engine MapReduce input reader class. Determines the shard split for a given query and uses
genomicsapi
to retrieve data for the pipeline.- pipeline.py:
includes several MapReduce pipeline configurations. The one used by
main.py
,PipelineGenerateCoverage
, computes Read coverage for a genomic region and the output ends up in GoogleCloudStorage (as many files).- main.py:
provides a web interface (using templates/index.html) for starting up the MapReduce pipeline.
- Provide a real world MapReduce example that uses the Genomics API (in python).
- Prove that a MapReduce is both feasible and a good idea for Genomics data. The resulting analysis should be useful.
This code is nearing the end of active development. It is still receiving performance improvements and other small tweaks.
It also might be more useful to switch the MapReduce to implement some sort of Variant analysis rather than outputting Read coverage data. The former should provide a better example of Reducing (which is fairly useless in coverage), and should have a smaller final result (which makes it easier to consume).