Skip to content

mpi2/impc-etl

Repository files navigation

IMPC ETL Process Build Status

IMPC Extraction, Transformation and Loading process to generate the data that supports mousephenotype.org among with other internal processes.

Requirements

How to run it

Download the latest release package from the releases page and decompress it. Then submit your job to your Spark 2 cluster using:

spark-submit --py-files impc_etl.zip,libs.zip main.py

Development environment setup

  1. Install Spark 2+ and remember to set the SPARK_HOME environment variable.

  2. Fork this repo and then clone your forked version:

    git clone https://github.com/USERNAME/impc-etl.git
    cd impc-etl
  3. Run make to create a venv in the ./.venv path and install the development dependencies on it:

    make devEnv
  4. Use your favorite IDE to make your awesome changes and make sure the project is pointing to the venv generated. To do that using Pycharm fo to the instructions here.

  5. Then update and run the unit tests:

    make test
  6. Run pylint to be sure that we are using the best practices:

    make lint
  7. And finally commit and push your changes to your fork and the make a pull request to the original repo when you are ready to go. Another member of the team will review your changes and after having two +1 you will be ready to merge them to the base repo.

    In order to sync your forked local version with the base repo you need to add an upstream remote:

    git remote add upstream https://github.com/mpi2/impc-etl.git

    Please procure to have your version in sync with the base repo to avoid merging hell.

    git fetch upstream
    git checkout master
    git merge upstream/master
    git push origin master

Re-generate the documentation

pdoc --html --force --template-dir docs/templates -o docs impc_etl