A flexible toolkit for doing and sharing reproducible data science.
EasyData started life as an experimental fork of cookiecutter-data-science where we could try out ideas before proposing them as fixes to the upstream branch. It has grown into its own toolkit for implementing a reproducible data science workflow, and is the basis of our Bus Number tutorial on Reproducible Data Science.
For a tutorial on making use of this framework, visit: https://github.com/hackalog/bus_number/
-
anaconda (or miniconda)
-
python3.6+ Technically, we still prompt for a choice between python and python3, but we aim to deprecate this, and move all python version control to conda
-
Cookiecutter Python package >= 1.4.0: This can be installed with pip by or conda depending on how you manage your Python packages:
$ pip install cookiecutter
or
$ conda config --add channels conda-forge
$ conda install cookiecutter
cookiecutter https://github.com/hackalog/cookiecutter-easydata
The directory structure of your new project looks like this:
LICENSE
Makefile
- top-level makefile. Type
make
for a list of valid commands
- top-level makefile. Type
README.md
- this file
catalog
- Data catalog. This is where data sources and data transformations are saved
data
- Data directory. often symlinked to a filesystem with lots of space
data/raw
- Raw (immutable) hash-verified downloads
data/interim
- Extracted and interim data representations
data/processed
- The final, canonical data sets for modeling.
docs
- A default Sphinx project; see sphinx-doc.org for details
models
- Trained and serialized models, model predictions, or model summaries
models/trained
- Trained models
models/output
- predictions and transformations from the trained models
notebooks
- Jupyter notebooks. Naming convention is a number (for ordering),
the creator's initials, and a short
-
delimited description, e.g.1.0-jqp-initial-data-exploration
.
- Jupyter notebooks. Naming convention is a number (for ordering),
the creator's initials, and a short
references
- Data dictionaries, manuals, and all other explanatory materials.
reports
- Generated analysis as HTML, PDF, LaTeX, etc.
reports/figures
- Generated graphics and figures to be used in reporting
reports/tables
- Generated data tables to be used in reporting
reports/summary
- Generated summary information to be used in reporting
requirements.txt
- (if using pip+virtualenv) The requirements file for reproducing the
analysis environment, e.g. generated with
pip freeze > requirements.txt
- (if using pip+virtualenv) The requirements file for reproducing the
analysis environment, e.g. generated with
environment.yml
- (if using conda) The YAML file for reproducing the analysis environment
setup.py
- Turns contents of
MODULE_NAME
into a pip-installable python module (pip install -e .
) so it can be imported in python code
- Turns contents of
MODULE_NAME
- Source code for use in this project.
MODULE_NAME/__init__.py
- Makes MODULE_NAME a Python module
MODULE_NAME/data
- Scripts to fetch or generate data. In particular:
MODULE_NAME/data/make_dataset.py
- Run with
python -m MODULE_NAME.data.make_dataset fetch
orpython -m MODULE_NAME.data.make_dataset process
- Run with
MODULE_NAME/analysis
- Scripts to turn datasets into output products
MODULE_NAME/models
- Scripts to train models and then use trained models to make predictions.
e.g.
predict_model.py
,train_model.py
- Scripts to train models and then use trained models to make predictions.
e.g.
tox.ini
- tox file with settings for running tox; see tox.testrun.org
- Remove python2 support, (python2 can be supported via a conda envinronment if absolutely needed)
The first time:
make create_environment
Subsequent updates:
make requirements