Note that the following procedure encodes each sequence into an array of shape (bs, 4, 1, len) where bs is the number of samples and len is the length of each DNA sequence. Therefore, to work with datasets generated from this procedure, you will need to set "image_data_format" in ~/.keras/keras.json file as "channels_first".

User needs to prepare sequence file in FASTA format and target file for training,validation and test set. Refer to the toy data we provided for more examples.

Then run the following to embed each set into HDF5 format.

paste - - -d' ' < FASTA_FILE > tmp.tsv
python $REPO_HOME/embedH5.py tmp.tsv TARGET_FILE DATA_TOPDIR/FILE_NAME  -b BATCHSIZE

FASTA_FILE: sequence in FASTA format
TARGET_FILE: targets (labels or real values) corresponding to the sequences (in the same order)
DATA_TOPDIR: the absolute path of the output directory
FILE_NAME: 'train.h5.batch','valid.h5.batch',or 'test.h5.batch' for training, validation and test set.
BATCHSIZE: optional and the default is 5000. Save every this number of samples to a separate file DATA_CODE.h5.batchX where X is the corresponding batch index.

Model preparation

Change the model function in the template provided to implement your favorite network. Refer to here for examples of how to specifying hyper-parameters to tune.

Running the model

python main.py -d DATA_TOPDIR -m MODEL_FILE_NAME ORDER

DATA_TOPDIR: same as above.
MODEL_FILE: the model file prepared.
ORDER: actions to take. Multiple ones can be used and they will be executed in order.
- -y: hyper-parameter tuning. Output will saved under "$DATA_TOPDIR/$MODELNAME", where MODELNAME is the base name of the model file (minus the ".py" at the end if there is one).
  
  Optional:
  - -hi: the max number of iterations for each hyper-parameter combinations (default:20)
  - -dm: 'memory' for loading all the data into memory and 'generator' for using a Python generator to load the data in batch (default: 'memory').
- -t: train on the training set. Output will be saved in the same folder as -y. After early stopping or reaching the maximum number of epoches specified, the model from the epoch with the smallest validation loss (best model) and the model from the last epoch (last model) will be saved.
  
  Optional:
  - -te: the number of epochs to train for (default 20)
  - -bs: the size of minibatch (default 100).
  - -dm: same as above.
  - -pa: number of epochs with no improvement in validation loss after which training will be stopped (default 10).
  - -w: the file to save the weight of the best model at (default: $DATA_TOPDIR/$MODELNAME/${MODELNAME}_bestmodel_weights.h5).
  - -l: the file to save the weight of the last model at (default: $DATA_TOPDIR/$MODELNAME/${MODELNAME}_lastmodel_weights.h5).
- -e: evaluate the model on the test set. Output will be saved in the same folder as -y.
- -p data_to_predict: predict on new data.data_to_predict should be the prefix of the embedded file to predict on up to the batch number. For example, assume we are to predict on some sequence data prepared at /my_folder/mydata.batchX, where X is 1,2,3,etc., then data_to_predict should be /my_folder/mydata.batch.
  
  Optional:
  - -o: the output directory (default /my_folder/pred.mymodel.mydata.batch). Predictions for every batch will be saved to a separate subdirectory and split into different pickle files, one for each output neuron.
- -r runcode -re weightfile: resume training from a weight file
  - runcode: the codename for this new run. The new model files will be the original ones plus .runcode.
  - weightfile: the weight file to resume training from.

Quick run on the toy data

We prepare some toy data and toy model here.

To perform a quick run, first run the following command to convert the data to desired format and save under "expt1" in the current folder.

cd $REPO_HOME
for dtype in 'train' 'valid' 'test'
do
	paste - - -d' ' < example/$dtype.fa > tmp.tsv
	python embedH5.py tmp.tsv example/$dtype.target expt1/$dtype.h5
done

Then perform hyper-parameter tuning, training and testing by:

python main.py -d expt1 -m example/model.py -y -t -e

All the intermediate output will be under "expt1". If everything works fine, you should get a test AUC around 0.97

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Dockerfiles		Dockerfiles
example		example
.dockerignore		.dockerignore
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
common_defs.py		common_defs.py
embedH5.py		embedH5.py
hyperband.py		hyperband.py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dockerfiles

Dockerfiles

example

example

.dockerignore

.dockerignore

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

common_defs.py

common_defs.py

embedH5.py

embedH5.py

hyperband.py

hyperband.py

main.py

main.py

Repository files navigation

Table of contents

Notice of major refactorization

Data preparation

Model preparation

Running the model

Quick run on the toy data

About

Releases

Packages

Languages

License

zhouyu/Keras-genomics

Folders and files

Latest commit

History

Repository files navigation

Table of contents

Notice of major refactorization

Data preparation

Model preparation

Running the model

Quick run on the toy data

About

Resources

License

Stars

Watchers

Forks

Languages