prune

A neural network appraoch to DNA contamination removal

Uses the Keras Libary for neural network creation to filter specific sequences from a fastq/fasta

Usage

Provide a txt file containing the seqeunces to be serached for and a fastq as your background.

Change the Input files in the scripts accordingly.

System

Linux : use a the command line to create a conda envoirement and also clone the repo , than the files can easily be run with PYTHONHASHSEED= parameter

Windows : i used anacondas Powershell prompt which lets you use all your packages. You can also install git on anaconda so git clone will work as well on the powershell. Just run the Programms via python. Hashseedfixing via $env:PYTHONHASHSEED= and a number

Preparation

Clone or copy the files, one easy way to do this is git clone https://github.com/JonasDauster/prune
Provide a set of normal, clean sequences that do not contain the sequences you want to search for or simply your (as fastq, for reference see example data)
Provide the sequences you want to search for (as txt, for reference see example data)
Install all requiered packages, i recommend conda for that (keras,tensorflow,pandas,biopython,numpy all python 3 versions should work fine )

Training

Change the files in TrainingDataCreation.py accordingly and run it
Do the same with CNN.py, but before you let it run fix the python hash seed. One easy way(with linux) is to use this setup with an command line interface: PYTHONHASHSEED=6 python3 CNN.py . You can use any number behind PYTHONHASHSEED= , just choose the same when running the next steps. In Windows the same can be achived using $env:PYTHONHASHSEED=6 for example

Testing/Running

For testing either generate a new training file and run it with DataLoad.py (use the same hashseed as for CNN.py) or use the validation_split method built in keras.
For running on data to classify, prepare the data with TestDataPrep.py . Than run DataLoad.py on the prepared data (again with the same hashseed), without proper labels you can ignore the score. The found seqeunces can be found in network_hits.csv, the clean ones in no_hits.csv

Performance Issues

do not use more than about 10 individual seqeunces for training, use less for better results
do not use too short fragments, i would recommend at least 12 bases long kmers
watch out that your reads are all about the same length, harsh indifferences lead to bad training. If the fastq you want to search has a diffrent length than the ones the network was trained with consider setting max_lengthtest in the scripts to get a fixed length

Further Methods

to also spot mutations of a given fragment, the script Mutated.py can be used to generate mutations. The generated txt is than used for training.
to directly get a fastq rid of the given seqeunce, DirectFastqFilter.py can be used.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
Example Data		Example Data
Mutations		Mutations
CNN.py		CNN.py
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DataLoad.py		DataLoad.py
DirectFastqFilter.py		DirectFastqFilter.py
LICENSE		LICENSE
README.md		README.md
TestDataPrep.py		TestDataPrep.py
TrainingDataCreation.py		TrainingDataCreation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example Data

Example Data

Mutations

Mutations

CNN.py

CNN.py

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

DataLoad.py

DataLoad.py

DirectFastqFilter.py

DirectFastqFilter.py

LICENSE

LICENSE

README.md

README.md

TestDataPrep.py

TestDataPrep.py

TrainingDataCreation.py

TrainingDataCreation.py

Repository files navigation

prune

Usage

System

Preparation

Training

Testing/Running

Performance Issues

Further Methods

About

Releases

Packages

Languages

License

JonasDauster/prune

Folders and files

Latest commit

History

Repository files navigation

prune

Usage

System

Preparation

Training

Testing/Running

Performance Issues

Further Methods

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages