Nano-Align

A set of algorithms for protein identification using a sub-nanopore.

Installation

The package is implemented in Python 2.7 and requires no installation. However, it depends on some third party Python packages:

These packages should be instaled in your system. You can use either system package manager (e.g. apt-get in Ubuntu) or pip for installation:

pip install package_name

Data Availability

To download the datasets used in the paper, type:

wget https://github.com/fenderglass/datasets/raw/master/nano-align/nano-align.zip

The package also contains the trained Random Forest and SVR model for convenience.

Quick Example

This is a quick example of Nano-Align pipeline, assuming that you have downloaded the manuscript datasets into 'data' directory. Please see the detailed usage information below.

Train Random Forest model on H32 nanospectra:

./train-model.py rf data/nanospectra/H32.mat h32_rf.pcl

Perform identification of H4 nanospectra:

./identify.py data/nanospectra/H4.mat h32_rf.pcl

Plot H32 nanospectra against the RF and MV models:

./plotting/models-fit.py data/nanospectra/H32.mat h32_rf.pcl,-

Plot identification p-values as a function of number of nanospectra in a cluster:

./plotting/identification-pvalues.py data/nanospectra/H4.mat h32_rf.pcl

Usage

See detailed description of the parameters for each script by specifying "-h" option.

train-model.py

Trains the regression model based on Random Forest / SVR, given nanospectra of a known protein. The output file (model) then is then used as an input for other algorithms.

identify.py

Performs protein identification and estimates p-values. It takes trained RF/SVR model as an input.

Visualization scripts

There is a number of scripts that can be used to visualize different features of the data. They are located in the "plotting" directory

models-fit.py

Plots nanospectra against the corresponding regression models.

identification-pvalues.py

Plots identification p-values depending on the cluster size

mixture.py

Plots the frequency distribution of a multiple sets of nanospectra, originating from different proteins.

volume-bias.py

Plots volume- or hydrophilicity-related bias of the model.

Scripts for Input Data Manipulation

The scripts located in "scripts" directory provide some extra functionality for the data preprocessing and analysis.

cut-protein-db.py

Creates a protein database with the certain protein lengths from a bigger FASTA database (such as human proteome).

flip-blockades.py

Given blockades singals and trained SVR model, for each blockade determines if was recorded from straight (N- to C-terminus) or reverse protein translocation. Reverse translocations are then flipped.

merge-mats.py

Merges multiple .mat files into one.

protein-label.py

Adds protein sequence labels to .mat file - a prerequisite for the further analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
nanoalign		nanoalign
plotting		plotting
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
identify.py		identify.py
train-model.py		train-model.py

License

fenderglass/Nano-Align

Folders and files

Latest commit

History

Repository files navigation