GitHub - morozlab/autonomics: An automatic sequence analysis system.

Branches Tags
Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
alignment		alignment
assembly		assembly
bin		bin
credentials		credentials
data		data
docs		docs
files		files
install_data		install_data
java		java
network		network
prediction		prediction
proj_config		proj_config
scripts		scripts
README.autonomics		README.autonomics
__init__.py		__init__.py
config.pl		config.pl
config_loader.py		config_loader.py
csh2		csh2
cshrc		cshrc
data_gremlin.py		data_gremlin.py
dispatcher.py		dispatcher.py
file_io.py		file_io.py
jobs.py		jobs.py
manager.py		manager.py
netutils.py		netutils.py
queue.py		queue.py
resource_server.py		resource_server.py
settings.py		settings.py
start.auto		start.auto
statistics.py		statistics.py
utility.py		utility.py
Repository files navigation

The autonomics pipeline is described in the paper XXXXXX.
Here we describe how to install and use the autonomics pipeline (pipeline).

----------------
QUICK START
----------------

1) scan the full installation instructions below.


2) in your .bashrc or .cshrc file set the following environment variables:

AUTONOMICS_PATH <path.to.autonomics.dir>, e.g. /home/pwilliams/autonomics

PIPELINE_PATH <path.to.where.your.project.folders.will.be.created>, e.g. /srv/data2/pipeline
(needs to be in an area that can store big data)
create this dir

CLASSPATH <path.to.autonomics.dir>/java/bin:<path.to.autonomics.dir>/java/bin/java/mysql-connector-java-5.0.8-bin.jar
e.g.  /home/pwilliams/autonomics/java/bin:/home/pwilliams/autonomics/java/mysql-connector-java-5.0.8-bin.jar

PATH <path.to.autonomics.dir>/bin

PYTHONPATH <path.to.autonomics.dir>
you also need to add the path to khmer/python

https://github.com/ctb/khmer/blob/master/scripts/normalize-by-median.py

----------------
THE INSTALLATION
----------------

The pipeline is intended to be run on a linux computer with multiple
nodes (the local computer).  In addition, an HPC cluster (referred to
as the HPC or remote computer) is required to run blast, bowtie and
hmmer. 

The pipeline code itself, including the mysql database, the
preassembly aps (adapter_trim, quality_trim (cutadapt), khmer (read
normalization), the assemblers (Trinity & mira; mira is optional), and
the code that creates the GO & KEGG results are all run on the local
machine.  The assemblers use multiple nodes; the other local jobs use
a single cpu.

The default arguments to the jobs (blast, hmmer, bowtie, trinity, etc)
are given in the mysql table: default_args.  If you log in to your
mysql database after it is set up and execute "select * from
default_args;" you will see what those default args are. For
nondefault cases the args are inserted into the args table (see how to
manually insert a project into the pipeline, below).

All the perl scripts, i.e. those ending in .pl, in the autonomics/bin
and the autonomics/scripts dirs assume that the perl interpreter is at
/usr/bin/perl  If this is not the case, edit each of the .pl files in
and enter the correct path at the beginning of each file.
 That is, change the first line of each file from "#!/usr/bin/perl -w"
to the correct path.  This line must start with "#!" and must have no
spaces before it an be the very first line in the file.

A redis keystore is used to speed up the annotation of Gene Ontology
and Kyoto Encyclopedia of Genes and Genomes terms.  First, download
and install the redis distribution from http://redis.io/ Then follow
the instructions to install the redis-server as given in the
distribution README.  This will create the server:
/usr/local/bin/redis-server.  Now, be sure /usr/local/bin is in your
path. Next start the redis server: in the redis distributon dir
execute "redis-server ./redis.conf".  This start the server in the
background.  Run the ps command and you should see the redis-server
running.

Next, download dump.rdb from the autonomics dir on our public ftp
site: 128.227.123.35, with the following credentials: 
username: moroz-guest      password:  moroz-guest  
and Put the dump.rdb file in your redis distribution
dir.  Next stop the redis-server (kill -9 ...).  Do a ps to be sure it
has stopped.  Now restart the redis-server exactly as you before and
it will automatically read in the dump file and populate the key store
with the necessary annotation mappings.  This may take awhile as the
dump file is large ... be patient.

Also download the file autonomics.data.for.sql.tables from the ftp site
and put that file in the autonomics/data directory.

To start, edit the settings.py file in the autonomics directory and
enter data appropriate for your situation.  The settings file is
clearly documented to indicate what you need to modify.  Create the
necessary directories.  Some of the aps you will need installed on
your local machine are: khmer, trinity, mira, cutadapt, redis server,
ssh server, python 2.7.x, MySQL 5+, biopython, MySQL-python, paramiko,
pg8000, sqlalchemy; and on your remote cluster: pfam, blast, hmmer.


It may require some trial and error to set the wall time for blast NR
jobs as this will depend on your HPC cluster performance and the size
of your projects are, i.e. number of contigs in the assembled data
fasta file.  While setting wall time very high is safe, this may incur
a penalty in setting the priority of your jobs in the submission
queue.

Add the following to your environment variable PYTHONPATH:
<path.to.autonomics>:<path.to.autonomics>/autonomics
and to your CLASSPATH variable:
<path.to.autonomics>/autonomics/java/bin:<path.to.autonomics>/autonomics/java/mysql-connector-java-5.0.8-bin.jar
and add to your PATH variable:
<path.to.autonomics>autonomics./bin 
in your .bashrc or .cshrc file and then 'source' that file.  

If any of these variables do not exist in your .cshrc or bashrc filem
then create them there.

For example: 
PYTHONPATH=/home/morozgroup/autonomics:/home/morozgroup
CLASSPATH=/home/morozgroup/autonomics/java/bin:/home/morozgroup/autonomics/java/mysql-connector-java-5.0.8-bin.jar
PATH=//home/morozgroup/autonomics/bin: .....

Add the environment variable: AUTONOMICS_PATH and set it to: <path.to.autonomics>/autonomics
For example:  AUTONOMICS_PATH=/home/morozgroup/autonomics

Set up the mySQL database for the autonomics pipeline.  Here is how we did it:
(assumes ZC_DB_NAME = "zero_click" in settings.py; if not, change below to
whatever you set ZC_DB_NAME to)
% mysql -u root --password=XXXXX --database=mysql
mysql> create database zero_click;
mysql> create user zeroclick;
mysql> GRANT SELECT, INSERT, UPDATE, DELETE, INDEX, LOCK TABLES ON `zero_click`.* TO 'zeroclick'@'localhost';
mysql> GRANT SELECT, INSERT, UPDATE, DELETE, INDEX, LOCK TABLES ON `zero_click`.* TO 'zeroclick'@'127.0.0.1';
mysql> GRANT FILE ON *.* TO 'zero_click'@'localhost';
mysql> GRANT FILE ON *.* TO 'zero_click'@'127.0.0.1';
mysql> use mysql;
mysql> update user set Password=PASSWORD('XXXXXXX') where User like 'zeroclick%';
mysql> flush privileges;
mysql> use zero_click;

Now create the necessary tables in your mySQL database on your local
machine. (see autonomics/docs/README).  Then load the data to
initialize the tables. Here is what I did, right after the above commands.
In the autonomics directory:
mysql -u root --password=XXXXXXXX zero_click < autonomics/data/autonomics.sql.tables
mysql -u root --password=XXXXXXXX zero_click < autonomics/data/autonomics.data.for.sql.tables

The code in autonomics/jobs.py creates qsub files that control jobs on
the remote hpc cluster.  Our cluster uses the moab control system and
the qsubs look like this:

#! /bin/bash
#PBS -r n
#PBS -N Capitella_proteins_pfam96
#PBS -o Capitella_proteins_pfam96.stdout
#PBS -e Capitella_proteins_pfam96.stderr
#PBS -q bio
#PBS -m a
#PBS -M morozhpc@gmail.com,plw1080@gmail.com
#PBS -l walltime=48:00:00
#PBS -l pmem=500mb
#PBS -l nodes=1:ppn=1
module load hmmer
pfam_scan.pl  -dir /scratch/lfs/moroz/pfam/ -cpu 1 -fasta /scratch/lfs/moroz/Capitella_proteins_pfam20131028171739/Capitella_proteins_project_pfam_96.fasta

The above script is created automatically by the autonomics pipeline code.
It gets it values such the email addresses from the settingpy file you
edited above.  You do not need to do anythin, we are just showing you
what it looks like for your information.

However, if your cluster does not use qsub scripts (most clusters do)
to run cluster batch jobs, edit the code in jobs.py so it creates qsub
files appropriate for your hpc cluster.


---------------------
STARTING THE PIPELINE
---------------------

After installing all of the above aps and modifying the settings.py
file, start the pipeline by executing start.auto in the autonomics
directory.  You will be asked to give a log file name.  This will then
start the dispatcher and manager loops which run continuously in the
background and control the pipeline.  The processes monitor the mySQL
database for jobs to run and the local and remote cluster for job
completion.  If you wish to use the data gremplin, then execute python
data_gremlin.py &.

For example: autonomics/start.auto FOO
results in:
% ls autonomics/*log*
dlogFOO  (which is the log from the dispatcher)
mlogFOO  (log from manager)

% ps -U morozgroup -u morozgroup u
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
501       7611  0.0  0.4 245608 24916 ?        S    06:20   0:00 python dispatcher.py
501       7613  0.0  0.4 251328 26392 ?        S    06:20   0:07 python manager.py
501      25264  0.0 74.9 10685964 4490928 ?    Sl   Nov12   4:14 redis-server ./redis.conf

If you need to start the pipeline again, BE SURE to run the 'ps'
command and then 'kill' ALL the dispatcher and manager process that are
running ... there may be more than one of each.

If you need help from us, we will need to look at your log files.

----------------
THE DATA GREMLIN
----------------

The data-gremlin unit listens on the sequencing server and when a
project is completed, starts the project on its journey through
pipeline.  In its journey the reads will be assembled and then
annotated using blast (NR & SwissPort) and hmmer for the Pfams.  Based
on the SwissProt results, the Gene Ontology (GO) and KEGG results are
computed.  In addition, quantification (abundances) are computed using
bowtie.

At our lab, we have chosen not to use the data-gremlin as the lead
scientist wishes to evalaute the quality and validity of sequenced
projects before assembling/annotating them.  In addition, the default
names given to a project by the sequencer can be meaningless,
e.g. R_2013_09_26_14_06_58_user_PRO-23.  In the curation process,
meaningful names are assigned to the projects.  This avoids clogging
up the pipeline with bad data and annotating data with meaningless
project names.

----------------------------------------------------
HOW TO MANUALLY INSERT A PROJECT INTO THE PIPELINE
----------------------------------------------------

For the reasons given above, we manually insert projects into the
pipeline rather than using the data_gremlin.  

To start a job manually, first create a directory whose name is the
name of the project <project> in the pipeline directory (the dir you
specified as the "proj_dir" in the settings.py file), and then copy
the reads into that directory, naming them <project>.fastq or in the
case of paired end reads, <project>.fastq and <project>.fastq.end2.

This naming convention MUST be strictly and exactly followed.  The
pipeline will automatically name the assembled data
<project>_project.fasta.  If you wish to annotate already assembled
data, e.g. gene models, you MUST to name the already assembled data
whether NT or AA as <project>_project.fasta this way.  The pipeline
directory must look like this:

<projectA>/<projectA>.fastq  (ordinary reads)

<projectB>/<projectB>.fastq  (paired end reads)
<projectB>/<projectB>.fastq.end2

<projectG>/<projectG>_project.fasta (already assembled data or gene models)

Where <projectA> might be Aplysia_californica_gastrulation for example.
The use of fairly short but meaningful names is important.

Once the data to be assembled/annotated is in the pipeline dir as
described above, to manually initiate the assembly/annotation of a
project, call annotate.proj.pl (in the autonomics/bin dir)
annotate.proj.pl
usage: -proj :project_name_with_path (REQUIRED)
       -mira   (OPTIONAL, default is Trinity)
       -paired_end   (OPTIONAL, default is non-paired end)
       -noass   (OPTIONAL, use when the assembly step is to be omitted)
       -data   (OPTIONAL [ NT | AA ] - REQUIRED if -noass used)
The -noass case can be used for annotating gene models.

You may not submit a project more than once with the same name.  If you
do need to annotate a project for a second time, then change its name
AND the name(s) or the reads or _project.fasta file.

To sepcify even finer control of the args and aps used during the
annotation/assembly process the ap systemtools.py can be used instead
of annotate.proj.pl.  The interface to systemtools.py is complex; see
python systemtools.py --help (systemtools.py is in the
autonomics/scripts directory).  99% of the time we use the simple
annotate.proj.pl to initiate jobs; when you execute it, you will see
how it in turn involves systemtools.py, for example:

% annotate.proj.pl -proj /data/pipeline/aplysia.brain.rear
python /home/morozgroup/autonomics/scripts/systemtools.py --add-project --assign-workflow --add-jobs adapter_trim quality_trim read_normalization assemble quantification blast_nr blast_swissprot pfam kegg go --set-config adapter_trim:+ quality_trim:+ read_normalization:+ assemble:+ quantification:+ blast_nr:+ blast_swissprot:+ pfam:+ kegg:+ go:+  --project-names aplysia.brain.rear
job submitted successfully

It is essential that the submission ends with: job submitted successfully.

When a project is started with annotate.proj.pl (or systemtools.py), the
data for the project is entered into the mySQL database in the
pn_mapping and jn_mapping tables.  The manager and dispatcher
processes described above then check the database and progress of the
various running jobs once a minute.  Any jobs found found are queued
for starting if their dependencies are satisfied.  The number of jobs
of each job type is limited, e.g. only two blast NR jobs can run at
once to avoid overloading the HPC cluster queue (this limit can be
changed in the settings.py file.)

The dependencies are given in the dependency table:
mysql> select * from dependency;
+--------------------+--------------------+
| job_type           | depends_on         |
+--------------------+--------------------+
| assemble           | adapter_trim       |
| assemble           | quality_trim       |
| assemble           | read_normalization |
| blast_nr           | assemble           |
| blast_swissprot    | assemble           |
| go                 | assemble           |
| go                 | blast_swissprot    |
| kegg               | assemble           |
| kegg               | blast_swissprot    |
| pfam               | assemble           |
| quality_trim       | adapter_trim       |
| quantification     | assemble           |
| read_normalization | adapter_trim       |
| read_normalization | quality_trim       |
+--------------------+--------------------+

By default the jobs shown in the default_configuration table are run for a project:
mysql> select * from default_configuration;
+--------------------+------+
| job_type           | code |
+--------------------+------+
| adapter_trim       | +    |
| assemble           | +    |
| blast_nr           | +    |
| blast_swissprot    | +    |
| go                 | +    |
| kegg               | +    |
| pfam               | +    |
| quality_trim       | +    |
| quantification     | +    |
| read_normalization | +    |
+--------------------+------+

-----------------------------------
CHECKING STATUS OF JOBS IN PIPELINE
-----------------------------------

The script autonomics/bin/pipe_status allows you to see the status of
jobs in the autonomics pipeline.

% ./bin/pipe_status 

Usage: ./bin/pipe_status <cmd pa | pn | pname | pid | jobs | q | run |nf | nfns | snf >  <arg1 optional>
       pipe_status pa: prints <project_id> <project_name> for all projects in database, finished and unfinished, sorted alphabetically
       pipe_status pn: prints <project_id> <project_name> for all projects in database, finished and unfinished, sorted by project_id
       pipe_status pname <proj_id>: prints <project_name for that <project_id>
       pipe_status pid <proj_name>: prints <project_id for that <project_name>
       pipe_status jobs <proj_id>: prints status of all jobs for that <project_id>
       pipe_status q : prints jobs in queue
       pipe_status run : prints all jobs that are currently running
       pipe_status nf : prints all jobs that are not finished
       pipe_status snf : prints all jobs that are started but not finished
       pipe_status nfns : prints all jobs that are not finished and not yet started
For example:
% pipe_status pid Strombus_gigas_CNS_mix_trans
project_id			       project_name
312				       Strombus_gigas_CNS_mix_trans

% pipe_status jobs 312
job_id project_id  job_type	job_name	                        q_ts	         started    s_ts	  finished  f_ts         queued
242     312     adapter_trim  Strombus_gigas_CNS_mix_trans_adapter_trim	2013-04-18 01:03:02  Y    2013-04-18 01:04:23 Y 2013-04-18 01:04:24 Y
244	312     assemble      Strombus_gigas_CNS_mix_trans_assemble	2013-04-18 01:03:02  Y    2013-04-18 08:19:48 Y 2013-04-18 13:35:45 Y
..
..

% pipe_status nf
job_id	project_id  job_type	job_name	         q_ts	                     started	s_ts	      finished	f_ts           queued
2178	   549      blast_nr	PAIRED5_blast_nr	2013-11-11 03:00:24		Y    2013-11-11 10:05:52 N    0000-00-00 00:00:00 Y

These are all just shortcuts to accessing the mysql database. If you
know how to use myql, then you may do that instead.

-----------
THE RESULTS
-----------

The autonomics pipeline puts the assembled data and annotations in the <project> directory
For example here is the data from a completed project:
% ls -1 Helix_lucorum_Control/
Helix_lucorum_Control.fastq          (the original reads)
Helix_lucorum_Control_project.fasta  (the assembled data)
Helix_lucorum_Control_blast_nr.txt
Helix_lucorum_Control_blast_swissprot.txt
Helix_lucorum_Control_gocats.txt
Helix_lucorum_Control_GO.txt
Helix_lucorum_Control_KEGG.txt
Helix_lucorum_Control_pfam.txt
Helix_lucorum_Control_quantification.txt

The gocats file is a GO category file containing 1) the number of
unique sequences placed into each GO category 2) the number of unique
annotations present in each category and 3) the abundance of all
sequences placed into that category.

The above files are the input to the neurobase comparative browser,
described in the paper YYYYYY.

I recommend checking the results when all jobs for a project are completed.
Use the script autonomics/bin/check_results

% check_results Helix_lucorum_Space
Helix_lucorum_Space
NUM_FA_SEQS	7454
NR: SEQS_TRIED	7454	SEQS_W_HITS	881 (11%)	SEQS_WO_HITS	6573 (88%)
SP: SEQS_TRIED	7454	SEQS_W_HITS	415 (5%)	SEQS_WO_HITS	7039 (94%)
GO: 6603
gocats: 1926
KEGG: 405
pfam: 4734
quantification: 6916

The first three numbers should always be equal, i.e. 7454 in this
case, if the two blasts are correct.  If either NR or SP are less than
the total number of sequences in the XXX_project.fasta file then that
blast job is incomplete and you need to deal with it as described in
the next section.

--------------------------------
DEALING WITH PROBLEMS ON CLUSTER
--------------------------------

When running 100's of jobs concurrently on a large share cluster, as
the autonomics pipeline does, occasionally a job may fail due to a
node failure or some other hardware problem, or rarely because the
qsub deid not ask for enough wall time.  When this happens it usually
involves only one job out of the entire run. For example, when running
blast_nr for a project, the data is split into 100 parts and a
separate job is run for each part.

The term job is used in two senses: e.g. a blast_nr job for a project;
however a blast job is divided into 100 sub jobs and each run in parallel;
these sub-jobs are also referred to as jobs.

The procedure described below assumes some computational
sophistication on the part of the user.  If this is not the case, the
easiest, but not the fastest solution, is to just resubmit the entire
blast or pfam job.  You can do this by executing:
 stop_job.py <project_name> <job_type>  and then:
restart_job <project_name> <job_type>. In addition there is a script:
stop_project.  

Do not restart go or kegg jobs until you are certain that
blast_swissprot has completed for that project as the go & kegg
computation need the blast_swissprot results.

Also, do not call restart to do quantification unless you have reads,
as quantification requires them.  This would apply to gene models for
example.

When getting started you may need to experiment with a few runs to see
how much wall time to give your jobs on the cluster as cluster vary in
performance.  You modify the wall time in settings.py After modifying
the settings file, you will need to stop the pipeline by killing all
processes running the dispatcher and the manager, there may be more
than one of each; (use the ps command and the kill -9 command).  Then
restart the pipeline using start.auto.  You may need to restart any
jobs (using restart.py) that were running prior to this.

If a problem with a job is encountered, the qsub script requires the
cluster to send an email to the address specified in the settings.py
file.  But clusters are not perfect and sometimes, if a problem is
suspected, log in to the cluster and run the qstat command to see the
status of all your jobs.  

You can also go to the remote_dir specified in the setting.py file and
then to the dir with your run data,
e.g. Capitella_proteins_blast_nr20131028171733/.  A done flag will be
set for each job that has completed,
e.g. done.Capitella_proteins_blast_nr21. If you count these, for
example for blast_nr jobs there should be 100 dones, if all jobs
completed.

If a job is missing, see it it is still running using qstat, if not,
look in its stderr file to see what the problem is.  If necessary you
can rerun any failed job by executing, e.g. 
"qsub Capitella_proteins_blast_nrXX.qsub" where XX is the job number that
failed.  For successfully completed jobs, all jobs except the last
job, job 100 in the blast nr case, will have the same number of seqs blasted.
Use autonomics/bin/countblasts to verify this:

gator1.ufhpc{plw1080}27: countblasts Capitella_proteins_blast_nr 1 100 nr
Capitella_proteins_blast_nr1_blast_nr.txt	325
Capitella_proteins_blast_nr2_blast_nr.txt	325
Capitella_proteins_blast_nr3_blast_nr.txt	325
Capitella_proteins_blast_nr4_blast_nr.txt	325
Capitella_proteins_blast_nr5_blast_nr.txt	325
Capitella_proteins_blast_nr6_blast_nr.txt	325
Capitella_proteins_blast_nr7_blast_nr.txt	325
..

In the rare case of a failed job on the cluster, after running it to a
completed state, you can merge all the blast outputs, using the
catall_blasts, for example the command "catall_blasts Capitella_proteins NR 100"
will create one file, Capitella_proteins_blast_nr.txt with the merged
results.  You can then scp that file to your project directory on your local
machine.

If the problem encountered was that some jobs did not have enough wall
time, then edit the relevant qsub files and increase the wall time and
qsub them again.  If all the qsub files need an increase, use sed and
a command something like this: sed -i 's/old-wallTime/new-wallTime/g' *.qsub
This may be an indication that you need to increase the wall time in the
settings.py file - see above in the section The Installation.

------------------------------------------------------------
GETTING HELP IN CASE OF TROUBLE WITH THE AUTONOMICS PIPELINE
------------------------------------------------------------

If you have carefully followed the above instruction for installation
and you run into problems.  Send an email to plw1080@gmail.com

We will require a record of all the steps you took to install and
setup the autonomics pipeline and the 2 runtime logfiles.


------------------------------------------
DE-NOVO PREDICTION OF SECRETORY MOLECULES
------------------------------------------

Code for de-novo prediction of secretory molecules from proteomic data is
included in the autonomics distribution and can be found in the
directory autonomics/prediction.  The prediction is based on the
results from several heuristic tools, including SignalP4, TMHMM and
Phobius, which in turn are based on a variety of universal protein
motifs including the presence or absence of N-terminal signal peptides
(SP), transmembrane domains (TM), and other sublocalization motifs.

You need /lib/ld-linux.so.2 on your machine in order run the
prediction code.  If you don't have it, you can get it with: "sudo yum
/lib/ld-linux.so.2", (you may need your sysadmin to do this if you
dont have root privileges).

The input to the code is a file of proteins in fasta format.  The
input file must in put in the relevant project dir in the pipeline dir
and given the name <proj_name>_proteins.fasta  Like:
   <pipeline_dir>/<proj_name>/<proj_name>_proteins.fasta

Start the prediction by invoking:
python <path.to>/autonomics/prediction/predicts.py [--max_threads <n>] \
[--sp_cutoff <float 0.0 .. 1.0>] <proj_name> >& log.pred.<proj_name>.<sp_cutoff> &

--sp_cutoff defaults to 0.5 if not specified; this is SignalP4 score for 
discriminating signal peptides from transmembrane regions
--max_threads defaults to 1 if not specified; and,  must be 1 or a multiple of 3

You may also specify the args:
--use_tmhmm  ('0 or 1',default=1) # Use: TMHMM to predict transmembrane domains
--use_phobtm ('0 or 1',default=1) # Use: Phobius to predict transmembrane domains
--use_phobsp ('0 or 1',default=1) # Use: Phobius to predict signal peptides
You do not normally use these args; they are for experimentation with the code.

The final output is in two files:
[file1] <proj_dir>/<proj_name>/<proj_name>_peptide_prediction
[file2] <proj_dir>/<proj_name>/<proj_name>_peptides_<sp_cutoff_value>.fasta

The output of [file1], a line of which looks like this:
"jgi|Lotgi1|54396|gw1.1.87.1,0.099,0,0,0,-" is interpreted as follows.

The first item in the comma separated output file is the sequence
name, the next items are:
[A] the SignalP4 confidence score, a floating point number between 0 and 1
[B] the number of transmembrane domains predicted by TMHMM
[C] the number of transmembrane domains predicted by Phobius
[D] whether Phobius predicts the presence of a signal peptide (0 or 1)
[E] a '+' or '-' if the combined score predicts the protein is / is not a
secretory molecoule.

For the combined score [E] to be '+', requires:
    [A] > sp_cutoff; [B] to be 0 unless ([C] = 0 and [D] is 1)

The reasoning is as follows:

[B (TMHMM)] detects the widest range of transmembrane domains.
However, signal peptides share many traits with transmembrane domains,
and so can be misclassified.

Phobious is particularly designed to distinguish between tm domains
and signal peptides, so if [B (TMHMM)] says there is a tm domain but
phobius [C (PhobiusTM)] says there is no tm domain and
[D (PhobiusSP)] indicates there is a signal peptide, we assume that
[B (TMHHM)] misclassified the protein and so ignore its result.

The subset of input proteins whose combined score is '+' are written
to [file2] in fasta format.

The total time to run the peptide prediction is given at the end of
the log file you created above.

If for any reason you abort a predicts.py run or it fails to run to completion,
you must delete the following directory before running predicts.py again: 
<proj_dir>/<project_name>/prediction  Be careful not to remove the wrong diretory.

You only need to run predicts.py once for a given project.  After that, to get a
prediction for a different SignalP cutoff, use the script: parse_preds.py as follows: 
python parse_preds.py --sp_cutoff <new_cutogg> <proj_name> 
this will create a file: <pipeline>/<proj_name>/<proj_name>_<cutoff>.fasta

For more information on the neuropeptide prediction code, please contact: David Girardo
dogirardo@WPI.EDU

For assitance with the autonomics pipeline code, contact: Peter Williams plw1080@gmail.com