morozlab/autonomics
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The autonomics pipeline is described in the paper XXXXXX. Here we describe how to install and use the autonomics pipeline (pipeline). ---------------- QUICK START ---------------- 1) scan the full installation instructions below. 2) in your .bashrc or .cshrc file set the following environment variables: AUTONOMICS_PATH <path.to.autonomics.dir>, e.g. /home/pwilliams/autonomics PIPELINE_PATH <path.to.where.your.project.folders.will.be.created>, e.g. /srv/data2/pipeline (needs to be in an area that can store big data) create this dir CLASSPATH <path.to.autonomics.dir>/java/bin:<path.to.autonomics.dir>/java/bin/java/mysql-connector-java-5.0.8-bin.jar e.g. /home/pwilliams/autonomics/java/bin:/home/pwilliams/autonomics/java/mysql-connector-java-5.0.8-bin.jar PATH <path.to.autonomics.dir>/bin PYTHONPATH <path.to.autonomics.dir> you also need to add the path to khmer/python https://github.com/ctb/khmer/blob/master/scripts/normalize-by-median.py ---------------- THE INSTALLATION ---------------- The pipeline is intended to be run on a linux computer with multiple nodes (the local computer). In addition, an HPC cluster (referred to as the HPC or remote computer) is required to run blast, bowtie and hmmer. The pipeline code itself, including the mysql database, the preassembly aps (adapter_trim, quality_trim (cutadapt), khmer (read normalization), the assemblers (Trinity & mira; mira is optional), and the code that creates the GO & KEGG results are all run on the local machine. The assemblers use multiple nodes; the other local jobs use a single cpu. The default arguments to the jobs (blast, hmmer, bowtie, trinity, etc) are given in the mysql table: default_args. If you log in to your mysql database after it is set up and execute "select * from default_args;" you will see what those default args are. For nondefault cases the args are inserted into the args table (see how to manually insert a project into the pipeline, below). All the perl scripts, i.e. those ending in .pl, in the autonomics/bin and the autonomics/scripts dirs assume that the perl interpreter is at /usr/bin/perl If this is not the case, edit each of the .pl files in and enter the correct path at the beginning of each file. That is, change the first line of each file from "#!/usr/bin/perl -w" to the correct path. This line must start with "#!" and must have no spaces before it an be the very first line in the file. A redis keystore is used to speed up the annotation of Gene Ontology and Kyoto Encyclopedia of Genes and Genomes terms. First, download and install the redis distribution from http://redis.io/ Then follow the instructions to install the redis-server as given in the distribution README. This will create the server: /usr/local/bin/redis-server. Now, be sure /usr/local/bin is in your path. Next start the redis server: in the redis distributon dir execute "redis-server ./redis.conf". This start the server in the background. Run the ps command and you should see the redis-server running. Next, download dump.rdb from the autonomics dir on our public ftp site: 128.227.123.35, with the following credentials: username: moroz-guest password: moroz-guest and Put the dump.rdb file in your redis distribution dir. Next stop the redis-server (kill -9 ...). Do a ps to be sure it has stopped. Now restart the redis-server exactly as you before and it will automatically read in the dump file and populate the key store with the necessary annotation mappings. This may take awhile as the dump file is large ... be patient. Also download the file autonomics.data.for.sql.tables from the ftp site and put that file in the autonomics/data directory. To start, edit the settings.py file in the autonomics directory and enter data appropriate for your situation. The settings file is clearly documented to indicate what you need to modify. Create the necessary directories. Some of the aps you will need installed on your local machine are: khmer, trinity, mira, cutadapt, redis server, ssh server, python 2.7.x, MySQL 5+, biopython, MySQL-python, paramiko, pg8000, sqlalchemy; and on your remote cluster: pfam, blast, hmmer. It may require some trial and error to set the wall time for blast NR jobs as this will depend on your HPC cluster performance and the size of your projects are, i.e. number of contigs in the assembled data fasta file. While setting wall time very high is safe, this may incur a penalty in setting the priority of your jobs in the submission queue. Add the following to your environment variable PYTHONPATH: <path.to.autonomics>:<path.to.autonomics>/autonomics and to your CLASSPATH variable: <path.to.autonomics>/autonomics/java/bin:<path.to.autonomics>/autonomics/java/mysql-connector-java-5.0.8-bin.jar and add to your PATH variable: <path.to.autonomics>autonomics./bin in your .bashrc or .cshrc file and then 'source' that file. If any of these variables do not exist in your .cshrc or bashrc filem then create them there. For example: PYTHONPATH=/home/morozgroup/autonomics:/home/morozgroup CLASSPATH=/home/morozgroup/autonomics/java/bin:/home/morozgroup/autonomics/java/mysql-connector-java-5.0.8-bin.jar PATH=//home/morozgroup/autonomics/bin: ..... Add the environment variable: AUTONOMICS_PATH and set it to: <path.to.autonomics>/autonomics For example: AUTONOMICS_PATH=/home/morozgroup/autonomics Set up the mySQL database for the autonomics pipeline. Here is how we did it: (assumes ZC_DB_NAME = "zero_click" in settings.py; if not, change below to whatever you set ZC_DB_NAME to) % mysql -u root --password=XXXXX --database=mysql mysql> create database zero_click; mysql> create user zeroclick; mysql> GRANT SELECT, INSERT, UPDATE, DELETE, INDEX, LOCK TABLES ON `zero_click`.* TO 'zeroclick'@'localhost'; mysql> GRANT SELECT, INSERT, UPDATE, DELETE, INDEX, LOCK TABLES ON `zero_click`.* TO 'zeroclick'@'127.0.0.1'; mysql> GRANT FILE ON *.* TO 'zero_click'@'localhost'; mysql> GRANT FILE ON *.* TO 'zero_click'@'127.0.0.1'; mysql> use mysql; mysql> update user set Password=PASSWORD('XXXXXXX') where User like 'zeroclick%'; mysql> flush privileges; mysql> use zero_click; Now create the necessary tables in your mySQL database on your local machine. (see autonomics/docs/README). Then load the data to initialize the tables. Here is what I did, right after the above commands. In the autonomics directory: mysql -u root --password=XXXXXXXX zero_click < autonomics/data/autonomics.sql.tables mysql -u root --password=XXXXXXXX zero_click < autonomics/data/autonomics.data.for.sql.tables The code in autonomics/jobs.py creates qsub files that control jobs on the remote hpc cluster. Our cluster uses the moab control system and the qsubs look like this: #! /bin/bash #PBS -r n #PBS -N Capitella_proteins_pfam96 #PBS -o Capitella_proteins_pfam96.stdout #PBS -e Capitella_proteins_pfam96.stderr #PBS -q bio #PBS -m a #PBS -M morozhpc@gmail.com,plw1080@gmail.com #PBS -l walltime=48:00:00 #PBS -l pmem=500mb #PBS -l nodes=1:ppn=1 module load hmmer pfam_scan.pl -dir /scratch/lfs/moroz/pfam/ -cpu 1 -fasta /scratch/lfs/moroz/Capitella_proteins_pfam20131028171739/Capitella_proteins_project_pfam_96.fasta The above script is created automatically by the autonomics pipeline code. It gets it values such the email addresses from the settingpy file you edited above. You do not need to do anythin, we are just showing you what it looks like for your information. However, if your cluster does not use qsub scripts (most clusters do) to run cluster batch jobs, edit the code in jobs.py so it creates qsub files appropriate for your hpc cluster. --------------------- STARTING THE PIPELINE --------------------- After installing all of the above aps and modifying the settings.py file, start the pipeline by executing start.auto in the autonomics directory. You will be asked to give a log file name. This will then start the dispatcher and manager loops which run continuously in the background and control the pipeline. The processes monitor the mySQL database for jobs to run and the local and remote cluster for job completion. If you wish to use the data gremplin, then execute python data_gremlin.py &. For example: autonomics/start.auto FOO results in: % ls autonomics/*log* dlogFOO (which is the log from the dispatcher) mlogFOO (log from manager) % ps -U morozgroup -u morozgroup u USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND 501 7611 0.0 0.4 245608 24916 ? S 06:20 0:00 python dispatcher.py 501 7613 0.0 0.4 251328 26392 ? S 06:20 0:07 python manager.py 501 25264 0.0 74.9 10685964 4490928 ? Sl Nov12 4:14 redis-server ./redis.conf If you need to start the pipeline again, BE SURE to run the 'ps' command and then 'kill' ALL the dispatcher and manager process that are running ... there may be more than one of each. If you need help from us, we will need to look at your log files. ---------------- THE DATA GREMLIN ---------------- The data-gremlin unit listens on the sequencing server and when a project is completed, starts the project on its journey through pipeline. In its journey the reads will be assembled and then annotated using blast (NR & SwissPort) and hmmer for the Pfams. Based on the SwissProt results, the Gene Ontology (GO) and KEGG results are computed. In addition, quantification (abundances) are computed using bowtie. At our lab, we have chosen not to use the data-gremlin as the lead scientist wishes to evalaute the quality and validity of sequenced projects before assembling/annotating them. In addition, the default names given to a project by the sequencer can be meaningless, e.g. R_2013_09_26_14_06_58_user_PRO-23. In the curation process, meaningful names are assigned to the projects. This avoids clogging up the pipeline with bad data and annotating data with meaningless project names. ---------------------------------------------------- HOW TO MANUALLY INSERT A PROJECT INTO THE PIPELINE ---------------------------------------------------- For the reasons given above, we manually insert projects into the pipeline rather than using the data_gremlin. To start a job manually, first create a directory whose name is the name of the project <project> in the pipeline directory (the dir you specified as the "proj_dir" in the settings.py file), and then copy the reads into that directory, naming them <project>.fastq or in the case of paired end reads, <project>.fastq and <project>.fastq.end2. This naming convention MUST be strictly and exactly followed. The pipeline will automatically name the assembled data <project>_project.fasta. If you wish to annotate already assembled data, e.g. gene models, you MUST to name the already assembled data whether NT or AA as <project>_project.fasta this way. The pipeline directory must look like this: <projectA>/<projectA>.fastq (ordinary reads) <projectB>/<projectB>.fastq (paired end reads) <projectB>/<projectB>.fastq.end2 <projectG>/<projectG>_project.fasta (already assembled data or gene models) Where <projectA> might be Aplysia_californica_gastrulation for example. The use of fairly short but meaningful names is important. Once the data to be assembled/annotated is in the pipeline dir as described above, to manually initiate the assembly/annotation of a project, call annotate.proj.pl (in the autonomics/bin dir) annotate.proj.pl usage: -proj :project_name_with_path (REQUIRED) -mira (OPTIONAL, default is Trinity) -paired_end (OPTIONAL, default is non-paired end) -noass (OPTIONAL, use when the assembly step is to be omitted) -data (OPTIONAL [ NT | AA ] - REQUIRED if -noass used) The -noass case can be used for annotating gene models. You may not submit a project more than once with the same name. If you do need to annotate a project for a second time, then change its name AND the name(s) or the reads or _project.fasta file. To sepcify even finer control of the args and aps used during the annotation/assembly process the ap systemtools.py can be used instead of annotate.proj.pl. The interface to systemtools.py is complex; see python systemtools.py --help (systemtools.py is in the autonomics/scripts directory). 99% of the time we use the simple annotate.proj.pl to initiate jobs; when you execute it, you will see how it in turn involves systemtools.py, for example: % annotate.proj.pl -proj /data/pipeline/aplysia.brain.rear python /home/morozgroup/autonomics/scripts/systemtools.py --add-project --assign-workflow --add-jobs adapter_trim quality_trim read_normalization assemble quantification blast_nr blast_swissprot pfam kegg go --set-config adapter_trim:+ quality_trim:+ read_normalization:+ assemble:+ quantification:+ blast_nr:+ blast_swissprot:+ pfam:+ kegg:+ go:+ --project-names aplysia.brain.rear job submitted successfully It is essential that the submission ends with: job submitted successfully. When a project is started with annotate.proj.pl (or systemtools.py), the data for the project is entered into the mySQL database in the pn_mapping and jn_mapping tables. The manager and dispatcher processes described above then check the database and progress of the various running jobs once a minute. Any jobs found found are queued for starting if their dependencies are satisfied. The number of jobs of each job type is limited, e.g. only two blast NR jobs can run at once to avoid overloading the HPC cluster queue (this limit can be changed in the settings.py file.) The dependencies are given in the dependency table: mysql> select * from dependency; +--------------------+--------------------+ | job_type | depends_on | +--------------------+--------------------+ | assemble | adapter_trim | | assemble | quality_trim | | assemble | read_normalization | | blast_nr | assemble | | blast_swissprot | assemble | | go | assemble | | go | blast_swissprot | | kegg | assemble | | kegg | blast_swissprot | | pfam | assemble | | quality_trim | adapter_trim | | quantification | assemble | | read_normalization | adapter_trim | | read_normalization | quality_trim | +--------------------+--------------------+ By default the jobs shown in the default_configuration table are run for a project: mysql> select * from default_configuration; +--------------------+------+ | job_type | code | +--------------------+------+ | adapter_trim | + | | assemble | + | | blast_nr | + | | blast_swissprot | + | | go | + | | kegg | + | | pfam | + | | quality_trim | + | | quantification | + | | read_normalization | + | +--------------------+------+ ----------------------------------- CHECKING STATUS OF JOBS IN PIPELINE ----------------------------------- The script autonomics/bin/pipe_status allows you to see the status of jobs in the autonomics pipeline. % ./bin/pipe_status Usage: ./bin/pipe_status <cmd pa | pn | pname | pid | jobs | q | run |nf | nfns | snf > <arg1 optional> pipe_status pa: prints <project_id> <project_name> for all projects in database, finished and unfinished, sorted alphabetically pipe_status pn: prints <project_id> <project_name> for all projects in database, finished and unfinished, sorted by project_id pipe_status pname <proj_id>: prints <project_name for that <project_id> pipe_status pid <proj_name>: prints <project_id for that <project_name> pipe_status jobs <proj_id>: prints status of all jobs for that <project_id> pipe_status q : prints jobs in queue pipe_status run : prints all jobs that are currently running pipe_status nf : prints all jobs that are not finished pipe_status snf : prints all jobs that are started but not finished pipe_status nfns : prints all jobs that are not finished and not yet started For example: % pipe_status pid Strombus_gigas_CNS_mix_trans project_id project_name 312 Strombus_gigas_CNS_mix_trans % pipe_status jobs 312 job_id project_id job_type job_name q_ts started s_ts finished f_ts queued 242 312 adapter_trim Strombus_gigas_CNS_mix_trans_adapter_trim 2013-04-18 01:03:02 Y 2013-04-18 01:04:23 Y 2013-04-18 01:04:24 Y 244 312 assemble Strombus_gigas_CNS_mix_trans_assemble 2013-04-18 01:03:02 Y 2013-04-18 08:19:48 Y 2013-04-18 13:35:45 Y .. .. % pipe_status nf job_id project_id job_type job_name q_ts started s_ts finished f_ts queued 2178 549 blast_nr PAIRED5_blast_nr 2013-11-11 03:00:24 Y 2013-11-11 10:05:52 N 0000-00-00 00:00:00 Y These are all just shortcuts to accessing the mysql database. If you know how to use myql, then you may do that instead. ----------- THE RESULTS ----------- The autonomics pipeline puts the assembled data and annotations in the <project> directory For example here is the data from a completed project: % ls -1 Helix_lucorum_Control/ Helix_lucorum_Control.fastq (the original reads) Helix_lucorum_Control_project.fasta (the assembled data) Helix_lucorum_Control_blast_nr.txt Helix_lucorum_Control_blast_swissprot.txt Helix_lucorum_Control_gocats.txt Helix_lucorum_Control_GO.txt Helix_lucorum_Control_KEGG.txt Helix_lucorum_Control_pfam.txt Helix_lucorum_Control_quantification.txt The gocats file is a GO category file containing 1) the number of unique sequences placed into each GO category 2) the number of unique annotations present in each category and 3) the abundance of all sequences placed into that category. The above files are the input to the neurobase comparative browser, described in the paper YYYYYY. I recommend checking the results when all jobs for a project are completed. Use the script autonomics/bin/check_results % check_results Helix_lucorum_Space Helix_lucorum_Space NUM_FA_SEQS 7454 NR: SEQS_TRIED 7454 SEQS_W_HITS 881 (11%) SEQS_WO_HITS 6573 (88%) SP: SEQS_TRIED 7454 SEQS_W_HITS 415 (5%) SEQS_WO_HITS 7039 (94%) GO: 6603 gocats: 1926 KEGG: 405 pfam: 4734 quantification: 6916 The first three numbers should always be equal, i.e. 7454 in this case, if the two blasts are correct. If either NR or SP are less than the total number of sequences in the XXX_project.fasta file then that blast job is incomplete and you need to deal with it as described in the next section. -------------------------------- DEALING WITH PROBLEMS ON CLUSTER -------------------------------- When running 100's of jobs concurrently on a large share cluster, as the autonomics pipeline does, occasionally a job may fail due to a node failure or some other hardware problem, or rarely because the qsub deid not ask for enough wall time. When this happens it usually involves only one job out of the entire run. For example, when running blast_nr for a project, the data is split into 100 parts and a separate job is run for each part. The term job is used in two senses: e.g. a blast_nr job for a project; however a blast job is divided into 100 sub jobs and each run in parallel; these sub-jobs are also referred to as jobs. The procedure described below assumes some computational sophistication on the part of the user. If this is not the case, the easiest, but not the fastest solution, is to just resubmit the entire blast or pfam job. You can do this by executing: stop_job.py <project_name> <job_type> and then: restart_job <project_name> <job_type>. In addition there is a script: stop_project. Do not restart go or kegg jobs until you are certain that blast_swissprot has completed for that project as the go & kegg computation need the blast_swissprot results. Also, do not call restart to do quantification unless you have reads, as quantification requires them. This would apply to gene models for example. When getting started you may need to experiment with a few runs to see how much wall time to give your jobs on the cluster as cluster vary in performance. You modify the wall time in settings.py After modifying the settings file, you will need to stop the pipeline by killing all processes running the dispatcher and the manager, there may be more than one of each; (use the ps command and the kill -9 command). Then restart the pipeline using start.auto. You may need to restart any jobs (using restart.py) that were running prior to this. If a problem with a job is encountered, the qsub script requires the cluster to send an email to the address specified in the settings.py file. But clusters are not perfect and sometimes, if a problem is suspected, log in to the cluster and run the qstat command to see the status of all your jobs. You can also go to the remote_dir specified in the setting.py file and then to the dir with your run data, e.g. Capitella_proteins_blast_nr20131028171733/. A done flag will be set for each job that has completed, e.g. done.Capitella_proteins_blast_nr21. If you count these, for example for blast_nr jobs there should be 100 dones, if all jobs completed. If a job is missing, see it it is still running using qstat, if not, look in its stderr file to see what the problem is. If necessary you can rerun any failed job by executing, e.g. "qsub Capitella_proteins_blast_nrXX.qsub" where XX is the job number that failed. For successfully completed jobs, all jobs except the last job, job 100 in the blast nr case, will have the same number of seqs blasted. Use autonomics/bin/countblasts to verify this: gator1.ufhpc{plw1080}27: countblasts Capitella_proteins_blast_nr 1 100 nr Capitella_proteins_blast_nr1_blast_nr.txt 325 Capitella_proteins_blast_nr2_blast_nr.txt 325 Capitella_proteins_blast_nr3_blast_nr.txt 325 Capitella_proteins_blast_nr4_blast_nr.txt 325 Capitella_proteins_blast_nr5_blast_nr.txt 325 Capitella_proteins_blast_nr6_blast_nr.txt 325 Capitella_proteins_blast_nr7_blast_nr.txt 325 .. In the rare case of a failed job on the cluster, after running it to a completed state, you can merge all the blast outputs, using the catall_blasts, for example the command "catall_blasts Capitella_proteins NR 100" will create one file, Capitella_proteins_blast_nr.txt with the merged results. You can then scp that file to your project directory on your local machine. If the problem encountered was that some jobs did not have enough wall time, then edit the relevant qsub files and increase the wall time and qsub them again. If all the qsub files need an increase, use sed and a command something like this: sed -i 's/old-wallTime/new-wallTime/g' *.qsub This may be an indication that you need to increase the wall time in the settings.py file - see above in the section The Installation. ------------------------------------------------------------ GETTING HELP IN CASE OF TROUBLE WITH THE AUTONOMICS PIPELINE ------------------------------------------------------------ If you have carefully followed the above instruction for installation and you run into problems. Send an email to plw1080@gmail.com We will require a record of all the steps you took to install and setup the autonomics pipeline and the 2 runtime logfiles. ------------------------------------------ DE-NOVO PREDICTION OF SECRETORY MOLECULES ------------------------------------------ Code for de-novo prediction of secretory molecules from proteomic data is included in the autonomics distribution and can be found in the directory autonomics/prediction. The prediction is based on the results from several heuristic tools, including SignalP4, TMHMM and Phobius, which in turn are based on a variety of universal protein motifs including the presence or absence of N-terminal signal peptides (SP), transmembrane domains (TM), and other sublocalization motifs. You need /lib/ld-linux.so.2 on your machine in order run the prediction code. If you don't have it, you can get it with: "sudo yum /lib/ld-linux.so.2", (you may need your sysadmin to do this if you dont have root privileges). The input to the code is a file of proteins in fasta format. The input file must in put in the relevant project dir in the pipeline dir and given the name <proj_name>_proteins.fasta Like: <pipeline_dir>/<proj_name>/<proj_name>_proteins.fasta Start the prediction by invoking: python <path.to>/autonomics/prediction/predicts.py [--max_threads <n>] \ [--sp_cutoff <float 0.0 .. 1.0>] <proj_name> >& log.pred.<proj_name>.<sp_cutoff> & --sp_cutoff defaults to 0.5 if not specified; this is SignalP4 score for discriminating signal peptides from transmembrane regions --max_threads defaults to 1 if not specified; and, must be 1 or a multiple of 3 You may also specify the args: --use_tmhmm ('0 or 1',default=1) # Use: TMHMM to predict transmembrane domains --use_phobtm ('0 or 1',default=1) # Use: Phobius to predict transmembrane domains --use_phobsp ('0 or 1',default=1) # Use: Phobius to predict signal peptides You do not normally use these args; they are for experimentation with the code. The final output is in two files: [file1] <proj_dir>/<proj_name>/<proj_name>_peptide_prediction [file2] <proj_dir>/<proj_name>/<proj_name>_peptides_<sp_cutoff_value>.fasta The output of [file1], a line of which looks like this: "jgi|Lotgi1|54396|gw1.1.87.1,0.099,0,0,0,-" is interpreted as follows. The first item in the comma separated output file is the sequence name, the next items are: [A] the SignalP4 confidence score, a floating point number between 0 and 1 [B] the number of transmembrane domains predicted by TMHMM [C] the number of transmembrane domains predicted by Phobius [D] whether Phobius predicts the presence of a signal peptide (0 or 1) [E] a '+' or '-' if the combined score predicts the protein is / is not a secretory molecoule. For the combined score [E] to be '+', requires: [A] > sp_cutoff; [B] to be 0 unless ([C] = 0 and [D] is 1) The reasoning is as follows: [B (TMHMM)] detects the widest range of transmembrane domains. However, signal peptides share many traits with transmembrane domains, and so can be misclassified. Phobious is particularly designed to distinguish between tm domains and signal peptides, so if [B (TMHMM)] says there is a tm domain but phobius [C (PhobiusTM)] says there is no tm domain and [D (PhobiusSP)] indicates there is a signal peptide, we assume that [B (TMHHM)] misclassified the protein and so ignore its result. The subset of input proteins whose combined score is '+' are written to [file2] in fasta format. The total time to run the peptide prediction is given at the end of the log file you created above. If for any reason you abort a predicts.py run or it fails to run to completion, you must delete the following directory before running predicts.py again: <proj_dir>/<project_name>/prediction Be careful not to remove the wrong diretory. You only need to run predicts.py once for a given project. After that, to get a prediction for a different SignalP cutoff, use the script: parse_preds.py as follows: python parse_preds.py --sp_cutoff <new_cutogg> <proj_name> this will create a file: <pipeline>/<proj_name>/<proj_name>_<cutoff>.fasta For more information on the neuropeptide prediction code, please contact: David Girardo dogirardo@WPI.EDU For assitance with the autonomics pipeline code, contact: Peter Williams plw1080@gmail.com
About
An automatic sequence analysis system.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published