Proteoformer

A proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching.

Introduction

PROTEOFORMER is a proteogenomic pipeline that delineates true in vivo proteoforms and generates a protein sequence search space for peptide to MS/MS matching. It can be combined with canonical protein databases or used independently for identification of novel translation products. The pipeline makes use of the recently developed next generation sequencing strategy termed ribosome profiling (RIBO-seq) that provides genome-wide information on protein synthesis in vivo. RIBO-seq is based on the deep sequencing of ribosome protected mRNA fragments. RIBO-seq allows for the mapping of the location of translating ribosomes on mRNA with sub codon precision, it can indicate which portion of the genome is actually being translated at the time of the experiment as well as account for sequence variations such as single nucleotide polymorphism and RNA splicing.

The pipeline

aligns your ribosome profiling data to a reference genome
checks the quality and general features of this alignments
searches for translated transcripts
searches for all possible proteoforms in these transcripts
constructs counts for different feature levels and calculates FLOSS scores
constructs fasta files which allow mass spectrometry validation

Most modules of this pipeline are provided with a built-in help message. Execute the script of choice with the -h or --help to get the full help message printed in the command line.

PROTEOFORMER is also available in for Galaxy environments. Galaxy files (tool XML wrappers, general tool and tool data config XML files and LOC files) are available. The tool-specific XML files are present in the directories of the different modules in this GitHub repo. The general Galaxy config files are present 'Galaxy files' folder of this GitHub repo.

We set up our own Galaxy environment at: http://galaxy.ugent.be

Dependencies

Proteoformer is built in Perl 5, Python 2.7 and Bash. All necessary scripts are included in this GitHub repository. Some parts are updated to Python 3.10.

To prevent problems with missing dependencies, we included all necessary dependencies in a Conda environment. For more information about Conda installation, click here.

Once conda is installed, make sure to have the right channel order by executing following commands in the same order as listed here:

conda config --add channels gtcg
conda config --add channels r
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda

Then you can install all dependencies based on the yml file in the dependency_envs folder of this GitHub repository with following command:

conda env create -f Dependency_envs/proteoformer.yml

This installs a new Conda environment in which all needed Conda dependencies are installed and available, including Perl and Python. If not mentioned otherwise, all tools of the PROTEOFORMER pipeline should be executed in this environment. To activate this new Conda environment:

source activate proteoformer

Some Perl packages are not included in Conda, so after installation and first activation of the new environment, execute following script:

perl install_add_perl_tools.pl

If you want to exit the proteoformer Conda environment:

source deactivate

For running this pipeline with Python 3.10, we advise to use Mamba to install the tool. Mamba is very similar to Conda but performs much faster. More info can be found here. To install the dependencies for the Python 3.10 version of PROTEOFORMER, run the following commands:

mamba env create -f Dependency_envs/proteoformer_py3.yml

Then, use the following environment instead of the default proteoformer environment for running tools:

mamba activate proteoformer_py3

To exit this environment:

mamba deactivate

Additional environments for RiboZINB, SPECtre and SRA download

For some tools, we needed to construct separate environments with different versions of the underlying tools. For all the other tools, the proteoformer environment is used.

RiboZINB

conda env create -f Dependency_envs/ribozinb.yml
source activate ribozinb

SPECtre

conda env create -f Dependency_envs/spectre.yml
source activate spectre

PRICE

conda env create -f Dependency_envs/price.yml
source activate price

SRA download

conda env create -f Dependency_envs/download_sra_parallel.yml
source activate download_sra_parallel

Preparations

iGenomes reference information download

Mapping is done based on reference information in the form of iGenomes directories. These directories can easily downloaded and constructed with the get_igenomes.py script in the Additional_tools folder. For example:

python get_igenomes.py -v 92 -s human -d /path/to/dir -r -c 15

Input arguments:

Argument	Default	Description
-d / --dir	Mandatory	Directory wherein the igenomes tree structure will be installed
-v / --version	Mandatory	Ensembl annotation version to download (Ensembl plant (for arabidopsis) has seperate annotation versions!)
-s / --species	Mandatory	Specify the desired species for which gene annotation files should be downloaded
-r / --remove		If any, overwrite the existing igenomes structure for that species
-c / --cores	Mandatory	The amount of cores that will be used for downloading chromosomes files (Do not use more than 15 cores as the download server can only establish 15 connections at once)
-h / --help		This useful help message

The tool currently supports following species:

Species	Input value species argument
Homo sapiens	human
Mus musculus	mouse
Rattus norvegicus	rat
Drosophila melanogaster	fruitfly
Saccharomyces cerevisiae	yeast
Danio rerio	zebrafish
Arabidopsis thaliana	arabidopsis
Caenorhabditis elegans	c.elegans
Salmonella enterica subsp. enterica serovar Typhimurium str. SL1344	SL1344
Mycobacterium abscessus atcc 19977	MYC_ABS_ATCC_19977

There is also an analogue Python 3.10 version of this step:

python get_igenomes_py3.py -v 92 -s human -d /path/to/dir -r -c 15

Ensembl download

After mapping, mostly reference annotation is used from Ensembl (exons, splicing, canonical translation initiation,...). This information is available as an SQLite database and is downloadable by using the ENS_db.py script of the Additional_tools folder. For example:

python ENS_db.py -v 92 -s human

Input arguments:

Argument	Default	Description
-v / --version	Mandatory	Ensembl annotation version to download (supported versions: from 74)
-s / --species	Mandatory	Specify the desired species for which gene annotation files should be downloaded
-h / --help		Print this useful help message

Currently supported species:

Species	Input value species argument
Homo sapiens	human
Mus musculus	mouse
Drosophila melanogaster	fruitfly
Saccharomyces cerevisiae	yeast
Caenorhabditis elegans	c.elegans
Salmonella enterica subsp. enterica serovar Typhimurium str. SL1344	SL1344
Mycobacterium abscessus atcc 19977	MYC_ABS_ATCC_19977

There is also a Python 3.10 version of this step:

python ENS_db_py3.py -v 92 -s human

UTR simulation for Prokaryotes

For Prokaryotes, no untranslated upstream regions (UTRs) exist. Although, offset callers, used during mapping, need these regions in order to calculate P-site offsets. Therefore, for Prokaryotes, these UTRs need to be simulated with the simulate_utr_for_prokaryotes.py script in the Additional_tools folder. For example:

python simulate_utr_for_prokaryotes.py igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32.gtf > igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32_with_utr.gtf
# Move and copy GTFs
mv igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32.gtf igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32_without_utr.gtf
cp igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32_with_utr.gtf igenomes/SL1344/Ensembl/ASM21085v2/Annotation/Genes/genes_32.gtf

This outputs a new GTF file. Best to rename the old GTF file and copy the new one under the name of the original GTF file as shown in the example.

Additional documentation can be found in the help message of the module.

SRA parallel download

If you download the raw data (FASTQ) from SRA, you can use the SRA toolkit. However, we made a module to speed up this downloading process by using multiple cores of your system for multiple files at once. Use the specific environment for this tool, while using this module. For example:

source activate download_sra_parallel
./download_sra_parallel.sh -c 20 -f 3034567 -l 3034572 #This downloads all fastq data  from SRR3034572 up until SRR3034572 on 20 cores

Main pipeline

General quality check: fastQC

Before starting off the analysis, we believe it is useful to check the general features and quality of the raw data. This can be done by running FastQC. FastQC is included in the proteoformer conda environment, so you do not need to download it separately.

For example, to run it on 20 cores:

fastqc yourdata.fastq -t 20

This generates an HTML output report with figures for assessing the quality and general features and a ZIP file (for exporting the results to another system). More information about the tool can be found on the help page of FastQC. A tutorial video of what to expect of the figures can be found here.

Mapping

The first big task of the pipeline is mapping the raw data on the reference genome. All reference data should be downloaded as an iGenomes folder.

First, some prefiltering of bad and low-quality reads is done by using the FastX toolkit. Also, the adapters are eventually clipped off, using the FastQ Clipper (recommended) or the clipper included in STAR (faster).

Mapping can be done by using STAR, TopHat or BowTie. BowTie is less preferable as this not includes splicing. Before mapping against the genome, it is possible to filter out contaminants of PhiX, rRNA, sn(o)RNA and tRNA with the same mapping tool you chose to map against the genome.

After mapping, SAM and BAM files with aligned data are obtained. Plastid can be used to determine the P site offsets per RPF length. These offsets allow to pinpoint all reads to an exact base position as explained here. These offsets are very important further down the pipeline to assign reads to the correct base position.

Next, these offsets are used to parse the alignment files into count tables. These count tables will be placed inside a results SQLite database in which also the mapping statistics and the arguments will be put. During all following steps of the pipeline, all results will be stored in this database and are available for consultation. We recommend to use the sqlite3 command line shell for easy consultation of this database. More information can be found on their website. Use the following command for starting up sqlite3:

sqlite3 SQLite/results.db

For visualization, BedGraph files are generated. These can be used on different genome browsers like UCSC.

The mapping can be run as in one of following examples:

#Combination of untreated and treated data
perl mapping.pl --inputfile1 link/to/your/untr_data.fq --inputfile2 link/to/your/tr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo

#Only untreated data with fastx clipping
perl mapping.pl --inputfile1 link/to/your/untr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo_untr --clipper fastx --adaptor AGATCGGAAGAGCACAC

#To automatically determine P site offsets with Plastid and parse the results into count tables
perl mapping.pl --inputfile1 link/to/your/untr_data.fq --inputfile2 link/to/your/tr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo --suite plastid

#With extra prefiltering, clipping, P site offset calling, count table parsing, extra file generation
perl mapping.pl --inputfile1 link/to/your/untr_data.fq --inputfile2 link/to/your/tr_data.fq --name my_experiment --species human --ensembl 92 --cores 20 --unique Y --igenomes_root /user/igenomes --readtype ribo --clipper fastx --adaptor CTGTAGGCACCATCAAT --phix Y --rRNA Y --snRNA Y --tRNA Y --rpf_split Y --pricefiles Y --suite plastid

Input arguments:

Argument	Default	Description
--inputfile1	Mandatory	The fastq file of the 'untreated' data for RIBO-seq (no drug,CHX,EMT) or the 1st fastq for single/paired-end RNA-seq
--inputfile2	Mandatory for ribo and PE readtypes	The fastq file of the treated data for RIBO-seq (PUR,LTM,HARR) or the 2nd fastq for paired-end RNA-seq,mandatory argument
--name	Mandatory	Name of the run
--species	Mandatory	Species, possibilities: Mus musculus (mouse), Rattus norvegicus (rat), Homo sapiens (human), Drosophila melanogaster (fruitfly), Arabidopsis thaliana (arabidopsis), Salmonella (SL1344), Danio rerio (zebrafish), Saccharomyces cerevisiae (yeast)
--ensembl	Mandatory	Ensembl annotation version
--cores	Mandatory	Available cores for the pipeline to run on
--unique	Mandatory	Y/N. Retain only unique mapped reads (Y) or also multimapped reads (N)
--igenomes_root	Mandatory	Root folder of igenomes folder structure
--readtype	ribo	The readtype, possbilities: RIBOseq untreated and treated sample (ribo), RIBOseq only untreated sample (ribo_untr), RNAseq Paired end PolyA enriched (PE_polyA), RNAseq Single end PolyA enriched (SE_polyA), RNAseq Paired end total (PE_total) or RNAseq Single end total (SE_total)
--adaptor	CTGTAGGCACCATCAAT	The adaptor sequence that needs to be clipped of by the clipper
--work_dir	$CWD env setting	Working directory
--tmp	work_dir/tmp	Temporary files folder
--min_l_plastid	22	Minimum RPF length for Plastid
--max_l_plastid	34	Maximum RPF length for Plastid
--offset_img_untr	work_dir/plastid/run_name_untreated_p_offsets.png	Path to save the offset image of plastid in for untreated sample
--offset_img_tr	work_dir/plastid/run_name_treated_p_offsets.png	Path to save the offset image of plastid in for treated sample
--min_l_parsing	26 (25 for fruitfly)	Minimum length for count table parsing
--max_l_parsing	34	Maximum length for count table parsing
--out_bg_s_untr	untreat_sense.bedgraph	Output visual file for sense untreated count data
--out_bg_as_untr	untreat_antisense.bedgraph	Output visual file for antisense untreated count data
--out_bg_s_tr	treat_sense.bedgraph	Output visual file for sense treated count data
--out_bg_as_tr	treat_antisense.bedgraph	Output visual file for antisense treated count data
--out_sam_untr	untreat.sam	Output alignment file of untreated data
--out_sam_tr	treat.sam	Output alignment file of treated data
--out_bam_untr	untreat.bam	Binary output alignment file of untreated data
--out_bam_tr	treat.bam	Binary output alignment file of treated data
--out_bam_tr_untr	untreat_tr.bam	Binary output alignment file of untreated data in transcript coordinates (if tr_coord option selected)
--out_bam_tr_tr	treat_tr.bam	Binary output alignment file of treated data in transcript coordinates (if tr_coord option selected)
--out_sqlite	work_dir/SQLite/results.db	Output results SQLite database
--clipper	none	If and which clipper has to be used, possibilities: none, STAR, fastx, trimmomatic
--phix	N	Map to Phix data as prefilter before genomic mapping: Yes (Y) or No (N)
--rRNA	Y	Map to rRNA data as prefilter before genomic mapping: Yes (Y) or No(N)
--snRNA	N	Map to sn(-o-)RNA data as prefilter before genomic mapping: Yes (Y) or No (N)
--tRNA	N	Map to tRNA data as prefilter before genomic mapping: Yes (Y) or No (N)
--tr_coord	N	Generate alignment files based on transcript coordinates: Yes (Y) or No (N)
--truseq	Y	Whether strands (+ and -) are assigned as in TruSeq for RNAseq: Yes (Y) or No (N)
--mismatch	2	Alignment will be outputted only if it has fewer mismatches than this value
--maxmultimap	16	Alignments will be outputted only if the read maps fewer times than this value
--splicing	Y	Allow splicing for genome alignment: Yes (Y) or No (N)
--firstrankmultimap	N	Only retain the first ranked alignment when multimapping is allowed: Yes (Y) or No (N)
--rpf_split	N	Whether the program generates RPF specific bedgraph files: Yes (Y) or No (N)
--price_files	N	If the program needs to generate SAM files specifically for PRICE: Yes (Y) or No (N), choose Y if you plan to execute PRICE later on
--cst_prime_offset	12	The constant 5prime or 3prime offset when using those kind of offsets
--min_cst_prime_offset	22	The minimum RPF length for which a constant offset will be calculated
--max_cst_prime_offset	40	The maximum RPF length for which a constant offset will be calculated
--suite	custom	Options of how to run the different mapping modules (mapping, mapping_plastid, mapping_parsing) all together for ribo data: only execute mapping and startup plastid and parsing manually later (custom), mapping and parsing with standard offsets (standard), mapping with afterwards plastid offset calculation and parsing based on these offsets (plastid), use offsets with constant 5prime (cst_5prime), use offsets with constant 3prime (cst_3prime)
--suite_tools_loc	work_dir	The foder with scripts of the subsequent mapping tools when using plastid or standard suite
--help		Generate help message

If you chose the custom suite, you can execute the Plastid and the parsing part of the mapping yourself after mapping.pl. This gives you the opportunity to test different offset sets. As you can see, also a tab-separated offset list can be used in the parsing by inputting them as a txt file. In that way, you can let the program use your offsets of choice.

Examples of how to run these separated programs:

#Plastid for untreated sample
perl mapping_plastid.pl --out_sqlite SQLite/results.db --treated untreated
#Plastid for treated sample
perl mapping_plastid.pl --out_sqlite SQLite/results.db --treated treated
#Parsing of all results into counts (mapping itself, Plastid untreated, Plastid treated)
perl mapping_parsing.pl --out_sqlite SQLite/results.db --offset plastid

Input arguments of mapping_plastid.pl:

Argument	Default	Description
--work_dir	CWD env setting	The working directory
--tmp	work_dir/tmp	Folder to store the temporary files in
--out_sqlite	work_dir/SQLite/results.db	SQLite results database
--treated	untreated	Which sample to calculate offsets for (options: untreated/treated)
--offset_img	work_dir/plastid/run_name_(un)treated_p_offsets.png	P-site offsets output image path
--verbose	N	Use Y for debugging
--help		Print help message

Input arguments of mapping_parsing.pl:

Argument	Default	Description
--work_dir	CWD env setting	The working directory
--tmp	work_dir/tmp	Folder to store the temporary files in
--out_sqlite	work_dir/SQLite/results.db	SQLite results database
--offset	standard	Offset option: you can use the standard offsets from Ingolia 2009 (standard), you can add offsets as a tab-separated txt file (from_file) or you can use the offsets determined with Plastid (plastid), use offsets with constant 5prime (cst_5prime), use offsets with constant 3prime (cst_3prime)
--offset_file_untr	mandatory if offset=from_file	Offset input tab-separated txt file for untreated sample
--offset_file_tr	mandatory if offset=from_file	Offset input tab-separated txt file for treated sample
--help		Print help message

Output:

After running the full mapping step, different tables will be generated in the SQLite results database.

A table of all inputted arguments:

Variable	Value
run_name	experiment1
...	...

A table with mapping statistics:

Sample	type	total	mapped_U	mapped_M	mapped_T	unmapped	map_freq_U	map_freq_M	map_freq_T
experiment1.fastq1	genomic	31133026	17607183	1240756	18847939	12285087	0.565546792656775	0.039853369863554	0.3945998375403
...	...	...	...	...	...	...	...	...	...

A table with counts per base position for each sample:

chr	strand	start	count
8	-1	116252	2.0
...	...	...	...

A table with counts per base position and per RPF length for each sample:

chr	strand	start	RPF	count
8	-1	116252	28	2.0
...	...	...	...	...

Furthermore, BED and BedGraph files are generated, which allow visualizing these counts in a genome browser.

Extra: To obtain normalized BedGraph files based on the different library sizes, there is an extra tool for that in the Additional_tools folder. As input, it takes the different original BedGraph files and library sizes of both samples (i.e. the total mapped reads against the genomic reference, which can be found in the statistics table of the results database).

bash normBedgraph --untrs output/untreat_sense.bedgraph --untras output/untreat_antisense.bedgraph --nttrs output/treat_sense.bedgraph --nttras output/treat_antisense.bedgraph --libuntr 37873493 --libtr 45427218

General quality check: fastQC

This step is quite similar to the first quality check step with FastQC, although this time FastQC will be performed on the alignment files (sam). This generates again HTML reports but due to the adapter clipping and pre-filtering during the mapping step, quality will normally have remarkably improved.

fastqc output/untreat.sam -t 20
fastqc output/treat.sam -t 20

FastQC is available in the Galaxy tool shed and can be added easily to the general workflow in Galaxy.

Specific ribosome profiling quality check: mappingQC

MappingQC generates some figures which give a nice overview of the quality and the general feature of the aligned ribosome profiling data. More specific, it gives an overview of the P site offset calculation, the gene distribution and the metagenic classification. Furthermore, MappingQC does a thorough analysis of the triplet periodicity and the linked triplet phase (typical for ribosome profiling) in the canonical transcript of your data. Especially, the link between the phase distribution and the RPF length, the relative sequence position and the triplet identity are taken into account. MappingQC is also available as a stand-alone tool, separated from PROTEOFORMER and its SQLite results structure, so that you can apply it on every samfile you like, independent of the mapping source.

For PROTEOFORMER, mappingQC needs a SAM file (from one of both samples), an Ensembl database and the results database to run. It generates an HTML file which gives an overview of all generated figures. For exporting this report, a ZIP file is generated with the HTML and all figures included. The tool needs a tool directory with different background scripts. The directory (mqc_tools) is available in our GitHub repository under 2_mappingQC. Input the path of where you put this directory in the --tool_dir argument.

Example:

perl mappingQC.pl --samfile output/untreat.sam --treated untreated --cores 20 --result_db SQLite/results.db --unique Y --ens_db ENS_hsa_92.db --offset plastid --plastid plastid/experiment1_untreated_p_offsets.png --tool_dir mqc_tools --plotrpftool pyplot3D

Input arguments:

Argument	Default	Description
--work_dir	CWD env setting	The working directory
--tmp	work_dir/tmp	The temporary files folder
--samfile	Mandatory	The SAM file to do the analysis for
--treated	untreated	Whether the SAM file is from the treated or untreated sample (possiblities: untreated/treated)
--cores	5	The amount of cores that can be used by the program
--result_db	Mandatory	The result database with all mapping results
--unique	Y	Whether to use only the unique alignments (Y/N) (has to be Y if the mapping was done uniquely)
--ens_db	Mandatory	The Ensembl database with annotation info
--offset	standard	The source of offsets: calculated with Plastid during mapping (plastid), standard offsets from Ingolia 2019 (standard), from an input file (from_file), constant 5 prime offsets (cst_5prime) or constant 3 prime offsets (cst_3prime)
--offset_file	Mandatory if offset=from_file	The offsets input file
--offset_img	Mandatory if offset=plastid	The offsets image Plastid generated during mapping
--output_folder	work_dir/mappingQC_output	The output folder for storing output files
--tool_dir	work_dir/mqc_tools	The directory with all necessary underlying tools
--plotrpftool	grouped2D	Module used for plotting the RPF-phase figure: Seaborn grouped 2D chart (grouped2D), mplot3d 3D bar chart (pyplot3D) or mayavi 3D bar chart (mayavi)
--html	work_dir/mappingqc_out.html	The output HTML file
--zip	work_dir/mappingQC_(un)treated.zip	The output ZIP file
--help		This helpful screen

Caution: The mplot3d package in Pyplot is vulnerable for so-called Escher effects. Sometimes, certain boxes are plotted with a wrong z-order, but overall, figures are clear. On the other hand, the Mayavi package requires the usage of a graphical card, so on servers, this is mostly not an option.

There is a Python 3.10 version of the mappingQC step available. Example:

perl 2_mappingQC_py3/mappingQC.pl --samfile output/untreat.sam --treated untreated --cores 20 --result_db SQLite/results.db --unique Y --ens_db ENS_hsa_92.db --offset plastid --plastid plastid/experiment1_untreated_p_offsets.png --tool_dir mqc_tools --plotrpftool pyplot3D

Transcript calling

After checking the aligned data for quality and general features, you can search for the translated transcripts.

Rule-based transcript calling

A first way to determine these translated transcript, is based on general rules. Transcript without RIBO-seq counts are ignored from the start. Then, for each exon of the transcript, the counts of ribosome reads are calculated and normalized by exon length. If the normalized count of an exon is 5 times lower than the mean normalized exon count of the given transcript, the exon is classified as a noise exon. Transcripts with less than 15% noise exons, are called as actively-translated transcripts (exon_coverage = Yes in the output table).

An example of how to run this tool:

perl ribo_translation.pl --in_sqlite SQLite/results.db --out_sqlite SQLite/results.db --ens_db ENS_hsa_92.db

Input arguments:

Argument	Default	Description
--work_dir	CWD env setting	The working directory
--tmp	work_dir/tmp	The temporary files folder
--in_sqlite	SQLite/results.db	The SQLite results database from previous steps
--out_sqlite	The same as in_sqlite atrgument	The SQLite results database to store all results in
--ens_db	Mandatory	The Ensembl database with annotation info
--help		Generate help message

Output table structure in the SQLite results database:

transcript_id	stable_id	chr	seq_region_id	seq_region_strand	seq_region_start	seq_region_end	read_counts	normalized_counts	biotype	exon_coverage	canonical	ccds	gene_stable_id	FPKM	coverage
196519	ENST00000371471	20	131538	-1	53567065	53593839	585	0.187680461982676	protein_coding	Yes	Yes	CCDS13443	ENSG00000171940	10.6593122808274	0.6674393749734
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

RiboZINB

Another way to call transcripts, is by using the RiboZINB tool. This tool makes use of the zero-inflated binomial model to determine actively translated transcript isoforms. RiboZINB was directly included in the PROTEOFORMER pipeline. The RiboZINB tool itself is still in beta and statistics are still in development and validation.

The RiboZINB tool requires its own earlier installed Conda environment to run:

source activate ribozinb

An example of how to run this tool:

python RiboZINB.py -p SQLite/results.db

Input arguments:

Argument	Default	Description
-w/--work_dir	CWD env setting	The working directory
-x/--tmpfolder	work_dir/tmp	The temporary files folder
-i/--in_sqlite	Mandatory	The input SQLite results database
-o/--out_sqlite	Same path as in_sqlite	The output SQLite results database
-e/--ens_db	Mandatory	The Ensembl annotation database
-m/--mincount	5	The minimum reads for a transcript to be called
-n/--no_of_samples	30	The number of iterations when generating a negative set
-f/--fdr	0.05	The false discovery rate
-s/--default_score	d	Use the default score threshold (d) or estimate the threshold by performing a permutation test (p)
-v/--cutoff	0.1	The default score threshold
-a/--alpha	1	Proportion of noise when generating the negative set

Output table structure in the SQLite results database:

transcript_id	stable_id	chr	seq_region_id	seq_region_strand	seq_region_start	seq_region_end	read_counts	normalized_counts	biotype	exon_coverage	canonical	ccds	gene_stable_id	FPKM
173461	ENST00000314167	4	131552	-1	849278	932298	1513	0.340612336785	protein_coding	NA	Yes	CCDS3340	ENSG00000178950	19.3450784708
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

ORF calling

Based on the translated transcripts, you can search for the translatiion product candidates. Three methods are currently implemented: PROTEOFORMER classic ORF calling, PRICE and SPECtre. For an overview of all analyses that you performed, an overview table for all the analysis ID's can be generated.

PROTEOFORMER classic ORF calling

A first method to determine the candidate translation products based on ribosome profiling is PROTEOFORMER classic method. This method is rule-based and therefore does not work based on a score. However, this method is developped with MS validation afterwards in mind. Therefore, defaultly it works quite unstringent, allowing to filter stringently later on during MS. As this method is also based on TIS calling, it needs a initiation ribosome profile to work (i.e. a second sample treated with LTM or HARR). The classic ORF calling can be separated in three subsequent steps: TIS calling, SNP calling and ORF assembly. The SNP calling step is optional.

TIS calling

The first step of this method determines the translation initiation sites based on the initiation profile (HARR/LTM). Only (near-)cognate start codons are considered as a possible translation initiation site (TIS). This means that a codon can only differ in one base from the canonical 'ATG' start codon to be considered as a possible TIS. Furthermore, a local max in the initiation profile counts needs to be observable in the inputted local_max range. The TIS peak needs also to contain more counts than the min_count parameter for its annotation class to be called and the R_LTM-R_CHX value needs to be above the threshold set for its annotation class.

An example of how to run this tool:

perl TIScalling_categorised --sqlite_db SQLite/results.db

Input arguments:

Argument	Default	Description
--dir	CWD env setting	The working directory
--tmp	work_dir/tmp	The temporary files folder
--sqlite_db	Mandatory	The SQLite results database with all mapping and translated transcript calling results
--local_max	1	The range wherein the local maximum can be located (e.g. 1 means +/- one triplet)
--min_count_aTIS	5	The minimum count of ribosome profiling reads that need to map to the aTIS
--R_aTIS	0.01	The Rltm - Rchx value threshold for aTIS ORFs
--min_count_5	10	The minimum count of ribosome profiling reads that need to map to a 5'UTR ORF TIS
--R_5	0.05	The Rltm - Rchx value threshold for 5'UTR ORFs
--min_count_CDS	15	The minimum count of ribosome profiling reads that need to map to a CDS ORF TIS
--R_CDS	0.15	The Rltm - Rchx value threshold for CDS ORFs
--min_count_3	10	The minimum count of ribosome profiling reads that need to map to a 3'UTR ORF TIS
--R_3	0.05	The Rltm - Rchx value threshold for 3'UTR ORFs
--min_count_ntr	10	The minimum count of ribosome profiling reads that need to map to a non-codign region ORF TIS
--R_ntr	0.05	The Rltm - Rchx value threshold for non-coding region ORFs
--out_sqlite	/	Parameter is only used in Galaxy, not in CLI
--transcriptfilter	none	Use certain filters at transcript level (options: none, canonical, ccds)

Called TIS's are saved in the SQLite results database in following table format:

transcript_id	stable_id	biotype	chr	strand	start	dist_to_transcript_start	dist_to_aTIS	annotation	exon	aTIS_call	start_codon	peak_shift	count	Rltm_min_Rchx
669998	ENST00000337132	protein_coding	1	1	16440762	91	0	aTIS	1	True	ATG	+1 0 -1	130.0	0.961529860001382
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

SNP calling

Additional to the information about called TIS positions, it is also possible to search for SNP positions based on the ribosome profilng data. More information about the SNP calling can be found in its separate README file.

With Picard tools(for removing duplicates) and SAM tools, SNP positions will be called based on the aligned ribosome profiling data. Data from SNPdb can also be included in the analysis.

An example of how to run this tool:

bash snp_calling --sqlitein path/to/results/database.db --sqliteout path/to/output/database.db --removeduplicates [y|n] --picardpath /path/to/picardmap --snpdbselected [y|n] --snpdb path/to/snpdb --toolsdir /path/to/tooldir --reads path/to/mapped/reads.sam --mincoverage 3 --maxcoverage 100 --high_af 0.95 --lower_af 0.3 --upper_af 0.7

Input arguments:

Argument	Default	Description
-s / --sqlitein	Mandatory	Path to the database with results from previous steps
--sqliteout	Mandatory	Path to store the database with all results after this step
--removeduplicates	Mandatory	'y' (yes) or 'n' (no), whether to remove duplicate reads with Picard
--picardpath	Mandatory if previous option is 'y'	Path to the Picard tools jar file
--snpdbselected	Mandatory	'y' (yes) or 'n' (no), whether he mapped reads will be searched for known SNPs data
--snpdb	Mandatory if previous option is 'y'	Path to the SNPdb
--toolsdir	Mandatory	Path to the folder where filterSAMfile.pl, snpIndexBuilder.pl and splitVCFaltRecords.pl are
-r / --reads	Mandatory	Path to the SAM file containing alignments ('untreated')
--mincoverage	3	SAMtools parameter, the minimal number of reads that need to map at a location so that a SNP can be called there
--maxcoverage	100	SAMtools parameter, the maximal number of reads that need to map at a location so that a SNP can be called there
--high_af	0.95	High, lower and upper allelic frequency, input variables for mergeVCFfiles.pl, select SNPs & INDELS when their allelic frequency is between lower_af and upper_af or higher than high_af
--lower_af	0.3	Lower allelic frequency
--upper_af	0.7	Upper allelic frequency

The reasoning behind the high, lower and upper allelic frequency is providing the ability to select both hetero- and monozygotic SNP variations.

Called SNPs are stored in the SQLite results database in following table format:

id	chr	pos	ref	alt	dp	af	new	seq_region_id
100001506993	10	150699	A	C	1	0.5	m	131544
...	...	...	...	...	...	...	...	...

The column "new" is added when the results are compared to SNPdb. A "y" (yes) indicates that a variant was new, ie NOT found in SNPdb. A "n" (no) means not new (ie found in SNPdb) and an "m" (mismatch) is used for those mismatches in the mapped reads that were found in SNPdb. There is no allelic frequency information for the "m" variants, so this has been set to 0.0.

ORF assembly

Information about called TIS's and SNPs can be used to construct the candidate translation products in silico. In this process, the program keeps track of the exonic structure of the transcript and known selenocysteines (encoded by an 'UGA' codon, which can also function as a STOP signal). Both the translation product with the selenocysteine and the earlier terminated product will be kept. For near-cognate start sites, the first codon is replace for a cognate methionine. Furthermore, translation products will only be outputted if the translation stops at a valid stop signal. The coverage and FPKM value per translation product will be calculated as well.

An example of how to run this tool:

perl assembly.pl --sqliteRES SQLite/results.db --snp samtools_dbSNP --tis_ids 1

Input arguments:

Argument	Default	Description
--dir	CWD env setting	The working directory
--tmp	work_dir/tmp	The temporary files folder
--sqliteRES	Mandatory	The SQLite results database with all previous results
--local_max	1	The range wherein the local maximum can be located (e.g. 1 means +/- one triplet)
--min_count_aTIS	5	The minimum count of ribosome profiling reads that need to map to the aTIS
--R_aTIS	0.05	The Rltm - Rchx value threshold for aTIS ORFs
--min_count_5	10	The minimum count of ribosome profiling reads that need to map to a 5'UTR ORF TIS
--R_5	0.05	The Rltm - Rchx value threshold for 5'UTR ORFs
--min_count_CDS	15	The minimum count of ribosome profiling reads that need to map to a CDS ORF TIS
--R_CDS	0.15	The Rltm - Rchx value threshold for CDS ORFs
--min_count_3	10	The minimum count of ribosome profiling reads that need to map to a 3'UTR ORF TIS
--R_3	0.05	The Rltm - Rchx value threshold for 3'UTR ORFs
--min_count_ntr	10	The minimum count of ribosome profiling reads that need to map to a non-codign region ORF TIS
--R_ntr	0.05	The Rltm - Rchx value threshold for non-coding region ORFs
--out_sqlite	same as sqliteRES parameter	The SQLite database holding all output
--snp	NO	The applied SNP algorithm. Options: 'NO', 'samtools', 'samtools_dbSNP'
--tis_ids	Mandatory	List of the analysis ids (one integer or a comma-separated list of integers)

Candidate translation products are stored in the SQLite results database in following table format:

tr_stable_id	chr	strand	start	start_codon	stop	starts_list	ends_list	dist_to_transcript_start	dist_to_aTIS	annotation	aTIS_call	peak_shift	count	Rltm_min_Rchx	coverage	FPKM	SNP	INDEL	secs	tr_seq	aa_seq
ENST00000368236	1	1	156581365	ATG	156586550	156581365_156582231_156583042_156583338_156583797_156584877_156585125_156585713_156585949_156586469	156582070_156582434_156583170_156583450_156583895_156584974_156585231_156585826_156586045_156586550	37	0	aTIS	no_data	NA	2164.0	NA	0.183533447684391	9.89861772252107	692_A_G_0.5			ATGTCTTCCCCCAACCCTGAGGATGTGCCCCGGAGGCCAGAACCTGAGCCCTCAAGCTCCAATAAGAAAAAGAAGAAAAGAAAGTGGCTGCGGCAAGAAGCCAGCATCCAAGCCCTCACCAGGGCTGGCCATGGGGCCCTTCAGGCTGGCCAGAACCATGAAGCCTTGAACAACTTCCAGAGGGCCTTCCTTCTGGCCTCCAAGGCCCCACAAACCAGGGATACCCCTGTGCTCCAGGCCTGCGCCTTCAACCTGGGGGCTGCCTATGTGGAGACTGGGGACCCAGCCAGAGGCCTTGAGCTACTCCTGCGAGCCCACCCTGAAGAGAAGGCACAGGGCAGGCGACACGGCGACCAATGTTTCAATGTGGCTTTGGCCTACCATGCCCTCGGCGAGCTGCCTCAAGCTTTGGCCTGGTACCACAGGGCCCTGGGCCACTACCAGCCACAGGGTGACCAGGGAGAAGCCTGGGCAAAAATGGGAGCCTGCTACCAGGCTCTGGGACAGCCTGAGCTAGCAGCCCACTGCCTGCAGGAAGCAAGCCAGGCCTATGCTCAAGAGAGACAGCTGCGGGCCGCAGCCCTGGCACTGGGGGCTGCGGCAGGATGTATGCTGAAGAGTGGGCGGCATCGGGTGGGGGAAGTTGTGCAGGTGCTGGAGAAAAGCCGGAGGCTTGCCGAGAGGAGCACTGGGAGGCGACTGCTGGGGCACCTCTATAACGATCTAGGCCTGGGCTACTCCCAGCTCCAGCTGTTCCCGCTGGCCGTGGAGGCCTTCCTGCAGGCCCTGCCCCTGTGCTGGGTGCCAGGAGAGCAGGCCACAGTGCTAAGAAACCTCGGGATGGCCCACAATGCCCTCGGCAACTATCAGGAAGCTCGGGAGTTTCACCAGAAGGCTGCTGACCTACACGGCTCTGTGGGGCAGCGGTGGGAGCAGGGCCGGAGCTTTGGCAGCCTGGCCTTTGCATTGAGCCAGCTGGGGGACCACAAGGCTGCCAGAGACAACTACCTGCATGCCCTGCAGGCTGCCCGGGACTCTGGGGACATGAAGGGACAGTGGCAGGCCTGTGAGGGTCTGGGGGCTGCTGCAGCCAGGCTGGGGCAGTATGACCAGGCCTTGAAGTACTATAAGGAAGCACTGGCCCAGTGTCAGAAGGAGCCAGATTCTGTGCGAGAACGGCTGGTGGCCAAGCTGGCAGACACCGTGAGGACGCGCTTGGCCCAGGTGGGGCTGGTCCAGACTCACACCCTGACTTCGGCTCCGGGAAGACTCCAGGCTCCAGGTGGGGCCAGCCAGGCGGAGGGGACCCCAGCAAAGGCAGGAAGCAGCACAGCAGGTGTCCAGCACAGATCTTCCAGTGGGTGGGAAGATGAAGAGTTTGAGGAGGGCCACCAGAAGAAAAAAGAGGAGAGGTCGGCAAACGTTCCGGTGAGGGCTGGGCCGGGAAGACCAGAGCTGTGTTTCCTTCCAGGCACAGTGAATCATTCGCACCATCTAGCTTCTAGTTGCCCCACGTTTACCAAGCACACGCCCTGCAGAGGGACAGTCCTCGGCAAAGCCTCCATCTATAGTCCAGGACCCAGGGCCCATCTTCCATTTGTAGGTCCAGGCCCTCCCAGAGCGGAGTACCCTAGCATCTTGGTACCCAATGGCCCTCAAGCCAATAGGTCATCCAGGTGGCCCAGGGAAAGCCTCAGCAGGAGCCGCCAGAGGAGACCCATGGAGTCGGGCATCTGCACTATTGTGTGA	MSSPNPEDVPRRPEPEPSSSNKKKKKRKWLRQEASIQALTRAGHGALQAGQNHEALNNFQRAFLLASKAPQTRDTPVLQACAFNLGAAYVETGDPARGLELLLRAHPEEKAQGRRHGDQCFNVALAYHALGELPQALAWYHRALGHYQPQGDQGEAWAKMGACYQALGQPELAAHCLQEASQAYAQERQLRAAALALGAAAGCMLKSGRHRVGEVVQVLEKSRRLAERSTGRRLLGHLYNDLGLGYSQLQLFPLAVEAFLQALPLCWVPGEQATVLRNLGMAHNALGNYQEAREFHQKAADLHGSVGQRWEQGRSFGSLAFALSQLGDHKAARDNYLHALQAARDSGDMKGQWQACEGLGAAAARLGQYDQALKYYKEALAQCQKEPDSVRERLVAKLADTVRTRLAQVGLVQTHTLTSAPGRLQAPGGASQAEGTPAKAGSSTAGVQHRSSSGWEDEEFEEGHQKKKEERSANVPVRAGPGRPELCFLPGTVNHSHHLASSCPTFTKHTPCRGTVLGKASIYSPGPRAHLPFVGPGPPRAEYPSILVPNGPQANRSSRWPRESLSRSRQRRPMESGICTIV*
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

PRICE

Another method uses the PRICE algorithm to pick up translation from ribosome profiling data. PRICE (Probabilistic inference of codon activities by an EM algorithm) is a method to identify translated ORFs. Make sure you have the right Java distribution as stated on the wiki of PRICE. The right Java distribution is also available in a separate Conda environment. You can activate this environment with following command:

source activate price

PROTEOFORMER contains a wrapper that fully executes PRICE and uses its results in order to obtain a candidate translation product table similar to the assembly table. PRICE does not account for selenocysteines, so these cannot be included with this methodology. Execution of PRICE comprises genome reference preparation, the actual PRICE run and conversion of the output CIT file. Afterwards the output of PRICE will be combined with information from Ensembl in order to obtain all needed features of the translation product table. All these steps are included in the PROTEOFORMER wrapper program. The wrapper let PRICE use both treatment samples if they are available but it is also possible to run the PRICE wrapper with only an untreated/CHX sample.

An example of how to run this wrapper:

python PRICE.py -r SQLite/results.db

Input arguments:

Argument	Default	Description
-w/--workdir	CWD env setting	Working directory
-t/--tmp	workdir/tmp	Temporary files folder
-r/--result_db	Mandatory	The SQLite database with results of mapping and translated transcript calling
-f/--fdr	0.1	FDR value threshold used in PRICE
-m/--max_ram	PRICE default	Maximum amount of gigabytes RAM available for running PRICE
-h/--help	/	Show help message

The output table has the following format:

tr_stable_id	chr	strand	start	start_codon	stop	starts_list	ends_list	dist_to_transcript_start	dist_to_aTIS	annotation	biotype	aTIS_call	peak_shift	count	Rltm_min_Rchx	coverage	FPKM	SNP	INDEL	secs	tr_seq	aa_seq	price_id	price_annotation	price_pval
ENST00000218516	X	-1	101407909	GTG	101397809	101397809_101398370_101398785_101400666_101401632_101403811_101407710	101398099_101398567_101398946_101400757_101401809_101403985_101407909	17	-6	5UTR	protein_coding	NA	NA	9878.0	NA	0.545524691358	60.9777121395				GTGACAATGCAGCTGAGGAACCCAGAACTACATCTGGGCTGCGCGCTTGCGCTTCGCTTCCTGGCCCTCGTTTCCTGGGACATCCCTGGGGCTAGAGCACTGGACAATGGATTGGCAAGGACGCCTACCATGGGCTGGCTGCACTGGGAGCGCTTCATGTGCAACCTTGACTGCCAGGAAGAGCCAGATTCCTGCATCAGTGAGAAGCTCTTCATGGAGATGGCAGAGCTCATGGTCTCAGAAGGCTGGAAGGATGCAGGTTATGAGTACCTCTGCATTGATGACTGTTGGATGGCTCCCCAAAGAGATTCAGAAGGCAGACTTCAGGCAGACCCTCAGCGCTTTCCTCATGGGATTCGCCAGCTAGCTAATTATGTTCACAGCAAAGGACTGAAGCTAGGGATTTATGCAGATGTTGGAAATAAAACCTGCGCAGGCTTCCCTGGGAGTTTTGGATACTACGACATTGATGCCCAGACCTTTGCTGACTGGGGAGTAGATCTGCTAAAATTTGATGGTTGTTACTGTGACAGTTTGGAAAATTTGGCAGATGGTTATAAGCACATGTCCTTGGCCCTGAATAGGACTGGCAGAAGCATTGTGTACTCCTGTGAGTGGCCTCTTTATATGTGGCCCTTTCAAAAGCCCAATTATACAGAAATCCGACAGTACTGCAATCACTGGCGAAATTTTGCTGACATTGATGATTCCTGGAAAAGTATAAAGAGTATCTTGGACTGGACATCTTTTAACCAGGAGAGAATTGTTGATGTTGCTGGACCAGGGGGTTGGAATGACCCAGATATGTTAGTGATTGGCAACTTTGGCCTCAGCTGGAATCAGCAAGTAACTCAGATGGCCCTCTGGGCTATCATGGCTGCTCCTTTATTCATGTCTAATGACCTCCGACACATCAGCCCTCAAGCCAAAGCTCTCCTTCAGGATAAGGACGTAATTGCCATCAATCAGGACCCCTTGGGCAAGCAAGGGTACCAGCTTAGACAGGGAGACAACTTTGAAGTGTGGGAACGACCTCTCTCAGGCTTAGCCTGGGCTGTAGCTATGATAAACCGGCAGGAGATTGGTGGACCTCGCTCTTATACCATCGCAGTTGCTTCCCTGGGTAAAGGAGTGGCCTGTAATCCTGCCTGCTTCATCACACAGCTCCTCCCTGTGAAAAGGAAGCTAGGGTTCTATGAATGGACTTCAAGGTTAAGAAGTCACATAAATCCCACAGGCACTGTTTTGCTTCAGCTAGAAAATACAATGCAGATGTCATTAAAAGACTTACTTTAA	MTMQLRNPELHLGCALALRFLALVSWDIPGARALDNGLARTPTMGWLHWERFMCNLDCQEEPDSCISEKLFMEMAELMVSEGWKDAGYEYLCIDDCWMAPQRDSEGRLQADPQRFPHGIRQLANYVHSKGLKLGIYADVGNKTCAGFPGSFGYYDIDAQTFADWGVDLLKFDGCYCDSLENLADGYKHMSLALNRTGRSIVYSCEWPLYMWPFQKPNYTEIRQYCNHWRNFADIDDSWKSIKSILDWTSFNQERIVDVAGPGGWNDPDMLVIGNFGLSWNQQVTQMALWAIMAAPLFMSNDLRHISPQAKALLQDKDVIAINQDPLGKQGYQLRQGDNFEVWERPLSGLAWAVAMINRQEIGGPRSYTIAVASLGKGVACNPACFITQLLPVKRKLGFYEWTSRLRSHINPTGTVLLQLENTMQMSLKDLL*	ENST00000218516_Variant_0	Variant	1.6931e-11
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

SPECtre

A third method is based on SPECtre. SPECtre is a tool for searching actively translated regions in ribosome profiling using a spectral coherence classifier. SPECtre runs solely based on an untreated/CHX sample, so no initiation profile is necessary. However, SPECtre starts from the reference information in the GTF file and is therefore less suited to pick up unknown new ORFs.

SPECtre needs its own earlier installed Conda environment to run:

source activate spectre

In this environment, a PROTEOFORMER wrapper for SPECtre can run. The wrapper runs SPECtre in the background, multithreaded over all chromosomes and merges all results afterwards. Next, these results are parsed and supplied with information from Ensembl to obtain all features needed for the PROTEOFORMER translation product table. Furthermore, SPECtre takes selenocysteines into account.

An example of how to run the SPECtre wrapper:

python SPECtre.py -r SQLite/results.db -o 28:12,29:12,30:12 -c 20 -x 1

Input arguments:

Argument	Default	Description
-w/--workdir	CWD env setting	The working directory
-t/--tmp	workdir/tmp	The temporary files folder
-r/--result_db	Mandatory	SQLite database with results from mapping and translated transcript calling
-b/--untr_bam	Get from results database	BAM file of the untreated/CHX sample
-o/--offsets	Get from results database	Custom list of offsets, but defaultly the program takes these from the results database
-c/--cores	Get from results database	Defaultly taken from the results database, but you can suggest a different amount of cores especially for SPECtre
-x/--threads_per_chrom	1	The number of threads used per chromosome to run SPECtre
-f/--fdr	0.05	FDR threshold used in SPECtre
-h/--help	/	Show help message

Caution: SPECtre is currently not suited to work with large offset list like outputted by Plastid. Therefore, it is advisable to only use a list of the P-site offsets of the most abundant RPF lengths.

The output table has the following format:

tr_stable_id	chr	strand	start	start_codon	stop	starts_list	ends_list	dist_to_transcript_start	dist_to_aTIS	annotation	biotype	aTIS_call	peak_shift	count	Rltm_min_Rchx	coverage	FPKM	SNP	INDEL	secs	tr_seq	aa_seq	spectre_id
ENST00000588919	19	1	1104125	ATG	1106766	1104125_1105186_1105366_1105658_1106242_1106378_1106540	1104127_1105280_1105510_1105809_1106266_1106459_1106766	53	0	aTIS	protein_coding	NA	NA	27921.0	NA	0.552812071331	277.962315247			97_96	ATGTGCGCGTCCCGGGACGACTGGCGCTGTGCGCGCTCCATGCACGAGTTTTCCGCCAAGGACATCGACGGGCACATGGTTAACCTGGACAAGTACCGGGGCTTCGTGTGCATCGTCACCAACGTGGCCTCCCAGTGAGGCAAGACCGAAGTAAACTACACTCAGCTCGTCGACCTGCACGCCCGATACGCTGAGTGTGGTTTGCGGATCCTGGCCTTCCCGTGTAACCAGTTCGGGAAGCAGGAGCCAGGGAGTAACGAAGAGATCAAAGAGTTCGCCGCGGGCTACAACGTCAAATTCGATATGTTCAGCAAGATCTGCGTGAACGGGGACGACGCCCACCCGCTGTGGAAGTGGATGAAGATCCAACCCAAGGGCAAGGGCATCCTGGGAAATGCCATCAAGTGGAACTTCACCAAGTTTGGACACCGTCTCTCCACAGTTCCTCATCGACAAGAACGGCTGCGTGGTGAAGCGCTACGGACCCATGGAGGAGCCCCTGGTGATAGAGAAGGACCTGCCCCACTATTTCTAGCTCCACAAGTGTGTGGCCCCGCCCGAGCCCCTGCCCACGCCCTTGGAGCCTTCCACCGGCACTCATGACGGCCTGCCTGCAAACCTGCTGGTGGGGCAGACCCGAAAATCCAGCGTGCACCCCGCCGGAGGAAGGTCCCATGGCCTGCTGGGCTTGGCTCGGCGCCCCCACCCCTGGCTACCTTGTGGGAATAA	MCASRDDWRCARSMHEFSAKDIDGHMVNLDKYRGFVCIVTNVASQUGKTEVNYTQLVDLHARYAECGLRILAFPCNQFGKQEPGSNEEIKEFAAGYNVKFDMFSKICVNGDDAHPLWKWMKIQPKGKGILGNAIKWNFTKFGHRLSTVPHRQERLRGEALRTHGGAPGDREGPAPLFLAPQVCGPARAPAHALGAFHRHSURPACKPAGGADPKIQRAPRRRKVPWPAGLGSAPPPLATLWE*	55220
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

Analysis ID overview table

Over the different analysis strategies (translated transcript calling and ORF calling), the applied parameters are kept in a TIS overview table. For each analysis executed and stored in a certain SQLite results database, PROTEOFORMER keeps track of this analysis ID and its corresponding parameters in a separate table. At any time, the content of this analysis ID table can be printed in a tab file using following program:

perl TIS_overview.pl --sqlite_db SQLite/results.db

Input arguments:

Argument	Default	Description
--work_dir	CWD env setting	Working directory
--sqlite_db	Mandatory	SQLite results database
--out_tis_overview	work_dir/SQLite/overview.tis	Analysis ID overview output file

The overview table has following format:

ID	local_max	min_count_aTIS	R_aTis	min_count_5UTR	R_5UTR	min_count_CDS	R_CDS	min_count_3UTR	R_3UTR	min_count_ntr	R_ntr	PRICE_FDR	SPECTRE_FDR	SNP	indel	filter	tr_calling	TIS_calling
1	1	5	0.01	10	0.05	15	0.15	10	0.05	10	0.05			samtools_dbSNP	NO	none	rule_based	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...

This table will be outputted as a tab separated text file.

FASTA file generation

Translation products estimated in a particular analysis ID can be outputted in FASTA file format. This allows to submit them in a proteomics analysis as search space. During this output generation, additional processes can be executed. The candidate translation products can be mapped to reference information and redundancy on subsequence level in between translation products can be removed.

An example of how to run this program for analysis ID 1:

perl generate_translation_db.pl --result_db SQLite/results.db --tis_ids 1 --mflag 4

Input arguments:

Argument	Default	Description
--work_dir	CWD env setting	Working directory
--result_db	Mandatory	SQLite database holding all results
--tis_ids	Mandatory	Analysis IDs for which an output file needs to be generated. Either an integer with one ID, a comma-separated list of IDs or 'all', leading to a FASTA/PEFF for all of the analysis IDs available
--mslength	6	Minimum sequence length allowed in the output FASTA/PEFF file
--translation_db	species_TIS_analysisID_transcripts.fasta/peff	Name of the output translation database file (FASTA/PEFF)
--var_file	species_TIS_analysisID_transcripts_VAR.txt	File to store SNP and indel information of sequences in the translation database
--tis_call	Y	Whether annotated TISs that did not pass the TIS calling algorithm are allowed in the output file. Possible options: Yes ('Y') or No ('N')
--peff	N	Generate PEFF instead of FASTA format. Options: Yes ('Y') or No ('N')
--mflag	4	Flag describing the additional tasks that will be performed: 1 = remote BioMart mapping, 2 = local file BioMart mapping, 3 = sequence based blast mapping, 4 = no mapping, only redundancy removal, 5 = no mapping and no redundancy removal
--mapping	Mandatory if mflag=1 or 2	Ensembl database to download BioMart mapped transcripts or local file to map data
--db_config_version	Mandatory if mflag=1	Ensembl database configuration version
--external_ref	Mandatory if mflag=1	External reference in biomart to map transcripts to
--blast_pgm	ublast	BLAST program to map based on sequence. Possible options: 'ublast' and 'blastp'
--blastdb	Mandatory if mflag=3	The UBLAST or BLASTP formatted database
--evalue	1e-10	BLAST e-value
--min_blast_length	32	Minimum sequence length to perform BLAST
--identity	75	Minimum alignment score for BLAST
--coverage	30	Minimum number of identical positions for BLAST
--word_size	3	Minimum word size for BLAST
--gapopen	11	Cost of gap open in BLAST
--gapextend	1	Gap extension penalty for BLAST
--matrix	BLOSUM62	Matrix for the BLAST search

Redundancy will be default removed unless mflag 5 is selected or if a PEFF file is generated instead of a FASTA file. Mapping can be done based on BioMart (remote or with a local file) or sequence-based using UBLAST or BLASTP. After mapping, inherent redundancy (also on sub-string level) will be removed if desired. A VAR file will also be generated, next to the FASTA file in order to better describe which SNPs are linked to which SNP IDs.

Hereafter, a snippet of an example FASTA output file is given:

>generic|ENST00000647097_4_102760894_aTIS_1|ENSG00000109323 ATG True -1 0 +1 [ENST00000489071_20_35548863_ntr#ENST00000226578_4_102760894_aTIS_1]
MRLHLLLLLALCGAGTTAAELSYSLRGNWSICNGNGSLELPGAVPGCVHSALFQQGLIQD
SYYRFNDLNYRWVSLDNWTYSKEFKIPFEISKWQKVNLILEGVDTVSKILFNEVTIGETD
NMFNRYSFDITNVVRDVNSIELRFQSAVLYAAQQSKAHTRYQVPPDCPPLVQKGECHVNF
VRKEQCSFSWDWGPSFPTQGIWKDVRIEAYNICHLNYFTFSPIYDKSAQEWNLEIESTFD
VVSSKPVGGQVIIAIPKLQTQQTYSIELQPGKRIVELFVNISKNITVETWWPHGHGNQTG
YNMTVLFELDGGLNIEKSAKVYFRTVELIEEPIKGSPGLSFYFKINGFPIFLKGSNWIPA
DSFQDRVTSELLRLLLQSVVDANMNTLRVWGGGIYEQDEFYELCDELGIMVWQDFMFACA
LYPTDQGFLDSVTAEVAYQIKRLKSHPSIIIWSGNNENEEALMMNWYHISFTDRPIYIKD
YVTLYVKNIRELVLAGDKSRPFITSSPTNGAETVAEAWVSQNPNSNYFGDVHFYDYISDC
WNWKVFPKARFASEYGYQSWPSFSTLEKVSSTEDWSFNSKFSLHRQHHEGGNKQMLYQAG
LHFKLPQSTDPLRTFKDTIYLTQVMQAQCVKTETEFYRRSRSEIVDQQGHTMGALYWQLN
DIWQAPSWASLEYGGKWKMLHYFAQNFFAPLLPVGFENENTFYIYGVSDLHSDYSMTLSV
RVHTWSSLEPVCSRVTERFVMKGGEAVCLYEEPVSELLRRCGNCTRESCVVSFYLSADHE
LLSPTNYHFLSSPKEAVGLCKAQITAIISQQGDIFVFDLETSAVAPFVWLDVGSIPGRFS
DNGFLMTEKTRTILFYPWEPTSKNELEQSFHVTSLTDIY
>generic|ENST00000647097_4_102760894_aTIS_2|ENSG00000109323 ATG True -1 0 +1 [ENST00000226578_4_102760894_aTIS_2]
MRLHLLLLLALCGAGTTAAELSYSLRGNWSICNGNGSLELPGAVPGCVHSALFQQGLIQD
SYYRFNDLNYRWVSLDNWTYSKEFKIPFEISKWQKVNLILEGVDTVSKILFNEVTIGETD
NMFNRYSFDITNVVRDVNSIELRFQSAVLYAAQQSKAHTRYQVPPDCPPLVQKGECHVNF
VRKEQCSFSWDWGPSFPTQGIWKDVRIEAYNICHLNYFTFSPIYDKSAQEWNLEIESTFD
VVSSKPVGGQVIVAIPKLQTQQTYSIELQPGKRIVELFVNISKNITVETWWPHGHGNQTG
YNMTVLFELDGGLNIEKSAKVYFRTVELIEEPIKGSPGLSFYFKINGFPIFLKGSNWIPA
DSFQDRVTSELLRLLLQSVVDANMNTLRVWGGGIYEQDEFYELCDELGIMVWQDFMFACA
LYPTDQGFLDSVTAEVAYQIKRLKSHPSIIIWSGNNENEEALMMNWYHISFTDRPIYIKD
YVTLYVKNIRELVLAGDKSRPFITSSPTNGAETVAEAWVSQNPNSNYFGDVHFYDYISDC
WNWKVFPKARFASEYGYQSWPSFSTLEKVSSTEDWSFNSKFSLHRQHHEGGNKQMLYQAG
LHFKLPQSTDPLRTFKDTIYLTQVMQAQCVKTETEFYRRSRSEIVDQQGHTMGALYWQLN
DIWQAPSWASLEYGGKWKMLHYFAQNFFAPLLPVGFENENMFYIYGVSDLHSDYSMTLSV
RVHTWSSLEPVCSRVTERFVMKGGEAVCLYEEPVSELLRRCGNCTRESCVVSFYLSADHE
LLSPTNYHFLSSPKEAVGLCKAQITAIISQQGDIFVFDLETSAVAPFVWLDVGSIPGRFS
DNGFLMTEKTRTILFYPWEPTSKNELEQSFHVTSLTDIY
>generic|ENST00000647097_4_102760938_5UTR|ENSG00000109323 CTG NA -1 0 +1 [ENST00000645348_4_102760938_5UTR#ENST00000646727_4_102760938_5UTR#ENST00000644545_4_102760938_5UTR#ENST00000226578_4_102760938_5UTR#ENST00000642252_4_102760938_5UTR#ENST00000644159_4_102760938_5UTR#ENST00000644965_4_102760797_CDS#ENST00000646311_4_102760938_5UTR]
MPFDLSTSRWRGISRCASTCSCCSRCAVQAPPPRSSVTACVATGASAMGTARWSCPGRSL
AACTAPCSSRA
...

Another option is to generated a PEFF file, an extended FASTA format for proteogenomics recently devised by the HUPO-PSI. When the PEFF option is turned on, redundancy will not be removed but schematically included in the PEFF tags.

Hereafter, a snippet of an example FASTA output file is given:

# PEFF 1.0
# //
# DbName=genericDB
# DbSource=http://www.biobix.be/proteoform/
# DbVersion=2018-03-22
# HasAnnotationIdentifiers=true
# ProteoformDb=true
# Prefix=gen
# NumberOfEntries=4
# SequenceType=AA
# GeneralComment=Proteogenomics application to generate protein sequences from RIBO-seq
# //
>gen:ENST00000000412-1 \PName=mannose-6-phosphate receptor, cation dependent  \GName=mannose-6-phosphate receptor, cation dependent  \NcbiTaxId=9606 \TaxName=Homo sapiens \Length=277 \VariantSimple=(1:122|H) \VariantComplex=(2:1|120|M)(3:1|191|M)(4:1|235|M)  \Proteoform=(ENST00000000412_12_8946404_aTIS_pf1.0|1-277||canonical form)(ENST00000000412_12_8946404_aTIS_1_pf1.4|1-277|1|proteoform with SAV)(ENST00000000412_12_8943896_CDS_pf1.5|1-277|2|N-terminal truncation)(ENST00000000412_12_8943418_CDS_pf1.2|1-277|3|N-terminal truncation)(ENST00000000412_12_8942424_CDS_pf1.3|1-277|4|N-terminal truncation)(ENST00000000412_12_8943896_CDS_1_pf1.1|1-277|1,2|N-terminal truncation, proteoform with SAV)
MFPFYSCWRTGLLLLLLAVAVRESWQTEEKTCDLVGEKGKESEKELALVKRLKPLFNKSFESTVGQGSDTYIYIFRVCREAGNHTSGAGLVQINKSNGKETVVGRLNETHIFNGSNWIMLIYKGGDEYDNHCGKEQRRAVVMISCNRHTLADNFNPVSEERGKVQDCFYLFEMDSSLACSPEISHLSVGSILLVTFASLVAVYVVGGFLYQRLVVGAKGMEQFPHLAFWQDLGNLVADGCDFVCRSKPRNVPAAYRGVGDDQLGEESEERDDHLLPM
>gen:ENST00000000412-2 \PName=uORF of mannose-6-phosphate receptor, cation dependent  \GName=mannose-6-phosphate receptor, cation dependent  \NcbiTaxId=9606 \TaxName=Homo sapiens \Length=41 \VariantComplex=(1:1|11|M)  \Proteoform=(ENST00000000412_12_8949642_5UTR_pf2.0|1-41||canonical form)(ENST00000000412_12_8949612_5UTR_pf2.1|1-41|1|N-terminal truncation)
MGHSEALGGTLASETSSGTGVCGRPLAGPVSHPQKGIHWGF
>gen:ENST00000000412-3 \PName=Alternative reading frame product of mannose-6-phosphate receptor, cation dependent  \GName=mannose-6-phosphate receptor, cation dependent  \NcbiTaxId=9606 \TaxName=Homo sapiens \Length=16  \Proteoform=(ENST00000000412_12_8946336_CDS_pf3.0|1-16||canonical form)
MLADRRKNLRLGRRKG
>gen:ENST00000000412-4 \PName=Alternative reading frame product of mannose-6-phosphate receptor, cation dependent  \GName=mannose-6-phosphate receptor, cation dependent  \NcbiTaxId=9606 \TaxName=Homo sapiens \Length=56  \Proteoform=(ENST00000000412_12_8943516_CDS_pf4.0|1-56||canonical form)
MRSVAKSKIVSTSLRWIAAWPVHQRSPTSVWVPSYLSRLHHWLLFMLLGGSYTSDW

Optional steps

FLOSS calculation

The fragment length organization similarity score (FLOSS) measures the similarity between the fragment length distribution of an ORF and the averaged fragment length distribution of a set of reference ORFs. This statistic was first described by Ingolia et al. (2014). The FLOSS formula can also be found in this manuscript.

FLOSS calculation is embedded in the PROTEOFORMER pipeline. A tool is included that:

Calculates the fragment length distribution of the reference set (the canonical translation regions of all Ensembl nuclear protein-coding genes, except those that are overlapping with non-coding regions). Reference distribution will be stored in the results database for further use.
Calculates smoother data for the reference set. For all canonical coding reference ORFs (step 1), a FLOSS score can be calculated. Then, using a sliding window approach (range 200) over the FLOSS scores ranked by the increasing amount of reads of the underlying reference ORFs. For each window the first (Q₁) and third quantile (Q₃) are used to calculate the raw extreme (Q₃ + 3 . (Q₃-Q₁)). Based on all raw extremes, a Loess smoother is fitted, using R. R is also used for generating some plots. Smoother results will be stored in the results database for further use.
For the candidate ORFs, FLOSS scores can be calculated and based on the amount of reads of the candidate ORF, its FLOSS score can be compared to the cutoff smoother, enabling a FLOSS classification of the candidate ORF. 'Good' means that the ORF has a good coding probability. 'Extreme' is used for ORF which are probably non-coding. 'Not in cutoff range' means that the amount of reads of the ORF is outside of the smoother range (due to the sliding window approach). All classification results will be stored in a separate table of the results database.

An example of how to run this tool:

perl FlossProteoformer.pl --sqlite_db SQLite/results.db --tis_ids 1

Input arguments:

Argument	Default	Description
--dir	CWD env setting	Path to the working directory
--tmp	dir/tmp	Temporary files folder
--sqlite_db	Mandatory	SQLite results database with mapping, translated transcript and ORF calling results
--tis_ids	Mandatory	List of analysis IDs

The results table is in the following format:

Table ID	TIS ID	nreads	FLOSS	classfication
1	ENST0000042332	2522.0	0.8456787728763	Good
...	...	...	...	...

ORF-based counts

An optional module was written in order to fetch the read counts for each identified open reading frame (ORF). In this tool, there is an option to trim off counts at the start and end of the ORF (as it has been seen that artefacts can occur at the borders of ORFs due to the RIBO-Seq protocol). The trim length of these regions can be set in both an absolute or relative (percentage in function of the ORF length) fashion. Attention though, for small ORFs with absolute trimming, trimming will be performed however relative as the trimming length could become larger than the actual ORF length otherwise. ORF read counts can then be used to do differential expression analysis with packages like DESeq or EdgeR.

An example of how to run this program:

python ORFbasedCounts.py --sqlitedb SQLite/results.db --tis_ids 1 --trim absolute --nt_trim 15 --ltm n --rna /data2/steven/eIF1/NTmRNA/SQLite/results.db

Input arguments:

Argument	Default	Description
-w/--work_dir	CWD env setting	The working directory
-s/--sqlitedb	Mandatory	The SQLite results database with results from former steps
-t/--tis_ids	Mandatory	The TIS ID or a comma-separated list of analysis IDs for which the ORF-based counts need to be determined
-c/--trim	relative	Whether ORFs should be trimmed. Options: absolute, relative, none
-n/--nt_trim	'15' for absolute trimming, '0.025' for relative trimming	For absolute trimming, you can here suggest the trimming length. For relative trimming, you can here suggest the part that will be trimmed off in function of the ORF length.
-l/--ltm	n	Whether the program should make an ORF based counts table for the 'treated' data. Options: 'y' (yes) or 'n' (no).
-r/--rna	n	Whether the program should make an ORF based counts table for RNA data. Options: 'n' (No) or suggest the SQLite table where RNA results are stored

ORF based counts results will be saved in the SQLite results database in following format:

ORF_ID	tr_stable_id	start	chr	strand	trim	start_main_orf	end_main_orf	pre_count	count	post_count
ENST00000307677_31722851	ENST00000307677	31722851	20	-1	6	31722723	31722845	22.0	936.0	23.0
...	...	...	...	...	...	...	...	...	...	...

Feature summarization

Next to ORF based counts, counts can also be summarized for other features:

genes
transcripts
exons
promotors

Based on annotation, features will be calculated by another tool of the PROTEOFORMER pipeline. Results can afterwards be used in differential expression tools.

An example of how to run this tool:

python summarize_feature.py --sqliteC SQLite/results.db --sqliteE SQLite/ENS_hsa_82.db --feature transcript --transcripts all --data ribo

Input arguments:

Argument	Default	Description
-w/--work_dir	CWD env setting	Working directory
-c/--sqliteC	Mandatory	The SQLite database holding count data
-e/--sqliteE	Mandatory	The SQLite database holding Ensembl annotation
-f/--feature	Mandatory	The annotation feature at what level to summarize. Options: exon, transcript, gene, promotor
-d/--data	Mandatory	The type of data used to generate counts. Options: ribo, rna, rrbs
-m/--mapping	unique, mandatory for rrbs data	The type of read mapping. Options: unique, multi
-t/--transcripts	all	The type of transcripts to process. Options: all, canonical
-r/--rrbs	Mandatory for promotor summarization	The table holding the rrbs count data
-o/--five_prime_offset	0	The 5'offset for feature boundaries used for gene features

Output table format:

feature_stable_id	counts_sum	counts_n
ENST0000052565	230.0	45.0
...	...	...

Database combinations

The FASTA files per TIS ID can be further combined with other FASTA files. Tools to do so are present in the Additional_tools folder of this GitHub repo.

Different analysis IDs

The FASTA files generated per TIS ID can be combined in order to get a total FASTA file of multiple analyses together. Bincodes will be generated per sequence in order to represent in which analysis ID the sequence is occurring. An overview file clearly reports the composition of this bincode and the total sequence counts per bincode. Venn diagrams will also be generated, showing the overlap between the analysis sets.

If a sequence occurred in different analyses, the ID of the sequence in other analyses will be kept at the end of the description line if the verbose output is turned on.

This tool has a great internal help message:

python combine_dbs.py --help

usage: combine_dbs.py --in_files [COMMA-SEPARATED LIST] [--help]
                      [--workdir [FOLDER]] [--output_file [PATH]]
                      [--overview_file [PATH]] [--venn_diagram [PATH]]
                      [--verbose_output [Y/N]]

Mandatory parameters:
  --in_files [COMMA-SEPARATED LIST], -i [COMMA-SEPARATED LIST]
                        Comma-separated list of all paths to all input fasta
                        files (mandatory)

Optional parameters:
  --help, -h            Show help message and exit
  --workdir [FOLDER], -w [FOLDER]
                        Working directory (default: CWD)
  --output_file [PATH], -o [PATH]
                        Path of the combined output fasta file(default:
                        outpub_combined_dbs.fa)
  --overview_file [PATH], -t [PATH]
                        Path to a file where information about the input files
                        and the database combinations with their binary codes
                        will be stored (default: fasta_file_overview.txt)
  --venn_diagram [PATH], -d [PATH]
                        Path to a png file where the Venn diagram will be
                        stored (default: venn_diagram.png)
  --verbose_output [Y/N], -v [Y/N]
                        Whether the output fasta should give the full list of
                        accessions per combination. If not, redundant
                        accessions between input files, will be omitted
                        (default: Y)

The combined fasta file can then be converted to an SQLite database in order to investigate the overlap between the initial FASTA files with SQL query language. For more information:

python combfasta2sqlite.py --help

With UniProt

FASTA file per TIS ID or combined FASTA files can lateron combined with UniProt FASTA files. The origin of the files is clear from the structure of the ID, so no bincodes are needed here. An overview file still reports the counts and counts are also plotted in Venn diagrams. With verbose output, redundant sequences will keep all their IDs in the description lines, at the end. The UniProt ID will be privileged however for becoming the main ID.

The tool has a great internal help message:

python combine_with_uniprot.py --help

usage: combine_with_uniprot.py --fasta [PATH] --uniprot [PATH] [--help]
                               [--workdir [FOLDER]] [--output_fasta [PATH]]
                               [--overview_file [PATH]]
                               [--venn_diagram [PATH]]

Mandatory parameters:
  --fasta [PATH], -f [PATH]
                        Input (combined) fasta file (mandatory)
  --uniprot [PATH], -u [PATH]
                        Input UniProt fasta file (mandatory)

Optional parameters:
  --help, -h            Show help message and exit
  --workdir [FOLDER], -w [FOLDER]
                        Working directory (default: CWD)
  --output_fasta [PATH], -o [PATH]
                        Output fasta file with combination of data of Uniprot
                        and PROTEOFORMER (default:
                        comb_fasta_uniprot_proteoformer.fasta)
  --overview_file [PATH], -t [PATH]
                        Path to a file where information about the input files
                        and the database combinations are stored (default:
                        uniprot_proteoformer_overview.txt)
  --venn_diagram [PATH], -d [PATH]
                        Path to a png file where the Venn diagram will be
                        stored (default: venn_diagram.png)

The combined fasta file can then be converted to an SQLite database in order to investigate the overlap between the initial FASTA files with SQL query language. For more information:

python combuniprot2sqlite.py --help

MS Validation

All different FASTA (and PEFF) files can be used as a search space in mass spectrometry-based validation. Different programs and search engines are available for analysis.

SearchGUI and Peptideshaker

SearchGUI is a user-friendly open-source graphical user interface, enabling running different proteomics identification search engines, including:

X! Tandem
MS-GF+
MS Amanda
MyriMatch
Comet
Tide
Andromeda
OMSSA
Novor
DirecTag

The tool is available in both a command line and a graphical user interface version. All details on installation and usage can be found here.

MaxQuant

MaxQuant is a proteomics software package, specifically aimed at high-resolution MS data. Several labeling techniques as well as label-free quantification is supported. The package includes both the Andromeda search engine as the viewer application for inspecting raw data, identification and quantification results. The tool is available as a graphical user interface. More information (download info, manual, tutorials, forums) can be found here.

In the folder Additional_tools/MaxQuant_results_parsing of this GitHub repo, some additional scripts can be found to parse the output of the MaxQuant searches. Essential output files of MaxQuant that are needed in this results parsing, can be found in the combined/txt of the MaxQuant output folder and include:

proteinGroups.txt
peptides.txt
msms.txt

Based on these files, results parsing scripts can be run.

The parse_maxquant.py script analyses the part of each ORF calling method in the final MS identified output of MaxQuant. The parse_maxquant_uniprot.py script analyses the part of the total PROTEOFORMER pipeline in the final MS identified output of MaxQuant compared to the part of UniProt. This generates a max_protein_db database. Based on this database, a third script can be run (analyse_proteoforms.py). This script classifies all new found proteoforms of PROTEOFORMER based on the nature of their variation. For this last script, the ClustalOmega tool needs to be installed on your system.

For more information about the usage of these scripts, use their help command:

python parse_maxquant.py --help
python parse_maxquant_uniprot.py --help
python analyse_proteoforms.py --help

Pipeline master script

If you want to run the complete pipeline (or several parts of it) on one or multiple samples, the PROTEOFORMER pipeline master script (scripted in Bash) can be a useful tool. An example or default version is given in the main directory of this repository. Feel free to adapt this by block (un)commenting the steps you need or want to skip. Also, the input arguments can be changed to fit your purpose.

To run the master script, use the following command:

chmod 755 proteoformer_pipeline.sh
./proteoformer_pipeline.sh

There is a version of the master script that uses every tool in its Python 3.10 version if available:

chmod 755 proteoformer_pipeline.sh
./proteoformer_pipeline_py3.sh

Publications

Crappé, J., Ndah, E., Koch, A., Steyaert, S., Gawron, D., De Keulenaer, S., De Meester, E., De Meyer, T., Van Criekinge, W., Van Damme, P., and Menschaert, G. (2014) PROTEOFORMER: deep proteome coverage through ribosome profiling and MS integration. Nucleic Acids Res. 43, e29
Verbruggen S., Ndah E., Van Criekinge W., Gessulat S., Kuster B., Wilhelm M., Van Damme P. and Menschaert G. (2019). Molecular and Cellular Proteomics, https://doi.org/10.1074/mcp.RA118.001218

Copyright

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

More information

For more (contact) information visit http://www.biobix.be/PROTEOFORMER

Name		Name	Last commit message	Last commit date
Latest commit History 498 Commits
.idea		.idea
1_mapping		1_mapping
1_mapping_py3		1_mapping_py3
2_mappingQC		2_mappingQC
2_mappingQC_py3		2_mappingQC_py3
3_tr_calling		3_tr_calling
4_ORF_calling		4_ORF_calling
5_tis_overview		5_tis_overview
6_orf_based_counts		6_orf_based_counts
7_translation_database		7_translation_database
8_floss_calculation		8_floss_calculation
9_summarize_feature		9_summarize_feature
Additional_tools		Additional_tools
Dependency_envs		Dependency_envs
Galaxy files		Galaxy files
LogoBanner		LogoBanner
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
examples_commands.md		examples_commands.md
proteoformer_pipeline_py2.sh		proteoformer_pipeline_py2.sh
proteoformer_pipeline_py3.sh		proteoformer_pipeline_py3.sh

License

Biobix/proteoformer

Folders and files

Latest commit

History

Repository files navigation

Proteoformer

Table of contents

Introduction

Dependencies

Additional environments for RiboZINB, SPECtre and SRA download

RiboZINB

SPECtre

PRICE

SRA download

Preparations

iGenomes reference information download

Ensembl download

UTR simulation for Prokaryotes

SRA parallel download

Main pipeline

General quality check: fastQC

Mapping

General quality check: fastQC

Specific ribosome profiling quality check: mappingQC

Transcript calling

Rule-based transcript calling

RiboZINB

ORF calling

PROTEOFORMER classic ORF calling

TIS calling

SNP calling

ORF assembly

PRICE

SPECtre

Analysis ID overview table

FASTA file generation

Optional steps

FLOSS calculation

ORF-based counts

Feature summarization

Database combinations

Different analysis IDs

With UniProt

MS Validation

SearchGUI and Peptideshaker

MaxQuant

Pipeline master script

Publications

Copyright

More information

About

Resources

License

Stars

Watchers

Forks

Languages