OR gene family identification

This is a pipline for identification of olfactory receptor gene family

ORFAM

A pipeline for identification of olfactory receptor(OR) gene family

Quick start

Install

git clone git@github.com:jianzuoyi/orfam.git
cd orfam
make

Run the example script
```
cd example
./run_orfam
```

Installation

Prerequisites

Python 2.7 (https://www.python.org)
- Biopython

Configuration

System paths to orfam's component software are specified in the [orfam.config] (bin/orfam.config) file, which should reside in the same directory as the orfam executable (for alternate locations use the -K flag). Upon installation, orfam attempts to automatically generate this file, but manual editing may be necessary.

Install components

Bioawk (https://github.com/lh3/bioawk)
bedtools (https://github.com/arq5x/bedtools2)
tblastn (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/)
Exonerate (http://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate)
MAFFT (http://mafft.cbrc.jp/alignment/software/)
MEGACC (http://www.megasoftware.net/)

If any components already exist on the system, their paths should be manually specified by editing orfam.config.

Usage

orfam is a modular framework with three components:

orfam align - Search against the target genome with known OR protein sequences as query and produce a alignment result file that can be processed with other orfam modules.
orfam func - Identification of intact OR genes.
orfam pseudo - Identification of truncated OR genes and pseudogenes.

orfam align

orfam align Search against the target genome with known OR protein sequences as query and produce a alignment result file that can be processed with other orfam modules.

Internally, orfam align runs the following steps to produce a output file (BLAST format 6):

Discard the query sequences which length is less than 250
Alignment with TBLASTN

usage:	orfam align [options]

Alignment options

-q FILE olfactory receptor proteins (FASTA)
-s FILE subject genome (FASTA)
-o STR 	output file [.align]
-T DIR 	temp directory [./tmpXXXXXXXX]

TBLASTN options

-e FLOAT evalue for hits
-t INT 	threads [1]

Global options

-K FILE path to orfam.config file (default: same directory as orfam)
-v 	verbose
-h 	show this message

Output

orfam align produces a single output file (BLAST format 6):

outprefix.tblastn
- The alignment result file. This file serve as input for orfam func

orfam func

orfam func identifies intact OR genes from the target genome.

usage:	orfam func [options]

Options

-R FILE reference file (fasta) (required)
-r FILE reference olfactory receptor (fasta) (required)
-B FILE BED file represents the regions of reference olfactory receptor (required)
-A FILE tblastn output (tabular) (required)
-O FILE olfactory receptor for outgroup (fasta) (required)
-S FILE MAO file, setting used to the construction of phylogenetic tree (generated by megaproto) (required)
-o STR 	output prefix [required]
-t INT  threads [1]
-T DIR 	temp directory [./tmpXXXXXXXX]
-k 		keep temporary files
-K FILE path to orfam.config file (default: same directory as orfam)
-v 		verbose
-h 		show this message

Output

orfam func produces two output file:

outprefix_best_hit.gff
- This GFF file contains all OR candicate sequences which can be classified into three types: Intact OR genes, Truncated OR genes and OR pseudogenes.
outprefix_intact.fa
- This FASTA file contains all Intact OR gene sequences.

orfam pseudo

orfam pseudo identifies truncated OR genes or OR pseudogenes.

usage:	orfam pseudo [options]

Options

-s FILE subject genome (fasta) (required)
-q FILE query olfactory receptor proteins (fasta) (required)
-b FILE best hits (gff) (required)
-i FILE intact olfactory receptor (fasta) (required)
-o STR 	output prefix
-T DIR 	temp directory [./tmpXXXXXXXX]
-k 	keep temporary files
-K FILE path to orfam.config file (default: same directory as orfam)
-v 	verbose
-h 	show this message"

Output

orfam pseudo produces five output files:

outprefix_truncated.gff
- This GFF file contains truncated OR genes.
outprefix_pseudo.gff
- This GFF file contains OR pseudogenes.
outprefix_pseudo_nonsense.fa
- This FASTA file contains olfactory receptors with nonsense mutations.
outprefix_pseudo_frameshift.fa
- This FASTA file contains olfactory receptors with frame shift mutations.
outprefix_pseudo_others.fa
- This FASTA file contains olfactory receptors with other mutations.

Example workflows

Identification of OR gene from a target genome

Use orfam align to produce a alignment result file.

orfam align \
	-q data/ORs/ORs.fa \
	-s data/mm10/mm10.fa \
	-o mm10 \
	-e 1e-10 \
	-t 20 \
	-T temp \
	-v \
	-k

Use orfam func to identify intact OR genes.

orfam func \
	-R data/mm10/mm10.fa \
	-r data/ORs/O43749.fasta \
	-B data/ORs/O43749.bed \
	-O data/ORs/outgroup.fa \
	-S bin/infer_NJ_protein.mao \
	-A mm10.tblastn \
	-o mm10 \
	-t 20 \
	-T temp \
	-k \
	-v

Use orfam pseudo to identify truncated OR genes and OR pseudogenes.

orfam pseudo \
	-s data/mm10/mm10.fa \
	-q intact/mm10_intact.fa \
	-b mm10_best_hit.gff \
	-i mm10_intact.fa \
	-o mm10 \
	-T temp \
	-k \
	-v

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
bin		bin
db		db
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env.sh		env.sh

License

jianzuoyi/orfam

Folders and files

Latest commit

History

Repository files navigation

OR gene family identification

This is a pipline for identification of olfactory receptor gene family

ORFAM

Table of Contents

Quick start

Installation

Prerequisites

Configuration

Install components

Usage

orfam align

Alignment options

TBLASTN options

Global options

Output

orfam func

Options

Output

orfam pseudo

Options

Output

Example workflows

Identification of OR gene from a target genome

About

Resources

License

Stars

Watchers

Forks

Languages