#repertoire
Profiling model T-cell and B-cell metagenomes with short reads.
##Receptor Assembly
My reads for this T cell analysis were from sheared fragments so I had to assemble using iSSAKE.
.gz is supported throughout the pipeline
Quality trim your fastq
seqtk trimfq in.fq > trimmed.fq
Convert fastq to fasta
bioawk -c fastx '{print ">"$name"\n"$seq}' trimmed.fq > trimmed.fa
Download TCRB predictions from IMGT (TRAV or TRBV)
Create tags from IMGT regions
python create_tags.py -v -l 35 trav.fa > trav.tags.fa
Find seeds among your reads
python find_seeds.py -v trav.tags.fa in.fq > seeds.fa
Run iSSAKE
iSSAKE -f trimmed.fa -s seeds.fa -b sampleid
##Contig Assessment
Download J regions based on strand (TRAJ or TRBJ).
Rename fasta names
python renameIMGT.py --gene TRAJ imgt_traj.fa > traj.fa
Locally align J regions to assembled contigs
exonerate -q sampleid.contigs \
-t traj.fa \
--bestn 1 \
--ryo ">%qi|%ti\n%qs" \
--showalignment FALSE \
--showvulgar FALSE \
> sampleid.exonerate_out.fa
This step not only filters out possible bad contigs that have identifiable J
region, but also adds the J region name onto the read name. You'll have to
filter out some unwanted lines added by exonerate
.
grep -v "Command line:\|Hostname:\|-- completed" sampleid.exonerate_out.fa > sampleid.fa
Parse read names into data table
python reads2meta.py sampleid.fa > sampleid.metadata
##Links
Bioawk: https://github.com/lh3/bioawk
Python dependency: pip install toolshed