TRIAD

Scripts for analysing the composition of libraries generated by TRIAD cloning, which accompany the manuscript by Stephane Emond1, Maya Petek, Emily Kay, Brennen Heames, Sean Devenish, Nobuhiko Tokuriki and Florian Hollfelder.

Setup and installation

Alignment software

PEAR: Obtain separately from https://cme.h-its.org/exelixis/web/software/pear/
samtools version 1.9
htslib version 1.9
bowtie2 version 2.3
emboss version 6.6

All are currently (August 2019) available as packages in standard Ubuntu repositories, apart from PEAR, which now requires a separate download due to a change in licensing.

The versions given here were used to process the Illumina dataset described in the manuscript. The pickled summary data and reference sequences for alignment are provided in the folder manuscript_data.

Python and Conda dependencies

These are a set of scripts that was developed and tested within the Anaconda tools. The environment is given in TRIAD.yml. To install the environment, go to the TRIAD directory and use:

conda env create -f TRIAD.yml

Environment set-up usually takes about 10 minutes. Activate the environment with source activate TRIAD or conda activate TRIAD: the latter is preferred for modern conda v.4.4 or higher.

Possible issue with conda and resolution

While BioPython is included in the conda environment file, you may run into an issue where BioPython cannot be loaded. The workaround is to first install pip3 with your preferred package manager, then create the conda environment and finally install BioPython with sudo pip3 install -c conda-forge biopython.

Short version of scripts

1a. If working on a cluster and it is difficult to install PEAR, assemble the reads separately:

pear -f $forwardReads -r $reverseReads -o $baseName.$activity --keep-original --min-overlap 5 --min-assembly-length 0 --quality-threshold 15 --max-uncalled-base 0.01

1b. Run count.sh reads_fw reads_rv reference.fa base_name activity or count_PC.sh [arguments] as appropriate to environment.

Arguments:

reads_fw : (only for PC version) path to forward fastq.gz reads
reads_rv: (only for PC version) path to reverse reads
reference.fa : filepath to reference fasta file
base_name : Usually a shorthand for what gene we are looking at, eg. PTE
activity : A label for what fraction / activity gate / input library these reads came from

Steps in the pre-processing script.

(If using paired end reads in 1a: merge reads with PEAR. Take the opportunity to filter out very broken data.)
Align all reads against reference.
While we have SAM files, take the opportunity to calculate depth / position.
Extract reads that are correctly mapped, keep the name.
Throw away reads the fully match reference. This is faster than NW alignment for all reads later.
Feed interesting reads to EMBOSS Needleman-Wunsch aligner.
Output fasta ALN files.

These scripts will output an alignment file in FASTA format from each pair of input forward & reverse reads. The output names are specified as options to count.sh such that alignments will be named base_name.activity.aln (for example, PTE.3bp_deletion.aln).

Counting substitutions, deletions and combinations for each sequencing file

When run on the PTE library data, this the following command and options:

PTE_composition.py --folder /path/to/aln/files --reference TRIAD/manuscript_data/full_fragment.fa --start_offset 200 --end_trail 97 --output S6_full

Load a dictionary of all interesting mutations we're considering
Read a read+reference into a SeqRecord
Various checks that the read is not broken Barcodes: As long as the read does not contain insertions, the barcode is ignored and does not contribute to detected mutations
If the mutation is defined as interesting, figure out what kind it is and add to valid_counts dictionary
Add the mutation to a dictionary counting everything
Save both dictionaries for later viewing.

IPYNB notebook that reproduces figures and statistics in the manuscript

Start a jupyter notebook with jupyter lab and have a look at results in TRIAD_composition_figures.ipynb.

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
baseline_scripts		baseline_scripts
manuscript_data		manuscript_data
.gitignore		.gitignore
LICENSE		LICENSE
PTE_composition.py		PTE_composition.py
README.md		README.md
TRIAD.yml		TRIAD.yml
TRIAD_composition_figures.ipynb		TRIAD_composition_figures.ipynb
__init__.py		__init__.py
count.sh		count.sh
count_PC.sh		count_PC.sh
ind.py		ind.py
output.py		output.py

License

fhlab/TRIAD

Folders and files

Latest commit

History

Repository files navigation

TRIAD

Setup and installation

Alignment software

Python and Conda dependencies

Possible issue with conda and resolution

Short version of scripts

Counting substitutions, deletions and combinations for each sequencing file

IPYNB notebook that reproduces figures and statistics in the manuscript

About

Resources

License

Stars

Watchers

Forks

Languages