Scripts for analysing the composition of libraries generated by TRIAD cloning, which accompany the manuscript by Stephane Emond1, Maya Petek, Emily Kay, Brennen Heames, Sean Devenish, Nobuhiko Tokuriki and Florian Hollfelder.
- PEAR: Obtain separately from https://cme.h-its.org/exelixis/web/software/pear/
- samtools version 1.9
- htslib version 1.9
- bowtie2 version 2.3
- emboss version 6.6
All are currently (August 2019) available as packages in standard Ubuntu repositories, apart from PEAR, which now requires a separate download due to a change in licensing.
The versions given here were used to process the Illumina dataset described in the manuscript. The pickled summary data and reference sequences for alignment are provided in the folder manuscript_data
.
These are a set of scripts that was developed and tested within the Anaconda tools. The environment is given in TRIAD.yml. To install the environment, go to the TRIAD directory and use:
conda env create -f TRIAD.yml
Environment set-up usually takes about 10 minutes. Activate the environment with source activate TRIAD
or conda activate TRIAD
: the latter is preferred for modern conda v.4.4 or higher.
While BioPython is included in the conda environment file, you may run into an issue where BioPython cannot be loaded. The workaround is to first install pip3
with your preferred package manager, then create the conda environment and finally install BioPython with sudo pip3 install -c conda-forge biopython
.
1a. If working on a cluster and it is difficult to install PEAR, assemble the reads separately:
pear -f $forwardReads -r $reverseReads -o $baseName.$activity --keep-original --min-overlap 5 --min-assembly-length 0 --quality-threshold 15 --max-uncalled-base 0.01
1b. Run count.sh reads_fw reads_rv reference.fa base_name activity
or count_PC.sh [arguments]
as appropriate to environment.
Arguments:
- reads_fw : (only for PC version) path to forward fastq.gz reads
- reads_rv: (only for PC version) path to reverse reads
- reference.fa : filepath to reference fasta file
- base_name : Usually a shorthand for what gene we are looking at, eg. PTE
- activity : A label for what fraction / activity gate / input library these reads came from
Steps in the pre-processing script.
- (If using paired end reads in 1a: merge reads with PEAR. Take the opportunity to filter out very broken data.)
- Align all reads against reference.
- While we have SAM files, take the opportunity to calculate depth / position.
- Extract reads that are correctly mapped, keep the name.
- Throw away reads the fully match reference. This is faster than NW alignment for all reads later.
- Feed interesting reads to EMBOSS Needleman-Wunsch aligner.
- Output fasta ALN files.
These scripts will output an alignment file in FASTA format from each pair of input forward & reverse reads. The output names are specified as options to count.sh
such that alignments will be named base_name.activity.aln (for example, PTE.3bp_deletion.aln).
When run on the PTE library data, this the following command and options:
PTE_composition.py --folder /path/to/aln/files --reference TRIAD/manuscript_data/full_fragment.fa --start_offset 200 --end_trail 97 --output S6_full
- Load a dictionary of all interesting mutations we're considering
- Read a read+reference into a SeqRecord
- Various checks that the read is not broken Barcodes: As long as the read does not contain insertions, the barcode is ignored and does not contribute to detected mutations
- If the mutation is defined as interesting, figure out what kind it is and add to valid_counts dictionary
- Add the mutation to a dictionary counting everything
- Save both dictionaries for later viewing.
Start a jupyter notebook with jupyter lab
and have a look at results in TRIAD_composition_figures.ipynb
.