Gordonia-CRISPR

Scripts used to the reconstruction of Gordonia CRISPR. All these scripts run under Linux operating system.

This pipeline allow to reconstruct a specific CRISPR array diversity from several samples using illumina pair-end reads. Reads length must be >=100 bp or longer to get a better result. As a rule the read length >= 2 spacers + 1 repeat.

May be possible to use filtered files by read length and quality

Requirements:

bbduck (https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide)

Mothur v.1.39.5 (https://mothur.org)

Blast

Python modules:

Biopython (pip install biopython)

Use some of the available software to identify possible CRISPR arrays and extract the repeat sequence. Consider the number of mismatches.

-Using this information run bbduk over all samples in order to extract the reads with the repeat sequence:

bbduk.sh

in=R1_sample_1.fastq

in2=R2_sample_1.fastq

outm=R1_matched_sample_1.fq

outm2=R2_matched_sample_1.fq

k= # k-mer length (max 31)

mm=f

literal= # repeat sequence

hdist= # allowed mismatches

rcomp=T

-Make a source file to be used by the script findRepeatCRISPR.py:

The source file is a tab delimited file with a sample identifier and the reads file names:

sample1 R1.sample1.fastq R2.sample1.fastq

sample2 R1.sample2.fastq R2.sample2.fastq

sample3 R1.sample3.fastq R2.sample3.fastq

…

samplen R1.samplen.fastq R2.samplen.fastq

-Run the following Mothur scripts in order to process the sequences. Choose the desired parameters:

screen.seqs(fasta=, maxlength=130)

unique.seqs(fasta=)

pre.cluster(fasta=, name=, diffs=3)

cluster.fragments(fasta=, name=, diffs=3)

-Mothur can't resolve the merge of identical sequences of diferent length. To overcome this problem run the mergeLengthCRISPR.py script.

-Use the generated files to make the input nodes and edges files using the networkCRISPR.py script. These two files can be uploaded to gephi to visualize the raw network.

-In order to filter the network eliminating low quality connections and nodes, the qualityCRISPR.py script uses a series of files to generate a file which can be used into gephi:

The last .fasta file created (mergeLengthCRISPR.py step).
The merged reads (fastq) files for all samples (e.g.: all.R1.fastq,all.R2.fastq).
The last .name file created (mergeLengthCRISPR.py step).
The original .seq file created at the begining (findRepeatCRISPR.py step).

-Spacers fasta file creation

The easiest way to extract the spacers as fasta file is from the network.nodes.csv file:

sed 's/ /\t/g' network.nodes.csv | cut -f 1,3 | sed 's/^/>/' | sed 's/\t/\n/' | tail -n+3 > spacersL.fa

sed 's/ /\t/g' network.nodes.csv | cut -f 1,5 | sed 's/^/>/' | sed 's/\t/\n/' | tail -n+2 > spacersR.fa

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
LICENSE		LICENSE
README.md		README.md
findRepeatCRISPR.py		findRepeatCRISPR.py
gordonia_MAG.fa		gordonia_MAG.fa
mergeLengthCRISPR.py		mergeLengthCRISPR.py
networkCRISPR.py		networkCRISPR.py
phage_DC-56.fa		phage_DC-56.fa
phage_DS-92.fa		phage_DS-92.fa
qualityCRISPR.py		qualityCRISPR.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

findRepeatCRISPR.py

findRepeatCRISPR.py

gordonia_MAG.fa

gordonia_MAG.fa

mergeLengthCRISPR.py

mergeLengthCRISPR.py

networkCRISPR.py

networkCRISPR.py

phage_DC-56.fa

phage_DC-56.fa

phage_DS-92.fa

phage_DS-92.fa

qualityCRISPR.py

qualityCRISPR.py

Repository files navigation

Gordonia-CRISPR

About

Releases

Packages

Languages

License

GuerreroCRISPR/Gordonia-CRISPR

Folders and files

Latest commit

History

Repository files navigation

Gordonia-CRISPR

About

Resources

License

Stars

Watchers

Forks

Languages