Group_CID_7180

1 install packages

Set up running environment and install software: (1) Modify shebang line of setup.py to make it run correctly (2) $Python3 setup.py build to build environment (3) $python3 setup.py install to install software needed (4) $python3 setup.py - - help can help you run
if not install successfully you can try pip install Run the below from within your $HOME directory on the ASC!!

Use python3 to install pip (a python package/library manager) in your user account using a python script copied from the biobootcamp directory

cp ~/biobootcamp/get-pip.py . && python3 get-pip.py --user

Confirm that pip is running in your user account and its version is 20.0.2

.local/bin/pip --version

Now we need to add the ~/.local/bin to our $PATH, so....

nano ~/.bashrc.local

And add the following to the bottom of the file (what's the reason for the below?)

export PATH=~/.local/bin:$PATH

Now log out and back into your supercomputer account to pick up the new $PATH. To make sure this is the case,

echo $PATH

And be sure the new ~/.local/bin is at the beginning of your $PATH

Now time to use pip to install packages like:

pip install biopython pip install reportlab

2 Download genome sequence

Use Scripts in Part_1

Welcome to our pipeline for analysis of COVID-19 sequences.

We are starting with the file COVID_Seqs.csv as a reference for accession numbers.

The first step will be to sort the accessions by country for comparison. SortCut_filter.sh will copy the location and accession data to a separate file, then copy out accessions from different countries.

--------------outline version----------- Step 1 Start with COVID_Seqs.csv Run SortCut_Filter.sh to separate seqs into lists of “location; accession” ($place_sorted.txt) --also splits up USA into sets of 100 randomized (USA_randsplit_aa)

ls _sorted for reference

Run Lines_to_commas.sh to convert location/accession lists to comma-separated lists for download use Ls DL* for reference

Run Collect_seqs.sh and input name of your DL file to get fasta Ls -thor *.fasta to make sure it went through

If size seems too big or small, check number of samples with: cat Other_seqs.fasta | grep ">" | wc –l

Step2: Or download all complete genomes with fasta format and genbank data NC_045512.2.gb from NCBI: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=SARS-CoV-2,%20taxid:2697049&SLen_i=29000%20TO%2030000

3 data preparation

replace non ATCG by N. sed -e '/^[^>]/s/[^ATGCatgc]/N/g' example_sequence.fasta > example_sequence_processed.fasta

4 Get genome sequences general data

You can run with command line: python3 CID_sequence_description.py example_sequence_processed.fasta

5 Output longest orf fasta file

python3 CID_longest_peptide.py example_sequence_processed.fasta

6 Multiple alignment

You can run with command line: ./multiple_alignment.sh or run with script run_script multiple_alignment.sh

7 Construct COVID 19 genome structure using NC_045512.2.gb and Multiple alignment results

python3 genome_structure.py

8 PCA for Cluster analysis based on genome sequences general data using R

R code in PCA.R

9 phylogentic tree using NCBI, ETE Toolkit, and Clustalw2

On command line type 'clustalw2'

Selections appear for alignments, creating phylogenetic trees

Phylogenetic Tree is selected

Next, select .fasta or .aln files to upload

After, select type of output file (.ph)

Select type of phylogenetic tree (Neighbor joining)

Input outfile name and run

Take outfile and view on NCBI or ETE Toolkit

*After ETEToolkit has been debugged, one should be able to use the script to create a .PNG file of the Tree using this on the command line

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
CID.egg-info		CID.egg-info
CID		CID
Misc_Files		Misc_Files
Part_1		Part_1
Phylogenetic_Tree		Phylogenetic_Tree
__pycache__		__pycache__
build/lib/CID		build/lib/CID
CID_longest_peptide.py		CID_longest_peptide.py
CID_sequence_description.py		CID_sequence_description.py
DNA-phylogenetic-trees.sh		DNA-phylogenetic-trees.sh
FinalProjectPaper.txt		FinalProjectPaper.txt
LICENSE		LICENSE
NC_045512.2.gb		NC_045512.2.gb
PCA.R		PCA.R
Part_3		Part_3
PhylogeneticTree.py		PhylogeneticTree.py
Presentation_Notes.txt		Presentation_Notes.txt
Proposal.md		Proposal.md
README.md		README.md
Rplot_tree.r		Rplot_tree.r
Sars-coV-2.txt		Sars-coV-2.txt
SpeciesList.txt		SpeciesList.txt
ete3fix.py		ete3fix.py
example_sequence.fasta		example_sequence.fasta
example_sequence_processed.fasta		example_sequence_processed.fasta
find_orf_sequence.py		find_orf_sequence.py
genome_structure.py		genome_structure.py
longest_peptide.fasta		longest_peptide.fasta
multiple_alignment.sh		multiple_alignment.sh
nucleotide_counts.tsv		nucleotide_counts.tsv
nucleotide_sequence.aln		nucleotide_sequence.aln
orf_translate.aln		orf_translate.aln
read_sequence.py		read_sequence.py
setup.py		setup.py
test.fasta		test.fasta
test2.py		test2.py
translate.py		translate.py
tree.svg		tree.svg
virus_linear_nice.eps		virus_linear_nice.eps
virus_linear_nice.pdf		virus_linear_nice.pdf
virus_linear_nice.png		virus_linear_nice.png

License

mza0150/Group_CID_7180_final_project

Folders and files

Latest commit

History

Repository files navigation

Group_CID_7180

1 install packages

2 Download genome sequence

3 data preparation

4 Get genome sequences general data

5 Output longest orf fasta file

6 Multiple alignment

7 Construct COVID 19 genome structure using NC_045512.2.gb and Multiple alignment results

8 PCA for Cluster analysis based on genome sequences general data using R

9 phylogentic tree using NCBI, ETE Toolkit, and Clustalw2

About

Resources

License

Stars

Watchers

Forks

Languages