Software tools to collect and analyze Coronavirus sequences, including code that extracts gene and protein sequences from COV (coronavirus) genome files downloaded from NCBI, and creates gene and protein-specific sequence collections, alignments, and Hidden Markov Models (HMMs). The HMMs can be used to annotate COV sequences and create BED files for genome visualization.
./download_gb_by_taxid.py
./feature_to_gene_and_protein.py *.gb
./seqs_to_aligns_and_hmms.py *.fa
./annotate_to_bed.py *.gb
The COV genes and proteins are parsed from the GenBank files as features and assigned standard names based on their product tags. Possible synonyms for these standard names are listed in cov_dictionary.yaml. A QC step compares all the COV protein sequences to expected lengths listed in the cov_length_variants.yaml file, and sequences that do not match expected lengths are not included in sequence files, alignments, or HMMs.
- Python3 and packages, including Biopython
- Sequence aligner (muscle, or clustalo, or mafft)
- HMMER 3.3
- To make BED files:
- Get an NCBI API key
- Pass the key as a command-line argument or configure as an
env
variable (export NCBI_API_KEY=8cc3fffffff2b4444492e68a8167aaaa08
)
- GenBank sequence:
.gb
- Fasta sequence:
.fa
- Fasta index file:
.fai
- Fasta alignment:
.fasta
- Stockholm alignment:
.sto
- MAF alignment:
.maf
- HMMER profile HMM:
.hmm
- hmmsearch table:
.tblout
- BED track file:
.bed
- HMM-based annotation and QC
- Visualization (e.g. for IGV)