Skip to content

uc-cdis/covid-bioinformatics

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

covid-bioinformatics

Software tools to collect and analyze Coronavirus sequences, including code that extracts gene and protein sequences from COV (coronavirus) genome files downloaded from NCBI, and creates gene and protein-specific sequence collections, alignments, and Hidden Markov Models (HMMs). The HMMs can be used to annotate COV sequences and create BED files for genome visualization.

typical usage

  • ./download_gb_by_taxid.py
  • ./feature_to_gene_and_protein.py *.gb
  • ./seqs_to_aligns_and_hmms.py *.fa
  • ./annotate_to_bed.py *.gb

The COV genes and proteins are parsed from the GenBank files as features and assigned standard names based on their product tags. Possible synonyms for these standard names are listed in cov_dictionary.yaml. A QC step compares all the COV protein sequences to expected lengths listed in the cov_length_variants.yaml file, and sequences that do not match expected lengths are not included in sequence files, alignments, or HMMs.

requirements

recommended for best performance

  • Get an NCBI API key
  • Pass the key as a command-line argument or configure as an env variable (export NCBI_API_KEY=8cc3fffffff2b4444492e68a8167aaaa08)

file formats created

  • GenBank sequence: .gb
  • Fasta sequence: .fa
  • Fasta index file: .fai
  • Fasta alignment: .fasta
  • Stockholm alignment: .sto
  • MAF alignment: .maf
  • HMMER profile HMM: .hmm
  • hmmsearch table: .tblout
  • BED track file: .bed

to do

  • HMM-based annotation and QC
  • Visualization (e.g. for IGV)

About

Software tools to collect and analyze Coronavirus sequences using HMMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.3%
  • Shell 5.7%