=========
Link to paper: https://doi.org/10.1093/molbev/msx223
The vast majority of bacteria have an increase in adenine beyond that expected by chance alone. We determine the extent of this bias and propose and test a series of hypotheses that may help to explain this bias. This repository contains the source code for running the analysis described in the paper.
- All files run from the root directory except the site_tests scripts.
- Scripts 1,2 are used to download the EMBL files.
- A list of accessions for the genomes used can be found in accession_lists.
- Eukaryote genomes downloaded directly from Ensembl and need to be placed in folder named eukaryote_genomes in root.
- 1_bacterial_genome_download/_get_genomes.tcl and 14_archaea/14.1_get_genomes.tcl are run using tcl.
- 4_ratio_testing/_all_t4_genomes.py, 18_protists/get_genomes.py and 18_protists/protists.py run using Python3.
- All other files run using Python2 or R.
- R scripts are found in scripts/r_scripts/.
- get_genomes.tcl: download genomes (command requires path to accession list)
- group_genomes.py: group genomes into one folder
- _sortFiles.py: sort the genomes
- _parseFiles.py: parse the embl files and output to fasta like formats
- _filterGenes.py: filter genes
- _parseFiles_t4.py: parse the files for all table 4 genomes
- _calculate_ratios.py: calculate enrichment ratios for given site
- _chitest.py: chi square test for given site
- _site_4_no_overlap.py: calculate enrichment ratios when cds with an overlap to the next cds are discounted
- _chitest_site_4_no_overlap: chitest for site 4 with no overlaps
- _a_usage_start_cds.py: get the A usage at sites in the 5' domain
- _all_t4_genomes.py: calculate fourth site ratios for all table 4 genomes
- _codon_nucleotide_proportions.py: calculate the nucleotide proportions at each position
- _codon_gc_proportions.py: calculate the gc content at each position
- _codon_gc_varaiance.py: calcaulte the gc variance at each position
- _aod_second_amino.py: calculate aod scores for amino acids
- _stopDistanceGenome.py: distances to first +1 stop
- _stopDistanceGenomeSecondStop.py: distances to second +1 stop
- _stopDistanceGenomeThirdStop.py: distances to third +1 stop
- _getHighExpGenes.py: get genomes with annotations of the highly expressed genes
- _genomeGenes.py: create files with all genes, and the highly expressed genes
- _runCodonW.py: calculate CAI scores
- _CAI_analysis.py: run analysis on CAI values
- _CAI_stats.py: get CAI for the different start codons
- _compare_base_changes_related_species.py: compare orthologs between e. coli and shigella
- _leaderGenes.py: search for upstream leader genes
- _filteredLeader.py: filter the genes that may have a leader
- _leaderAnalysis.py: analyse the leaders
- _leader_leaderless_a_content: get the a content dependent on leader status
- _upstream_cds.py: see whether an upstream cds affects fourth site content
- _met_pairs.py: compare use of amino acids after methionine
- _anti_sd_sequences.py: get the 16s anti sd sequence
- _sdSequences.py: calculate the binding strength of cds with the antisd sequence
- _analyse_sd_sequences.py: analysis of sd sequences
- 14.1_get_genomes.tcl: download archaea genomes (command requires path to accession list)
- 14.2_group_genomes.py: group archaea genomes
- 14.3_sortFiles.py: sort files
- 14.4_parseFiles.py: extract to fasta like format
- 14.5_filterGenes.py: filter the cdss
- 14.6_calculate_ratios: calculate enrichment ratios for given site
- 14.7_chitest.py: chitest for site
- 15.1_filterGenes.py: filter the cdss
- 15.2_calculate_ratios.py: calculate ratios for given site
- _second_amino_relative_usage.py: amino acid use in the second position
- _second_amino.py: get enrichment ratios of amino acids
- 17_multivar.py: sort outputs for multivar analysis
- get_genomes.py: download protist genomes
- protists.py: calculate enrichment ratios for protists
- Two independent scripts verifying fourth site calculations