CORE: COde for Romps in Evolutionary data

A mixture of scripts and libraries to help with sequence data manipulation, tree parsing, and other things.

Author

Gregg Thomas

About

These scripts can be used for many tasks including sequence handling, tree making, and sequence alignment.

Some of these programs are mainly used as wrappers to easily run other genomics or phylogenetics programs on a bunch of files. Pay attention to the dependencies for each script to make sure you have the proper programs installed.

Please note that many of these scripts expect input as FASTA files. For my scripts, these must have the extension .fa. If you don't have FASTA formatted files, you can use seq_convert to get them to FASTA format and fa_edit to make any changes you need to them afterwards.

Almost all of these scripts are written in Python 2.7 (https://www.python.org/downloads/).

For any script, use the -h flag for specific usage details.

CORE scripts

cafecore/cafe_report_analysis.py

This script reads the report output file from a CAFE run and makes the results more understandable. It has a lot of options for output based on the files you have.

corelib/core.py

General helper functions such as reading sequences to a dictionary. You'll have to look to see what all is there.

corelib/nj_tree.r

Simple R script to get a Neighbor Joining tree. Used by supertreemaker and probably not helpful standalone.
Dependencies:
i. R (https://www.r-project.org/)

corelib/treeparse.py

A couple functions that read (rooted) Newick formatted trees and return all relevant information in a more useful way to code with.

count_aln.py

This script gathers statistics about a single alignment file, or a directory full of alignment files.

count_pos.py

This script simply counts the number of amino acids or nucleotides in a file or directory.

fa_concat.py

Concatenates many FASTA formatted sequence files into a single FASTA file.

fa_edit.py

A general purpose FASTA handling script. Can relabel and trim headers and remove start and stop AAs.

how_many_trees

Just a little script to show the number of possible rooted tree topologies for a given number of species.

paml_lrt.py

Performs a likelihood ratio test on output from the branch-site test in codeml.
Dependencies: Output from two run_codeml.py runs with -b 1 (null model) and -b 2 (alternate model).

run_codeml.py

A script to run some basic PAML analyses with codeml.
Dependencies:
i. PAML (http://abacus.gene.ucl.ac.uk/software/paml.html) called as codeml
ii. Newick Utilities (http://cegg.unige.ch/newick_utils) called as nw_prune

run_gblocks.py

A script to run GBlocks to mask a directory full of alignments in FASTA format. Note: This currently runs GBlocks at the most relaxed settings for phylogenetic tree inference. It will reject any masks that remove more than 20% of the columns from the original alignment.
Dependencies:
i. GBlocks (http://molevol.cmima.csic.es/castresana/Gblocks.html) called as gblocks

run_muscle.py

This will make MUSCLE alignments out of a directory of FASTA files.
Dependencies:
i. muscle (http://www.drive5.com/muscle/downloads.htm) called as muscle

run_pasta_aln.py

This will make PASTA alignments out of a directory of FASTA files.
Dependencies:
i. PASTA (http://www.cs.utexas.edu/~phylo/software/pasta/) called as python run_pasta.py

run_raxml.py

Runs some basic RAxML analyses on a directory full of FASTA files.
Dependenceies:
i. RAxML (http://sco.h-its.org/exelixis/web/software/raxml/index.html) called as raxml, though you can specify the path to your own raxml executable.

seq_convert.py

A sequence file format conversion tool. Currently converts between FASTA (.fa), Phylip (.ph), and Nexus (.nex) formats. It assumes files will have those extensions. Remember, these formats vary a lot in the details, so they might not work right away for everything. Let me know if you run into problems and I'll try to fix it.

supertreemaker.py

This script can do several things. Runs SDM to get average consensus distance matrices, makes NJ trees from distance matrices, re-roots trees, and makes ultrametric trees. All of these programs will need to be in your PATH.
Dependencies:
i. SDM (http://www.atgc-montpellier.fr/sdm/) to calculate the matrix, called as java -jar ~/bin/SDM/SDM.jar
ii. R (https://www.r-project.org/) to make NJ trees, called as Rscript
iii. Newick Utilities (http://cegg.unige.ch/newick_utils) to re-root trees, called as nw_reroot
iv. r8s (http://loco.biosci.arizona.edu/r8s/) to smooth trees, called as r8s

Name		Name	Last commit message	Last commit date
Latest commit History 169 Commits
corelib		corelib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Treemaking_ensembl_walkthrough.txt		Treemaking_ensembl_walkthrough.txt
count_aln.py		count_aln.py
count_pos.py		count_pos.py
fa_concat.py		fa_concat.py
fa_edit.py		fa_edit.py
fotc.py		fotc.py
get_orth_fasta.py		get_orth_fasta.py
how_many_trees		how_many_trees
orth_combine.py		orth_combine.py
paml_lrt.py		paml_lrt.py
root_trees.py		root_trees.py
run_codeml.py		run_codeml.py
run_gblocks.py		run_gblocks.py
run_muscle.py		run_muscle.py
run_pasta_aln.py		run_pasta_aln.py
run_raxml.py		run_raxml.py
seq_convert.py		seq_convert.py
supertreemaker.py		supertreemaker.py

License

rtraborn/core

Folders and files

Latest commit

History

Repository files navigation

CORE: COde for Romps in Evolutionary data

A mixture of scripts and libraries to help with sequence data manipulation, tree parsing, and other things.

Author

Gregg Thomas

About

These scripts can be used for many tasks including sequence handling, tree making, and sequence alignment.

Some of these programs are mainly used as wrappers to easily run other genomics or phylogenetics programs on a bunch of files. Pay attention to the dependencies for each script to make sure you have the proper programs installed.

Please note that many of these scripts expect input as FASTA files. For my scripts, these must have the extension .fa. If you don't have FASTA formatted files, you can use seq_convert to get them to FASTA format and fa_edit to make any changes you need to them afterwards.

Almost all of these scripts are written in Python 2.7 (https://www.python.org/downloads/).

For any script, use the -h flag for specific usage details.

CORE scripts

About

Resources

License

Stars

Watchers

Forks

Languages