A Bayesian APOBEC3G-induced hypermutation analysis tool implemented in Python.
The following examples illustrate some of the basic usage of hyperfreq.
Note that if one navigates to the root of this code base, these commands should all execute using the test data in tests/data
.
To run an analysis, use the hyperfreq analyze
command.
Running hyperfreq analyze -h
will give you a full list of options.
Here are some examples to get you started and give you a rough sense of hyperfreq's capabilities.
# Simple analysis comparing each sequence to a consensus sequence constructed from the entire alignment
hyperfreq analyze tests/data/alignment.fasta
# Instead, compare each sequence to the consensus for a cluster specified in a clusters file
hyperfreq analyze tests/data/alignment.fasta -c tests/data/clusters.csv
# Specify the reference sequence(s) you want each sequence to be compared to
hyperfreq analyze tests/data/alignment.fasta -r tests/data/ref_seqs.fasta
These commands will all output data to a file named hyperfreq_analysis.call.csv
.
The prefix and location of this file can be specified using the -P
and -o
flags.
By default, these analyses look for GG context hypermutation, suggestive of APOBEC3G activity.
One can specify multiple contexts for analysis using the -p
or --patterns
flag(s).
Pattern options include
pattern | associated with |
---|
`GG` | A3G activity
`GA` | A3F (and other A3) activity in humans
`GR` | combined A3G and A3F activity (as often observed in hypermutated HIV
`GM` | rhesus macaque A3DE activity (as observed in XMRV and SFV infections) \*
`GV` | combined rhesus A3DE and A3G activity
Note that R, M and V are IUPAC degenerate codes for A or G; A or C; and A, C or G, respectively.
When running multiple patterns, the call.csv
file contains a column called call_pattern
which represents the pattern in which the evidence of hypermutation appears to be strongest.
Other data in the call.csv
file will contain counts, and statistics specifically for the pattern considered the call pattern.
If the -F/--full-output
flag is specified, a separate file is output for each pattern analyzed (for example hyperfreq_analysis.GG.csv
).
Given an alignment, we can cut out sites/columns suspected of hypermutation by using the split
command.
Running this command requires specifying an alignment and a CSV file with a column named column
, specifying which positions in the alignment to be cut out.
In addition to the hyperfreq_analysis.call.csv
file, hyperfreq analyze
also produces a hyperfreq_analysis.sites.csv
file which has such a column, as well as information regarding which sequences were hypermutated at which sequence positions.
For example,
# for an alignment with hypermutated columns removed
hyperfreq split alignment.fasta hypermutated_columns.csv
For more thorough usage, run hyperfreq split -h
at the command line.
If you want to write your own scripts, you can do so by importing the appropriate modules
from hyperfreq import Alignment
from Bio import SeqIO
# Create a hyperfreq alignment object
seqs = SeqIO.parse('some_file.fasta', 'fasta')
aln = Alignment(seqs)
# Obtain an analysis generator which can be iterated over.
analysis = aln.analyze()
# Iterate over each sequence in the analysis, and do whatever you like!
for seq_result in analysis:
print seq_result['sequence'], "hm status:", seq_result['hm_pos']
It's also possible to define your own mutation patterns using the MutPattern
and MutPatternSet
classes.
It may be possible in the future to more flexibly specify patterns more flexibly via the CLI, but for now, doing so requires using this code base as a library in writing your own scripts.
If having this functionality available via the command line is important to you, please submit an issue, and we'll see what we can do.
It is also now possible to run a hyperfreq analyze
command with the -N / --interactive
flag.
This causes the program to load up an interactive python session with the analysis
generator, instead of simply writing the results to file.
You can enter dir()
to see what namespaces and data have been included for you, and obviously load any libraries (numpy, biopython, etc.) that might be helpful.
Do note though that analysis
is a generator, so if you need to make more than one pass through the results interactively, you'll need to throw them into an list.
analysis = [result for result in analysis]
Currently, Hyperfreq has only been tested on Linux systems, but it should be possible to get it set up on OSX fairly easily. If you do get it set up on OSX or Windows and have any tips to share, please feel free to add a page to the wiki.
Hyperfreq depends on the following python libraries
We recommend installing betarat and alnclst first, using the directions in the links above. Since alnclst requires biopython, you should now only have to install fisher. Assuming you have pip and the python-dev libraries installed (which you should after following the instructions above), you should be able to run
# (sudo may not be necessary, depending on how you set up your python environment)
sudo pip install fisher
Now you should be able to download and install hyperfreq
# I like to download things to a src directory in my home folder
mkdir -p ~/src
cd ~/src
# Download and unzip the source code
wget https://github.com/fhcrc/hyperfreq/archive/betarat-refactor.zip -O hyperfreq.zip
unzip hyperfreq.zip
cd hyperfreq
# Install (sudo may not be necessary, depending on how you set up your python environment)
sudo python setup.py install
And there you have it!
You can try running hyperfreq -h
from the command line to test your installation.
If you have any trouble installing, please submit an issue, so we can try to help and update the documentation.