GitHub - TomSkelly/PacBioEDA: Python scripts for Exploratory Data Analysis of Pacific Biosciences sequence data

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
examples		examples
.gitignore		.gitignore
CHANGELOG		CHANGELOG
COPYING		COPYING
H5BasFile.py		H5BasFile.py
H5CmpFile.py		H5CmpFile.py
PacBio_Bas.py		PacBio_Bas.py
PacBio_BasFilteredSubreads.py		PacBio_BasFilteredSubreads.py
PacBio_BasSummary.py		PacBio_BasSummary.py
PacBio_CmpDetails.py		PacBio_CmpDetails.py
PacBio_Dump.py		PacBio_Dump.py
PacBio_FindAdapt.py		PacBio_FindAdapt.py
PacBio_FullInsert.py		PacBio_FullInsert.py
PacBio_HQ.py		PacBio_HQ.py
PacBio_HQHistory.py		PacBio_HQHistory.py
PacBio_Productivity.py		PacBio_Productivity.py
PacBio_Qual.py		PacBio_Qual.py
PacBio_Show.py		PacBio_Show.py
PacBio_ShowAlignments.py		PacBio_ShowAlignments.py
PacBio_XY.py		PacBio_XY.py
README		README
SWAligner.py		SWAligner.py
run_pac		run_pac
tt_log.py		tt_log.py

Repository files navigation

This is a tools package for performing fairly low-level exploratory
data analysis on sequence data from a Pacific Biosciences instrument.

The scripts typically take as input a .bas.h5 file, as produced by the
primary analysis software running on the instrument server. Some
scripts also expect a .cmp.h5 file containing alignments produced by
the secondary analysis step.

The scripts are in Python, since that seems to be the native language
of PacBio. Keep in mind that I'm a Python newbie, so they will not be
as idiomatically correct as I'd like.

The scripts (and their supporting packages) are described at a high
level here. If you want to know more detail about what a script does
and how, use the Force: read the source. I am a fairly loquacious
commenter.

Feedback and patches are always welcome.

As of February 2013, I've moved from the Sanger Institute to the
Frederick (MD) National Laboratory for Cancer Research. My new email
is below.

Tom Skelly (thomas.skelly@nih.gov)


How to install:
~~~~~~~~~~~~~~

Installation is trivial: Just untar the tar file to a location of your
choice. At the Linux command line:

   cd MyFavouriteDirectory
   tar -xzvf TomSkelly-PacBioEDA-e1e8636.tar.gz

(with e1e8636 replaced by the current github version id of the file
you downloaded).

That will create a subdirectory called TomSkelly-PacBioEDA-<version>
containing the scripts. (You'll probably want to create a link to that
directory with a more meaningful name.) The only further configuration
required is to tell the scripts where to find the supporting Python
modules they require, as described in the next section.


How to run:
~~~~~~~~~~

In theory, the scripts can be run directly from the command
line. However, they need to be told where to find the relevant Python
packages (my modules, plus numpy, matplotlib, h5py, etc). You could do
that by permanently setting the PATH, PYTHONPATH, and LD_LIBRARY_PATH
environment variables in your shell profile. But it's probably easier
to run the scripts via the run_pac shell script included here. The
settings in run_pac are a subset of those in
smrtanalysis/etc/setup.sh. run_pac sets the appropriate environment
variables, then invokes the python script you specify. You'll need to
modify run_pac in a couple of places (indicated by comments) to point
to your installation directories.

Note that, in particular, the python executable itself (which the
scripts find via $PATH) should be the one supplied in the smrtanalysis
release (currently python 2.5, moving to 2.7 in PacBio release
1.3.0). If you mix and match (even different installations of v2.5),
you may see obscure errors having to do with UTF8 encodings. Use the
run_pac script to get all this right.

Start by trying "run_pac ScriptName.py --help" for each script to see
what parameters and options it accepts.

Then do something like:

    run_pac PacBio_FindAdapt.py --nocol MyBasFile.bas.h5 6789


A note about offsets:
~~~~~~~~~~~~~~~~~~~~

** REVISED REVISED REVISED REVISED **

Offsets printed are zero-based. Start-stop intervals are printed
python-style: The interval 0-10 is ten bases long, and its last base
is offset 9.

Alignment coordinates against a reference genome used to be a stated
exception to this rule. But it turns out that the alignment start/stop
offsets in a cmp.h5 file are zero-based. (I just discovered this --
sorry!) Rather than go fixing this by adding bunch of +1's everywhere
in the code, let's just let alignment offsets join the 0-based club
along with everything else.


Scripts:
~~~~~~~

* PacBio_BasSummary.py:

Python script to print summary information about the contents of a
*.bas.h5 file, as produced by the PacBio instrument. Provides counts
of reads, subreads and bases for a hierarchy of increasingly
restrictive subsets of subreads:

      Total contents of bas file
        Sequencing ZMWs
          Productivity-0 Sequencing ZMWs
          Productivity-2 Sequencing ZMWs
          Productivity-1 Sequencing ZMWs
            Reads with HQ region length > threshold
              Reads with HQ score > threshold
                Reads with average insert size > threshold (i.e., not adapter dimers)
                  Reads which aligned (if a .cmp.h5 file was supplied)

All of the above except average insert size are criteria applied by
the filtering step of secondary analysis.

Inputs are a .bas.h5 file and (optionally) a .cmp.h5 file, specified
as command line parameters. Filtering parameters can be overridden
on the command line. See the --help option for details. Output is to
stdout.


* PacBio_Bas.py:

Print detailed data from *.bas.h5 file and (optionally) the *.cmp.h5
alignments file on a region-by-region basis. If a cmp.h5 file is
supplied, it prints the alignment outcome for each region. First
parameter is input *.bas.h5 file. Optional second parameter is
*.cmp.h5 file. Output is to stdout, redirect with >.

NOTE: For help in interpreting the various line types printed by
this script, do "run_pac PacBio_Bas.py --linehelp".


* PacBio_BasFilteredSubreads.py:

Python script to extract filtered subreads from a bas.h5 file, and
write them to a fasta (or, optionally, fastq) file. By default, it
will reproduce the sec/data/filtered_subreads.fasta file for the set
described by the input .bas.h5 file. Different filtering parameters
can be specified as command line parameters.

The main reason for the existence of this script is to convince
myself that I understand how filtering works. But it also provides a
quick way to extract a set of reads from a .bas.h5 file.

The option to produce a fastq file is not easily available elsewhere
(as of SMRTanalysis 1.2.3). It uses the QualityValue field from the
bas.h5 data, and ascii-encodes it as Q+33.


* PacBio_HQ.py

Plot the number of ZMWs in pre-HQ, HQ, and post-HQ status as a
function of time.


* PacBio_HQHistory.py

Create a time-stepped series of plots showing the spatial
distribution on the SMRTcell of ZMWs in their HQ regions.

* PacBio_XY.py

Create a plot of ZMW status by x/y position on the SMRTcell. First
parameter is input hdf5 file. Output png file is optional command
line parameter, defaulting to XY.png.


* PacBio_Productivity.py

Create a plot of ZMW productivity by x/y position on the
SMRTcell. First parameter is input .bas.h5 file. Output png file is
optional command line parameter, defaulting to productivity.png.


* PacBio_FindAdapt.py:

Read a PacBio bas.h5 file, extract the sequence for a ZMW specified
on the command line, align it to the adapter sequence, and output a
plot of Smith-Waterman alignment scores at each position in the
read. By default, output is to file ZMW-zzzzz.png, where zzzzz is
the zero-padded ZMW number -- this can be overridden on the command
line. Optional command line parameters specify a 0-based inclusive
range of bases within the read.


* PacBio_FullInsert.py:

Read a PacBio bas.h5 file, find 'full' insert regions, and produce a
histogram of their lengths. Full inserts are insert regions which
cover the entire sequenced milecule. I.e., inserts which run from one
adapter to the other.


* PacBio_Show.py:

Read a PacBio bas.h5 file, print a range of sequence from a specified
ZMW in fasta format. Command line parameters are bas.h5 file name and
ZMW#. Options include 0-based start and stop offsets.


* PacBio_Dump.py:

Dump a generic hdf5 file. Doesn't know anything about PacBio
formats. First (only) parameter is input hdf5 file. Output is to
stdout.


Modules:
~~~~~~~

* H5BasFile.py:

Support for common operations reading a PacBio bas.h5 file.


* H5CmpFile.py:

Package for accessing data from PacBio cmp.h5 alignment files.


* SWAligner.py:

Python module implementing a Smith-Waterman aligner with fixed gap
penalty. NOTE: This is not going to be fast! It's intended for small
jobs like finding adapters. We need a C version for the heavy lifting.


* tt_log.py:

Tiny module setting up parameters for Python logger for use by these
scripts.