Description

Please cite Shirley, Matthew (2014): pyfaidx: efficient pythonic random access to fasta subsequences. figshare. DOI:10.6084/m9.figshare.972933.

Description

Samtools provides a function "faidx" (FAsta InDeX), which creates a small flat index file ".fai" allowing for fast random access to any subsequence in the indexed fasta, while loading a minimal amount of the file in to memory.

Pyfaidx provides an interface for creating and using this index for fast random access of DNA subsequences from huge fasta files in a "pythonic" manner. Indexing speed is comparable to samtools, and in some cases sequence retrieval is much faster (benchmark). For example:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta')
>>> genes
Fasta("tests/data/genes.fasta")

Acts like a dictionary.

>>> genes.keys() ['NR_104215.1',
'KF435150.1', 'NM_001282548.1', 'NM_001282549.1', 'XM_005249644.1',
'NM_001282543.1', 'NR_104216.1', 'XM_005265508.1', 'XR_241079.1',
'AB821309.1', 'XM_005249645.1', 'XR_241081.1', 'XM_005249643.1',
'XM_005249642.1', 'NM_001282545.1', 'NR_104212.1', 'XR_241080.1',
'XM_005265507.1', 'KF435149.1', 'NM_000465.3']

>>> genes['NM_001282543.1'][200:230]
>NM_001282543.1:201-230
CTCGTTCCGCGCCCGCCATGGAACCGGATG

>>> genes['NM_001282543.1'][200:230].seq
'CTCGTTCCGCGCCCGCCATGGAACCGGATG'

>>> genes['NM_001282543.1'][200:230].name
'NM_001282543.1:201-230'

>>> genes['NM_001282543.1'][200:230].start
201

>>> genes['NM_001282543.1'][200:230].end
230

>>> len(genes['NM_001282543.1'])
5466

Slices just like a string:

>>> genes['NM_001282543.1'][200:230][:10]
>NM_001282543.1:201-210
CTCGTTCCGC

>>> genes['NM_001282543.1'][200:230][::-1]
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> genes['NM_001282543.1'][200:230][::3]
>NM_001282543.1:201-230
CGCCCCTACA

>>> genes['NM_001282543.1'][:]
>NM_001282543.1:1-5466
CCCCGCCCCT........

Start and end coordinates are 0-based, just like Python.

Complements and reverse complements just like DNA

>>> genes['NM_001282543.1'][200:230].complement
>NM_001282543.1 (complement):201-230
GAGCAAGGCGCGGGCGGTACCTTGGCCTAC

>>> genes['NM_001282543.1'][200:230].reverse
>NM_001282543.1:230-201
GTAGGCCAAGGTACCGCCCGCGCCTTGCTC

>>> -genes['NM_001282543.1'][200:230]
>NM_001282543.1 (complement):230-201
CATCCGGTTCCATGGCGGGCGCGGAACGAG

Custom key functions provide cleaner access:

>>> from pyfaidx import Fasta
>>> genes = Fasta('tests/data/genes.fasta', key_function = lambda x: x.split('.')[0])
>>> genes.keys()
dict_keys(['NR_104212', 'NM_001282543', 'XM_005249644', 'XM_005249645', 'NR_104216', 'XM_005249643', 'NR_104215', 'KF435150', 'AB821309', 'NM_001282549', 'XR_241081', 'KF435149', 'XR_241079', 'NM_000465', 'XM_005265508', 'XR_241080', 'XM_005249642', 'NM_001282545', 'XM_005265507', 'NM_001282548'])
>>> genes['NR_104212'][:10]
>NR_104212:1-10
CCCCGCCCCT

It also provides a command-line script:

cli script: faidx

$ faidx tests/data/genes.fasta NM_001282543.1:201-210 NM_001282543.1:300-320
>NM_001282543.1
CTCGTTCCGC
>NM_001282543.1
GTAATTGTGTAAGTGACTGCA

$ faidx --complement tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1
GAGCAAGGCG

$ faidx --reverse tests/data/genes.fasta NM_001282543.1:201-210
>NM_001282543.1
CGCCTTGCTC

$ faidx tests/data/genes.fasta NM_001282543.1
>NM_001282543.1
CCCCGCCCCT........

$ faidx tests/data/genes.fasta --list regions.txt
...

Similar syntax as samtools faidx

A lower-level Faidx class is also available:

>>> from pyfaidx import Faidx
>>> fa = Faidx('T7.fa')
>>> fa.build('T7.fa', 'T7.fa.fai')
>>> fa.index
{'EM_PHG:V01146': {'lenc': 60, 'lenb': 61, 'rlen': 39937, 'offset': 40571}, 'EM_PHG:GU071091': {'lenc': 60, 'lenb': 61, 'rlen': 39778, 'offset': 74}}

>>> fa.fetch('EM_PHG:V01146', 1, 10)
EM_PHG:V01146
TCTCACAGTG

>>> fa.fetch('EM_PHG:V01146', 100, 120)
>EM_PHG:V01146
GGTTGGGGATGACCCTTGGGT

If the FASTA file is not indexed, when Faidx is initialized the build method will automatically run, producing "filename.fa.fai" where "filename.fa" is the original FASTA file.
Start and end coordinates are 1-based.

Installation

This package is tested under Python 3.3, 3.2, 2.7, 2.6, and pypy.

pip install pyfaidx

or

python setup.py install

CLI Usage

"samtools faidx" compatible FASTA indexing in pure python.

usage: faidx [-h] [-l LIST] [-n] [--complement] [--reverse]
             fasta [regions [regions ...]]

Fetch sequence from faidx-indexed FASTA

positional arguments:
  fasta                 FASTA file
  regions               space separated regions of sequence to fetch e.g.
                        chr1:1-1000

optional arguments:
  -h, --help            show this help message and exit
  -l LIST, --list LIST  list of regions, one per line
  -n, --name            print sequence names. default: True
  --complement          comlement the sequence. default: False
  --reverse             reverse the sequence. default: False

Changes

New in version 0.1.9:

line wrapping of faidx is set based on the wrapping of the indexed fasta file
added --reverse and --complement arguments to faidx

New in version 0.1.8:

key_function keyword argument to Fasta allows lookup based on function output

Acknowledgements

This project is freely licensed by the author, Matthew Shirley, and was completed under the mentorship and financial support of Drs. Sarah Wheelan and Vasan Yegnasubramanian at the Sidney Kimmel Comprehensive Cancer Center in the Department of Oncology.

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
pyfaidx		pyfaidx
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.rst		README.rst
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pyfaidx

pyfaidx

tests

tests

.gitignore

.gitignore

.travis.yml

.travis.yml

LICENSE

LICENSE

MANIFEST.in

MANIFEST.in

README.rst

README.rst

setup.cfg

setup.cfg

setup.py

setup.py

Repository files navigation

Description

cli script: faidx

Installation

CLI Usage

Changes

Acknowledgements

About

Releases

Packages

Languages

License

azalea/pyfaidx

Folders and files

Latest commit

History

Repository files navigation

Description

cli script: faidx

Installation

CLI Usage

Changes

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages