WARNING: This program is under active development and this documentation might not reflect reality. Please file a GitHub issue and we will take care of it as soon as we can.
FACS is the C implementation of a previous Perl module, please select the perl branch if you want to have a look at the old (unsupported) implementation.
- 'build' is for building a bloom filter from a reference file. It supports large genome files (>4GB), human genome, for instance.
- 'query' is for querying a fastq/fasta file against the bloom filter.
- 'remove' is for removing contamination sequences from a fastq/fasta file.
In order to fetch the source code and compile, run:
$ git clone https://github.com/SciLifeLab/facs && cd facs && make -j8
Please note that python's virtualenv is needed to run the tests.
If you are compiling on MacOSX or with LLVM's clang, please note that FACS it'll run in single core mode since OpenMP is still not ported to clang.
Also, wget
is required to run the testsuite, so please run:
brew install wget
Or use whatever packaging means you have in MacOSX.
Facs uses a similar commandline structure to the one found in the popular bwa. There are three main commands: build, query and remove. Each of them might have slightly different flags but should behave similarly.
$ ./facs -h
Program: facs - Sequence analysis using bloom filters
Version: 2.0
Contact: Enze Liu <enze.liu@scilifelab.se>
Usage: facs <command> [options]
Command: build build a bloom filter from a FASTA/FASTQ reference file
query query a bloom filter given a FASTA/FASTQ file
remove remove (contamination) sequences from FASTQ/FASTA file
For example, to build a bloom filter out of a FASTA reference genome, one should type:
$ ./facs build -r ecoli.fasta -o ecoli.bloom
That would generate a ecoli bloom filter that could be used to query a FASTQ file:
$ ./facs query -r ecoli.bloom -q contaminated_sample.fastq.gz -f "json"
Note that both plaintext fastq files and gzip-compressed files are supported transparently to the user.
Which would return some metrics, in json format, indicating how many reads might be contaminated with ecoli in that particular sample:
{
"timestamp": "2013-03-27T11:16:21.809+0100"
"organism": "test200.fastq"
"bloom_filter": "eschColi_K12.bloom"
"total_read_count": 201,
"contaminated_reads": 1,
"total_hits": 36,
"contamination_rate": 0.004975,
}
If one wishes to get tsv
format to easily import in
LibreOffice.org or Excel, indicate
-f "tsv"
in the commandline, and a tsv file will be written in the local directory:
$ cat test200.fastq.tsv
organism bloom_filter total_read_count contaminated_reads contamination_rate
test200.fastq eschColi_K12.bloom 201 1 0.004975
Finally, if one wants to remove those reads from the sample, one should run the following command:
$ ./facs remove -r ecoli.bloom -q contaminated_sample.fastq
Two output files will be generated:
contaminated_sample_ecoli_contam.fastq
contaminated_sample_ecoli_clean.fastq
A python C-Extension provides a very simple API to build, query and remove sequences, just as described above with the plain C-based commandline.
$ python
Python 2.6.6 (r266:84292, Jun 18 2012, 09:57:52)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import facs
>>> facs.build("ecoli.fasta", "ecoli.bloom")
>>> facs.query("contaminated_sample.fastq.gz", "ecoli.bloom")
>>> facs.remove("contaminated_sample.fastq", "ecoli.bloom")