Find and extract components of long-read assemblies that match to a database, for the purposes of decontamination.
Still early in development. Buyer beware! Here be dragons!!
Clone this repository and change into the top-level repo directory.
The file environment.yml
contains the necessary conda packages
(python and snakemake) to run charcoal; see the Quickstart section
for explicit instructions.
Clone the repository, change into it, create the environment, and activate it:
git clone https://github.com/ctb/2020-long-read-assembly-decontam
cd ./2020-long-read-assembly-decontam/
conda env create -f environment.yml -n lra-decontam
conda activate lra-decontam
To run, execute (in the top-level directory):
snakemake --use-conda -p -j 1
This should succeed :).
Once that works, you can configure it yourself by copying
test-data/conf-test.yml
to a new file and editing it. See
conf/conf-necator.yml
for a real example.
In the output directory (e.g. output.test
, or whatever is specified
in the config file you use), there will be a few important files --
the main ones are,
gather.csv
- the list of contaminantsmatching-contigs.fa
- all contigs with any matches to the databasematching-fragments.fa
- all fragments with any matches to the database
On a ~300 MB assembly, this took about 2 hours and required about 2 GB of RAM, using the RefSeq microbial genomes SBT. The disk space requirement is more significant, mainly because the SBTs are in the ~10-30 GB range when unpacked.
Please ask questions and file issues on the sourmash GitHub issue tracker.
Thanks to Erich Schwarz (for stubborn pursuit of contamination in long-read assemblies) and Taylor Reiter (for stubborn pursuit of contamination, period) for their inspiration!
A first try at this approach is detailed here, and the discussion that led to this particular repo is in sourmash issue #940.
@ctb April 2020