Skip to content

Try 2 of detecting/removing microbial contamination from long-read assemblies.

License

Notifications You must be signed in to change notification settings

pythseq/2020-long-read-assembly-decontam

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

2020-long-read-assembly-decontam

Find and extract components of long-read assemblies that match to a database, for the purposes of decontamination.

Still early in development. Buyer beware! Here be dragons!!

Installing!

Clone this repository and change into the top-level repo directory. The file environment.yml contains the necessary conda packages (python and snakemake) to run charcoal; see the Quickstart section for explicit instructions.

Quickstart:

Clone the repository, change into it, create the environment, and activate it:

git clone https://github.com/ctb/2020-long-read-assembly-decontam
cd ./2020-long-read-assembly-decontam/
conda env create -f environment.yml -n lra-decontam
conda activate lra-decontam

Running!

To run, execute (in the top-level directory):

snakemake --use-conda -p -j 1

This should succeed :).

Once that works, you can configure it yourself by copying test-data/conf-test.yml to a new file and editing it. See conf/conf-necator.yml for a real example.

Explanation of output files.

In the output directory (e.g. output.test, or whatever is specified in the config file you use), there will be a few important files -- the main ones are,

  • gather.csv - the list of contaminants
  • matching-contigs.fa - all contigs with any matches to the database
  • matching-fragments.fa - all fragments with any matches to the database

Resources

On a ~300 MB assembly, this took about 2 hours and required about 2 GB of RAM, using the RefSeq microbial genomes SBT. The disk space requirement is more significant, mainly because the SBTs are in the ~10-30 GB range when unpacked.

Need help?

Please ask questions and file issues on the sourmash GitHub issue tracker.

Credits

Thanks to Erich Schwarz (for stubborn pursuit of contamination in long-read assemblies) and Taylor Reiter (for stubborn pursuit of contamination, period) for their inspiration!

A first try at this approach is detailed here, and the discussion that led to this particular repo is in sourmash issue #940.


@ctb April 2020

About

Try 2 of detecting/removing microbial contamination from long-read assemblies.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%