2020-long-read-assembly-decontam

Find and extract components of long-read assemblies that match to a database, for the purposes of decontamination.

Still early in development. Buyer beware! Here be dragons!!

Installing!

Clone this repository and change into the top-level repo directory. The file environment.yml contains the necessary conda packages (python and snakemake) to run charcoal; see the Quickstart section for explicit instructions.

Quickstart:

Clone the repository, change into it, create the environment, and activate it:

git clone https://github.com/ctb/2020-long-read-assembly-decontam
cd ./2020-long-read-assembly-decontam/
conda env create -f environment.yml -n lra-decontam
conda activate lra-decontam

Running!

To run, execute (in the top-level directory):

snakemake --use-conda -p -j 1

This should succeed :).

Once that works, you can configure it yourself by copying test-data/conf-test.yml to a new file and editing it. See conf/conf-necator.yml for a real example.

Explanation of output files.

In the output directory (e.g. output.test, or whatever is specified in the config file you use), there will be a few important files -- the main ones are,

gather.csv - the list of contaminants
matching-contigs.fa - all contigs with any matches to the database
matching-fragments.fa - all fragments with any matches to the database

Resources

On a ~300 MB assembly, this took about 2 hours and required about 2 GB of RAM, using the RefSeq microbial genomes SBT. The disk space requirement is more significant, mainly because the SBTs are in the ~10-30 GB range when unpacked.

Need help?

Please ask questions and file issues on the sourmash GitHub issue tracker.

Credits

Thanks to Erich Schwarz (for stubborn pursuit of contamination in long-read assemblies) and Taylor Reiter (for stubborn pursuit of contamination, period) for their inspiration!

A first try at this approach is detailed here, and the discussion that led to this particular repo is in sourmash issue #940.

@ctb April 2020

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
conf		conf
scripts		scripts
test-data		test-data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conf

conf

scripts

scripts

test-data

test-data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Snakefile

Snakefile

environment.yml

environment.yml

Repository files navigation

2020-long-read-assembly-decontam

Installing!

Quickstart:

Running!

Explanation of output files.

Resources

Need help?

Credits

About

Releases

Packages

Languages

License

pythseq/2020-long-read-assembly-decontam

Folders and files

Latest commit

History

Repository files navigation

2020-long-read-assembly-decontam

Installing!

Quickstart:

Running!

Explanation of output files.

Resources

Need help?

Credits

About

Resources

License

Stars

Watchers

Forks

Languages