Skip to content

tlentali/idseq-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

idseq-bench

Benchmark generator for the IDSeq Portal.

So far just a thin wrapper around InSilicoSeq.

setup

pip3 install InSilicoSeq
pip3 install ncbi-acc-download

running

python3 main.py

This produces zipped fastq files that you can upload to the IDSeq Portal via IDSEQ-CLI.

selecting organisms and chromosomes

Add/modify entries like this in params.py

    Genome("fungi", "aspergillus_fumigatus",
           [("subspecies", 330879), ("species", 746128), ("genus", 5052), ("family", 1131492)],
           ["NC_007194.1", "NC_007195.1", "NC_007196.1", "NC_007197.1", "NC_007198.1", "NC_007199.1",
            "NC_007200.1", "NC_007201.1"],
           "https://www.ncbi.nlm.nih.gov/genome/18?genome_assembly_id=22576"),

tweaking InSilicoSeq options

Edit params.py or main.py as desired, e.g., to select a different set of error models.

If the output data changes, we expect the output file name to change as well. It's always a good idea to increment LOGICAL_VERSION (whole numbers only) after making changes to the code.

interpreting the output

Each output file name reflects the params of its generation, like so:

norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v4__[R1, R2].fastq.gz
  -- number of organisms: 6
  -- number of accessions: 27
  -- distribution: uniform per organism
  -- error model: hiseq
  -- logical version: 4

TODO: Random generator seed control.

We generate a summary file for each pair of fastqs, indicating read counts per organism, and the average coverage of the organism's genome. Each pair counts as 2 reads / 300 bases, matching InSilicoSeq and IDSeq conventions.

READS  COVERAGE    LINEAGE                                          GENOME
----------------------------------------------------------------------------------------------------------------------
16656    215.3x    benchmark_lineage_0_37124_11019_11018            viruses__chikungunya__37124
16594      0.1x    benchmark_lineage_330879_746128_5052_1131492     fungi__aspergillus_fumigatus__330879
16564    352.1x    benchmark_lineage_0_463676_12059_12058           viruses__rhinovirus_c__463676
16074      0.5x    benchmark_lineage_1125630_573_570_543            bacteria__klebsiella_pneumoniae_HS11286__1125630
15078      0.8x    benchmark_lineage_93061_1280_1279_90964          bacteria__staphylococcus_aureus__93061
14894      0.1x    benchmark_lineage_36329_5833_5820_1639119        protista__plasmodium_falciuparum__36329

We alter the read id's generated by ISS to satisfy the input requirements of tools like STAR. This requires stripping those _1 and _2 pair indicators from all read id's, so that both reads in a pair have the exact same read ID. Then, each read ID gets a serial number, and a tag identifying the taxonomic lineage of the organism that read was sourced from, like so:

@NC_016845.1_503__benchmark_lineage_1125630_573_570_543__s0000001169

This is helpful in tracking reads through complex bioinformatic pipelines and scoring results. We assume the pipelines would not cheat by inspecting those tags.

About

infectious disease benchmarking

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages