Benchmark generator for the IDSeq Portal.
So far just a thin wrapper around InSilicoSeq.
pip3 install InSilicoSeq
pip3 install ncbi-acc-download
python3 main.py
This produces zipped fastq files that you can upload to the IDSeq Portal via IDSEQ-CLI.
Add/modify entries like this in params.py
Genome("fungi", "aspergillus_fumigatus",
[("subspecies", 330879), ("species", 746128), ("genus", 5052), ("family", 1131492)],
["NC_007194.1", "NC_007195.1", "NC_007196.1", "NC_007197.1", "NC_007198.1", "NC_007199.1",
"NC_007200.1", "NC_007201.1"],
"https://www.ncbi.nlm.nih.gov/genome/18?genome_assembly_id=22576"),
Edit params.py or main.py as desired, e.g., to select a different set of error models.
If the output data changes, we expect the output file name to change as well. It's always a good idea to increment LOGICAL_VERSION (whole numbers only) after making changes to the code.
Each output file name reflects the params of its generation, like so:
norg_6__nacc_27__uniform_weight_per_organism__hiseq_reads__v4__[R1, R2].fastq.gz
-- number of organisms: 6
-- number of accessions: 27
-- distribution: uniform per organism
-- error model: hiseq
-- logical version: 4
TODO: Random generator seed control.
We generate a summary file for each pair of fastqs, indicating read counts per organism, and the average coverage of the organism's genome. Each pair counts as 2 reads / 300 bases, matching InSilicoSeq and IDSeq conventions.
READS COVERAGE LINEAGE GENOME
----------------------------------------------------------------------------------------------------------------------
16656 215.3x benchmark_lineage_0_37124_11019_11018 viruses__chikungunya__37124
16594 0.1x benchmark_lineage_330879_746128_5052_1131492 fungi__aspergillus_fumigatus__330879
16564 352.1x benchmark_lineage_0_463676_12059_12058 viruses__rhinovirus_c__463676
16074 0.5x benchmark_lineage_1125630_573_570_543 bacteria__klebsiella_pneumoniae_HS11286__1125630
15078 0.8x benchmark_lineage_93061_1280_1279_90964 bacteria__staphylococcus_aureus__93061
14894 0.1x benchmark_lineage_36329_5833_5820_1639119 protista__plasmodium_falciuparum__36329
We alter the read id's generated by ISS to satisfy the input requirements of tools like STAR.
This requires stripping those _1
and _2
pair indicators from all read id's, so that both reads in a pair have the exact same
read ID. Then, each read ID gets a serial number, and a tag identifying the taxonomic lineage of the organism
that read was sourced from, like so:
@NC_016845.1_503__benchmark_lineage_1125630_573_570_543__s0000001169
This is helpful in tracking reads through complex bioinformatic pipelines and scoring results. We assume the pipelines would not cheat by inspecting those tags.