Skip to content

Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection

License

Notifications You must be signed in to change notification settings

seedpcseed/DiTaxa

 
 

Repository files navigation

DiTaxa

Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection

Ehsaneddin Asgari, Philipp C Münch, Till R Lesker, Alice C McHardy, Mohammad R K Mofrad; DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection, Bioinformatics, , bty954, https://doi.org/10.1093/bioinformatics/bty954

Developer: Ehsaneddin Asgari (asgari [at] berkeley [dot] edu)
Please feel free to report any technical issue by sending an email or reporting an issue here.
Project page: http://llp.berkeley.edu/ditaxa
PIs: Prof. Alice McHardy* and Prof. Mohammad Mofrad*
         


Summary   Identifying distinctive taxa for microbiome-related diseases is considered key to the establishment of diagnosis and therapy options in precision medicine and imposes high demands on the accuracy of microbiome analysis techniques. We propose an alignment- and reference- free subsequence based 16S rRNA data analysis, as a new paradigm for microbiome phenotype and biomarker detection. Our method, called DiTaxa, substitutes standard OTU-clustering by segmenting 16S rRNA reads into the most frequent variable-length subsequences. We compared the performance of DiTaxa to the state-of-the-art methods in phenotype and biomarker detection, using human-associated 16S rRNA samples for periodontal disease, rheumatoid arthritis, and inflammatory bowel diseases, as well as a synthetic benchmark dataset. DiTaxa performed competitively to the k-mer based state-of-the-art approach in phenotype prediction while outperforming the OTU-based state-of-the-art approach in finding biomarkers in both resolution and coverage evaluated over known links from literature and synthetic benchmark datasets.  

Please cite the Bioinformatics paper

@article{10.1093/bioinformatics/bty954,
    author = {Asgari, Ehsaneddin and Münch, Philipp C and Lesker, Till R and McHardy, Alice C and Mofrad, Mohammad R K},
    title = "{DiTaxa: nucleotide-pair encoding of 16S rRNA for host phenotype and biomarker detection}",
    year = {2018},
    month = {11},
    doi = {10.1093/bioinformatics/bty954},
    url = {https://dx.doi.org/10.1093/bioinformatics/bty954},
    eprint = {http://oup.prod.sis.lan/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/bty954/27452903/bty954.pdf},
}

Installation

For the detailed installation using conda virtual environment and testing the working example please refer to the installation guideline .

Working Example

An example of periodontal disease dataset (Jorth et al, 2015) is provided in the repo. In order to see how DiTaxa runs, you may run the following command after installation.

python3 ditaxa.py --indir dataset/periodontal/
 --fast2label dataset/periodontal/mapping.txt
 --ext fastq
 --outdir results_dental/
 --dbname periodontal
 --cores 20
 --phenomap diseased:1,healthy:0
 --heatmap PeriodontalSamples:HealthySamples
 --phenoname DvsH
 --override 1
 (optional)--blastn BLASTN_PATH

Alternatively you can run:

bash ./run_test.sh

Example Dataset and parameter explanation

You may use this example to prepare your input files:

The "indir": e.g. «dataset/periodontal/» contains fastq files for each 16S rRNA samples.


The "fast2label"" e.g. «dataset/periodontal/mapping.txt» provides a file containing mapping from fastq files to their labels in a tabular format:

d1.fastq    diseased
d2.fastq    diseased
d3.fastq    diseased
d4.fastq    diseased
d5.fastq    diseased
d6.fastq    diseased
d7.fastq    diseased
d8.fastq    diseased
d9.fastq    diseased
d10.fastq    diseased
h1.fastq    healthy
h2.fastq    healthy
h3.fastq    healthy
h4.fastq    healthy
h5.fastq    healthy
h6.fastq    healthy
h7.fastq    healthy
h8.fastq    healthy
h9.fastq    healthy
h10.fastq    healthy

The "phenomap", e.g. «diseased:1,healthy:0» determining which labels to be considered as positive class and which as negative class as a string with no space in the following format:
diseased:1,healthy:0

The "override", 1 will override already existing files in the directory.
The "heatmap" e.g. «PeriodontalSamples:HealthySamples» determines the names for plotting positive and negative pheotypes on the heatmap.
The "blastn", optional: only if you don't run build.sh you need to specify this: This is the path to the "bin" directory of blast existing on your system. In this case, you may get the latest version of blast for your operating system from: ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/).

After running this command the output files will be generated in 'results_dental' as described bellow. The example output files are provided in the './output_example/' directory.

Output example

The automatically generated output of the example is as follows, you may also see automatically generated files in `output_example`:

ditaxaout

Biomarker detection

DiTaxa provides a taxonomic tree for significant discriminative biomarkers, where identified taxa to the positive and negative class are colored according to their phenotype (red for positive class and blue for negative class). The DiTaxa implementation for taxonomic tree generation uses a Phylophlan-based backend. Other than PDF files, DiTaxa provides the raw graphlan outputs to facilitate further annotations.

STDP here means the state-of-the-art OTU based approach compared to DiTaxa.

Heatmap of Biomarkers

DiTaxa provides a heatmap of top biomarkers occurrences in samples, where the rows denote markers and the columns are samples is generated. Such a heatmap allows biologists to obtain a detailed overview of markers' occurrences across samples. The heatmap shows number of distinctive sequences hit by each biomarker in different samples and stars in the heatmap denote hitting unique sequences, which cannot be analyzed by OTU clustering approaches.

Excel file of Biomarkers

In addition, DiTaxa provides a detailed excel file of biomarker sequnces and their taxonomy annotations along with their p-values.

t-SNE visualization

T-sne visualization of data using all NPEs and selected markers will be also generated by default.

Detailed User Manual

After installation using the installation guideline .you may use DiTaxa The parameteres for running DiTaxa are as follows:

python3 ditaxa.py --indir address_of_samples --ext extension_of_the_files --outdir output_directory --dbname database_name --cores 20 --fast2label mapping_file_from_name_to_phenotype --phenomap mapping_labels_to_binary_1_or_0_phenotype
--blastn /mounts/data/proj/asgari/dissertation/deepbio/taxonomy/ncbi-blast-2.5.0+/bin/

Using the above mentioned command all the steps will be done sequentially and output will be organized in subdirectories.

Main parameters for biomarker detection/analysis

--indir: The input directory containing all fasta or fastq files. (e.g.: datasets/periodontal/)
--ext: Sequence file extensions (fasta or fastq) (e.g.: fastq)
--outdir: The output directory (e.g.: /mounts/data/ditaxa/results/test_dental_out/)
--cores: Number of cores (e.g.: 40)
--fast2label: tabular mapping file between file names and the labels
--phenomap: mapping from label to binary phenotypes
--phenoname: name of the phenotype mapping, if not given the labels and their value will be used for identification: label1@1#label2@1...#label3@0. Please note that a single project may have several phenotype mapping schemes (untreated diseased versus all or untreated versus healthy or etc.)
--override: 1 to override the existing files, 0 to only generate the missing files
--heatmap: generates occurrence heatmap for the top 100 markers (e.g: positive_title:negative_title).
--excel: 1 or 0, the default is 1 to generate a detailed list of markers, their taxonomic assignment, and their p-values
--blastn: If you have already run './build.sh' you do not need to specify this parameter and the script will download it and put the NCBI BLASTN /bin/ path in your system. Otherwise, if you already have this on your system you can specify it here.

You can also download blast+ from below and specify the path:
Linux
http://ftp.ncbi.nlm.nih.gov/blast/executables/blast%2B/2.7.1/ncbi-blast-2.7.1%2B-x64-linux.tar.gz
MacOSx
http://ftp.ncbi.nlm.nih.gov/blast/executables/blast%2B/2.7.1/ncbi-blast-2.7.1%2B-x64-macosx.tar.gz

Phenotype prediction

For phenotype classification functionality, evaluation a 10XFold cross-validation framework: --classify: which predictive model to use: choices=[False: default, 'RF': random forest, 'SVM': support vector machines, 'DNN': deep multi-layer perceptron, 'LR': logistic regression]

Deep neural network parameters

Although a full script is provided, in order to simplify the core installation of DiTaxa for biomarker detection/analysis we have commented the deep neural network classifier and its dependencies. In case you are interested in using neural network prediction of the phenotype you only need to install some further dependencies (keras/tensorflow) and uncomment "import DNN" in main/DiTaxa.py.

--arch: The comma separated definition of neural network layers connected to eahc other, you do not need to specify the input and output layers, values between 0 and 1 will be considered as dropouts, e.g., 1024,0.2,512'
--batchsize
--gpu_id: which GPU to use
--epochs: Number of epochs

Bootstrapping for sample size selection

We use bootstrapping to investigate sufficiency and consistency of NPE representation, when only a small portion of the sequences are used. This has two important implications, first, sub-sampling reduces the preprocessing run-time, second, it shows that even a shallow 16S rRNA sequencing is enough for the phenotype prediction. We use a resampling framework to find a proper sampling size. The DiTaxa implementation uses a defualt parameter setting based on bootstrapping on several datasets. The bootstrapping library is located at "DiTaxa/bootstrapping/bootstrapping.py" if further investigation is needed.

About

Nucleotide-pair encoding of 16S rRNA sequences for host phenotype and biomarker detection

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 94.8%
  • Shell 5.2%