Skip to content

flalix/mia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

title author date output
MIA
Flavio Lichtenstein
December 11, 2015
html_document

Mutual Information Analyzer, a software that clusters molecular sequences based on entropy and mutual information

Federal University of Sao Paulo (UNIFESP), DIS-Bioinformatics

Motivation: Here we propose a method to discriminate closely related species at the molecular level using entropy and mutual information. Sequences of orthologous genes in the same species might contain species-specific covariation patterns that can be identified through mutual information.

Summary: Mutual Information Analyzer (MIA) is a pipeline written in Python with the intent to calculate Vertical Entropy, Vertical and Horizontal Mutual Information. From VH, VMI and HMI distributions, Jensen-Shannon Divergence (JSD) is calculated to estimate the distances between species sequences. Each pair of mutual information distribution distances with their respective standard errors are calculated and stored in distance matrices. These distances between distributions can be presented as histograms or hierarchical cluster dendrograms.

How to install:

see install document

We did an executable for Windows and Linux, with PyInstaller, but at the end it didn't work (I don't know why). We apologize.

Since we open the source code you can execute by command line, but first you must install some libraries and set a python path.


Methods:

Mutual Information Analyzer (MIA) is a pipeline written in Pytho with the following algorithms:

  • A1) NCBI: gathers data in NCBI and stores them in GBK file format;
  • A2) Gbk to Fasta: analyze GBK file and organizes in fasta files per species;
  • A3) Alignment: aligns all sequences and at the end creates two fasta files: "mincut" cutting out columns and sequences with large gaps, and "maxmer" maintaining the maximum possible gaps;
  • A4) Purging: replaces ambiguous nucleotides via IUPAC nucleotide ambiguity table, and eliminates sequences with undesirable words in their names like "synthetic";
  • A5) Consensus: replaces gaps by their vertical consensus nucleotide;
  • A6) VMI: calculates and stores Vertical Entropy (VH) and Vertical Mutual Information (VMI) distributions, and plots the respective histograms and heat maps;
  • A7) HMI: calculates and stores Horizontal Mutual Information (HMI) distributions, and plots the histograms;
  • A8) JSD: calculates Jensen-Shannon Divergence, storing distances and their SEs in distance matrix files, and plots the histograms;
  • A9) HC: calculates hierarchical cluster and present it as a dendrogram; A10) Entropy: simulates Shannon Entropy.


HMI and VMI are calculated with and without bias corrections, therefore, the gain or loss of information for "mincut" versus "maxmer", with or without bias correction, can be compared. Distances between distributions are calculated via the square root of JSD. Since Mutual Information and JSD are not linear functions of the data their standard errors are calculated by empirical propagation.



####Images:


Vertical Shannon Entropy

Vertical Entropy


Vertical Mutual Information 2D Heatmap

VMI 2D Heatmap


Vertical Mutual Information 3D Heatmap

VMI 3D Heatmap


Horizontal Entropy

HMI


JSD Histogram

JSD Histogram


Hierarchical Cluster

Hierarchical Cluster


Soon: a nonparametric classifier

We developed also a MI_Classifier, a non parametric classifier. You give all sequences in a fasta file, it calculates MI spectra and JSD distances, and export the Informational Dendrogram to Figtree. It is in test.

About

Mutual Information Analyzer, a software that clusters molecular sequences based on entropy and mutual information

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages