GUI-based visualisation software for interactively and quickly exploring high dimensional data with functionality to identify genomic signals and relate these to user specified phenotypes.
Early exploratory analysis of high dimensional data typically involves a transformation of the data to a dimensionality reduced space for visualisation. This transformation and visual inspection highlights, in an intuitive way, global patterns in the data, such as if the samples are clustering in accordance with the hypotheses. Batch effects, sample mismatches and other technical artefacts are also highlighted by this visualisation. An unsupervised dimensionality reduction method, such as principal components analysis (PCA), produces views on the data unbiased by our hypotheses. PCA creates a low-dimensional representation of a data set which is optimal in the terms of containing as much of the variance in the original data set as is possible. These principal components are ordered by the patterns encoding the highest variance in the data set. Plotting principle components shows how samples cluster on each dimension, with clustering illustrative of 'likeness' on that dimension. This allows users to visually discover, in an unbiased manner, variables that are characteristic for specific sample groups. Often, this unbiased view reveals new insights into the data that were not expected. It would be particularly useful to further characterise these insights and determine why samples is clustering or segregating on given dimension(s) and if it is related to a phenotype or experimental technical factor. Further, if this data clustering is correlated with a phenotype of interest, what are the genes, transcripts, methylated CpG sites or so on that are driving this phenomenon? Capturing this information would lead to a far more powerful exploratory data analysis - one which generates new hypotheses and analytical questions for the next phase of the analysis.
To concurrently explore all principal components (PC) across the number of samples (n), we present a scatterplot with the PC order on the y-axis. To explore one or two PCs in more detail, we present a standard 2D scatterplot. To highlight clusters, we allow user-specified phenotypes to be mapped to colour, shape or point size with selection from drop-down menus. Both graphs interact and clusters are able to be defined with a select tool.