PrismEXP

Prediction of gene Insights from Stratified Mammalian gene co-Expression

Gene-gene co-expression can be effectively applied to impute gene functional annotations with machine learning. The elucidation of gene annotations relies on the correlation structure within gene-gene co-expression matrices. The performance of the predictions relies on the chosen gene expression data. In some applications, correlations derived from tissue-specific or cell-type-specific gene expression outperform correlations derived from global cross-tissue cross-cell-type gene expression. However, the identification of the optimal tissue and cell type is not trivial. Since tissues are made of multiple cell type, the constraint of a single tissue might be limiting in some circumstances. Here we introduce and validate a statistical approach called PRediction of gene Insights from Stratified Mammalian gene co-EXPression (PrismEXP), for accurate gene annotation prediction. We apply PrismEXP using the ARCHS4 gene expression compendium to predict a wide variety of gene annotations such as pathway memberships, phenotype associations, and regulation by transcription factors. PrismEXP outperforms single correlation matrix approaches on all tested domains. PrismEXP can enhance existing machine learning methods that use correlation matrices from other domains such as proteomics and metabolomics requiring only minor adjustments to the existing algorithm.

Python package

The PrismEXP Python3 package enables the generation of correlation matrices needed for the prediction of gene annotations from GMT files. The memory requirement depends on the number of genes and the number of gene expression profiles used.

Default settings with the ARCHS4 mouse gene expression matrix should require less than 8GB of memory. The file formats used are hdf5 and feather. Gene expression has to be provided in H5 format. Gene expression should be stored as a matrix under "data/expression", gene symbols under "meta/genes", and sample identifieres under "meta/Sample_geo_accession"

Precomputed PrismEXP predictions for annotation from Enrichr gene set libraries can be accessed from here: https://maayanlab.cloud/prismexp
The PrismExp Appyter for all Enrichr libraries can be accessed here: https://appyters.maayanlab.cloud/PrismEXP/.

NOTE PrismEXP requires a large gene expression repository. The code expects gene expression as gene counts. Data compatible with PrismEXP can be downloaded from the ARCHS4 website.

Mouse Data (717,966 samples): https://s3.dev.maayanlab.cloud/archs4/archs4_gene_mouse_v2.1.2.h5
Human Data (620,825 samples): https://s3.dev.maayanlab.cloud/archs4/archs4_gene_human_v2.1.2.h5

Installation

Install the python package directly from Github using PIP.

$ pip install git+https://github.com/MaayanLab/prismexp.git

Quick usage example

import urllib.request
import prismx as px

urllib.request.urlretrieve("https://s3.dev.maayanlab.cloud/archs4/archs4_gene_human_v2.1.2.h5", "human_matrix.h5")

work_dir = "/home/maayanlab/code/prismexp"
h5_file = "human_matrix.h5"
gmt_file = px.load_library("GO_Biological_Process_2021")

cluster_number = 100

px.create_correlation_matrices(work_dir, h5_file, cluster_count=cluster_number, verbose=True)
px.features(work_dir, gmt_file, threads=4, verbose=True)
px.train(work_dir, gmt_file, verbose=True)
px.predict(work_dir, gmt_file, verbose=True)

Usage

Create gene correlation matrices

Creating gene-gene correlation matrices requires 4 steps:

Download the ARCHS4 gene expression data: https://s3.dev.maayanlab.cloud/archs4/archs4_gene_human_v2.1.2.h5 (there is test data included in the package)
Filter genes with low expression
Partition gene expression profiles into a set of distinct clusters
Calculate gene-gene correlation within each cluster

Create gene annotation predictions

Creating the predictions requires the gene-gene correlation matrices as a prerequisite.

Provide a GMT file. (Samples of GMT file can be found at: https://maayanlab.cloud/Enrichr/#stats)
- each line of a GMT is tab separated and starts with a gene set name followed by a description, followed by gene symbols. Example: potassium ion import (GO:0010107) \t description \t SLC12A3 \t KCNJ5 \t SLC12A4 \t KCNJ6 \t ...
Create gene expression cluster-wise predictions
Assemble cluster based predictions
Apply the trained PrismEXP machine learning model

Code example

The following example will download the ARCHS4 gene expression compendium and build 50 gene expression clusters. This process will, depending on the used hardware, take considerable amount of time. It also requires about 1GB of diskspace per gene expression cluster. Additional to the 50 gene-gene matrices, the algorithm will also compute a correlation matrix across clusters. Memory consumption depends on the number of clusters, but should stay below 8GB.

Python3

I) Compute correlation matrices

The choice of number of clusters will impact the overall quality of gene annotations predictions. The predictions improve proportional to the log of the number of clusters. Adding more clusters will increase the runtime of the algorithm. If possible, we recommend 200-300 clusters. Beyond 300 clusters improvements are marginal.

This is the first step taken by PrismEXP. We first identify N = cluster_number clusters of samples to partition the ARCHS4 gene expression compendium. We then limit each cluster to a maximum of sample_count = 5,000 samples. After normalizing the gene expression, PrismEXP computes the pairwise correlations between all gene pairs in each cluster resulting in N correlation matrices.

import urllib.request
import prismx as px

urllib.request.urlretrieve("https://s3.dev.maayanlab.cloud/archs4/archs4_gene_human_v2.1.2.h5", "human_matrix.h5")

work_dir = "/home/maayanlab/code/prismexp/"
h5_file = "human_matrix.h5"

cluster_number = 100

px.create_correlation_matrices(work_dir,
                               h5_file,
                               cluster_count=cluster_number, 
                               sample_count=5000, 
                               cluster_gene_count=1000,
                               reuse_clustering=False,
                               verbose=True)

`create_correlation_matrices(h5_file: str, work_dir: str, cluster_count: int=100, read_threshold: int=20, sample_threshold: float=0.01, filter_samples: int=2000, min_avg_reads_per_gene: int=2, cluster_method: str="minibatch", cluster_gene_count: int=1000, sample_count: int=5000, reuse_clustering: bool=False, correlation_method: str="pearson", verbose: bool=True)`

Parameter	Type	Default	Description
work_dir	str		The directory to save the resulting clustering and correlation matrices.
h5_file	str		The path to the h5 file containing the gene expression data.
cluster_count	int	100	The number of clusters to use for the sample clustering.
read_threshold	int	20	The minimum number of reads a gene must have in a fraction of total reads to keep.
sample_threshold	float	0.01	The minimum fraction of samples that contain `read_threshold` reads of a gene to keep.
filter_samples	int	2000	The maximum number of samples to use for gene filtering.
min_avg_reads_per_gene	int	2	The average number of reads per gene for a sample to be considered in the clustering. Can be used to remove samples with very low library size.
cluster_method	str	"minibatch"	The clustering method to use. Options are "minibatch" and "kmeans". minibatch is much faster.
cluster_gene_count	int	1000	The number of genes to use for the sample clustering.
sample_count	int	5000	The maximum number of samples to use for calculating the correlation matrices.
reuse_clustering	bool	False	Whether to reuse the existing clustering results in the work directory.
correlation_method	str	"pearson"	The correlation method to use. Options are "pearson" and "spearman". Pearsons correlation is faster and requires less memory.
verbose	bool	True	Whether to print progress messages.

II) Calculate average correlations between a gene and a gene set for given gene set library

This is the feature step. PrismEXP will iterate over the previously generated correlation matrices and compute the average correlation (features) for the given gene set library. Features are required for model training and also for making prediction. For example, training with the GO Biological Processes library is demostarted below:

import prismx as px

work_dir = "/home/maayanlab/code/prismexp/"

# load Enrichr library to use
gmt_file = px.load_library("GO_Biological_Process_2021")

# calculate the features that are used for model training and prediction
px.features(work_dir, gmt_file, threads=4, verbose=True)

`features(work_dir: str, gmt_file: str, intersect: bool=False, threads: int=2, verbose: bool=False)`

Parameter	Type	Default	Description
work_dir	str		Path to the directory containing the correlation matrices.
gmt_file	str		Path to the gmt file containing the gene set library.
intersect	bool	False	If True, only includes unique genes present in all gene sets in the feature matrix.
threads	int	2	Number of threads to use for parallel processing.
verbose	bool	False	If True, prints progress information.

III) Train a prediction model with the GO Biological Processes gene set library

The gene set library needs to be the same as the one used in the prior feature generation step.

import prismx as px

work_dir = "/home/maayanlab/code/prismexp/"

gmt_file = px.load_library("GO_Biological_Process_2021")

# build a training data set and train model
model = px.train(work_dir, gmt_file, training_size=300000, 
            test_train_split=0.1, sample_positive=40000,
            sample_negative=200000, random_state=1, verbose=True)

`train(work_dir: str, gmt_file: str, training_size: int=200000, test_train_split: float=0.1, sample_positive: int=20000, sample_negative: int=80000, random_state: int=42, verbose: bool=False)`

Parameter	Type	Default	Description
work_dir	str		Path to the directory containing the correlation matrices.
gmt_file	str		Path to the gmt file containing the gene set library.
training_size	int	200000	The number of gene sets to use for training.
test_train_split	float	0.1	The proportion of the training data to use for testing.
sample_positive	int	20000	The number of positive samples to use in the balanced training data.
sample_negative	int	80000	The number of negative samples to use in the balanced training data.
random_state	int	42	The seed for the random number generator.
verbose	bool	False	If True, prints progress information.

Once the model is trained it can be applied on any gene set library of choice. Models trained with the GO BP library were tested on all other gene set libraries in Enrichr.

IV) Predict gene functions

The prediction step of the model can be used across different libraries. There is also low risk of overfitting the model, so it can be trained and applied to the same gene set library. In this example the model was trained in GO Biological Processes, but applied to the KEGG pathways library. The prediction step will recompute the features, unless explicitly instructed, to reuse the features. The prediction is saved as a feather file at {work_dir}/predictions/{gmt_file}.f

import prismx as px

work_dir = "/home/maayanlab/code/prismexp/"

gmt_file = px.load_library("KEGG_2021_Human")

px.predict(work_dir, gmt_file, step_size=500, verbose=True)

To read the prediction matrix (genes as rows and gene sets as columns):

import pandas as pd
import feather

work_dir = "/home/maayanlab/code/prismexp/"

predictions = pd.read_feather(work_dir+"/predictions/KEGG_2021_Human.f").set_index("index")

`predict(work_dir: str, gmt_file: str, model=None, step_size: int=1000, intersect: bool=False, normalize:bool=False, verbose: bool=False, skip_features: bool=False, threads: int=2)`

Parameter	Type	Default	Description
work_dir	str		Path to the directory containing the correlation matrices and precomputed model.
gmt_file	str		Path to the gmt file containing the gene set library.
model	lightGBM model	`None`	The prediction model to use. If `None`, loads the model from the workdir.
step_size	int	1000	The number of samples to process at a time.
intersect	bool	False	If True, only includes unique genes present in all gene sets in the feature matrix.
normalize	bool	False	If True, normalizes the final prediction values using a z-score.
verbose	bool	False	If True, prints progress information.
skip_features	bool	False	If True, skips the feature computation step.
threads	int	2	Number of threads to use for parallel processing.

Bridge gene set enrichment analysis (bridgeGSEA)

PrismEXP gene set predictions can be used to enahnce gene set enrichment analysis to identify novel genes in enriched pathways and biological processes.

import prismx as px
import prismx.gsea as pxgsea
import blitzgsea as blitz
import urllib.request
import feather
import pandas as pd

work_dir = "/home/maayanlab/code/prismexp/"

url = "https://github.com/MaayanLab/blitzgsea/raw/main/testing/ageing_muscle_gtex.tsv"
urllib.request.urlretrieve(url, "ageing_muscle_gtex.tsv")

# read signature as pandas dataframe
signature = pd.read_csv("ageing_muscle_gtex.tsv")

# use enrichr submodule to retrieve gene set library
library = blitz.enrichr.get_library("GO_Biological_Process_2021")

# load PrismExp predictions
predictions = pd.read_feather(work_dir+"/predictions/GO_Biological_Process_2021.f").set_index("index")

result = px.bridgegsea.bridge_gsea(signature, library, predictions)

Name		Name	Last commit message	Last commit date
Latest commit History 273 Commits
figures		figures
logs		logs
prismx		prismx
testcode		testcode
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
buildproject		buildproject
requirements.txt		requirements.txt
setup.py		setup.py

License

MaayanLab/prismexp

Folders and files

Latest commit

History

Repository files navigation

PrismEXP

Prediction of gene Insights from Stratified Mammalian gene co-Expression

Python package

Installation

Quick usage example

Usage

Create gene correlation matrices

Create gene annotation predictions

Code example

Python3

I) Compute correlation matrices

II) Calculate average correlations between a gene and a gene set for given gene set library

features(work_dir: str, gmt_file: str, intersect: bool=False, threads: int=2, verbose: bool=False)

III) Train a prediction model with the GO Biological Processes gene set library

train(work_dir: str, gmt_file: str, training_size: int=200000, test_train_split: float=0.1, sample_positive: int=20000, sample_negative: int=80000, random_state: int=42, verbose: bool=False)

IV) Predict gene functions

predict(work_dir: str, gmt_file: str, model=None, step_size: int=1000, intersect: bool=False, normalize:bool=False, verbose: bool=False, skip_features: bool=False, threads: int=2)

Bridge gene set enrichment analysis (bridgeGSEA)

About

Resources

License

Stars

Watchers

Forks

Languages

`features(work_dir: str, gmt_file: str, intersect: bool=False, threads: int=2, verbose: bool=False)`

`train(work_dir: str, gmt_file: str, training_size: int=200000, test_train_split: float=0.1, sample_positive: int=20000, sample_negative: int=80000, random_state: int=42, verbose: bool=False)`

`predict(work_dir: str, gmt_file: str, model=None, step_size: int=1000, intersect: bool=False, normalize:bool=False, verbose: bool=False, skip_features: bool=False, threads: int=2)`