Simons Foundation Mouse Project

Author: Edward Huang

TCGA Preprocessing

Split the TCGA dataset into multiple networks, each corresponding to a specific type of cancer. This should be run before anything else.
```
$ python split_tcga_dataset.py
```

Downloading annotations

Download GO annotations. Go to biomart. Choose database -> "Ensembl Genes 87" -> Mus musculus genes/Homo sapiens genes

"Attributes" -> "Gene" -> uncheck "Ensembl Transcript ID" -> "External" -> check "GO Term Name" and "GO domain"

Hit "Results" at the top and export to file as TSV. Filenames will be mart_export.txt. Remember to tick "unique results only". Change mouse annotations to ensmusg_to_go.txt and move to ./data/mouse_data. Change human annotations to ensg_to_go.txt and move to ./data/tcga_data.
Download DBGAP annotations. Go to http://veda.cs.uiuc.edu/TCGA_classify/msigdb/gene_sets/dbgap_all/, download dbgap.edge, and rename to dbgap.txt. Move to ./data/. To get the translation files, obtain mart_export.txt by going to ensembl.org/biomart: Dataset -> Homo sapiens genes (GRCh38.p7) Filters -> Multi Species Comparisons -> Orthologous Mouse Genes: Only Attributes -> Ensembl Gene ID, uncheck transcript ID Add another dataset, [Ensembl genes 87] Mouse genes, then results. Export as TSV, tick "unique results only". Move to ./data/mouse_data/, rename as ensg_to_ensmusg.txt.
Download DisGeNET (GWAS) annotations. Go to http://www.disgenet.org/web/DisGeNET/menu/downloads and download the curated gene-disease associations. Move to ./data/. Go to biomart (as in GO and DBGAP). Dataset -> Homo sapiens genes Attributes -> Gene ID, uncheck transcript, EntrezGene ID Export, rename to ensg_to_entrez.txt Move to ./data/tcga_data/. Only get human genes because for mouse, we translate from Entrez->ENSG->ENSMUSG. The direct Entrez-ENSMUSG database is quite sparse. Get the ensg_to_hgnc.txt file in a similar manner.
Download CTD gene-pathway associations. Go to http://ctdbase.org/downloads/, and download the tsv files for different gene associations. Gene-pathway associations are under CTD_genes_pathways.tsv. Move it to ./data/.

Creating the gene network

Plot the standard deviation distribution of the genes, and write to file. Only keeps genes with standard deviation > 0.1.
```
$ python standard_deviation_hist.py mouse/all/tcga
```
If command line argument is 'all', runs for separate TCGA cancer types.
Compute Pearson coefficients between gene expression values to find correlated genes. Output file is high_std_network.txt.
```
$ python gene_edge_weights.py mouse/tcga/all
```
Create GO dictionary JSON files for each gene type. Must run with argument mf_go_go to create the the MF GO-GO dictionary.
```
$ python dump_label_dictionaries.py mouse/all/tcga/mf_go_go go/dbgap/gwas/nci
```
Last argument optional if first argument is mf_go_go

WGCNA Pre-processing

Must have previously run split_tcga_dataset.py and standard_deviation_hist.py.
```
$ python preprocess_wgcna.py mouse/tcga
```
Move results from preprocessing to working directory of R. Run wgcna.R in 64-bit R (you can just copy paste the contents into the R shell). Move output (%s_module_membership.txt) to ./data/wgcna_data. Takes roughly 45 minutes per dataset.

Clustering pipeline

This script compiles everything below in this section. Note: must run create_clustering_input.py prior to running evaluate_clustering for WGCNA. This is so WGCNA has the "true" network to evaluate in/out ratio. Must have run ./wgcna/wgcna.R prior to running simulated_annealing.py. This is so we know how many clusters to use. Must have run everything for WGCNA prior to plotting. This is so we have something to plot for WGCNA.

$ python full_pipeline.py mouse/tcga/tcga_idx wlogv/wgcna run_num

Adding GO nodes and formatting for clustering

Create 4 files, a network and real network each for a network with and without GO labels. Output files network_go_RUNNUM.txt, where RUNNUM indicates the run number. For networks with GO labels, we add in the full set of MF terms. Other files: real_network_go_RUNNUM.txt. network_no_go_RUNNUM.txt real_network_no_go_RUNNUM.txt
```
$ python create_clustering_input.py mouse/tcga/tcga_idx run_num -b<bootstrap-optional>
```

Clustering

Compile clustering code inside sim_anneal folder. If static error for EdgeWeightThreshold, add static in front of its declaration in cs-grn.h.
```
$ cd makedir
$ rm *
$ cmake ..
$ make
```
orth.txt just needs to contain at least one gene in the network.
Run the simulated annealing clustering cod to cluster on the networks.
```
$ python simulated_annealing.py data_type objective_function run_num go/no_go go_num <if go>
```
Only run the clustering on networks without GO only once, as it will be the same network for any given percentage of the raw network, since we use a random seed.
Runs the Perl script evaluate_clustering.pl to evaluate cluster densities. Illegal division by zero usually means a file doesn't exist.
```
$ python evaluate_clustering.py data_type objective_function run_num
```

Cluster analysis

Compute GO enrichments for each clustering.

$ python compute_label_enrichments.py data_type objective_function run_num go/dbgap/gwas

Check if genes in clusters labeled by the most enriched BP term in that cluster roughly have the same in/(in + out) as genes not labeled by the term.
```
$ python cheating_evaluation.py data_type objective_function run_num
```

Analyze the properties of the clusterings.

$ python cluster_info_summary.py data_type objective_function run_num

Plotting in-density versus top GO enrichment.

$ python plot_best_clusters.py data_type run_num go/go_auc/dbgap

Plotting box plots. One for enrichment, one for in/in + out. Plots for run_num's 1-10.
```
$ python box_plot_density_and_enrichment.py data_type clustering_method
```

Other tries

ProSNet

Must first run create_clustering_input.py. Direclty clustering on ProSNet gives poor in-density/out-density results.

Run prosnet on the networks created by create_clustering_input.py

$ python low_dimensional_nodes_prosnet.py mouse/tcga_idx run_num
$ python prosnet_kmeans.py mouse/tcga_idx run_num
$ python evaluate_clustering.py mouse/tcga_idx prosnet run_num
$ python compute_label_enrichments.py mouse/tcga_idx prosnet run_num go/dbgap/gwas/kegg/ctd
$ python cluster_info_summary.py mouse/tcga_idx prosnet run_num

Cluster-One and MCL

Runs the full pipeline after Sheng sends clusters in ./Sheng/Module/ Does not plot.
```
$ python evaluate_co_and_mcl.py mouse/tcga_cancer_index run_num
```

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.gitignore		.gitignore
README.md		README.md
cheating_evaluation.py		cheating_evaluation.py
cluster_info_summary.py		cluster_info_summary.py
compute_label_enrichments.py		compute_label_enrichments.py
create_clustering_input.py		create_clustering_input.py
create_prosnet_clustering_input.py		create_prosnet_clustering_input.py
dump_label_dictionaries.py		dump_label_dictionaries.py
evaluate_clustering.pl		evaluate_clustering.pl
evaluate_clustering.py		evaluate_clustering.py
extract_genes_from_gene_associations.py		extract_genes_from_gene_associations.py
file_operations.py		file_operations.py
find_go_overlaps.py		find_go_overlaps.py
full_pipeline.py		full_pipeline.py
gene_edge_weights.py		gene_edge_weights.py
low_dimensional_nodes_prosnet.py		low_dimensional_nodes_prosnet.py
plot_best_clusters.py		plot_best_clusters.py
preprocess_wgcna.py		preprocess_wgcna.py
prosnet_kmeans.py		prosnet_kmeans.py
simulated_annealing.py		simulated_annealing.py
split_tcga_dataset.py		split_tcga_dataset.py
standard_deviation_hist.py		standard_deviation_hist.py
wgcna.R		wgcna.R

ewhuang/simons_mouse

Folders and files

Latest commit

History

Repository files navigation

Simons Foundation Mouse Project

TCGA Preprocessing

Downloading annotations

Creating the gene network

WGCNA Pre-processing

Clustering pipeline

Adding GO nodes and formatting for clustering

Clustering

Cluster analysis

Other tries

ProSNet

Cluster-One and MCL

About

Resources

Stars

Watchers

Forks

Languages