Author: Edward Huang
-
Split the TCGA dataset into multiple networks, each corresponding to a specific type of cancer. This should be run before anything else.
$ python split_tcga_dataset.py
-
Download GO annotations. Go to biomart. Choose database -> "Ensembl Genes 87" -> Mus musculus genes/Homo sapiens genes
"Attributes" -> "Gene" -> uncheck "Ensembl Transcript ID" -> "External" -> check "GO Term Name" and "GO domain"
Hit "Results" at the top and export to file as TSV. Filenames will be mart_export.txt. Remember to tick "unique results only". Change mouse annotations to ensmusg_to_go.txt and move to ./data/mouse_data. Change human annotations to ensg_to_go.txt and move to ./data/tcga_data.
-
Download DBGAP annotations. Go to http://veda.cs.uiuc.edu/TCGA_classify/msigdb/gene_sets/dbgap_all/, download dbgap.edge, and rename to dbgap.txt. Move to ./data/. To get the translation files, obtain mart_export.txt by going to ensembl.org/biomart: Dataset -> Homo sapiens genes (GRCh38.p7) Filters -> Multi Species Comparisons -> Orthologous Mouse Genes: Only Attributes -> Ensembl Gene ID, uncheck transcript ID Add another dataset, [Ensembl genes 87] Mouse genes, then results. Export as TSV, tick "unique results only". Move to ./data/mouse_data/, rename as ensg_to_ensmusg.txt.
-
Download DisGeNET (GWAS) annotations. Go to http://www.disgenet.org/web/DisGeNET/menu/downloads and download the curated gene-disease associations. Move to ./data/. Go to biomart (as in GO and DBGAP). Dataset -> Homo sapiens genes Attributes -> Gene ID, uncheck transcript, EntrezGene ID Export, rename to ensg_to_entrez.txt Move to ./data/tcga_data/. Only get human genes because for mouse, we translate from Entrez->ENSG->ENSMUSG. The direct Entrez-ENSMUSG database is quite sparse. Get the ensg_to_hgnc.txt file in a similar manner.
-
Download CTD gene-pathway associations. Go to http://ctdbase.org/downloads/, and download the tsv files for different gene associations. Gene-pathway associations are under CTD_genes_pathways.tsv. Move it to ./data/.
-
Plot the standard deviation distribution of the genes, and write to file. Only keeps genes with standard deviation > 0.1.
$ python standard_deviation_hist.py mouse/all/tcga
If command line argument is 'all', runs for separate TCGA cancer types.
-
Compute Pearson coefficients between gene expression values to find correlated genes. Output file is high_std_network.txt.
$ python gene_edge_weights.py mouse/tcga/all
-
Create GO dictionary JSON files for each gene type. Must run with argument mf_go_go to create the the MF GO-GO dictionary.
$ python dump_label_dictionaries.py mouse/all/tcga/mf_go_go go/dbgap/gwas/nci
Last argument optional if first argument is mf_go_go
-
Must have previously run split_tcga_dataset.py and standard_deviation_hist.py.
$ python preprocess_wgcna.py mouse/tcga
-
Move results from preprocessing to working directory of R. Run wgcna.R in 64-bit R (you can just copy paste the contents into the R shell). Move output (%s_module_membership.txt) to ./data/wgcna_data. Takes roughly 45 minutes per dataset.
This script compiles everything below in this section. Note: must run create_clustering_input.py prior to running evaluate_clustering for WGCNA. This is so WGCNA has the "true" network to evaluate in/out ratio. Must have run ./wgcna/wgcna.R prior to running simulated_annealing.py. This is so we know how many clusters to use. Must have run everything for WGCNA prior to plotting. This is so we have something to plot for WGCNA.
$ python full_pipeline.py mouse/tcga/tcga_idx wlogv/wgcna run_num
-
Create 4 files, a network and real network each for a network with and without GO labels. Output files network_go_RUNNUM.txt, where RUNNUM indicates the run number. For networks with GO labels, we add in the full set of MF terms. Other files: real_network_go_RUNNUM.txt. network_no_go_RUNNUM.txt real_network_no_go_RUNNUM.txt
$ python create_clustering_input.py mouse/tcga/tcga_idx run_num -b<bootstrap-optional>
-
Compile clustering code inside sim_anneal folder. If static error for EdgeWeightThreshold, add static in front of its declaration in cs-grn.h.
$ cd makedir $ rm * $ cmake .. $ make
orth.txt just needs to contain at least one gene in the network.
-
Run the simulated annealing clustering cod to cluster on the networks.
$ python simulated_annealing.py data_type objective_function run_num go/no_go go_num <if go>
Only run the clustering on networks without GO only once, as it will be the same network for any given percentage of the raw network, since we use a random seed.
-
Runs the Perl script evaluate_clustering.pl to evaluate cluster densities. Illegal division by zero usually means a file doesn't exist.
$ python evaluate_clustering.py data_type objective_function run_num
-
Compute GO enrichments for each clustering.
$ python compute_label_enrichments.py data_type objective_function run_num go/dbgap/gwas
-
Check if genes in clusters labeled by the most enriched BP term in that cluster roughly have the same in/(in + out) as genes not labeled by the term.
$ python cheating_evaluation.py data_type objective_function run_num
-
Analyze the properties of the clusterings.
$ python cluster_info_summary.py data_type objective_function run_num
-
Plotting in-density versus top GO enrichment.
$ python plot_best_clusters.py data_type run_num go/go_auc/dbgap
-
Plotting box plots. One for enrichment, one for in/in + out. Plots for run_num's 1-10.
$ python box_plot_density_and_enrichment.py data_type clustering_method
Must first run create_clustering_input.py. Direclty clustering on ProSNet gives poor in-density/out-density results.
-
Run prosnet on the networks created by create_clustering_input.py
$ python low_dimensional_nodes_prosnet.py mouse/tcga_idx run_num $ python prosnet_kmeans.py mouse/tcga_idx run_num $ python evaluate_clustering.py mouse/tcga_idx prosnet run_num $ python compute_label_enrichments.py mouse/tcga_idx prosnet run_num go/dbgap/gwas/kegg/ctd $ python cluster_info_summary.py mouse/tcga_idx prosnet run_num
-
Runs the full pipeline after Sheng sends clusters in ./Sheng/Module/ Does not plot.
$ python evaluate_co_and_mcl.py mouse/tcga_cancer_index run_num