Onco Knowledge Explorer

Explore cancer data interactively.

Raw dataset
- Directed edges(TDI pairs, same path in different patients count as different ones): 11,769,129(filtered -> 1+3,281,018).
Reliable inputs:
- Github/inputData/ensemble.txt: 1+3,281,018
Metrics of ensemble.txt
- size patients: 4452/4468

can	blca	brca	coad	esca	gbm	hnsc	kirc	kirp	lihc	luad	lusc	ov	prad	read	stad	ucec
#pat	200	841	182	149	201	458	424	168	147	383	136	319	398	77	176	193

size SGA sets: 559
size DEG sets: 8,844
size (SGA | DEG): 9,162 ; size (SGA & DEG) = 241
size Directed edges(SGA->DEG pairs) = 33,149

Next steps:

Looks like that the SGA2DEG edge threshold (e.g., 1) should also be considered for better construction of the network: 0.1438, 0.3306 and 0.3183 of sga2deg in brca, gbm, ov have edge weight of 1.0. In pancan, all the edge weight is larger than 5.0.

Build up different affinity metrics of SGA.
Figure out the KEGG pathway.
GoTerm also has pathway information, seems to be more than KEGG pathway: e.g., http://amigo.geneontology.org/amigo/term/GO:0007219 Notch signaling pathway.
Run spectral clustering.
Run t-SNE using scaled 8k-Dim 0/1 vector.
Run hierarchical clustering on DEG and cluster the SGA.

Check the PI3K etc. pathway.

Try ProPPR rule to classify.

Compare the results.

Next? Try GoTerm merge and clustering.

SGA profile -> spectral clustering Weight CC positive predictive value

Go Enrichment analysis.

What I haven't tried:

I didn't try to measure the closeness of the SGA & DEG by considering the occurrence.

I didn't try to plot the ocurrence of GENEs in a specific pathway in the figure generated by t-SNE.

Use existing algorithm from Dr. Lu to clustering the GoTerm, and then test with our ProPPR rules.
Compare the SGA network in different cancers.
Continue to test with pathway, although they overlap little.
Transitive reduction.
Test with organelles structure, other than biological process.
Consider the directions of the perturbations? (Future direction)
(Try with the unfiltered datasets of TDI pairs)
Check the overlap of SGA & DEG w/ KEGG pathway.

*TDI_Results_new.csv: 1 + 11,769,129 records (No duplication, containing unit SGA). lowercase the record: tr A-Z a-z < TDI_Results_new.csv > out (Also No duplication, containing unit SGA)

min posterior = 0.011143. (sort -t$',' -k 4g TDI_Results_new.csv).

*TDI_Results_filter_no_unit.csv (1+3,281,866) [remove duplicated entities]-> (1+3,281,018) [overlap with TDI_Results_new.csv]-> ensemble.txt (1+3,281,018)

tr A-Z a-z < TDI_Results_filter_no_unit.csv > out

min posterior = 0.1

SGAs.txt (559 filtered SGAs); DEGs.txt (8,844 filtered DEGs)

Overlap:242; SGAs & hsa05200(cancer) = 25/179; DEGs & hsa05200 = 100/179;

cut -f5 ensemble.txt | tail -n +2 | sort | uniq > SGAs.txt

cut -f6 ensemble.txt | tail -n +2 | sort | uniq > DEGs.txt

X (The sort here will change MARCH9 to 9-mar)cut -f2 -d',' TDI_Results_new_lc.csv | tail -n +2 | sort | uniq > SGAl.txt

SGAl.txt (19,850 unfiltered SGAs); DEGl.txt (19,414 unfiltered DEGs);

Overlap:14,984; SGAs & hsa05200(cancer) = 168/179; DEGs & hsa05200 = 171/179;

cut -f2 -d',' TDI_Results_new_lc.csv | tail -n +2 | sort -g | uniq > SGAl.txt

cut -f3 -d',' TDI_Results_new_lc.csv | tail -n +2 | sort -g| uniq > DEGl.txt

Test with GO database for community detection.
examine the GO terms:
SGA cdkn2b go:0000079(B)- cdkn2b go:0000086(B) cdkn2b go:0004861(M) cdkn2b go:0005515(M) cdkn2b go:0005634(C) cdkn2b go:0005654(C) cdkn2b go:0005737(C) cdkn2b go:0005829(C) cdkn2b go:0007050(B)-+ cdkn2b go:0007093(B) cdkn2b go:0008285(B)- cdkn2b go:0019901(M) cdkn2b go:0030219(B) cdkn2b go:0030511(B)-+ cdkn2b go:0031668(M) cdkn2b go:0031670(B) cdkn2b go:0042326(B)- cdkn2b go:0045944(B) cdkn2b go:0048536(B)-+ cdkn2b go:0050680(B) cdkn2b go:0071901(B)- cdkn2b go:2000134(B)

It seems that the nodes do not contain each other in the relationship of 'is_a'..

rbm17 go:0000166 rbm17 go:0000380 rbm17 go:0003723 rbm17 go:0005515 rbm17 go:0005681 rbm17 go:0043234

We could use the 'is_a' relationship to cluster the DEGs. Further, we can propogate the SGA->DEG->GOTerm->subsets goterm of 8150.
Examine the class of SGA through GOTerm, or through DEG and then GOTerm, compare the differences.

try different methods of clustering, e.g. using spetral clustering to check the differences between cancers.
Test with BRCA, GBM, OV.

python analysis02.py --inputData /usr1/public/yifeng/Github/outputData --outputData /usr1/public/yifeng/Github/outputData --cancer gbm

BRCA: 480,577 pat,sga,deg records -> aggregate 22,499, 21,141 SGA edges, 529 nodes.

GBM: 151,030 -> 7,150, 3,043 SGA edges, 313 nodes.

OV: 110,125 -> 9,397, 6,251 SGA edges, 418 nodes.

It seems that merely aggregate the total number of overlap is not enough to distinguish the clusters...

try different scoring methods, e.g., Jaccard.

python analysis03.py --inputData /usr1/public/yifeng/Github/outputData --outputData /usr1/public/yifeng/Github/outputData --cancer gbm

The parameter of edges and nodes is the same with the raw methods.

It seems to be better.

This can also be seen from the distribution of the (SGA,SGA) weight, which shows some edges have much larger weight, other than the noise in simple overlap version. Simple overlap version: But may become better if we can set a threshold? Well... Actually I am really not sure if this is better after the experiment... if merely from the figure. Might need some other source?

Remember to check duplicated edges among patients.

Our methods is able to analyze within different types of cancers with different number of patients.
Test within one subset:

molecular function molecular activities of gene products cellular component where gene products are active biological process pathways and larger processes made up of the activities of multiple gene products.

t-SNE. It looks crazy, even if I tried to carefully tune the perplexity, the graph in 2D is hard to cluster into different classes. The example of BRCA is shown below:

There might be some other parameters to tune, e.g., the calculation of weight and distance.

Also we can consider the role within through patient-wise rule.

Note that we may want to filter the GoTerm which is not related to biological process in the goAnn.cfacts file: Already filltered.

It would be pretty interesting to check the results from three parts: simplified (SGA,SGA) weighted.
Check with SGA->GoTerm classification.

isA go0000003 go0008150 isA go0001906 go0008150 isA go0002376 go0008150 isA go0007610 go0008150 isA go0008152 go0008150 isA go0009987 go0008150 isA go0022414 go0008150 isA go0022610 go0008150 isA go0023052 go0008150 isA go0032501 go0008150 isA go0032502 go0008150 isA go0040007 go0008150 isA go0040011 go0008150 isA go0044699 go0008150 isA go0044848 go0008150 isA go0048511 go0008150 isA go0050896 go0008150 isA go0051179 go0008150 isA go0051704 go0008150 isA go0065007 go0008150 isA go0071840 go0008150 isA go0098743 go0008150 isA go0098754 go0008150 isA go0099531 go0008150

Merely classify the SGAs based on the goTerm seems not to be consistent with the graph classification.

Check with SGA->DEG->GoTerm classification. It does not seem to work well...
Throw away the new dataset, use the Github/inputData/ensemble.txt: 1+3,281,018
- New-YX_TDIresult_PanCancerAtlas_CT01_114.csv: (10,199,110+1 records, 6549+1 tumors)
- Old-YX_TDI_SGA_DEG_Tumor.csv: (3,281,018+1 records, 4452+1 tumors)
- Overlap: (396,715+1 records, 4239+1 tumors)
Noisy-or > 0.9
- train edges = 31,304
- test edges = 31,293 (overlap: 29,871)
- base graph = 32,726 (overlap: 31,172)

EOF.

Name		Name	Last commit message	Last commit date
Latest commit History 254 Commits
SpectralClustering		SpectralClustering
data		data
figure		figure
scripts		scripts
src		src
src_analysis		src_analysis
tSNE_matlab		tSNE_matlab
README.md		README.md
update_local		update_local

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpectralClustering

SpectralClustering

data

data

figure

figure

scripts

scripts

src

src

src_analysis

src_analysis

tSNE_matlab

tSNE_matlab

README.md

README.md

update_local

update_local

Repository files navigation

Onco Knowledge Explorer

Next steps:

About

Releases

Packages

Languages

cwt1/OncoExplorer

Folders and files

Latest commit

History

Repository files navigation

Onco Knowledge Explorer

Next steps:

About

Resources

Stars

Watchers

Forks

Languages