-
Problem
- Sharing of entities in different KBs
-
After integration
- Information redundancy elimination
- Integration to larger Knowledge graph
- Completion of one KB with another KB
Condition : Two knowledge graphs -> sub-graph
Goal : Integration knowledge graphs -> complete graph
-
Entity set:
$$\mathcal{E}=\mathcal{E}_1\cup\mathcal{E}_2\cup\mathcal{E}_s$$ -
Sub-graph1:
$$\mathcal{G1} : {(e_i^{h1},r_i^1,e_i^{t1})|e_i^{h1},e_i^{t1}\in\mathcal{E_1}\cup\mathcal{E}_s, r_i^1\in\mathcal{R_1}}$$ -
Sub-graph2:
$$\mathcal{G2} : {(e_i^{h2},r_i^2,e_i^{t2})|e_i^{h2},e_i^{t2}\in\mathcal{E_2}\cup\mathcal{E}_s, r_i^2\in\mathcal{R_2}}$$ -
Complete Graph:
$$\mathcal{G} : {(e_i^h,r_i,e_i^t)|e_i^h,e_i^t\in\mathcal{E}, r_i\in\mathcal{R}}$$ -
Train/Test Division:
$$\mathcal{E}_1=\mathcal{E}_1^{train}\cup\mathcal{E}_1^{test}$$ $$\mathcal{E}_2=\mathcal{E}_2^{train}\cup\mathcal{E}_2^{test}$$ $$\mathcal{E}_s=\mathcal{E}_s^{train}\cup\mathcal{E}_s^{test}$$ -
Problem setting
-
$\mathcal{E}_1^{train}, \mathcal{E}_2^{train}, \mathcal{E}_s^{train}$ And$\mathcal{E}_1^{test}\cup\mathcal{E}_s^{test},\mathcal{E}_2^{test}\cup\mathcal{E}_s^{test}$ is known -
$\mathcal{G1}, \mathcal{G2}$ is known -
$\mathcal{E}_1^{test}, \mathcal{E}_2^{test}, \mathcal{E}_s^{test}$ is unknown
-
-
Task Identifying elements in
$\mathcal{E}_1^{test}, \mathcal{E}_2^{test}, \mathcal{E}_s^{test}$
-
Sampling two sub-graphs from FB15K
-
Sampling hyper-parameter:
-
Sampling methods : next section
-
Overlap rate:
$$\frac{|\mathcal{E}_s|}{(|\mathcal{E}_s|+|\mathcal{E}_1|)}$$ -
Train rate:
$$\frac{|\mathcal{E}_s^{train}|+|\mathcal{E}_1^{train}|}{(|\mathcal{E}_s|+|\mathcal{E}_1|)}$$
-
-
All the experiment introduced below is based on this dataset
-
Three basic parts
- FB15K (13583 entities 592213 triples)
- DBpedia (crawl online)
- DBs-FBs -> download available
-
Crawled DBpedia
- only entities in FB15K
- without 'wikiPage'-link : (12730 entities 112391 triples) -> using
- with 'wikiPage'-link : (13934 entities 685392 triples)
- one more layer connect to entities in FB15K
- with 'wikiPage'-link : (6411218 entities 38597471 triples)
- without 'wikiPage'-link : (4349425 entities 12787660 triples)
- only entities in FB15K
-
After mapping
- 15580 DB items
- 13499 FB items (13583 in raw FB15K)
-
Examples:
-
Properties:
- One-to-many mapping relation between two datasets
- Overlap rate
-
Overlap
- DB, FB -> convert to undirected graph
- Adjacency matrix:
- FB matrix :
$\mathcal{M}_f$ - DB matrix :
$\mathcal{M}_d$ - Overlap matrix :
$\mathcal{M}_d = \mathcal{M}_f & \mathcal{M}_d$ - All matrix :
$\mathcal{M}_a = \mathcal{M}_f + \mathcal{M}_d$
- FB matrix :
-
Result:
- overlap / DB = sum(
$\mathcal{M}_o$ ) / sum($\mathcal{M}_d$ ) = 0.614 - overlap / FB = sum(
$\mathcal{M}_o$ ) / sum($\mathcal{M}_f$ ) = 0.063 - overlap / all = sum(
$\mathcal{M}_o$ ) / sum($\mathcal{M}_a$ ) = 0.114
- overlap / DB = sum(
-
Graph Sampling
-
Knowledge Base Integration
- Make dummy data
- Sampling of three parts form set of all nodes
- Sample $\ \mathcal{E}_1, \mathcal{E}_2, \mathcal{E}_s\ $ from
$\ \mathcal{E}$
Venn graph
- Analysis the effect by dataset to task
- Uniform random Sampling
- Neighborhood Sampling
-
Graph generated by different sampling method have different properties
- Exponent characterizing the distribution by zipf law
- Density
-
Zipf Exponent
- Raw graph : 0.47
- Uniform sampling sub-graph : 0.50
- Neighborhood sub-graph : 0.59
-
Density :
$d = \frac{m}{n(n-1)}$ - where n is the number of nodes and m is the number of edges in G
- Result :
- Raw graph : 0.00221
- Uniform sampling sub-graph : 0.00229
- Neighborhood sub-graph : 0.0016
-
Pipeline Model : TransE + Classification neural network
- Idea : Mapping between two embedding spaces
- Weakness : Incremental pipeline model - Increasing error
-
Classification neural network
- Non-concatenated
- Concatenated
- Half-Concatenated
-
Non-concatenated NN