kgpm

Mining neighbors, paths, and path patterns from a knowledge graph and a set of seed nodes

A detailed description of the motivation and the algorithms of kgpm is available in the related article.

Citing kgpm

When citing kgpm, please use the following reference:

Pierre Monnin, Emmanuel Bresso, Miguel Couceiro, Malika Smaïl-Tabbone, Amedeo Napoli, and Adrien Coulet. "Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study". In: 1st international conference "Algebras, graphs and ordered sets" (ALGOS 2020). Ed. by Miguel Couceiro, Pierre Monnin, and Amedeo Napoli. Nancy, France, Aug. 2020. url: https://arxiv.org/pdf/2007.08821.pdf.

@inproceedings{Monnin2020kgpm,
	author = {Monnin, Pierre and Bresso, Emmanuel and Couceiro, Miguel and Sma{\"i}l-Tabbone, Malika and Napoli, Amedeo and Coulet, Adrien},
	title = {{Tackling scalability issues in mining path patterns from knowledge graphs: a preliminary study}},
	editor    = {Miguel Couceiro and Pierre Monnin and Amedeo Napoli},
	booktitle = {{1st international conference ``Algebras, graphs and ordered sets'' (ALGOS 2020)}},
	address = {Nancy, France},
	year = {2020},
	month = Aug,
	url = {https://arxiv.org/pdf/2007.08821.pdf},
}

`query_graph.py`

Python script to query a knowledge graph and perform its canonicalization. The script outputs:

Files representing the canonical knowledge graph (in rdf_to_canonical_index, canonical_to_rdf_index, canonical_graph_adjacency, canonical_graph_inv_adjacency, rdf_nodes_cache_manager.csv, predicates_cache_manager.csv)
Statistics about the knowledge graph before and after canonicalization (in graphs_statistics.md)

Parameters:

--configuration: path of the JSON configuration file
--max-rows: max number of rows the SPARQL endpoint can return
--output: base directory for output files
--self-signed-ssl: enable self signed SSL certificates
--debug: print debug statements

`extract_features.py`

Python script to mine neighbors, paths, and path patterns from a canonical knowledge graph and a set of seed nodes.

Parameters:

--configuration: path of the JSON configuration file
--graph: base directory for the input graph files
--dataset-csv: CSV file with the seed nodes URIs (column 0) and class labels (column 1)
--dataset-name: name of the data set
--output: base directory for output files (statistics, scipy matrice of nodes x features, column name file, and a numpy vector of class labels)
-d: maximum degree to allow expansion (disabled with d = -1)
--lmin: minimum support for features
--lmax: maximum support for features
--kmin: minimum k to test (i.e., number of traversed edges, size of paths and path patterns)
--kmax: maximum k to test
--tmin: minimum t to test (i.e., level for generalization in class hierarchies); t = -1 disables type generalization, t = 0 only allows to generalize with owl:Thing
--tmax: maximum t to test
--undirected: whether only out arcs (false) or all arcs (true) are traversed
--meaningful: biomedical additional filtering strategies:
- p: only select features containing a pathway
- g: only select features containing a gene or a GO class
- m: only select features containing a MeSH class
- pg: disjunction of p and g
- pgm: disjunction of p, g, and m
- all: test all previous filters (thus, 5 outputs)
- no_check: disable the additional filtering
--debug: print debug statements

`subgraph_statistics.py`

Python script to compute the statistics about the subgraph accessible from a set of seed nodes in a canonical knowledge graph. It outputs a markdown file containing the number of neighbors and types reachable from the seed nodes.

Parameters:

--configuration: path of the JSON configuration file
--graph: base directory for the input graph files
--dataset-csv: CSV file with the seed nodes URIs (column 0) and class labels (column 1)
--dataset-name: name of the data set
--output: base directory for output files (Markdown files)
-d: maximum degree to allow expansion (disabled with d = -1)
--undirected: whether only out arcs (false) or all arcs (true) are traversed
--detailed: enable detailed statistics, i.e., number of neighbors and types accessible w.r.t. k and t until full neighborhood is reached. By default, only the max numbers of reachable neighbors and types in the full neighborhood are output (k and t are not given).

Configuration

An example of a JSON configuration file is given in configuration.json.example. Keys are:

server-address: address of the SPARQL endpoint to query
url-json-conf-attribute: URL attribute to use to get JSON results
url-json-conf-value: value of the url-json-conf-attribute to get JSON results
url-default-graph-attribute: URL attribute to use to define the default graph
url-default-graph-value: value of url-default-graph-attribute to define the default graph
url-query-attribute: URL attribute to use to define the query
timeout: timeout value for HTTP requests
username: username to use if HTTP authentication is required (empty otherwise)
password: password to use if HTTP authentication is required (empty otherwise)
path_predicates_blacklist: blacklist of URIs or prefixes of predicates not to traverse
types_blacklist: blacklist of URIs or prefixes of types not to use in path generalization
types_expansion_blacklist: blacklist of URIs or prefixes of types whose instances cannot be traversed

Dependencies

tqdm
numpy
bitarray
scipy

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
README.md		README.md
configuration.json.example		configuration.json.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

configuration.json.example

configuration.json.example

Repository files navigation

kgpm

Citing kgpm

`query_graph.py`

`extract_features.py`

`subgraph_statistics.py`

Configuration

Dependencies

About

Releases

Packages

Languages

pmonnin/kgpm

Folders and files

Latest commit

History

Repository files navigation

kgpm

Citing kgpm

query_graph.py

extract_features.py

subgraph_statistics.py

Configuration

Dependencies

About

Resources

Stars

Watchers

Forks

Languages

`query_graph.py`

`extract_features.py`

`subgraph_statistics.py`