Skip to content

TimKettenacker/datainference-v2

Repository files navigation

Updated version 2.0 contains a complete use case in combination with Neo4j.

One particular underrated category within the cause of data management is the representation and usage of context. Classical data modelling falls short because it seems to focus on providing technical necessities to fuel application engines rather than considering interoperability and inference. Graphs and ontologies promise to structure knowledge. Yet, solely relying on the creation and maintenance of ontologies alone is likely to fall on deaf ears in a corporate environment. I made a similar experience in my project work as a consultant. My recommendation is instead to go for a combined effort; the abundance of textual content generated in companies and the need to extract information from it provides a perfect, management-approved playground for coupling the discipline of Natural Language Processing with ontology engineering. This is creating a true, contextual representation of discovered entities in raw text by linking them to an ontology representation.

Setting up ontologies is quite cumbersome, so I turned to an existing one in this project, and decided to leverage on Wikidata (http://www.wikidata.org/wiki/Wikidata:Main_Page). Wikidata is a project sponsored by the wikimedia foundation with the goal to structure the content of Wikipedia to make it accessible and editable by both humans and machines. Hence, it can support a broad range of use cases. I decided to go with a classical one, like staffing consultants to an upcoming project. People from the business side frequently struggle to concretize from a notion or - the other way around - generalize from a detailed term to the overall picture, i.e. an HR person tasked with finding someone with the skillset "MongoDB" will most likely ignore a person with "Neo4j" skills although the skillset is likely to be similar. This is because connecting information is missing or too difficult to collect in a timely manner.

The code first reads in One Pagers from Power Point slides. Then, it creates a topic model from the input for each of the slides representing a consultant profile. The topics from the profiles are then enriched by structures received from wikidata, to match the profile topics with fitting item descriptions and bring them into a structure proposed by the wikidata ontology. The representation is done in Neo4j.

Information objects are then being compared in a graph to each other in order to find the perfect match to an incoming request. A dataset from kaggle containing job descriptions is used to simulate the request (https://www.kaggle.com/PromptCloudHQ/us-technology-jobs-on-dicecom).

A bunch of machine learning algorithms can be evaluated to perform and tune that matching. Visualization could be done using the GRANDstack (Apollo, React, Neo4j).

Releases

No releases published

Packages

No packages published

Languages