Skip to content

A Pattern Induced RDF Statement Extraction System: natural language processing for structured data extraction (Python, Java).

Notifications You must be signed in to change notification settings

riccardoangius/pairses

Repository files navigation

PaIRSES

Bachelor Thesis Dissertation products, presented on 2013-09-23

Abstract

The remarkable quantity of structured data extracted by DBPedia from Infoboxes on Wikipedia articles lends itself as a great starting block for further extraction of data. The project aims to collect and catalogue the natural language patterns therewith the data is presented on the actual discourse of Wikipedia articles, and exploit these patterns in order to obtain and store an analogue data set (i.e. RDF statements pertaining to the predicates associated with these patterns) from both within Wikipedia and external text sources. Mimicking the natural human approaches as defined by the current chunking theories of language acquisition, the experimental algorithms developed for the purpose employ the model of Stanford Typed Dependencies to reach a precision rate of 0.26 and a recall rate of 0.26. These result from tests on 200 sentences sampled from the same corpora of 51,536 Wikipedia articles concerning human settlements (cities, towns, etc.) used for collecting patterns in conjunction with a training set retrieved article by article from DBPedia, and do not consider as retrieved the statements obtained by matching a pattern to the sentence it originated from.

About

A Pattern Induced RDF Statement Extraction System: natural language processing for structured data extraction (Python, Java).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published