Skip to content

Generate a log of all added/deleted quads in the virtuoso triple store

License

Notifications You must be signed in to change notification settings

jauco/virtuoso-quad-log

 
 

Repository files navigation

Background

The availability of massive quantities of digital sources (textual, audio-visual and structured data) for research is revolutionizing the humanities. Top-quality humanities scholarship of today and tomorrow is therefore only possible with the use of sophisticated ICT tools. CLARIAH aims to offer humanities scholars a ‘Common Lab’ that provides them access to large collections of digital resources and innovative user-friendly processing tools, thus enabling them to carry out ground-breaking research to discover the nature of human culture.

-- http://www.clariah.nl/en/voorstel/proposal-summary

This practically comes down to a lot of tooling and infrastructure to create, share and discover resources and research results.

To bring all this data together we group the tooling into workpackages (WP3, 4 and 5) and have a seperate workpackage to harvest their data and link it together. As explained here: https://github.com/CLARIAH/wp2-interaction

Approach

Each workpackage will be able to present their information encoded in the RDF datamodel (exact encodings have not yet been specified). A crawler will read this information and store it's own database. The individual workpackages are owner of the data, not the harvester. This means that the harvester's database can be deleted and re-generated at will by re-harvesting from the providing data sources.

This repository details the harvesting protocol and its place within the other protocols. It also contains a reference implementation for the open virtuoso rdf server.

Available protocols

There are roughly two parts to the harvesting protocol. The interaction between the harvester and the provider and the data format in which the data is encoded.

For the interaction we have evaluated the following approaches:

  • OAI-PMH Focussed on metadata, not own data.
  • Atom Features that we need such as marking an item as retracted are only available as extensions and finding tooling that supports the proper extensions is therefore hard. (Supports atom is not clear enough)
  • Sitemaps Doesn't allow for retractions. Requires full re-indexing on every crawl.
  • OAI-ResourceSync Seems to adress our usecase exactly according to the motivating examples. Is a bit large for our usecase, but our servers only need to deal with a subset and a client that fully implements the spec is already available.

If you know of other sync frameworks that fit the bill better: Let us know!

For the data encoding we settled on RDF as the information model, utilizing RDF-Quads for named graphs as detailed in the section Do Graphs need Naming? of RDF Triples in XML but that still leaves a large amount of media types for encoding the data. We have a few requirements on the media types:

  1. Handling of blank nodes
  2. Allowing both assertions and retractions to be modelled
  3. Allowing named graphs to be modelled
  4. You should be able to evaluate an assertion/retraction with minimal knowledge of the statements around it (because the files get big and will only grow) and preferably without having to query the current data store

A few notable requirements that we don't have are

  1. The document does not need to allow for a round trip (importing the generated document in the rdf store needs not be idempotent)
  2. The document does not need to live in the global context, but rather defines its own
    • if the node _:b1 is mentioned during the first crawl, a subsequent reference to it in a later crawl still refers to the same node
    • if two different logs (from different repositories) refer to _:b1 they refer to two different nodes

A few mediatypes and the requirement that they do not support are listed below

media type Blank nodes assertions and retractions named graph support state dependency
JSON-LD x x
TRIG x x
N3 x x
RDF Patch * x x on the log, for tracking blank nodes
TurtlePatch x x
Sparql x x x on the data store and allows arbitrary processing
SparqlPatch x x x on the data store less arbitray processing, but still large runtime complexity of node matching
LD Patch x x x on the data store, path following instead of node matching

*) RDF patch only supports "store scoped" blank nodes. Meaning that a specially encoded blank node in the document will always refer to the same node in the graph, but in between documents these identifiers will refer to different nodes.

We're therefore leaning towards RDF Patch, though that specification is stale after the LDP WG went for the LDpatch approach.

Quickstart

To launch a self-contained sandbox that you can play around in, run the playground.sh script in this repo. And open the link that's printed in the console for further instructions.

./playground.sh

To connect the logger to a production virtuoso server, you can pass it the connection details as environment variables using the -e flag.

docker run -i -t --rm -v $PWD/data:/datadir \
	-e="VIRTUOSO_ISQL_ADDRESS=127.0.01" \
	-e="VIRTUOSO_ISQL_PORT=1111" \
	-e="VIRTUOSO_USER=dba" \
	-e="VIRTUOSO_PASSWORD=dba" \
	jauco/virtuoso-quad-log

The first time, the container will ask to install stored procedure onto the virtuoso server.

The quad-log will generate a bunch of files in the volume that you map to /datadir that you can host using any static file server.

To advertise the logs you should provide either a robots.txt or a Source Description at the location that you submit to Work Package 2. See http://www.openarchives.org/rs/1.0/resourcesync for more information, or contact us! (The playground advertises the logs using the hidden folder .well-known)

Things to test and do (aka issues/tickets)

  • offsets with multiple trx files (is the offset global or file specific. How to handle the offset after a checkpoint has run. What if multiple checkpoints have run in between grabs)

  • non-default literals (stuff tagged as a date for example)

  • literals vs hyperlinks

  • blank nodes

  • handle the fact that the last trx might still be changing (handling it by skipping the current transaction log)

  • check if CheckpointAuditTrail is enabled when running this logger (cfg_item_value)

  • multiple trx files (wrapper script)

  • .well-known toevoegen

  • static file server aan de readme toevoegen

  • remove checkpoint statement before committing and deploying

  • virtuoso server with existing data

  • try multiple insertion strategies and see if we can trigger all cases in the log (LOG_INSERT, LOG_INSERT_SOFT etc.)

  • escaping literals (at least newlines and quotes, check the nquads spec)

  • make it stateful so we don't re-parse the same files over and over again

  • being able to go over the 50k rdf-patch files using resource-list indexes

  • make the rs update process atomic

About

Generate a log of all added/deleted quads in the virtuoso triple store

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 74.0%
  • Python 26.0%