Background

The availability of massive quantities of digital sources (textual, audio-visual and structured data) for research is revolutionizing the humanities. Top-quality humanities scholarship of today and tomorrow is therefore only possible with the use of sophisticated ICT tools. CLARIAH aims to offer humanities scholars a ‘Common Lab’ that provides them access to large collections of digital resources and innovative user-friendly processing tools, thus enabling them to carry out ground-breaking research to discover the nature of human culture.

-- http://www.clariah.nl/en/voorstel/proposal-summary

This practically comes down to a lot of tooling and infrastructure to create, share and discover resources and research results.

To bring all this data together we group the tooling into workpackages (WP3, 4 and 5) and have a seperate workpackage to harvest their data and link it together. As explained here: https://github.com/CLARIAH/wp2-interaction

Approach

Each workpackage will be able to present their information encoded in the RDF datamodel (exact encodings have not yet been specified). A crawler will read this information and store it's own database. The individual workpackages are owner of the data, not the harvester. This means that the harvester's database can be deleted and re-generated at will by re-harvesting from the providing data sources.

This repository details the harvesting protocol and its place within the other protocols. It also contains a reference implementation for the open virtuoso rdf server.

Available protocols

There are roughly two parts to the harvesting protocol. The interaction between the harvester and the provider and the data format in which the data is encoded.

For the interaction we have evaluated the following approaches:

~~OAI-PMH~~ Focussed on metadata, not own data.
~~Atom~~ Features that we need such as marking an item as retracted are only available as extensions and finding tooling that supports the proper extensions is therefore hard. (Supports atom is not clear enough)
~~Sitemaps~~ Doesn't allow for retractions. Requires full re-indexing on every crawl.
OAI-ResourceSync Seems to adress our usecase exactly according to the motivating examples. Is a bit large for our usecase, but our servers only need to deal with a subset and a client that fully implements the spec is already available.

If you know of other sync frameworks that fit the bill better: Let us know!

For the data encoding we settled on RDF as the information model, utilizing RDF-Quads for named graphs as detailed in the section Do Graphs need Naming? of RDF Triples in XML but that still leaves a large amount of media types for encoding the data. We have a few requirements on the media types:

Handling of blank nodes
Allowing both assertions and retractions to be modelled
Allowing named graphs to be modelled
You should be able to evaluate an assertion/retraction with minimal knowledge of the statements around it (because the files get big and will only grow) and preferably without having to query the current data store

A few notable requirements that we don't have are

The document does not need to allow for a round trip (importing the generated document in the rdf store needs not be idempotent)
The document does not need to live in the global context, but rather defines its own
- if the node _:b1 is mentioned during the first crawl, a subsequent reference to it in a later crawl still refers to the same node
- if two different logs (from different repositories) refer to _:b1 they refer to two different nodes

A few mediatypes and the requirement that they do not support are listed below

media type	Blank nodes	assertions and retractions	named graph support	state dependency
JSON-LD	x		x
TRIG	x		x
N3	x		x
RDF Patch	*	x	x	on the log, for tracking blank nodes
TurtlePatch		x	x
Sparql	x	x	x	on the data store and allows arbitrary processing
SparqlPatch	x	x	x	on the data store less arbitray processing, but still large runtime complexity of node matching
LD Patch	x	x	x	on the data store, path following instead of node matching

*) RDF patch only supports "store scoped" blank nodes. Meaning that a specially encoded blank node in the document will always refer to the same node in the graph, but in between documents these identifiers will refer to different nodes.

We're therefore leaning towards RDF Patch, though that specification is stale after the LDP WG went for the LDpatch approach.

Quickstart

To launch a self-contained sandbox that you can play around in, run the playground.sh script in this repo. And open the link that's printed in the console for further instructions.

./playground.sh

To connect the logger to a production virtuoso server, you can pass it the connection details as environment variables using the -e flag.

docker run -i -t --rm -v $PWD/data:/datadir \
	-e="VIRTUOSO_ISQL_ADDRESS=127.0.01" \
	-e="VIRTUOSO_ISQL_PORT=1111" \
	-e="VIRTUOSO_USER=dba" \
	-e="VIRTUOSO_PASSWORD=dba" \
	jauco/virtuoso-quad-log

The first time, the container will ask to install stored procedure onto the virtuoso server.

The quad-log will generate a bunch of files in the volume that you map to /datadir that you can host using any static file server.

To advertise the logs you should provide either a robots.txt or a Source Description at the location that you submit to Work Package 2. See http://www.openarchives.org/rs/1.0/resourcesync for more information, or contact us! (The playground advertises the logs using the hidden folder .well-known)

Things to test and do (aka issues/tickets)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
oai-rs		oai-rs
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
entrypoint.sh		entrypoint.sh
generate-rdfpatch.sh		generate-rdfpatch.sh
parse_trx.sql		parse_trx.sql
playground.sh		playground.sh
virtuoso.ini		virtuoso.ini
wordnet-subset.nt		wordnet-subset.nt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oai-rs

oai-rs

Dockerfile

Dockerfile

LICENSE

LICENSE

README.md

README.md

entrypoint.sh

entrypoint.sh

generate-rdfpatch.sh

generate-rdfpatch.sh

parse_trx.sql

parse_trx.sql

playground.sh

playground.sh

virtuoso.ini

virtuoso.ini

wordnet-subset.nt

wordnet-subset.nt

Repository files navigation

Background

Approach

Available protocols

Quickstart

Things to test and do (aka issues/tickets)

About

Releases

Packages

Languages

License

jauco/virtuoso-quad-log

Folders and files

Latest commit

History

Repository files navigation

Background

Approach

Available protocols

Quickstart

Things to test and do (aka issues/tickets)

About

Resources

License

Stars

Watchers

Forks

Languages