Skip to content

edsu/europeana-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

europeana-crawler is a simple proof of concept script for extracting 
rdfa metadata from record pages using the sitemap they make available
for search engine crawlers. The triples for each resource are persisted
as a file to the filesystem using a pairtree to evenly distribute the
files across subdirectories.

To run the crawler you'll need to install a few dependencies. You might
want to do this with a virtualenv, or globally on your system. The 
instructions here are for using a virtualenv:

1. virtualenv --no-site-packages ENV
2. source ENV/bin/activate
3. pip install -r requirements.pip
4. ./crawl.py
5. tail -f crawl.log
6. ./aggregate.py > europeana.nt

Questions, comments:
Ed Summers <ehs@pobox.com>

About

a simple crawler of the RDFa in Europeana

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages