-
Notifications
You must be signed in to change notification settings - Fork 0
edsu/europeana-crawler
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
europeana-crawler is a simple proof of concept script for extracting rdfa metadata from record pages using the sitemap they make available for search engine crawlers. The triples for each resource are persisted as a file to the filesystem using a pairtree to evenly distribute the files across subdirectories. To run the crawler you'll need to install a few dependencies. You might want to do this with a virtualenv, or globally on your system. The instructions here are for using a virtualenv: 1. virtualenv --no-site-packages ENV 2. source ENV/bin/activate 3. pip install -r requirements.pip 4. ./crawl.py 5. tail -f crawl.log 6. ./aggregate.py > europeana.nt Questions, comments: Ed Summers <ehs@pobox.com>
About
a simple crawler of the RDFa in Europeana
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published