Skip to content

VTUL/news-to-saf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This script processes a zipped folder of Virginia Tech News articles sent by Univeristy Relations.
We parse them into individual HTML files with accompanying images, 
then batch import them into the VT New collection in VTechWorks, https://vtechworks.lib.vt.edu/handle/10919/19073. 

University Relations has asked for a 5-year lag before harvest. 
This collection contains items from 2003-2010. Items were last loaded ~2015-10-29. 
Since University Relations now uses a different content management system for VT news, 
we will have to modify the scrip. 

Each item in the VT News collection, https://vtechworks.lib.vt.edu/handle/10919/19073, 
contain an html file and associated image files. 
For instance,  “Science magazine features sustainability in the Caribbean,” 
https://vtechworks.lib.vt.edu/handle/10919/63880, contains bitstreams,

122010-science-fallmag.html, 
https://vtechworks.lib.vt.edu/bitstream/handle/10919/63880/122010-science-fallmag.html?sequence=1&isAllowed=y, 

and 

M_122010-science-fallmag.jpg, 
https://vtechworks.lib.vt.edu/bitstream/handle/10919/63880/M_122010-science-fallmag.jpg?sequence=2&isAllowed=y.

For example, 122010-science-fallmag.html contains 

    <p><img src="images/M_122010-science-fallmag.jpg" alt="Fall 2010 College of Science Magazine" /></p>

which loads M_122010-science-fallmag.jpg when the URL for 122010-science-fallmag.html is selected.


The link to images/M_122010-science-fallmag.jpg actually goes to M_122010-science-fallmag.jpg
as a feature of DSpace. When an HTML file is in an item as a primary content file, 
relative links in the HTML document are resolved by the 
DSpace webapp to the base filename (no path) within that item.

I think there should either be a DSpace documentation about the subject,
or else notes in the old script where he tweaked html and/or moved associated resources.

Notes:

https://wiki.duraspace.org/plugins/servlet/mobile?contentId=68064792#content/view/68064792

“HTML content in items”,
https://wiki.duraspace.org/display/DSDOC5x/Application+Layer#ApplicationLayer-HTMLContentinItems

About

Create a DSpace SAF package from a VT News Archive

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages