Skip to content

Harshdeep1996/Harshdeep1996.github.io

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

64 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Libretti Rolandi Entity Extraction

Add description

Contents

The repository is organised as follows:

  • code: contains all the code to extract entities from the coperte and title metadatum and their linking to external/internal sources.

In order to be able to reproduce the results from this folder, the files should be run in numeric order. For instance:

python 01_scrapper.py
python 02_place_extraction.py
python 03_fuzzy_place_extraction.py
python 04_composers_extraction.py
python 05_location_extraction.py
python 06_title_extraction.py
python 07_genre_extraction.py
python 08_occasion_extraction.py
python 09_quick_fixes.py
  • scraper: downloads the manifests of the libretti into the folder manifests

  • place extraction: OCRs the coperte of the libretti and extracts tentative city name, stores csv file with existing metadata and extracted city into the folder data

  • fuzzy place extraction: extracts tentative city name using fuzzy match, stores new csv file into the folder data

  • composers extraction: extracts composer names from copertas and titles, stores new csv file into the folder data

  • location extraction: extracts location of the representation (i.e. name of theater/church/...), stores new csv file into the folder data

  • title extraction: extracts mere title from title metadatum, stores new csv file into the folder data

  • genre extraction: extracts opera genre from title, stores new csv file into the folder data

  • occasion extraction: extracts occasion of representation (i.e. carnival, fair), stores new csv file into the folder data

  • quick fixes: improves composer extraction and wikimedia linking, stores new csv file into the folder data

  • data: contains all the produced csv files in order from oldest to most recent (with librettos_8 being the final version). Furthermore, it contains a ground truth containing the expected and observed entities for 20 random libretti.

Visualization

  • index.html: is the header page which provides a structure of the visualization which is further built upon using the Javascript code.

  • code/scripts: contains all the Python scripts for preprocessing and preparing the data for visualization purposes, for e.g. get all common composer or title links.

  • js/mapIntegration.js: builds the structure by working with the DOM and contains the most of the logic of the visualization, for e.g. mapping theaters, visualizing links or temporally looking at the librettos.

  • css/style.css: contains a single CSS file which provides the styling for the visualization.

To develop the visualization locally

Working and developing on your local machine can be done with the existing code base. Additionally, to counter the Cross Origin Resource Sharing (CORS) issue, one would need to copy the Python script given below and run it in the parent directory; so that the machine hosts the data and one can work locally.

#!/usr/bin/env python3
from http.server import HTTPServer, SimpleHTTPRequestHandler, test
import sys

class CORSRequestHandler (SimpleHTTPRequestHandler):
    def end_headers (self):
        self.send_header('Access-Control-Allow-Origin', '*')
        SimpleHTTPRequestHandler.end_headers(self)

if __name__ == '__main__':
    test(CORSRequestHandler, HTTPServer, port=int(sys.argv[1]) if len(sys.argv) > 1 else 8000)

Authors

  • Harshdeep
  • Aurel Maeder
  • Ludovica Schaerf

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published