Skip to content

larsmans/AVResearcherXL

 
 

Repository files navigation

AVResearcherXL

AVResearcherXL is a tool based on AVResearcher, a prototype aimed at allowing media researchers to explore metadata associated with large numbers of audiovisual broadcasts. AVResearcher allows them to compare and contrast the characteristics of search results for two topics, across time and in terms of content. Broadcasts can be searched and compared not only on the basis of traditional catalog descriptions, but also in terms of spoken content (subtitles), and social chatter (tweets associated with broadcasts). AVResearcher is a new and ongoing valorisation project at the Netherlands Institute for Sound and Vision. more details

In addition to the exploration of audiovisual broadcasts, AVResearcherXL allows users to search and compare different document collections. AVResearcherXL also implements a new design, the option to show relative counts on its timeline visualisation and multiple views on result sets.

AVResearcherXL is developed by Dispectu B.V..

Requirements

  • Python 2.7
    • pip
    • virtualenv
  • Elasticsearch > 1.1
  • Relational database (e.g. SQLite, MySQL or PostgreSQL)
  • A webserver with WSGI or proxy capabilities

Installing AVResearcherXL

  1. Clone the repository:
$ git clone git@github.com:beeldengeluid/AVResearcherXL.git
$ cd AVResearcherXL
  1. Create a virtualenv, activate it and install the required Python packages:
$ virtualenv ~/my_pyenvs/avresearcherxl
$ source ~/my_pyenvs/avresearcherxl/bin/activate
$ pip install -r requirements.txt
  1. Create a local settings file to override the default settings specified in settings.py. In the next steps we describe to minimal settings that should be changed to get the application up-and-running. Please have a look at the comments in settings.py to get an overview of all possible settings.
$ vim local_settings.py
  1. When running the application in a production environment, set DEBUG to False
  2. Set the SECRET_KEY for the installation (this key is used to sign cookies). A good random key can be generated as follows:
>>> import os
>>> os.urandom(24)
'\x86\xb8f\xcc\xbf\xd6f\x96\xf0\x08v\x90\xed\xad\x07\xfa\x01\xd0\\L#\x95\xf6\xdd'
  1. Configure the connections to the Elasticsearch instance(s) and the log index:
ES_SEARCH_CONFIG = {'hosts': ['index_host1', 'index_host2'], 'port': 9200}
ES_LOG_CONFIG = {'hosts': ['logging_host'] , 'port': 9200}
ES_LOG_INDEX = 'avresearcher_logs'

See the comments in avresearcher/settings.py for more advanced configuration options and examples.

  1. Set the options of the indexed collections (``COLLECTIONS_CONFIGzz).
  2. Provide the URI of the database. The SQLAlchemy documentation provides information on how to structure the URI for different databases. To use an SQLite database named avresearcher.db set DATABASE_URI to sqlite:///avresearcher.db.
  3. Load the schema in the database configured in the previous step.
./manage.py init_db
  1. Provide the settings of the SMTP server that should be used to send notification emails during registration:
MAIL_SERVER = 'localhost'
MAIL_PORT = 25
MAIL_USE_TLS = False
MAIL_USE_SSL = False
MAIL_USERNAME = None
MAIL_PASSWORD = None

If you don't want to run an SMTP server, you'll have to create user accounts from the command line. Issue python manage.py create_user --help to find out how.

  1. Use a built-in WSGI server (like uWSGI) or a standalone WSGI container (like Gunicorn) to run the Flask application. Make sure to serve static assets directly through the webserver.
$ pip install gunicorn
$ gunicorn --bind 0.0.0.0 -w 4 wsgi:app

Running the text analysis tasks

The package contains several text analysis tasks to generate the terms used in the 'descriptive terms' facet. Make sure that the collection you wish to use is fully indexed in Elasticsearch before running the analysis tasks.

  1. Install the required packages:
$ pip install -r requirements-text-analysis.txt
  1. Tokenize the source text by starting a producer that grabs the text and one or more consumers that perform the actual tokenization and lemmatization:
$ ./manage.py analyze_text tokenize producer "immix_source/*.json" immix_summaries
$ ./manage.py analyze_text tokenize consumer "immix_analyzed/summaries" immix_summaries
  1. Create a (Gensim) dictionary of the tokenized text:
$ ./manage.py analyze_text create_dictionary "immix_analyzed/summaries/*/*.txt" "gensim_data/immix_summaries.dict"
  1. Optionally prune the dictionary
$ ./manage.py analyze_text prune_dictionary gensim_data/immix_summaries.dict gensim_data/immix_summaries_pruned.dict --no_below 10 --no_above .10
  1. Construct the corpus in the Matrix Market format:
$ ./manage.py analyze_text construct_corpus "immix_analyzed/summaries/*.tar.gz" gensim_data/immix_summaries_pruned.dict gensim_data/immix_summaries.mm
  1. Construct the TF-IDF model
$ ./manage.py construct_tfidf_model gensim_data/immix_summaries.mm gensim_data/immix_summaries.tfidf_model
  1. Add the topN 'most descriptive' terms to each indexed document:
$ ./manage.py analyze_text index_descriptive_terms "immix_analyzed/summaries/*.tar.gz"  gensim_data/immix_summaries_pruned.dict gensim_data/immix_summaries.tfidf_model gensim_data/immix_summaries.tfidf_model 'quamerdes_immix_20140920' 'text_descriptive_terms' 10

License

Copyright 2014 Dispectu B.V. Parts copyright 2015 Netherlands eScience Center.

AVResearcherXL is distributed under the terms of the Apache 2.0 License (see the file LICENSE).

About

AVResearcherXL app

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 90.4%
  • Python 5.0%
  • HTML 3.0%
  • CSS 1.6%