AVResearcherXL

AVResearcherXL is a tool based on AVResearcher, a prototype aimed at allowing media researchers to explore metadata associated with large numbers of audiovisual broadcasts. AVResearcher allows them to compare and contrast the characteristics of search results for two topics, across time and in terms of content. Broadcasts can be searched and compared not only on the basis of traditional catalog descriptions, but also in terms of spoken content (subtitles), and social chatter (tweets associated with broadcasts). AVResearcher is a new and ongoing valorisation project at the Netherlands Institute for Sound and Vision. more details

In addition to the exploration of audiovisual broadcasts, AVResearcherXL allows users to search and compare different document collections. AVResearcherXL also implements a new design, the option to show relative counts on its timeline visualisation and multiple views on result sets.

AVResearcherXL is developed by Dispectu B.V..

Requirements

Python 2.7
- pip
- virtualenv
Elasticsearch > 1.1
Relational database (e.g. SQLite, MySQL or PostgreSQL)
A webserver with WSGI or proxy capabilities

Installing AVResearcherXL

Clone the repository:

$ git clone git@github.com:beeldengeluid/AVResearcherXL.git
$ cd AVResearcherXL

Create a virtualenv, activate it and install the required Python packages:

$ virtualenv ~/my_pyenvs/avresearcherxl
$ source ~/my_pyenvs/avresearcherxl/bin/activate
$ pip install -r requirements.txt

Create a local settings file to override the default settings specified in settings.py. In the next steps we describe to minimal settings that should be changed to get the application up-and-running. Please have a look at the comments in settings.py to get an overview of all possible settings.

$ vim local_settings.py

When running the application in a production environment, set DEBUG to False
Set the SECRET_KEY for the installation (this key is used to sign cookies). A good random key can be generated as follows:

>>> import os
>>> os.urandom(24)
'\x86\xb8f\xcc\xbf\xd6f\x96\xf0\x08v\x90\xed\xad\x07\xfa\x01\xd0\\L#\x95\xf6\xdd'

Configure the connections to the Elasticsearch instance(s) and the log index:

ES_SEARCH_CONFIG = {'hosts': ['index_host1', 'index_host2'], 'port': 9200}
ES_LOG_CONFIG = {'hosts': ['logging_host'] , 'port': 9200}
ES_LOG_INDEX = 'avresearcher_logs'

See the comments in avresearcher/settings.py for more advanced configuration options and examples.

Set the options of the indexed collections (``COLLECTIONS_CONFIGzz).
Provide the URI of the database. The SQLAlchemy documentation provides information on how to structure the URI for different databases. To use an SQLite database named avresearcher.db set DATABASE_URI to sqlite:///avresearcher.db.
Load the schema in the database configured in the previous step.

./manage.py init_db

Provide the settings of the SMTP server that should be used to send notification emails during registration:

MAIL_SERVER = 'localhost'
MAIL_PORT = 25
MAIL_USE_TLS = False
MAIL_USE_SSL = False
MAIL_USERNAME = None
MAIL_PASSWORD = None

If you don't want to run an SMTP server, you'll have to create user accounts from the command line. Issue python manage.py create_user --help to find out how.

Use a built-in WSGI server (like uWSGI) or a standalone WSGI container (like Gunicorn) to run the Flask application. Make sure to serve static assets directly through the webserver.

$ pip install gunicorn
$ gunicorn --bind 0.0.0.0 -w 4 wsgi:app

Running the text analysis tasks

The package contains several text analysis tasks to generate the terms used in the 'descriptive terms' facet. Make sure that the collection you wish to use is fully indexed in Elasticsearch before running the analysis tasks.

Install the required packages:

$ pip install -r requirements-text-analysis.txt

Tokenize the source text by starting a producer that grabs the text and one or more consumers that perform the actual tokenization and lemmatization:

$ ./manage.py analyze_text tokenize producer "immix_source/*.json" immix_summaries
$ ./manage.py analyze_text tokenize consumer "immix_analyzed/summaries" immix_summaries

Create a (Gensim) dictionary of the tokenized text:

$ ./manage.py analyze_text create_dictionary "immix_analyzed/summaries/*/*.txt" "gensim_data/immix_summaries.dict"

Optionally prune the dictionary

$ ./manage.py analyze_text prune_dictionary gensim_data/immix_summaries.dict gensim_data/immix_summaries_pruned.dict --no_below 10 --no_above .10

Construct the corpus in the Matrix Market format:

$ ./manage.py analyze_text construct_corpus "immix_analyzed/summaries/*.tar.gz" gensim_data/immix_summaries_pruned.dict gensim_data/immix_summaries.mm

Construct the TF-IDF model

$ ./manage.py construct_tfidf_model gensim_data/immix_summaries.mm gensim_data/immix_summaries.tfidf_model

Add the topN 'most descriptive' terms to each indexed document:

$ ./manage.py analyze_text index_descriptive_terms "immix_analyzed/summaries/*.tar.gz"  gensim_data/immix_summaries_pruned.dict gensim_data/immix_summaries.tfidf_model gensim_data/immix_summaries.tfidf_model 'quamerdes_immix_20140920' 'text_descriptive_terms' 10

License

AVResearcherXL is distributed under the terms of the Apache 2.0 License (see the file LICENSE).

Name		Name	Last commit message	Last commit date
Latest commit History 271 Commits
avresearcher		avresearcher
es_configs		es_configs
hunspell/nl_NL		hunspell/nl_NL
text_analysis		text_analysis
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
Vagrantfile		Vagrantfile
app.build.js		app.build.js
manage.py		manage.py
r.js		r.js
requirements-text-analysis.txt		requirements-text-analysis.txt
requirements.txt		requirements.txt
wsgi.py		wsgi.py

License

larsmans/AVResearcherXL

Folders and files

Latest commit

History

Repository files navigation

AVResearcherXL

Requirements

Installing AVResearcherXL

Running the text analysis tasks

License

About

Resources

License

Stars

Watchers

Forks

Languages