Skip to content

Information System for Media monitoring and analysis system Project under ПЦФ BR05236839

Notifications You must be signed in to change notification settings

KindYAK/NLPMonitor

Repository files navigation

About

@article{26583204_325243423_2019, 
    author = {Vladimir Barakhnin and Olga Kozhemyakina and Ravil Mukhamediev and Yulia Borzilova and Kirill Yakunin}, 
    keywords = {natural language processing, streaming word processing, text analysis information systemdevelopment of a text corpus processing system},
    title = {The design of the structure of the software system for processing text document corpus},
    year = {2019},
    number = {4 Vol.13},
    pages = {60-72},
    url = {https://bijournal.hse.ru/en/2019--4 Vol.13/325243423.html},
}

Media-monitoring system which solves the following problems:

  • Parsing of news web sites using custom configurable Spider (Scrapy)
  • Storage (Redis, PostgreSQL, Elasticsearch)
  • NLP data preprocessing (PyMorphy2, NLTK, Gensim)
  • Topic modelling (LDA, BigARTM, ETM), including dynamic models (Custom DTM, DETM)
  • Classification of documents according to arbitrary criteria (M4A, traditional ML approach)
  • Visualization (Django, HTML+CSS+JS, Plotly, MapBox)
  • Automatic report generation (LaTex+Jinja2)

Architecture

All components of the system are implemented as Docker containers. Such implementation allows components and subsystems to work independently, interchangeable and allows easy scalability.

Architecture

Airflow is an ETL subsystem, upon which scrapping spiders Spider(Scrapy) which are being stored in PostgreSQL as a persistent structured SQL storage. Data obtained through preprocessing, modifications and modelling is stored in ElasticSearch, which is the main storage for pre-calculated results necessary for displaying dashboard and reports.

Interface

Topic Document Dynamics Criteria Dynamics
analytics1 analytics2

The system also provides tools for visualization, such as dynamics of publications of topics in media according to various criteria, histograms of criteria value distribution, distributions amongh sources, etc.

Dynamic Topic Modelling

Mapping DTM is a custom algorithm for analyzing topic dynamics based on context semantic mapping (Context Fuzzy Jaccard). It allows to visualize topics lifesycle, analyze changes in vocabulary, classify topics by their dynamic characteristics in order to distinguish events, informational attacks, long-term trends, etc.

Analytics Dashboard

Dashboards - set of configurable widgets, which are able to perform the above mentioned visualizations. Dashboard can be configured according to client's needs and does not require additional development. Monitoring objects are implemented as a special NER requests language which allow to filter information based on any given entities. Example of such request is 1(Machine Learning) AND 1(Deep | Convolutional), which would require "Machine learning" phrase to be present in a text, along with either "Deep" or "Convolutional". This language allows to flexibly filter the corpus in order to analyze different entities such as persons, organisations, location and topics.

Geo Dashboard

Practical uses

Media Analytics can be applied in industrial tasks as :

  1. Competitive to ALEM MEDIA MONITORING service for:

    Monitoring of media space (news websites, social networks, TV, etc.)

    Reputation management, public opinion analysis, PR policies assessment and optimization

    Decision making support

    Configurable reporting and dashboards

    NER requests filtering

    KPI of marketing campaigns estimation, competitors comparative analysis, etc.

  2. Service for searching most relevant bloggers/influencers for advertising in social networks: YouTube, Instagram, Facebook, etc. Example of such service is GetBlogger

    Social network parsing, filtering by bloggers/authors popularity

    Topic modelling of the corpora, obtaining topic embedding for separate publications and aggregating them to bloggers'/authors' topic embeddings

    Creating a model which accepts textual information about business or product as an input, and outputs the most relevant bloggers/authors

About

Information System for Media monitoring and analysis system Project under ПЦФ BR05236839

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published