In the last 5 years more than 5 billion of reviews shoppers wrote on the Web.
Manufacturers know that is essential to understand what customer say, in order to create better products.
A lot of work on understanding customer reviews has been focused on sentiment analysis, especially aspect based sentiment analysis, to understand what product feature is considered positive or negative into a review.
But not so work has been focused on extracting other meaningful information.
We propose a technique to extract the most relevant insights for a manufacturer from a set of text reviews.
Most relevant insights correspond to those thoughts/opinions/sentences that customers express more often or less often inside a corpus of reviews.
In this way manufacturers can have a clear overview of what the majority of their online customers are thinking on their products.
The algorithm is based on 2 main concepts:
- sentence modeling: to transform sentences and represent them as vectors
- clustering: to organize sentence vectors in groups and each group includes all the sentences with almost the same meaning
The algorithm can be described with the the following steps:
- the text of each review is divided into sentences
- each sentence is transformed into a sentence vector
- all the vectors representing the sentences are organized in groups by some clustering algorithm like MeanShift
- the clusters/groups with the biggest size are selected (they are the most occurring sentences in the input set of reviews)
- the centroid of each cluster is selected as one of the most relevant insights
[TBD - libraries, programming languages, etc.]
Simple prototype:
- takes as input an array of documents - eventually read from a file
- run the NLP pipeline
- returns the best insights printed on command line - or on a file
- Sentence Splitter on each document (1h)
- Doc2Vec representation on each document, at sentence level (16h)
- Clustering on the whole corpus of sentences (16h)
- Selection of the best insights from the clusters resulting from previous step (8h)
- First execution on a real dataset
- Better definition of a set experiments to run and metrics
- First measurement of metrics
[TBD]
[TBD]
cd project-folder
python main.py train -incremental --input-dataset ./dataset.txt
cd project-folder
python main.py run ./input-folder
cd project-folder
python main.py visaulize ./input-folder