Introduction and general information
The goal of this task is to leverage data mining technics and to build a small-scale application system that would allow the envisioned end users (i.e., people who will benefit from the results that are generated by data mining algorithms) to upload a new data set and apply at least one algorithm that you developed or experimented with to mine the uploaded data set using a Web interface.
Application description
During this task I developed a "Yelp wizard" application which allow users to evaluate a dataset of restaurant’s reviews in order to find and choose a new restaurant for visit based on topics and cuisine which could be interested for a user.
Dataset description
For the task we will use Yelp's reviews data set:
- "yelp_academic_dataset_review.json"
- "yelp_academic_dataset_business.json"
with 703508 reviews for 14035 business. Each business linked to a set of categories like type of business, cuisines and so on. Before topic mining we will pre-process this file and choose review only for venues, which are in "Restaurant" category and extracts the set of cuisines which get us a set of 239 cuisines in whole. By default, the application operates by the whole this dataset but also, a user could upload any subset of this dataset for evaluation.
Functions and goals
The key goal of the application is to allow user to find a restaurant which could be interested for him based on data mining algorithms instead of the standard filters which available on Yelp service. So, in the application there are two key functions:
- Topic mining - which allow user to apply LDA algorithm with different parameters for all reviews in the dataset and choose which topic is interested. Based on this choose the application will show a list of cuisines and a list of restaurants for which chosen topic is a topic with the highs weight. It allows user to choose restaurant based on topics and key words of each topic, instead of normal search by keywords.
- Text similarities - which allow user to choose cuisine based on measure of similarities between cuisines based on review's texts. It allows user to evaluate a cuisines data set and find an interesting cuisine based on similarities between texts.
Toolkit and libraries
To develop the application, I used:
- Python - as general language
- Flask - application server
- amCharts - for charts and visualisation
- Bootstrap - for user interface
For data mining I used the following tools and libraries for Python:
- Sklearn - for classification
- Gensim - for text processing
- Numpy - for some additional tool
- NLTK - for text processing
User guide
[https://github.com/denisafanasev/CS598_YelpWizard/blob/master/docs/YelpWizard_User_guide.pdf]
Run
flask run --host=0.0.0.0 &