Ratings Predictor

Zocdoc Potential ratings for reviews

@Author: Santosh Sonawane

NOTE:

I am no more maintaining this repo. Please feel free to use and send PR if you have any improvements. This project was done as a part of a Machine Learning course that I attended at General Assembly, NY. https://github.com/ga-students/DAT-23-NYC

This project explores different machine learning techniques to predict if a user is satisfied given his review comments. Assumption: If a user rates 4,5 then he is satisfied.

Data is fetched by scraping zocdoc.com for all the Primary care physicians in Manhattan.

Requires:

BeautifulSoup Dom parser, if you want to use the extractor/scraper
pickle
scikit (Anaconda)
flask

How to run:

cd SRC/WEB_SERVER
python main.py It will train the machine learning model using the zocdoc files in DATA and start a flask webserver.

Directory Structure:

README.md - Self.
DATA Data stored as DocName,DocDegree,Location,PatientName,ReviewDate,OverallRating,BedsideMannerRating,WaitTimeRating,Review
- zocdoc2.csv 112,000 thousand records
- train.csv 60,000 records
- test.csv 52,000 records
SRC Python and iPython source code
- There are following components as of now: WebScrapper - Scrapes zocdoc.com and stores all doctors HTML. HTMLExtractor - Extracts and cleans up information from the HTMLs and stores in the CSV format AnalyzerAndPredictor - Analyzes the data for insights and Has models for predicting ratings from reviews.
- web_scrapper.py Provides general APIs for scraping the web.
- zoc_docscrapper.py Uses web_scrapper.py to fetch all Doctors pages as HTMLs
- html_extractor.py Provides APIs for general extraction data extraction from HTML
- doctor_reviews_extractor.py Uses html_extractor.py to extract the data and stores in the CSV format
- data_analysis_and_predictions.ipynb Ipython Notebook that runs through the data and Implements Logistic Regression topredict is a user is satisfied or not Data set is skewed towards Positives-91% positives, so Accuracy and Recall are not the best metrics. Accuracy, Precision and Recall tested using cross validation and input test set resulted in ~99% accuracy.

Enhancements:

Try other classifiers
Use tfid vectorizer and trigrams. Try increasing number of Features
Use Dimensionality reduction to reduce the features
Try multiclassification techniques to predict ratings as numbers
Try Polynomial Features to predict numbers as continuous values (This model is too slow)
Develop a web interface and web server that will take user text and predict his happiness
See how does this model apply to YELP review, where users give ratings between 1-5
To quantify user reviews. We can potentially scrape blogs of people regarding some doctor and say what would they rate this doctor as

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DATA

DATA

SRC

SRC

Project Proposal.docx

Project Proposal.docx

README.md

README.md

Repository files navigation

Ratings Predictor

Zocdoc Potential ratings for reviews

NOTE:

Requires:

How to run:

Directory Structure:

Enhancements:

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
DATA		DATA
SRC		SRC
Project Proposal.docx		Project Proposal.docx
README.md		README.md

turbosantosh/MLReviewsToRatings

Folders and files

Latest commit

History

Repository files navigation

Ratings Predictor

Zocdoc Potential ratings for reviews

NOTE:

Requires:

How to run:

Directory Structure:

Enhancements:

About

Resources

Stars

Watchers

Forks

Languages