This repo is intended for beginners in Python and ML. To view a production-ready application, and better coding see: GitHub: hisia
Couple of talks used this repo as a demo.
- PyData CPH: Talk on Building and Deploying Danish Sentiment Model (26-07-2018) at GiG
- Data Innovation Summit: Talk on Build Sentiment Model in Less Than 30 Minutes (14-03-2019) at Data Innovation Summit
This project is far from being done (mostly the flask apps). It is intended for academic reason only. It is not my fault, if you mess something up on your machine :). There exists typos everywhere, do point them out.
Make sure you have pipenv. If you do not, you can get it via pip install (pip --version has to be >= 9.0.1).
pip install pipenv
Clone this repository, and enter the project folder. Execute pipenv install to install all packages.
git clone https://github.com/Proteusiq/DanishSentiments.git
cd DanishSentiments
pipenv install
pipenv shell
To run the app, naviage to flask_app. and execute:
cd flask_app
python app.py
You are good to go :) Flask app should be running on port 5000. On your browser, head to localhost:5000.
You can train the SGDClassifier by navigating to flask_app folder and run.
cd flask_app
python db_admin.py train
The script will wait for Return Key to execute the code. When Return Key is registered, a simple Stochastic Gradient Descent Classifier would be train. Model score: 92%. Training takes less than 6 minutes on Windows 10, 64bit 16GB RAM.
Training data came from TrustPilot Reviews. I wrote a simple helper function TrustPilotReader, in case you want more training data or wish to train a different language model, e.g. Norwegian Sentiment Model :).
If everything went well, HashVectorizer.pkl and SGDClassifier.pkl would have been modified.
This model updates itself as users affirm or disaffirm the predictions. Database of users inputs stores new features and targets that can be used to train another model or bulk retratining :)
**NB:**This project is under development. To get current version, use:
git pull
- Data Gathering, Exploration and Cleaning(EDA_Sentiment.ipynb)
- Finding simple logit model that is fast and continous-trainable
- Serve the model to the outside world via Flask app and api
- Model continous learning with users interaction.
- Database to store users input for bulk model retraining.
Note: app.py is running on debugging mode. This is to allow changes. If you want to put the model in production, make sure to set debugging to False.
N.B: This project was build in Python 3.6, and uses f-formating, that might cause issues with lower Python version. lower python version will throw:
f'Positive: With {pos_proba:.1%} Probability'
If you want to use lower Python version, just git clone the project and change f-formating string to normal
.format() e.g.
'Positive: With {:.1%} Probability'.format(pos_proba)
- Gather users input for model retraining
- Rewrite the flask_app to actually do what it is suppose to do
- Grid search better parameter for partial_fit models
- A better tokenizer (remove places and peoples names)
- Clean code everywhere :)