Building Danish Sentiment Models

This repo is intended for beginners in Python and ML. To view a production-ready application, and better coding see: GitHub: hisia

Couple of talks used this repo as a demo.

PyData CPH: Talk on Building and Deploying Danish Sentiment Model (26-07-2018) at GiG
Data Innovation Summit: Talk on Build Sentiment Model in Less Than 30 Minutes (14-03-2019) at Data Innovation Summit

Disclaimers

This project is far from being done (mostly the flask apps). It is intended for academic reason only. It is not my fault, if you mess something up on your machine :). There exists typos everywhere, do point them out.

How-tos & Requirement

Make sure you have pipenv. If you do not, you can get it via pip install (pip --version has to be >= 9.0.1).

pip install pipenv

Clone this repository, and enter the project folder. Execute pipenv install to install all packages.

git clone https://github.com/Proteusiq/DanishSentiments.git
cd DanishSentiments
pipenv install
pipenv shell

Model Training & Deployment

To run the app, naviage to flask_app. and execute:

cd flask_app
python app.py

You are good to go :) Flask app should be running on port 5000. On your browser, head to localhost:5000.

You can train the SGDClassifier by navigating to flask_app folder and run.

cd flask_app
python db_admin.py train

The script will wait for Return Key to execute the code. When Return Key is registered, a simple Stochastic Gradient Descent Classifier would be train. Model score: 92%. Training takes less than 6 minutes on Windows 10, 64bit 16GB RAM.

Training data came from TrustPilot Reviews. I wrote a simple helper function TrustPilotReader, in case you want more training data or wish to train a different language model, e.g. Norwegian Sentiment Model :).

If everything went well, HashVectorizer.pkl and SGDClassifier.pkl would have been modified.

This model updates itself as users affirm or disaffirm the predictions. Database of users inputs stores new features and targets that can be used to train another model or bulk retratining :)

**NB:**This project is under development. To get current version, use:

git pull

Structure

Data Gathering, Exploration and Cleaning(EDA_Sentiment.ipynb)
Finding simple logit model that is fast and continous-trainable
Serve the model to the outside world via Flask app and api
Model continous learning with users interaction.
Database to store users input for bulk model retraining.

Note: app.py is running on debugging mode. This is to allow changes. If you want to put the model in production, make sure to set debugging to False.

Pending Documentation ...

N.B: This project was build in Python 3.6, and uses f-formating, that might cause issues with lower Python version. lower python version will throw:

SyntaxError ERROR

f'Positive: With {pos_proba:.1%} Probability'

If you want to use lower Python version, just git clone the project and change f-formating string to normal .format() e.g. 'Positive: With {:.1%} Probability'.format(pos_proba)

TODO:

Gather users input for model retraining
Rewrite the flask_app to actually do what it is suppose to do
Grid search better parameter for partial_fit models
A better tokenizer (remove places and peoples names)
Clean code everywhere :)

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
flask_app		flask_app
images		images
models		models
notebooks		notebooks
pybleau		pybleau
.gitignore		.gitignore
HashVectorizer.pkl		HashVectorizer.pkl
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
SGDclassifier.pkl		SGDclassifier.pkl
bert-embeddings.ipynb		bert-embeddings.ipynb
sentiment_data		sentiment_data
stops.pkl		stops.pkl

License

Proteusiq/DanishSentiments

Folders and files

Latest commit

History

Repository files navigation

Building Danish Sentiment Models

Disclaimers

How-tos & Requirement

Model Training & Deployment

Structure

Pending Documentation ...

SyntaxError ERROR

TODO:

About

Resources

License

Stars

Watchers

Forks

Languages