Customer Service Data Analysis with Machine Learning Technique

This project was done when I interned at Cyberlink. I was responsible for analyzing customer service data with machine learning technique.

During my summer intern, I do few things below:

Supervised text classification
Unsupervised text clustering with different sentence representation and clustering algorithm.

Sentence representation
- Doc2Vec
- TF-IDF
Clustering Algorithm
- Kmeans
- DBSCAN
Topic Modeling - LDA
Find related question based on BERT.

Note: Related analysis result pictures are stored in analysis/picture/ folder.

Dataset

This customer service data is around 0.1 million customer feedback sentence with subject, question sentence, and user selected question type.

Supervised text classification

I use simply GRU model to classify text sentence into their group. The input is question string sentence and the output is its class. The ground truth is user selected question type.

How to run

Preprocessing: python ./src/preprocess.py <directory/which/contain/model/config.json> [-e ./path/to/embedding.pkl]

The last argument -e is optional. Because processing embedding.pkl take long time, you can just use this argument and specify your embedding.pkl to save your time.

Ex: python ./src/preprocess.py ./model/lstm_model -e ./model/lstm_model/embedding.pkl
Training: python ./src/train.py <directory/which/contain/model/config.json>

Ex: python ./src/train.py ./model/lstm_model
Run tensorboard: tensorboard --logdir tensorboard

Model

Unsupervised text clustering

To get more insight into this customer service dataset, I use unsupervised text clustering to see whether it can discover something interesting.

In order to cluster text, we need to build vector to represent each sentence first. Therefore, I experiment two sentence representation method.

After we turn each sentence into vector, I use Kmeans and DBSCAN to cluster those data, and compare those results.

Sentence representation

Doc2Vec
TF-IDF

Clustering Algorithm

Kmeans
DBSCAN

Result and code

Note: Below graph are the clustering result using Doc2Vec sentence representation by Kmeans clustering algorithm.

Topic Modeling - LDA

I also use LDA which is a generative probability model to figure out some latent topics. The result below are visualized by pyLDAvis.

Find related question based on BERT.

Since if we can find related question and its corresponding answer, we can find the most similar and return to customer before the customer submit its question feedback.

In this way, we can solve customer's problem more quickly and also reduce the repeated questions that need human to answer.

Here, we use BERT, the state-of-the-art NLP model in 2018, to build sentence representation, and simply use cosine similarity to find related question. Model Code

Above method is very naive and has lots of space for improvement, so we simply serve it as a baseline to see its potential. Performance Analysis

File Tree

Cyberlink-Intern/
├── analysis
│   ├── Customer Service Data Clustering.ipynb
│   ├── data_analysis.ipynb
│   ├── Data Preprocessing.ipynb
│   ├── LSTM Predict Confusion Matrix.ipynb
│   ├── picture
│   │   ├── ConfusionMatrix_2.jpg
│   │   ├── Confusion Matrix.jpg
│   │   ├── ConfusionMatrix.jpg
│   │   ├── ConfusionMatrix_newdata.jpg
│   │   ├── Confusion Matrix.png
│   │   ├── ConfusionMatrix_RelatedQuestion.png
│   │   ├── data_distribution.jpg
│   │   ├── data_distribution_ordered.jpg
│   │   ├── Normalized_ConfusionMatrix_2.jpg
│   │   ├── Normalized_ConfusionMatrix_newdata.jpg
│   │   └── Normalized_ConfusionMatrix_RelatedQuestion.png
│   ├── Related Question Analysis.ipynb
│   ├── Related Question.ipynb
│   └── Topic Model.ipynb
├── data
│   └── emptydata.xlsx
├── model
│   └── lstm_model
│       ├── config.json
│       └── log.json
├── README.md
├── requirements.txt
├── src
│   ├── callbacks.py
│   ├── metric.py
│   ├── modules
│   │   └── net.py
│   ├── mypredictor.py
│   ├── predict.py
│   ├── preprocess.py
│   └── train.py
└── tensorboard
    └── lstm_model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

analysis

analysis

model/lstm_model

model/lstm_model

src

src

tensorboard/lstm_model

tensorboard/lstm_model

.gitignore

.gitignore

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Customer Service Data Analysis with Machine Learning Technique

Dataset

Supervised text classification

How to run

Model

Unsupervised text clustering

Sentence representation

Clustering Algorithm

Topic Modeling - LDA

Find related question based on BERT.

File Tree

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
analysis		analysis
model/lstm_model		model/lstm_model
src		src
tensorboard/lstm_model		tensorboard/lstm_model
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

hsinlichu/Customer-Service-Data-Analysis-with-Machine-Learning-Technique

Folders and files

Latest commit

History

Repository files navigation

Customer Service Data Analysis with Machine Learning Technique

Dataset

Supervised text classification

How to run

Model

Unsupervised text clustering

Sentence representation

Clustering Algorithm

Topic Modeling - LDA

Find related question based on BERT.

File Tree

About

Topics

Resources

Stars

Watchers

Forks

Languages