Skip to content

hsinlichu/Customer-Service-Data-Analysis-with-Machine-Learning-Technique

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Customer Service Data Analysis with Machine Learning Technique

This project was done when I interned at Cyberlink. I was responsible for analyzing customer service data with machine learning technique.

During my summer intern, I do few things below:

  1. Supervised text classification

  2. Unsupervised text clustering with different sentence representation and clustering algorithm.

    Sentence representation

    • Doc2Vec
    • TF-IDF

    Clustering Algorithm

    • Kmeans
    • DBSCAN
  3. Topic Modeling - LDA

  4. Find related question based on BERT.

Note: Related analysis result pictures are stored in analysis/picture/ folder.

Dataset

This customer service data is around 0.1 million customer feedback sentence with subject, question sentence, and user selected question type.

Supervised text classification

I use simply GRU model to classify text sentence into their group. The input is question string sentence and the output is its class. The ground truth is user selected question type.

How to run

  • Preprocessing: python ./src/preprocess.py <directory/which/contain/model/config.json> [-e ./path/to/embedding.pkl]

    The last argument -e is optional. Because processing embedding.pkl take long time, you can just use this argument and specify your embedding.pkl to save your time.

    Ex: python ./src/preprocess.py ./model/lstm_model -e ./model/lstm_model/embedding.pkl

  • Training: python ./src/train.py <directory/which/contain/model/config.json>

    Ex: python ./src/train.py ./model/lstm_model

  • Run tensorboard: tensorboard --logdir tensorboard

Model

Unsupervised text clustering

To get more insight into this customer service dataset, I use unsupervised text clustering to see whether it can discover something interesting.

In order to cluster text, we need to build vector to represent each sentence first. Therefore, I experiment two sentence representation method.

After we turn each sentence into vector, I use Kmeans and DBSCAN to cluster those data, and compare those results.

Sentence representation

  • Doc2Vec
  • TF-IDF

Clustering Algorithm

  • Kmeans
  • DBSCAN

Result and code

Note: Below graph are the clustering result using Doc2Vec sentence representation by Kmeans clustering algorithm.

Topic Modeling - LDA

I also use LDA which is a generative probability model to figure out some latent topics. The result below are visualized by pyLDAvis.

Find related question based on BERT.

Since if we can find related question and its corresponding answer, we can find the most similar and return to customer before the customer submit its question feedback.

In this way, we can solve customer's problem more quickly and also reduce the repeated questions that need human to answer.

Here, we use BERT, the state-of-the-art NLP model in 2018, to build sentence representation, and simply use cosine similarity to find related question. Model Code

Above method is very naive and has lots of space for improvement, so we simply serve it as a baseline to see its potential. Performance Analysis

File Tree

Cyberlink-Intern/
├── analysis
│   ├── Customer Service Data Clustering.ipynb
│   ├── data_analysis.ipynb
│   ├── Data Preprocessing.ipynb
│   ├── LSTM Predict Confusion Matrix.ipynb
│   ├── picture
│   │   ├── ConfusionMatrix_2.jpg
│   │   ├── Confusion Matrix.jpg
│   │   ├── ConfusionMatrix.jpg
│   │   ├── ConfusionMatrix_newdata.jpg
│   │   ├── Confusion Matrix.png
│   │   ├── ConfusionMatrix_RelatedQuestion.png
│   │   ├── data_distribution.jpg
│   │   ├── data_distribution_ordered.jpg
│   │   ├── Normalized_ConfusionMatrix_2.jpg
│   │   ├── Normalized_ConfusionMatrix_newdata.jpg
│   │   └── Normalized_ConfusionMatrix_RelatedQuestion.png
│   ├── Related Question Analysis.ipynb
│   ├── Related Question.ipynb
│   └── Topic Model.ipynb
├── data
│   └── emptydata.xlsx
├── model
│   └── lstm_model
│       ├── config.json
│       └── log.json
├── README.md
├── requirements.txt
├── src
│   ├── callbacks.py
│   ├── metric.py
│   ├── modules
│   │   └── net.py
│   ├── mypredictor.py
│   ├── predict.py
│   ├── preprocess.py
│   └── train.py
└── tensorboard
    └── lstm_model

About

In this project, I use several machine learning technique both supervised and unsupervised to analyze Cyberlink customer service feedback data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published