steward-sentiment-anlysis-bot

基于Google-Bert,进行 Sentiment Analysis 的任务和 ner的部分，其中ner部分代码目录在/ner下，的操作指南参考链接（see documentation here）, 并利用AWS SageMaker进行模型训练和部署。

Data

本解决方案使用的数据分为两部分，预训练模型的数据和

使用的基础模型是从google发布的bert预训练模型得到的，模型下载地址
使用的情感分析数据集是新浪微博的短文本，有10万条评论数据，公开数据已经标注了正负向的情感标注

Model

模型是end-to-end的二分类模型，模型论文

2018年google推出了bert模型，这个模型的性能要远超于以前所使用的模型，总的来说就是很牛。但是训练bert模型是异常昂贵的，对于一般人来说并不需要自己单独训练bert，只需要加载预训练模型，就可以完成相应的任务。

Features

CPU/GPU Support
Multi-GPU Support: tf.distribute.MirroredStrategy is used to achieve Multi-GPU support for this project, which mirrors vars to distribute across multiple devices and machines. The maximum batch_size for each GPU is almost the same as bert. So global batch_size depends on how many GPUs there are.
- Assume: num_train_examples = 32000
- Situation 1 (multi-gpu): train_batch_size = 8, num_gpu_cores = 4, num_train_epochs = 1
  - global_batch_size = train_batch_size * num_gpu_cores = 32
  - iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000
- Situation 2 (single-gpu): train_batch_size = 32, num_gpu_cores = 1, num_train_epochs = 4
  - global_batch_size = train_batch_size * num_gpu_cores = 32
  - iteration_steps = num_train_examples * num_train_epochs / train_batch_size = 4000
- Result after training is equivalent between situation 1 and 2 when synchronous update on gradients is applied.
SavedModel Support
SageMaker Training/Deploy Support
TFserving Support- SavedModel Export
Unbalanced Dataset Customer Loss Support
Multi-Class Support
Multi-Label Support

Dependencies

Tensorflow
- tensorflow >= 1.11.0 # CPU Version of TensorFlow.
- tensorflow-gpu >= 1.11.0 # GPU version of TensorFlow. (Upgrade to 1.14.0 when meets ImportError: No module named 'tensorflow.python.distribute.cross_device_ops' )
NVIDIA Collective Communications Library (NCCL)

Quick Start Guide

Train

使用SageMaker BYOC训练的步骤

下载预训练模型，放到./source/bert/pretrain_model目录下，模型大小364.20M

wget -P ./source/bert/pretrain_model https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip
cd ./source/bert/pretrain_model
unzip chinese_L-12_H-768_A-12.zip

run binary classification


source activate tensorflow_p36
export BERT_BASE_DIR=./bert/pretrain_model/chinese_L-12_H-768_A-12


nohup python -u ./bert/run_classifier.py \
  --data_dir='../data' \
  --task_name='chnsenticorp' \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --output_dir=./output/ \
  --do_train=true \
  --do_eval=true \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=200 \
  --train_batch_size=16 \
  --learning_rate=5e-5\
  --num_train_epochs=5.0\
  --save_checkpoints_steps=100\
  --weight_list='1,1' > train.log 2>&1 &

Shell script is available also (see shell_scripts/run_two_classifier.sh)

run multi-class classification

here we use example case three class, you can change by define the class


source activate tensorflow_p36
export BERT_BASE_DIR=./bert/pretrain_model/chinese_L-12_H-768_A-12


nohup python -u python bert/run_classifier.py \
  --data_dir='../data' \
  --task_name='GTProcessor' \
  --vocab_file=$BERT_BASE_DIR/vocab.txt \
  --bert_config_file=$BERT_BASE_DIR/bert_config.json \
  --output_dir=./output/ \
  --do_train=true \
  --do_eval=true \
  --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
  --max_seq_length=200 \
  --train_batch_size=16 \
  --learning_rate=5e-5\
  --num_train_epochs=1.0\
  --save_checkpoints_steps=100\
  --weight_list='1,1,1'

Shell script is available also (see shell_scripts/run_all.sh)

run multi-gpu classification

here we use example case three class, you can change by define the class


source activate tensorflow_p36
export BERT_BASE_DIR=./bert/pretrain_model/chinese_L-12_H-768_A-12

nohup python -u ./bert/run_custom_classifier.py \
    --task_name='gt' \
    --do_lower_case=true \
    --do_train=true \
    --do_eval=true \
    --do_predict=true \
    --save_for_serving=true \
    --data_dir='../data' \
    --vocab_file=$BERT_BASE_DIR/vocab.txt \
    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
    --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
    --max_seq_length=128 \
    --train_batch_size=32 \
    --learning_rate=2e-5\
    --num_train_epochs=1.0 \
    --use_gpu=true \
    --num_gpu_cores=4 \
    --use_fp16=false \
    --output_dir='./outputs' > train.log 2>&1 &

Shell script is available also (see shell_scripts/run_multi_gpu.sh)

根据Dockerfile 生成训练和预测的镜像，并且推送到ECR，注意这边需要切换到根路径

cd ./source
sh build_and_push.sh bert-sentiment-anylsis

用source/bert/tensorflow_bring_your_own.ipynb启动训练任务，并且生成模型文件保存在s3

此刻你可以看到你的SageMaker 控制台中生成了对应的Training Job

Deploy

利用EndpointDeploy.py，使用模型文件和Docker Image和.source/bert/run_classifier.py生成endpoint

cd ./source
python EndpointDeploy.py \
--ecr_image_path="847380964353.dkr.ecr.us-east-1.amazonaws.com/bert-sentiment-anylsis:latest" \
--model_s3_path="s3://sagemaker-us-east-1-847380964353/model/model.tar.gz" \
--instance_type="ml.m4.xlarge"

此刻你可以看到你的SageMaker 控制台中生成了对应的endpoint

Bot - 使用docker进行部署的机器人

机器人包含Dockerfile,task.py脚本，及相关依赖，目录结构如下

bot--|--dependency()--|--extract_features.py
     |                |--modeling.py
     |                |--tokenization.py
     |                |--vocab.txt
     |--Dockerfile
     |--task.py(执行主程序)

在任意ec2上运行如下命令即可build docker，运行对应的机器人任务

cd ./bot 
docker build -t ${DOCKER_IMAGE_NAME} .
docker run ${DOCKER_IMAGE_NAME}

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
data		data
docs		docs
ner		ner
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

docs

docs

ner

ner

source

source

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

steward-sentiment-anlysis-bot

Data

Model

Features

Dependencies

Quick Start Guide

Train

run binary classification

run multi-class classification

run multi-gpu classification

Deploy

Bot - 使用docker进行部署的机器人

About

Releases

Packages

Languages

License

jackie930/steward-sentiment-anlysis-bot

Folders and files

Latest commit

History

Repository files navigation

steward-sentiment-anlysis-bot

Data

Model

Features

Dependencies

Quick Start Guide

Train

run binary classification

run multi-class classification

run multi-gpu classification

Deploy

Bot - 使用docker进行部署的机器人

About

Resources

License

Stars

Watchers

Forks

Languages