This repository implements data preprocessing, machine learning models to optimize PBS Scheduler
NCAR Workload: Cheynne April 2019 (M2, M3, M4)
Benchmark Workload from http://www.cs.huji.ac.il/labs/parallel/workload/logs.html
Cornell Theory Center IBM SP2 (CTC)
Swedish Royal Institute of Technology IBM SP2 (KTH)
San Diego Supercomputer CenterBlue Horizon (SDSC)
Python 3.6.2
Numpy
Pandas
Pytorch 1.0.1
Scikit-learn 0.21.1
Tensorflow 1.13.1
python3 pbs_parser_seaprate_v3.py accounting outloc
where
accounting : directory with accounting logs (previous historical PBS data)
outloc : output directory with combined accounting logs (in csv format)
To extract additional features from accounting logs, follow the commands
python3 process_data.py inloc outloc
where
inloc : directory with combined accounting logs (from pbs parser step)
outloc : output directory with added features
To filter suspicious jobs from data, follow the commands
python3 filter_data.py inloc outloc
where
inloc : directory with combined accounting logs (from previous proccess_data step)
outloc : output directory with filtered data
To split data ino different weeks, follow the commands
python3 weekly_split.py pre_data_loc post_data_loc
where pre_data_loc : directory containing pre-processed CSV data
post_data_loc : directory containing post-processed CSV data
The goal of the scripts in this section is to produce charts for visualization of PBS accounting data. The analysis is done by calculating jobs' misprediction and dividing them into 6 bins (0-15 minutes, 15min-1h, 1h-3h, 3h-7h, >=7h, underprediction)
This script produces breakdown of users' misprediction in terms of different user_selected features. Follow the command to execute the script
python3 create_feature_plots.py data_path feature
where
data_path directory containing CSV data
feature stands for the features (i.e. queue, resources_used.walltime, etc.)
Successful scripts would produce 2 types of plots:
1. Pie plot: for individual period dissected by feature (i.e. Monday-> Friday if feature is day_week)
2. Stacked column chart: for all of the periods dissected by feature
This script extends the previous script to produce two-level feature filters. Currently, feature2 is supposed to be an outer filter and feature1 is an inner filter
python3 create_multi-index_feature_plots.py feature1andfeature2
where feature1 and feature2 stand for the two different features (i.e. queue, resources_used.walltime, etc.)
Successful scripts would produce 2 types of plots:
1. Pie plot: Contribution of each feature2 with more than 2% of job volume
2. Stacked column chart: for each feature2 with components of feature1
(See example in img directory)
Current list of supported plotting features:
time_day
week_month
day_week
user
account
queue
To generate two-dimensional plot analysis (by two features), follow the commands:
python3 overall_plot.py --multi_dim_plot=True --multi_dim_x_feature='mispred_ratio', --multi_dim_y_feature='num_jobs' --num_top=10 --groupby_val='account' --data_path='../apr_2019_full/'
Argument parameters:
--multi_dim_plot whether to plot multi_dim or not, default: False
--multi_dim_x_feature features on x_axis, default: mispred_ratio
--multi_dim_y_feature features on y_axism defaultL num_jobs
--num_top top users by two 2 specified dimensions (multiplication), default: 0 (no filter)
--groupby_val data point representation (i.e. user/ account), default: user
--data_path location storing parsed accounting csv of interest
To generate overall_plot, follow the commands:
python3 overall_plot.py --data_path='../apr_2019_full/' --groupby_val='user' --overall_distr_plot=True --overall_feature='user_mispred'
Argument parameters:
--data_path location storing parsed accounting csv of interest
--groupby_val data point representation (i.e. user/ account), default: user
--overall_distr_plot whether to plot overall distribution or not
--overall_feature feature to plot by
Current list of supported plotting features:
Resource_List.walltime
resources_used.walltime
resources_used.cput
user_mispred (user_predict running time - actual running time)
mispred_ratio (misprediction over user_predict running time)
mispred_ratio_runtime (misprediction over actual running time)
Current list of supported groupby_val features:
user
account
To train either FeedForward Network (keyword: ff), Bi-directionalr Long short-term Memory Network (Bi-LSTM) (keyword: rnn), Convolutional Neural Network (keyword: cnn), or Residual Neural Network (keyword: resnet), please follow the example below. Example is for FeedForward network:
python3 train.py --batch_size=64 --num_epochs=100 --hidden_size=128 --ckpt=False --train_path='../training_small/' --ckpt_path='../best_ff_model/' --test_path='../testing_small/' --model_type='ff' --dropout=0.8 --device='cuda:0' --old=True
Training default model proposed by "Machine Learning Predictions for Underestimation of Job Runtime on HPC System" (Guo, Nomura, Barton, Zhang, and Matsuoka, 2018)
python3 rf_xgboost.py train_path test_path rf_report xgb_report old
where
train_path, test_path: directory containing CSV of training and testing data
rf_report, xgb_report directory containing summary results of RF and XGB respectively
old: whether the data comes from benchmark workload or not (True: benchmark workload, False: NCAR workload)
Implementation of the proposed domain adaptation model from "Unsupervised Domain Adaptation by Backpropagation" (Ganin, Lempitsky, 2015)
python3 train_transfer.py --batch_size=32 --num_epochs=200 --hidden_size=128 --ckpt=False --train_path='../training_small/' --ckpt_path='../best_dann_model/' --test_path='../testing_small/' --model_type='dann' --dropout=0.8 --device='cuda:0'
Implementation of the proposed domain adaptation model from "Learning Transferable Features with Deep Adaptation Networks" (Long, Cao, Wang, Jordan, 2015)
python3 train_transfer.py --batch_size=32 --num_epochs=200 --hidden_size=128 --ckpt=False --train_path='../training_small/' --ckpt_path='../best_dan_model/' --test_path='../testing_small/' --model_type='dan' --dropout=0.8 --device='cuda:0'
Implementation based on the proposed model from "Joint Domain Alignment and Discriminative Feature Learning for Unsupervised Deep Domain Adaptation) (Chen et al., 2018)
python3 train_transfer.py --batch_size=32 --num_epochs=200 --hidden_size=128 --ckpt=False --train_path='../training_small/' --ckpt_path='../best_dan_model/' --test_path='../testing_small/' --model_type='dan' --dropout=0.8 --device='cuda:0' --old =True
Methods | RF | XGBoost | FC | Bi-LSTM | CNN | DCORAL |
---|---|---|---|---|---|---|
M2-->M3 | 0.467 | 0.054 | 0.389 | 0.357 | 0.412 | 0.875 |
M3-->M2 | 0.396 | 0.295 | 0.591 | 0.610 | 0.211 | 0.365 |
M2-->M4 | 0.258 | 0.041 | 0.105 | 0.129 | 0.295 | 0.398 |
M4-->M2 | 0.372 | 0.326 | 0.498 | 0.456 | 0.349 | 0.383 |
M3-->M4 | 0.849 | 0.377 | 0.772 | 0.882 | 0.129 | 0.698 |
M4-->M3 | 0.747 | 0.277 | 0.835 | 0.873 | 0.505 | 0.735 |
Average | 0.515 | 0.228 | 0.532 | 0.551 | 0.317 | 0.576 |
Methods | RF | XGBoost | FC | Bi-LSTM | CNN | DCORAL |
---|---|---|---|---|---|---|
CTC-->KTH | 0.254 | 0.089 | 0.191 | 0.245 | 0.193 | 0.385 |
KTH-->CTC | 0.462 | 0.118 | 0.528 | 0.303 | 0.276 | 0.426 |
CTC-->SDSC | 0.292 | 0.058 | 0.092 | 0.161 | 0.143 | 0.241 |
SDSC-->CTC | 0.256 | 0.611 | 0.204 | 0.210 | 0.285 | 0.395 |
KTH-->SDSC | 0.563 | 0.273 | 0.695 | 0.886 | 0.287 | 0.384 |
SDSC-->KTH | 0.614 | 0.159 | 0.749 | 0.799 | 0.212 | 0.511 |
Average | 0.407 | 0.218 | 0.410 | 0.434 | 0.233 | 0.390 |
https://github.com/chenchao666/JDDA-Master (Tensorflow)
https://github.com/yunjey/pytorch-tutorial